GOMAXPROCS大于CPU核数有意义吗？——GMP笔记

本文为小徐先生文章温故知新——Golang GMP 万字洗髓经的笔记，仅为粗浅记录。

什么是 GMP

G：goroutine 是 Go 中对协程的抽象。G 有自己的生命周期、运行栈、执行的函数。

M：Go 对线程的抽象，内核级线程，运行在操作系统的内核态。M 取出 P 中的 G 从而执行任务。

P：负责协调线程和协程之间任务的中间调度器，可以理解为 golang 世界的逻辑处理器。P 最重要的目的是管理等待分配给 M 的 goroutine 队列。P 的数量由运行时的 GOMAXPROCS 决定。

值得一提，P 管理的队列有两类：全局队列和本地队列。

P 的本地队列：由 P 自行管理，是一种CAS实现的无锁化队列。当一个新的 Goroutine 被创建时（go func），该 G 会优先放入当前 P 的本地队列。

当 M 与 P 结合后，M 会优先从 P 的本地队列中获取 G。

全局队列：当本地队列已经满了或因为复杂的调度策略，新的 Goroutine 会被放入全局队列。全局队列是一种带锁的结构。

复杂策略有很多：Goroutine 从系统调用（syscall）返回、对数个 P 进行负载均衡、……

GMP调度流程大致如下：

线程M想运行任务就需得获取 P，即与P关联。
然从 P 的本地队列(LRQ)获取 G
若LRQ中没有可运行的G，M 会尝试从全局队列(GRQ)拿一批G放到P的本地队列，
若全局队列也未找到可运行的G时候，M会随机从其他 P 的本地队列偷一半放到自己 P 的本地队列。
拿到可运行的G之后，M 运行 G，G 执行之后，M 会从 P 获取下一个 G，不断重复下去。

在 runtime/proc.go 中，可以看到为了防止全局队列被“饿死”。P 在每 61 次取出本地队列后，会优先从全局队列取一次 G。从源码可以看出，61 是被写死的“魔数”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// Check the global runnable queue once in a while to ensure fairness.
// Otherwise two goroutines can completely occupy the local runqueue
// by constantly respawning each other.
if pp.schedtick%61 == 0 && !sched.runq.empty() {
	lock(&sched.lock)
	gp := globrunqget()
	unlock(&sched.lock)
	if gp != nil {
		return gp, false, false
	}
}

写到这，我突然意识到很有意思的问题：原生的 syscall 是为线程级别进行设计，即调用 syscall 的线程会被挂起。但我们知道在 golang 中调用 syscall 不会影响高并发特点。Golang 是如何做的？

golang 将会阻塞执行的操作分为两大类：网络 IO、文件和普通 syscall。

如何避免网络 IO 阻塞

所有网络文件描述符（fd）默认设为非阻塞，Go 在 socket, accept, open 等调用后，立即调用 fcntl(fd, F_SETFL, O_NONBLOCK)（Linux）。

fcntl(fd, F_SETFL, O_NONBLOCK) 是一条 POSIX 系统调用，用来把已打开的文件描述符 fd 设置为非阻塞模式。

当然，fcntl 可能会失败。在此时，该 fd 将会注册到 netpoll 同时告知调度器："我（G）要进入可能阻塞的 syscall"。

netpoll 是 Go 自行实现的网络事件轮询器，相当于 GO 版的 epoll/kqueue 封装层 + 调度器挂钩。

此时，P 会与 M 解绑，使得 P 能够在其他 M 上继续执行。原来的 M 进入等待状态，此时 M 表现的就像阻塞一样。

如何避免普通文件和普通 Syscall 阻塞

在实际执行 syscall 之前，P 会与 M 解绑，原 M 进入阻塞状态。阻塞期间，这个 M 完全闲置（无法执行其他 G），但不影响其他 P 和 G。

GOMAXPROCS 大于实际 CPU 核心数有意义吗？

我查阅了一些资料，这件事情似乎要因地制宜地考虑。基本上，对于绝大多数应用来说 GOMAXPROCS 大于实际 CPU 核心数没有意义,甚至会因为 P 过度争抢 M 导致性能下降。但有部分开发者认为对于高 I/O 场景，GOMAXPROCS 大于实际 CPU 核心数可以提升性能。

为此，我进行了一个简单的实验：我的设备是 Ryzen 7 6800H 8核16线程，我让 Qwen3 生成了两类任务：

CPU 密集型：计算斐波那契数列，并且不使用记忆化以增大 CPU 负载。
I/O 密集型：一个带有延迟的网络服务器。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


// cpu_bound_test.go
package main

import (
	"math/rand"
	"runtime"
	"testing"
)

// fib 计算斐波那契数（故意不用缓存，制造 CPU 压力）
func fib(n int) int {
	if n <= 1 {
		return n
	}
	return fib(n-1) + fib(n-2)
}

func benchmarkCPUBound(b *testing.B, gomaxprocs int) {
	// 保存原始值，但 benchmark 通常独占运行，可不恢复
	runtime.GOMAXPROCS(gomaxprocs)

	// 预热：避免首次运行 JIT/缓存干扰（可选）
	_ = fib(35)

	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		// 每个 goroutine 独立计算一个随机斐波那契数
		for pb.Next() {
			n := 30 + rand.Intn(6) // 30~35，避免太小（太快）或太大（爆栈）
			_ = fib(n)
		}
	})
}

// 生成多个 benchmark，测试不同 GOMAXPROCS
func BenchmarkCPUBound_GOMAXPROCS_1(b *testing.B)  { benchmarkCPUBound(b, 1) }
func BenchmarkCPUBound_GOMAXPROCS_4(b *testing.B)  { benchmarkCPUBound(b, 4) }
func BenchmarkCPUBound_GOMAXPROCS_8(b *testing.B)  { benchmarkCPUBound(b, 8) }
func BenchmarkCPUBound_GOMAXPROCS_16(b *testing.B) { benchmarkCPUBound(b, 16) }
func BenchmarkCPUBound_GOMAXPROCS_32(b *testing.B) { benchmarkCPUBound(b, 32) }

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


// io_bound_test.go
package main

import (
	"io"
	"net/http"
	"net/http/httptest"
	"runtime"
	"testing"
	"time"
)

// 创建一个延迟响应的 HTTP 服务（模拟远程服务）
func newDelayedServer(delay time.Duration) *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		time.Sleep(delay) // 模拟网络/数据库延迟
		w.WriteHeader(http.StatusOK)
		w.Write([]byte("ok"))
	}))
}

func benchmarkIOBound(b *testing.B, gomaxprocs int, delay time.Duration, concurrency int) {
	runtime.GOMAXPROCS(gomaxprocs)

	server := newDelayedServer(delay)
	defer server.Close()

	client := &http.Client{
		Timeout: 10 * time.Second,
	}

	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			resp, err := client.Get(server.URL)
			if err != nil {
				b.Fatalf("Request failed: %v", err)
			}
			io.ReadAll(resp.Body)
			resp.Body.Close()
		}
	})
}

// 测试不同 GOMAXPROCS 下的 IO 性能（固定延迟 50ms，并发数由 RunParallel 控制）
func BenchmarkIOBound_50ms_GOMAXPROCS_1(b *testing.B)  { benchmarkIOBound(b, 1, 50*time.Millisecond, 0) }
func BenchmarkIOBound_50ms_GOMAXPROCS_4(b *testing.B)  { benchmarkIOBound(b, 4, 50*time.Millisecond, 0) }
func BenchmarkIOBound_50ms_GOMAXPROCS_8(b *testing.B)  { benchmarkIOBound(b, 8, 50*time.Millisecond, 0) }
func BenchmarkIOBound_50ms_GOMAXPROCS_16(b *testing.B) { benchmarkIOBound(b, 16, 50*time.Millisecond, 0) }
func BenchmarkIOBound_50ms_GOMAXPROCS_32(b *testing.B) { benchmarkIOBound(b, 32, 50*time.Millisecond, 0) }

结果如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


~/project/go_demo 7s
❯ go test -bench=. ./...        
goos: linux
goarch: amd64
pkg: gomaxprocs-bench
cpu: AMD Ryzen 7 6800H with Radeon Graphics         
BenchmarkCPUBound_GOMAXPROCS_1-16                    100          15270612 ns/op
testing: BenchmarkCPUBound_GOMAXPROCS_1-16 left GOMAXPROCS set to 1
BenchmarkCPUBound_GOMAXPROCS_4-16                    279           4803904 ns/op
testing: BenchmarkCPUBound_GOMAXPROCS_4-16 left GOMAXPROCS set to 4
BenchmarkCPUBound_GOMAXPROCS_8-16                    352           3154692 ns/op
testing: BenchmarkCPUBound_GOMAXPROCS_8-16 left GOMAXPROCS set to 8
BenchmarkCPUBound_GOMAXPROCS_16-16                   483           2153030 ns/op
BenchmarkCPUBound_GOMAXPROCS_32-16                   482           2204711 ns/op
testing: BenchmarkCPUBound_GOMAXPROCS_32-16 left GOMAXPROCS set to 32
BenchmarkIOBound_50ms_GOMAXPROCS_1-16                 22          50526068 ns/op
testing: BenchmarkIOBound_50ms_GOMAXPROCS_1-16 left GOMAXPROCS set to 1
BenchmarkIOBound_50ms_GOMAXPROCS_4-16                 86          13001928 ns/op
testing: BenchmarkIOBound_50ms_GOMAXPROCS_4-16 left GOMAXPROCS set to 4
BenchmarkIOBound_50ms_GOMAXPROCS_8-16                171           6604883 ns/op
testing: BenchmarkIOBound_50ms_GOMAXPROCS_8-16 left GOMAXPROCS set to 8
BenchmarkIOBound_50ms_GOMAXPROCS_16-16               356           3352537 ns/op
BenchmarkIOBound_50ms_GOMAXPROCS_32-16               711           1676945 ns/op
testing: BenchmarkIOBound_50ms_GOMAXPROCS_32-16 left GOMAXPROCS set to 32
PASS
ok      gomaxprocs-bench        16.211s

结果很明显，对于 CPU 密集型任务 GOMAXPROCS 等于实际核心数时性能最佳，过多反而导致性能下降。但高 I/O 场景，GOMAXPROCS 大于实际 CPU 核心数确实可以提升性能。

GOMAXPROCS大于CPU核数有意义吗？——GMP笔记

什么是 GMP

如何避免网络 IO 阻塞

如何避免普通文件和普通 Syscall 阻塞

GOMAXPROCS 大于实际 CPU 核心数有意义吗？

相关文章：