Why does sync.Mutex largely drop performance when goroutine contention is more than 3400?
sync.Mutex 's implementation is based on runtime semaphore. The reason why it encounters massive performance decreases is that the implementation of runtime.semacquire1
.
Now, let's sample two representative points, we use go tool pprof
when the number of goroutines was equal to 2400 and 4800:
goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8 50000000 46.5 ns/op
PASS
ok 2.508s
BenchmarkMutexWrite/goroutines-4800-8 50000000 317 ns/op
PASS
ok 16.020s
2400:
4800:
As we can see, when the number of goroutines increased to 4800, the overhead of runtime.gopark
becomes dominant. Let's dig more in the runtime source code and see who exactly calls runtime.gopark
. In the runtime.semacquire1
:
func semacquire1(addr *uint32, lifo bool, profile semaProfileFlags, skipframes int) {
// fast path
if cansemacquire(addr) {
return
}
s := acquireSudog()
root := semroot(addr)
...
for {
lock(&root.lock)
atomic.Xadd(&root.nwait, 1)
if cansemacquire(addr) {
atomic.Xadd(&root.nwait, -1)
unlock(&root.lock)
break
}
// slow path
root.queue(addr, s, lifo)
goparkunlock(&root.lock, waitReasonSemacquire, traceEvGoBlockSync, 4+skipframes)
if s.ticket != 0 || cansemacquire(addr) {
break
}
}
...
}
Based on the pprof graph we presented above, we can conclude that:
Observation:
runtime.gopark
calls rarely when 2400 #goroutines, andruntime.mutex
calls heavily. We infer that most of the code is done before the slow path.Observation:
runtime.gopark
calls heavily when 4800 #goroutines. We infer that most of the code was entering the slow path, and when we start usingruntime.gopark
, the runtime scheduler context switching costs must be considered.
Considering channels in Go is implemented based on OS synchronization primitives without involving runtime scheduler, eg. Futex on Linux. Therefore its performance decreases linearly with the increasing of problem size.
The above explains the reason why we see a massive performance decrease in sync.Mutex
.