Why does sync.Mutex largely drop performance when goroutine contention is more than 3400?

sync.Mutex 's implementation is based on runtime semaphore. The reason why it encounters massive performance decreases is that the implementation of runtime.semacquire1.

Now, let's sample two representative points, we use go tool pprof when the number of goroutines was equal to 2400 and 4800:

Click to copy

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8           50000000                46.5 ns/op
PASS
ok        2.508s

BenchmarkMutexWrite/goroutines-4800-8           50000000               317 ns/op
PASS
ok        16.020s

2400:

enter image description here

4800:

enter image description here

As we can see, when the number of goroutines increased to 4800, the overhead of runtime.gopark becomes dominant. Let's dig more in the runtime source code and see who exactly calls runtime.gopark. In the runtime.semacquire1:

Click to copy

func semacquire1(addr *uint32, lifo bool, profile semaProfileFlags, skipframes int) {
    // fast path
    if cansemacquire(addr) {
        return
    }

    s := acquireSudog()
    root := semroot(addr)
    ...
    for {
        lock(&root.lock)
        atomic.Xadd(&root.nwait, 1)
        if cansemacquire(addr) {
            atomic.Xadd(&root.nwait, -1)
            unlock(&root.lock)
            break
        }

        // slow path
        root.queue(addr, s, lifo)
        goparkunlock(&root.lock, waitReasonSemacquire, traceEvGoBlockSync, 4+skipframes)
        if s.ticket != 0 || cansemacquire(addr) {
            break
        }
    }
    ...
}

Based on the pprof graph we presented above, we can conclude that:

Observation: runtime.gopark calls rarely when 2400 #goroutines, and runtime.mutex calls heavily. We infer that most of the code is done before the slow path.
Observation: runtime.gopark calls heavily when 4800 #goroutines. We infer that most of the code was entering the slow path, and when we start using runtime.gopark, the runtime scheduler context switching costs must be considered.

Considering channels in Go is implemented based on OS synchronization primitives without involving runtime scheduler, eg. Futex on Linux. Therefore its performance decreases linearly with the increasing of problem size.

The above explains the reason why we see a massive performance decrease in sync.Mutex.

Why does sync.Mutex largely drop performance when goroutine contention is more than 3400?

Tags:

Performance

Go

Related

Recent Posts