Go Scheduler

What is the Go Scheduler?

Go scheduler’s purpose is to efficiently distribute goroutines over multiple OS threads.

Kernel thread is expensive; therefore, reducing and reusing kernel threads are keys.

initial goroutine stack comsumes 2 KB memory

default thread stack comsumes 1 MB memory

Scheduling Basics

P (Processor)

It is given a Logical Processor (P) for every virtual core.
Runtime keeps track of each G and maps them onto Logical Processors.
OS threads run on at most GOMAXPROCS number of processors.

M (Machine, OS thread)

M must have an associated P to execute G.

G (Goroutine)

G keeps track of its stack and status.
Status
- Idle: It was just allocated and has not yet been initialised.
- Runnable: It is on a runqueue.
- Running:
  - It is assigned an M.
  - It executes user code.
- Syscall:
  - It is assigned an M
  - It is not on a runqueue.
  - It is executing a system call and not executing user code.
- Waiting: Is is blocked in the runtime. (e.g. channel wait queue)
- Moribund unused
- Dead
- Enqueue unused
- Copy stack

Runqueue

Local runqueue
- Every P has its own local runqueue.
- Once goroutines are created, they are tracked in Ps local runqueue (FIFO).
Global runqueue
- All Ps share a global runqueue.
- A lower priority runqueue
- Go scheduler runs a background thread called sysmon to detect long-running goroutines (10+ ms, e.g. CPU-bound computation) and put them into global runqueue.

P2 has no work to do and P1 has no work to steal (we will talk about work stealing later).

Global Runqueue
┏━━━━┓
┃ G7 ┃
┗━━━━┛

┏━━━━┓  ┏━━━━┓  ┏━━━━┓         ┏━━━━┓  ┏━━━━┓  ┏━━━━┓
┃ P1 ┣━━┃ M1 ┣━━┃ G1 ┃         ┃ P2 ┣━━┃ M2 ┣━━┃    ┃
┗━┳━━┛  ┗━━━━┛  ┗━━━━┛         ┗━┳━━┛  ┗━━━━┛  ┗━━━━┛
  ┃                              ┃
  ┃    Local Runqueue            ┃    Local Runqueue
  ┃    ┏━━━━━━━┓                 ┃    ┏━━━━━━━┓
  ┗━━━━┫ empty ┃                 ┗━━━━┫ empty ┃
       ┗━━━━━━━┛                      ┗━━━━━━━┛

Since there was runnable G G7 in global runqueue, P2 got G7 from there.

Global Runqueue
┏━━━━━━━┓
┃ empty ┃
┗━━━━━━━┛

┏━━━━┓  ┏━━━━┓  ┏━━━━┓         ┏━━━━┓  ┏━━━━┓  ┏━━━━┓
┃ P1 ┣━━┃ M1 ┣━━┃ G1 ┃         ┃ P2 ┣━━┃ M2 ┣━━┃ G7 ┃
┗━┳━━┛  ┗━━━━┛  ┗━━━━┛         ┗━┳━━┛  ┗━━━━┛  ┗━━━━┛
  ┃                              ┃
  ┃    Local Runqueue            ┃    Local Runqueue
  ┃    ┏━━━━━━━┓                 ┃    ┏━━━━━━━┓
  ┗━━━━┫ empty ┃                 ┗━━━━┫ empty ┃
       ┗━━━━━━━┛                      ┗━━━━━━━┛

Context Switching

┏━━━━━━━━━━━━━━━━━━━━━━┓
┃  CPU Core            ┃
┃┏━━━━━━━━━━━━━━━━━━━━┓┃
┃┃ Thread 1           ┣╋━━┓
┃┃ ┏━━━━┓┏━━━━┓┏━━━━┓ ┃┃  ┃
┃┃ ┃ G1 ┃┃ G2 ┃┃ G3 ┃ ┃┃  ┃
┃┃ ┗━━━━┛┗━━━━┛┗━━━━┛ ┃┃  ┣━━＞OS scheduler
┃┗━━━━━━━━━━━━━━━━━━━━┛┃  ┃
┃┏━━━━━━━━━━━━━━━━━━━━┓┃  ┃
┃┃ Thread 2           ┣╋━━┛
┃┃ ┏━━━━┓┏━━━━┓┏━━━━┓ ┃┃
┃┃ ┃ G4 ┃┃ G5 ┃┃ G6 ┃ ┣╋━━━━━＞Go scheduler
┃┃ ┗━━━━┛┗━━━━┛┗━━━━┛ ┃┃
┃┗━━━━━━━━━━━━━━━━━━━━┛┃
┗━━━━━━━━━━━━━━━━━━━━━━┛

OS Threads are context-switched on and off a core.
Goroutines are context-switched on and off an M.
Speed of switching
- goroutine switches: tens of ns
- thread switches: a μs
Classes of events in Go programs that allow the scheduler to make scheduling decisions.
- The use of the keyword go (goroutine)
- Garbage collection
- System calls
- Synchronisation
- Network I/O

Asynchronous System Calls

netpoller

Use interface of network poller provided by OS to deal with network sockets.
- MacOS: kqueue
- Linux: epoll
- Windows: IOCP (IoCompletionPort)
Prevent goroutines from blocking the M when syscalls are made.
Use its own thread to do network I/O. (no need to create new M)

G1 ran on M1 and made a network call, so it was moved to the netpoller. And P1 got G2 from its local runqueue and G2 was context-switched on M1. Until G1 finish its network call, it will be moved back to the P1’s local runqueue.

netpoller
┏━━━━┓
┃ G1 ┃
┗━━━━┛

┏━━━━┓  ┏━━━━┓  ┏━━━━┓
┃ P1 ┣━━┃ M1 ┣━━┃ G2 ┃
┗━┳━━┛  ┗━━━━┛  ┗━━━━┛
  ┃
  ┃    Local Runqueue
  ┃    ┏━━━━┓  ┏━━━━┓  ┏━━━━┓
  ┗━━━━┫ G3 ┣━━┃ G4 ┣━━┃ G5 ┃
       ┗━━━━┛  ┗━━━━┛  ┗━━━━┛

Synchronous System Calls

Goroutine making the syscall will block the M.
- System call e.g. open a file
- Synchronisation (atomic, mutex, channel operation call)
Handoff
- Background monitor thread detects threads which are blocked the M.
- Start a new thread or unpark a idle thread.
- Background monitor hands off the runqueue to new thread.

G1 ran on M1 and made a system call that blocked M1. M1 with G1 was moved off from P1, then M2 was created and replace M1. Then P1 got G2 from its local runqueue and G2 was context-switched on M2. Until G1 finish its system call, it will be moved back to P1’s local runqueue and M1 will park as a idle thread for future use.

┏━━━━┓  ┏━━━━┓
┃ M1 ┣━━┃ G1 ┃
┗━━━━┛  ┗━━━━┛

┏━━━━┓  ┏━━━━┓  ┏━━━━┓
┃ P1 ┣━━┃ M2 ┣━━┃ G2 ┃
┗━┳━━┛  ┗━━━━┛  ┗━━━━┛
  ┃
  ┃    Local Runqueue
  ┃    ┏━━━━┓  ┏━━━━┓  ┏━━━━┓
  ┗━━━━┫ G3 ┣━━┃ G4 ┣━━┃ G5 ┃
       ┗━━━━┛  ┗━━━━┛  ┗━━━━┛

Spinning Thread

For the reason that hand-off and thread parking/unparking increase latency, minimise these actions for optimal thread management.

The problem is that it is impossible to predict the future whether goroutines are comming. We do not want to just park a worker thread and then unpark it.

The solution is to maintain one “idle” thread on M for incoming goroutines, and this state of worker thread called “spinning”.

Work Stealing

Balance the goroutines across all the Ps in order to keep Ms efficient.
The rules for work stealing are as follows.
- Check the local runqueue.
- If not found, try to steal work from other Ps.
- If not found, check the global runqueue.
- If not found, poll network.

P2 has finished all work and it will try to steal work from P1.

┏━━━━┓  ┏━━━━┓  ┏━━━━┓         ┏━━━━┓  ┏━━━━┓  ┏━━━━┓
┃ P1 ┣━━┃ M1 ┣━━┃ G1 ┃         ┃ P2 ┣━━┃ M2 ┣━━┃    ┃
┗━┳━━┛  ┗━━━━┛  ┗━━━━┛         ┗━┳━━┛  ┗━━━━┛  ┗━━━━┛
  ┃                              ┃
  ┃    Local Runqueue            ┃    Local Runqueue
  ┃    ┏━━━━┓  ┏━━━━┓  ┏━━━━┓    ┃    ┏━━━━━━━┓
  ┗━━━━┫ G4 ┣━━┃ G5 ┣━━┃ G6 ┃    ┗━━━━┫ empty ┃
       ┗━━━━┛  ┗━━━━┛  ┗━━━━┛         ┗━━━━━━━┛

P2 stealed half of Gs (G4 and G5) from P1’s local runqueue and G4 was context-switched on M2.

┏━━━━┓  ┏━━━━┓  ┏━━━━┓         ┏━━━━┓  ┏━━━━┓  ┏━━━━┓
┃ P1 ┣━━┃ M1 ┣━━┃ G1 ┃         ┃ P2 ┣━━┃ M2 ┣━━┃ G4 ┃
┗━┳━━┛  ┗━━━━┛  ┗━━━━┛         ┗━┳━━┛  ┗━━━━┛  ┗━━━━┛
  ┃                              ┃
  ┃    Local Runqueue            ┃    Local Runqueue
  ┃    ┏━━━━┓                    ┃    ┏━━━━┓
  ┗━━━━┫ G6 ┃                    ┗━━━━┫ G5 ┃
       ┗━━━━┛                         ┗━━━━┛

Trace go scheduler

schedtrace

$ GOMAXPROCS=2 GODEBUG=schedtrace=1000 go run main.go
...
SCHED 2009ms: gomaxprocs=2 idleprocs=0 threads=4 spinningthreads=0 idlethreads=1 runqueue=0 [8 0]
...

gomaxprocs: Number of processors
idleprocs: Number of idle processors
threads: Number of threads
idlethreads: Number of idle threads
spinningthreads: Number of spinning threads
runqueue: Number of goroutines in the global runqueue
[0 0]: Number of goroutine of respective local runqueue

schedetail

$ GOMAXPROCS=2 GODEBUG=schedtrace=1000,scheddetail=1 go run main.go
...
SCHED 0ms: gomaxprocs=2 idleprocs=0 threads=4 spinningthreads=1 idlethreads=1 runqueue=0 gcwaiting=0 nmidlelocked=0 stopwait=0 sysmonwait=0
  P0: status=1 schedtick=0 syscalltick=0 m=3 runqsize=0 gfreecnt=0
  P1: status=0 schedtick=2 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  M3: p=0 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=true blocked=false lockedg=-1
  M2: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M1: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M0: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=1
  G1: status=1(chan receive) m=-1 lockedm=0
  G2: status=4(force gc (idle)) m=-1 lockedm=-1
  G3: status=4(GC sweep wait) m=-1 lockedm=-1
...

0: idle
1: runnable
2: running
3: syscall
4: waiting
5: moribund unused
6: dead
7: enqueue unused
8: copy stack