GIL behavior in python 3.7 multithreading
Python is not executed directly. It is first compiled into so called Python bytecode. This bytecode is similar in its idea to raw assembly. The bytecode is executed.
What GIL does it doesn't allow two bytecode instructions to run in parallel. Although some opeartions (e.g. io) do release the GIL internally to allow real concurrency when it can be proved that it cannot break anything.
Now all you have to know is that count -= 1
does not compile into a single bytecode instruction. It actually compiles into 4 instructions
LOAD_GLOBAL 1 (count)
LOAD_CONST 1 (1)
INPLACE_SUBTRACT
STORE_GLOBAL 1 (count)
which roughly means
load global variable into local variable
load 1 into local variable
subtract 1 from local variable
set global to the current local variable
Each of these instruction is atomic. But the order can be mixed by threads and that's why you see what you see.
So what GIL does it makes the execution flow serial. Meaning instructions happen one after another, nothing is parallel. So when you run multiple threads in theory they will perform the same as single thread minus some time spent on (so called) context switch. My tests in Python3.6 confirm that the execution time is similar.
However in Python2.7 my tests showed significant performance degradation with threads, about 1.5x. I don't know the reason for this. Something other then GIL has to happen in the background.