Recent days(months) have seen my self-indulgent consequent 3 walkthroughs on Baldur’s Gate 3. To say I regret spending so much time on it is not accurate, as I really enjoy that game. But life goes on, so I shall quit the redemption of Bhaalspawn, and resume my inquiry to software source codes.
I have always wanted to read Linux. It’s said to be well-crafted and itself is such a crucial source code which powers the whole world! So let’s start.
To read it, I need to configure my text editor first(VSCode, in my case).
Clangd is my language server. It uses compiler commands database to get include-path, compile flags, etc. These information is provided by the compiler, and Clangd utilizes them to help me navigate the codebase.
The compiler emits the compiler commands database, so I need to build up linux first(uses make
). Then there’s a convenient script located in scripts/clang-tools/gen_compile_commands.py
, which helps to generate the database after building.
The version is 6.7.0 RC6
.
Spinlock API
Spinlock is a common synchronization mechanism for multi-processing systems. It prevents multiple processes from entering the same spaces, as known as the critical section. Related explanation can be found in generic operating system textbooks.
Here I choose three spinlock API in linux kernel space:
1 |
|
They are self-contained and complete. First initiate a lock, lock it, and unlock it. That’s a whole lifetime of a lock. There are actually plenty of APIs on locking, but I will only inspect these three for the sake of simplicity.
There’s another macro worth mentioning:
1 |
|
That’s a shortcut for defining spinlock which marries the definition and initialization. As a language without ctors, this is kind of the only way to establish some invariant when a type is constructed.
We can use it like:
1 |
|
Some language constructs
There’s some weird keyword in the API above. They are unfamiliar because they are kernel-specific, rather than a part of C language standard. So I will briefly explain them:
- do … while macro
That’s the statement block in macros. When we need to make our macro multi-lines, we use that kind of writing.
__always_inline
The kernel development group thinks the inline
keyword in GCC is, unfortunately, broken. Because that keyword is just a suggestion to the compiler, so it won’t do what it is designed to do(i.e. inline the function). So they killed that keyword, and use __always_inline
to force the compiler inlining the function and do nothing else.
- mask
The mask value is generated by macros. When the _Q_LOCKED_BITS == 8
and _Q_LOCKED_OFFSET = 0
, the generated mask is (1U << 8 - 1) << 0
, which is 0xff. To extract the related bit we use (a & mask)
, and (a & ~mask)
means bits other than the masked value.
PV locks
Linux supports virtualization natively. One hypervisor technology is para-virtualization, which is supported by the kernel.
Linux has a spinlock designated to KVM, the pvqspinlock
. I will not inspect them in that article. Virtualization belongs to a broader topic.
Spinlock and Raw Spinlock
The spinlock is a wrapper of raw_spinlock, which does all real works. The wrapper is used to detect and diagnose deadlocks. For locks and unlocks, the function simply forwards them to raw_spinlock. The module lockdeps
does the deadlock detection, but due to limited scope of the article, I will not discuss them further.
Now we are going to see the true presence of Spinlocks. How many states should it store to support the locking algorithm?
1 |
|
As I said, I will ignore the lockdeps
module, so spinlock is just the same as raw_spinlock, which unfolds as follows:
1 |
|
Aside from debugging codes, that structure is another direct mapping of the arch_spinlock_t
type, which unfolds further:
1 |
|
It’s a four byte struct, which contains a single union. The first union member, val
suggests that clients can only change the struct atomically. I don’t think it encodes other real useful information, so let’s look at the rest of the it.
Intuitively, the field locked
says whether this spinlock is in a locked state. In that state, if another thread wants to acquire the lock, it must wait. Other variables don’t make much sense right now, so let’s look at somewhere else.
Init
1 |
|
To initialize a spinlock, the code firstly check whether the incoming _lock
is a pointer to a spinlock, then fill it with a spinlock in unlocked
state.
Here’s a trick in weak typing C
to ensure some type safety:
1 |
|
That function is not used: no one checks its return value. Maybe every compiler removes it in optimization, but that’s fine, because that function is used in compile stage, rather than runtime stage.
When a client gives wrong type of pointer to the function, it will generate a compiler warning(which turning out to be errors), ending the compile process. By that way it ensures the _lock
user passes to spin_lock_init()
is always the right type.
After several macro expansions, let’s see how the bookkeeper registers an initial lock state:
1 |
|
It simply make the struct zero.
Locking
The first expansion resolves to this function:
1 |
|
1 During a lock acquisition, we really don’t want other threads to disturb us, otherwise this piece of code can bring out more race conditions than it tries to solve. So preempt is disabled here.
2 This one is used by lockdeps
module to track deadlocks.
3 This macro does the real work:
1 |
|
, which hints us to find do_raw_spin_lock
, turning out to be:
1 |
|
1 This is still used by diagnoses.
2 does the real work.
3 fallback for architectures that don’t support default spinlock implementations.
And the real work is…
1 |
|
Then we came to qspinlock.h
.
From C to x86 Assembly
The definition of queued_spin_lock is quite simple; it tries to acquire the lock. If the lock is acquired by other threads, it enters a “slow path” which details the queue algorithm.
1 |
|
The likely
is a compiler intrinsic optimization. By saying some branch in condition is more likely to happen, it helps the compiler arrange code to predict branch.
Next, I will take a closer look at atomic_try_cmpxchg_acquire
. There’s two interesting aspects about that function: 1. cmpxchg
is a assembly instruction, so it must be implemented by inlining assemblies. 2. acquire
specifies a memory ordering. These two concepts will comprehend the next two sections.
inlining assemblies
Eliminating instrumentation codes, the cmpxchg
is the following assembly code:
1 |
|
For our purpose only the 32-bit case is relevant. The function compares the value in _ptr
, if *_ptr
is equal to *_pold
, then *_ptr
is replaced by _new
. If the replacement occurs, the function returns true, otherwise false. Worth noting that this is a valued statement block. The value of ({statements;})
is the last statement, in our case, likely(success)
, or just success
.
There’s some sanity checks. They are necessary, but I will ignore them, and dive into the asm block. Related macros are expanded and simplified if necessary.
1 |
|
Asm block is scary at the first sight, so we need some patience and good guidance.
The first line, lock;
, is a instruction prefix, which means it decorates the next instruction, making it atomic, as said the AMD manual:
The LOCK prefix causes certain read-modify-write instructions that access memory to occur atomically. The mechanism for doing so is implementation-dependent (for example, the mechanism may involve locking of data-cache lines that contain copies of the referenced memory operands, and/or bus signaling or packet-messaging on the bus). The prefix is intended to give the processor exclusive use of shared memory operands in a multiprocessor system.
The general structure of a asm block are like this:
1 |
|
The first part is a instruction pattern. Just like printf
, the pattern is filled by parameters given later, after two colons. The %[<label>]
pattern is reserved for the latter refilling. For example, [new] "r" (__new)
specifies that the __new
will be read to fill the position %[<label>]
. Whether that value stores in memory or register is unclear to us; gcc chooses on its whim.
After this block, we can expect success
to store the conditional code which tells us whether the exchange happened. The GCC documentation tells us:
x86 family:
The flag output constraints for the x86 family are of the form =@cc(cond) where cond is one of the standard conditions defined in the ISA manual for jcc or setcc.
In particular, z
means equal or zero flag set.
Memory order
On strongly-ordered systems — x86, SPARC TSO, IBM mainframe, etc. — release-acquire ordering is automatic for the majority of operations.
We know in modern hardware, the compiler and hardware may interleave our code in unpredictable ways. To address that, we can instruct them to constraint their optimization. The constraint has four levels:
Sequentially Consistent: Don’t reorder.
Acquire: All data accesses after me stays with it.
Release: All data accesses before me stays with it.
Relaxed: Hardware, do whatever ye wants.
That’s my most imprecise and irresponsible interpretation. Many authors have discussed them extensively, but due to the scope of the article I can say no further.
Locking - the slow path
Now that we know the lock is acquired by someone else, the lock is going to spin and wait until the lock is available. The Linux kernel implements a optimized MCS locking algorithm, which added a pending bit(byte) to the lock. If the pending bit is set, it means the current lock has only one running holder before it, and the current lock is not queued. It’s only until the third competitor appears that a queue is constructed. Because that occasion is rarer, it reduces the cost of memory allocation as well as cacheline misses.
So this function has two execution flows: the quicker slow path, and the slower slow path.
The quicker slow path
As we enter the slow path, the value of lock is held or the lock is claimed but didn’t finish acquiring by other threads.
We denote the lock state by a triple (queue tail, pending bit, lock value)
.
1 |
|
(0, 0, 0)
is the fast path. The slow path starts with (0, 1, 0)
, where the lock is released recently by a old owner, but the current pending lock(not me) hasn’t fully acquired it. On that case we have to wait until it acquires the lock, after which the lock enters the state (0, 0, 1)
.
1 |
|
Then we read the lock again. We fetch the pending | queue
tail bits. They are set if someone else is already waiting for the lock. In that condition, we know our lock is the 3rd candidate, so we must queue. That’s the slower slow path. But let’s assume our lock is the pending lock so that we can go through the quicker slow path.
1 |
|
That function atomically test and set the pending bit of our lock. It returns the old value in the lock. We check the old value, if the lock is unfortunately acquired by others again before, our lock has no choices but to wait again:
1 |
|
After that our lock can simply spin until the former lock owner releases the lock and takes the lock:
1 |
|
The slowest path
When the lock is busy, the lock is arranged as a queue. The queue data structure is stored per CPU and is fixed. Here’s the definition:
1 |
|
qnode
is a thin wrapper of mcs_spinlock
. mcs_spinlock
contains one pointer to the next spinlock in queue, a locked flag and a nesting count.
1 |
|
Every CPU has four qnodes, because there’s four conditions where the control flow in one CPU changes, namely:
User processes: When the spin lock is acquired, the kernel interruption is disabled, leading to the end of context switching, so only one user process may acquire the lock.
Software interrupt: syscall…
Hardware interrupt: IO…
non-maskable interrupt
Our lock needs to know its position in the queue:
1 |
|
If one CPU has given out more than four instances of locks, the qspinlock fall back to normal spinlock.
1 |
|
After that, our lock get from CPU a node to enqueue.
1 |
|
It’s passed so long. During that time, the lock may have been released, so we can probe again to see whether we can grab the lock now:
1 |
|
As our lock has gained a node to enqueue, the next thing is to actuate the enqueue.
1 |
|
We replaced the tail id by our new tail id, refetching the old tail id. That id is decoded to find qnode. We need that qnode because it’s our former lock owner. After its release, it can find us and unset our lock bit.
Then start waiting! The loop is inside the macro arch_mcs_spin_lock_contended
, where ensures the node is head of queue after it.
1 |
|
Wait til the lock is free:
1 |
|
1 |
|
Conclusion
Hacking Linux is more tiring than I thought.. Maybe later articles in the series will deal with the release part. The locking part has demonstrated a heady and rigorous investigation on hardware/software interface and tradeoff between implementation and optimization.
The next article will not look at so complicated code, of course.