qspinlock, a spinlock in the Linux Kernel

December 21, 2023

CS Linux Synchronization

Recent days(months) have seen my self-indulgent consequent 3 walkthroughs on Baldur’s Gate 3. To say I regret spending so much time on it is not accurate, as I really enjoy that game. But life goes on, so I shall quit the redemption of Bhaalspawn, and resume my inquiry to software source codes.

I have always wanted to read Linux. It’s said to be well-crafted and itself is such a crucial source code which powers the whole world! So let’s start.

To read it, I need to configure my text editor first(VSCode, in my case).

Clangd is my language server. It uses compiler commands database to get include-path, compile flags, etc. These information is provided by the compiler, and Clangd utilizes them to help me navigate the codebase.

The compiler emits the compiler commands database, so I need to build up linux first(uses make). Then there’s a convenient script located in scripts/clang-tools/gen_compile_commands.py, which helps to generate the database after building.

The version is 6.7.0 RC6.

Spinlock API

Spinlock is a common synchronization mechanism for multi-processing systems. It prevents multiple processes from entering the same spaces, as known as the critical section. Related explanation can be found in generic operating system textbooks.

Here I choose three spinlock API in linux kernel space:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/**
From include/linux/spinlock.h
*/

# define spin_lock_init(_lock) \
do { \
spinlock_check(_lock); \
*(_lock) = __SPIN_LOCK_UNLOCKED(_lock); \
} while (0)

#endif

static __always_inline void spin_lock(spinlock_t *lock)
{
raw_spin_lock(&lock->rlock);
}

static __always_inline void spin_unlock(spinlock_t *lock)
{
raw_spin_unlock(&lock->rlock);
}

They are self-contained and complete. First initiate a lock, lock it, and unlock it. That’s a whole lifetime of a lock. There are actually plenty of APIs on locking, but I will only inspect these three for the sake of simplicity.

There’s another macro worth mentioning:

1
#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x)

That’s a shortcut for defining spinlock which marries the definition and initialization. As a language without ctors, this is kind of the only way to establish some invariant when a type is constructed.

We can use it like:

1
2
3
4
5
6
7
8
9
10
static int i = 0;
static DEFINE_SPINLOCK(lock);

void
some_function()
{
spin_lock(&lock);
i++;
spin_unlock(&lock);
}

Some language constructs

There’s some weird keyword in the API above. They are unfamiliar because they are kernel-specific, rather than a part of C language standard. So I will briefly explain them:

That’s the statement block in macros. When we need to make our macro multi-lines, we use that kind of writing.

The kernel development group thinks the inline keyword in GCC is, unfortunately, broken. Because that keyword is just a suggestion to the compiler, so it won’t do what it is designed to do(i.e. inline the function). So they killed that keyword, and use __always_inline to force the compiler inlining the function and do nothing else.

The mask value is generated by macros. When the _Q_LOCKED_BITS == 8 and _Q_LOCKED_OFFSET = 0, the generated mask is (1U << 8 - 1) << 0, which is 0xff. To extract the related bit we use (a & mask), and (a & ~mask) means bits other than the masked value.

PV locks

Linux supports virtualization natively. One hypervisor technology is para-virtualization, which is supported by the kernel.

Linux has a spinlock designated to KVM, the pvqspinlock. I will not inspect them in that article. Virtualization belongs to a broader topic.

Spinlock and Raw Spinlock

The spinlock is a wrapper of raw_spinlock, which does all real works. The wrapper is used to detect and diagnose deadlocks. For locks and unlocks, the function simply forwards them to raw_spinlock. The module lockdeps does the deadlock detection, but due to limited scope of the article, I will not discuss them further.

Now we are going to see the true presence of Spinlocks. How many states should it store to support the locking algorithm?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// include/linux/spinlock_types.h

typedef struct spinlock {
union {
struct raw_spinlock rlock;

#ifdef CONFIG_DEBUG_LOCK_ALLOC
# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))
struct {
u8 __padding[LOCK_PADSIZE];
struct lockdep_map dep_map;
};
#endif
};
} spinlock_t;

As I said, I will ignore the lockdeps module, so spinlock is just the same as raw_spinlock, which unfolds as follows:

1
2
3
4
5
6
7
8
9
10
11
// include/asm-generic/qspinlock_types.h
typedef struct raw_spinlock {
arch_spinlock_t raw_lock;
#ifdef CONFIG_DEBUG_SPINLOCK
unsigned int magic, owner_cpu;
void *owner;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} raw_spinlock_t;

Aside from debugging codes, that structure is another direct mapping of the arch_spinlock_t type, which unfolds further:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// include/asm-generic/qspinlock_types.h

typedef struct qspinlock {
union {
atomic_t val;

/*
* By using the whole 2nd least significant byte for the
* pending bit, we can allow better optimization of the lock
* acquisition for the pending bit holder.
*/
struct {
u8 locked;
u8 pending;
};
struct {
u16 locked_pending;
u16 tail;
};
};
// omit situation for big endian machines
} arch_spinlock_t;

It’s a four byte struct, which contains a single union. The first union member, val suggests that clients can only change the struct atomically. I don’t think it encodes other real useful information, so let’s look at the rest of the it.

Intuitively, the field locked says whether this spinlock is in a locked state. In that state, if another thread wants to acquire the lock, it must wait. Other variables don’t make much sense right now, so let’s look at somewhere else.

Init

1
2
3
4
5
6
7
// include/linux/spinlock.h

# define spin_lock_init(_lock) \
do { \
spinlock_check(_lock); \
*(_lock) = __SPIN_LOCK_UNLOCKED(_lock); \
} while (0)

To initialize a spinlock, the code firstly check whether the incoming _lock is a pointer to a spinlock, then fill it with a spinlock in unlocked state.

Here’s a trick in weak typing C to ensure some type safety:

1
2
3
4
static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
{
return &lock->rlock;
}

That function is not used: no one checks its return value. Maybe every compiler removes it in optimization, but that’s fine, because that function is used in compile stage, rather than runtime stage.

When a client gives wrong type of pointer to the function, it will generate a compiler warning(which turning out to be errors), ending the compile process. By that way it ensures the _lock user passes to spin_lock_init() is always the right type.

After several macro expansions, let’s see how the bookkeeper registers an initial lock state:

1
2
3
// include/asm-generic/qspinlock_types.h

#define __ARCH_SPIN_LOCK_UNLOCKED { { .val = ATOMIC_INIT(0) } }

It simply make the struct zero.

Locking

The first expansion resolves to this function:

1
2
3
4
5
6
7
8
9
// include/linux/spinlock_api_smp.h
// smp means Symmetric MultiProcessors, an architecture where we have many processors but only one memory.

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

1 During a lock acquisition, we really don’t want other threads to disturb us, otherwise this piece of code can bring out more race conditions than it tries to solve. So preempt is disabled here.

2 This one is used by lockdeps module to track deadlocks.

3 This macro does the real work:

1
2
3
4
5
// include/linux/lockdep.h

#define LOCK_CONTENDED(_lock, try, lock) \
lock(_lock)

, which hints us to find do_raw_spin_lock, turning out to be:

1
2
3
4
5
6
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
__acquire(lock);
arch_spin_lock(&lock->raw_lock);
mmiowb_spin_lock();
}

1 This is still used by diagnoses.

2 does the real work.

3 fallback for architectures that don’t support default spinlock implementations.

And the real work is…

1
#define arch_spin_lock(l)  queued_spin_lock(l)

Then we came to qspinlock.h.

From C to x86 Assembly

The definition of queued_spin_lock is quite simple; it tries to acquire the lock. If the lock is acquired by other threads, it enters a “slow path” which details the queue algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
/**
* queued_spin_lock - acquire a queued spinlock
* @lock: Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
int val = 0;

if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
return;

queued_spin_lock_slowpath(lock, val);
}

The likely is a compiler intrinsic optimization. By saying some branch in condition is more likely to happen, it helps the compiler arrange code to predict branch.

Next, I will take a closer look at atomic_try_cmpxchg_acquire. There’s two interesting aspects about that function: 1. cmpxchg is a assembly instruction, so it must be implemented by inlining assemblies. 2. acquire specifies a memory ordering. These two concepts will comprehend the next two sections.

inlining assemblies

Eliminating instrumentation codes, the cmpxchg is the following assembly code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// arch/x86/include/asm/cmpxchg.h

#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
({ \
bool success; \
__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \
__typeof__(*(_ptr)) __old = *_old; \
__typeof__(*(_ptr)) __new = (_new); \
switch (size) { \
\
case __X86_CASE_L: \
{ \
volatile u32 *__ptr = (volatile u32 *)(_ptr); \
asm volatile(lock "cmpxchgl %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
default: \
__cmpxchg_wrong_size(); \
} \
if (unlikely(!success)) \
*_old = __old; \
likely(success); \
})

For our purpose only the 32-bit case is relevant. The function compares the value in _ptr, if *_ptr is equal to *_pold, then *_ptr is replaced by _new. If the replacement occurs, the function returns true, otherwise false. Worth noting that this is a valued statement block. The value of ({statements;}) is the last statement, in our case, likely(success), or just success.

There’s some sanity checks. They are necessary, but I will ignore them, and dive into the asm block. Related macros are expanded and simplified if necessary.

1
2
3
4
5
6
7
8
asm volatile(
"lock;"
"cmpxchgl %[new], %[ptr]" \
: "=@ccz" (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \

Asm block is scary at the first sight, so we need some patience and good guidance.

The first line, lock;, is a instruction prefix, which means it decorates the next instruction, making it atomic, as said the AMD manual:

The LOCK prefix causes certain read-modify-write instructions that access memory to occur atomically. The mechanism for doing so is implementation-dependent (for example, the mechanism may involve locking of data-cache lines that contain copies of the referenced memory operands, and/or bus signaling or packet-messaging on the bus). The prefix is intended to give the processor exclusive use of shared memory operands in a multiprocessor system.

The general structure of a asm block are like this:

1
2
3
4
5
6
asm [volatile] (
"<instructions separated by ; or \n\t>"
: output parameters,
: input parameters,
: changed registers or memory
);

The first part is a instruction pattern. Just like printf, the pattern is filled by parameters given later, after two colons. The %[<label>] pattern is reserved for the latter refilling. For example, [new] "r" (__new) specifies that the __new will be read to fill the position %[<label>]. Whether that value stores in memory or register is unclear to us; gcc chooses on its whim.

After this block, we can expect success to store the conditional code which tells us whether the exchange happened. The GCC documentation tells us:

x86 family:
The flag output constraints for the x86 family are of the form =@cc(cond) where cond is one of the standard conditions defined in the ISA manual for jcc or setcc.

In particular, z means equal or zero flag set.

Memory order

On strongly-ordered systems — x86, SPARC TSO, IBM mainframe, etc. — release-acquire ordering is automatic for the majority of operations.

We know in modern hardware, the compiler and hardware may interleave our code in unpredictable ways. To address that, we can instruct them to constraint their optimization. The constraint has four levels:

That’s my most imprecise and irresponsible interpretation. Many authors have discussed them extensively, but due to the scope of the article I can say no further.

Locking - the slow path

Now that we know the lock is acquired by someone else, the lock is going to spin and wait until the lock is available. The Linux kernel implements a optimized MCS locking algorithm, which added a pending bit(byte) to the lock. If the pending bit is set, it means the current lock has only one running holder before it, and the current lock is not queued. It’s only until the third competitor appears that a queue is constructed. Because that occasion is rarer, it reduces the cost of memory allocation as well as cacheline misses.

So this function has two execution flows: the quicker slow path, and the slower slow path.

The quicker slow path

As we enter the slow path, the value of lock is held or the lock is claimed but didn’t finish acquiring by other threads.

We denote the lock state by a triple (queue tail, pending bit, lock value).

1
2
3
4
5
6
7
8
9
10
11
/*
* Wait for in-progress pending->locked hand-overs with a bounded
* number of spins so that we guarantee forward progress.
*
* 0,1,0 -> 0,0,1
*/
if (val == _Q_PENDING_VAL) {
int cnt = _Q_PENDING_LOOPS;
val = atomic_cond_read_relaxed(&lock->val,
(VAL != _Q_PENDING_VAL) || !cnt--);
}

(0, 0, 0) is the fast path. The slow path starts with (0, 1, 0), where the lock is released recently by a old owner, but the current pending lock(not me) hasn’t fully acquired it. On that case we have to wait until it acquires the lock, after which the lock enters the state (0, 0, 1).

1
2
if (val & ~_Q_LOCKED_MASK)
goto queue;

Then we read the lock again. We fetch the pending | queue tail bits. They are set if someone else is already waiting for the lock. In that condition, we know our lock is the 3rd candidate, so we must queue. That’s the slower slow path. But let’s assume our lock is the pending lock so that we can go through the quicker slow path.

1
val = queued_fetch_set_pending_acquire(lock);

That function atomically test and set the pending bit of our lock. It returns the old value in the lock. We check the old value, if the lock is unfortunately acquired by others again before, our lock has no choices but to wait again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

/*
* If we observe contention, there is a concurrent locker.
*
* Undo and queue; our setting of PENDING might have made the
* n,0,0 -> 0,0,0 transition fail and it will now be waiting
* on @next to become !NULL.
*/
if (unlikely(val & ~_Q_LOCKED_MASK)) {

/* Undo PENDING if we set it. */
if (!(val & _Q_PENDING_MASK))
clear_pending(lock);

goto queue;
}

After that our lock can simply spin until the former lock owner releases the lock and takes the lock:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/*
* We're pending, wait for the owner to go away.
*
* 0,1,1 -> *,1,0
*
* this wait loop must be a load-acquire such that we match the
* store-release that clears the locked bit and create lock
* sequentiality; this is because not all
* clear_pending_set_locked() implementations imply full
* barriers.
*/
if (val & _Q_LOCKED_MASK)
smp_cond_load_acquire(&lock->locked, !VAL);

/*
* take ownership and clear the pending bit.
*
* 0,1,0 -> 0,0,1
*/
clear_pending_set_locked(lock);
lockevent_inc(lock_pending);
return;

/*
* End of pending bit optimistic spinning and beginning of MCS
* queuing.
*/

The slowest path

When the lock is busy, the lock is arranged as a queue. The queue data structure is stored per CPU and is fixed. Here’s the definition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

// kernel/locking/mcs_spinlock.h
struct mcs_spinlock {
struct mcs_spinlock *next;
int locked; /* 1 if lock acquired */
int count; /* nesting count, see qspinlock.c */
};

// kernel/locking/qspinlock.c
struct qnode {
struct mcs_spinlock mcs;
#ifdef CONFIG_PARAVIRT_SPINLOCKS
long reserved[2];
#endif
};

qnode is a thin wrapper of mcs_spinlock. mcs_spinlock contains one pointer to the next spinlock in queue, a locked flag and a nesting count.

1
2
3
4

#define MAX_NODES 4

static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

Every CPU has four qnodes, because there’s four conditions where the control flow in one CPU changes, namely:

Our lock needs to know its position in the queue:

1
2
3
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);

If one CPU has given out more than four instances of locks, the qspinlock fall back to normal spinlock.

1
2
3
4
5
6
if (unlikely(idx >= MAX_NODES)) {
lockevent_inc(lock_no_node);
while (!queued_spin_trylock(lock))
cpu_relax();
goto release;
}

After that, our lock get from CPU a node to enqueue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
node = grab_mcs_node(node, idx);

/*
* Keep counts of non-zero index values:
*/
lockevent_cond_inc(lock_use_node2 + idx - 1, idx);

/*
* Ensure that we increment the head node->count before initialising
* the actual node. If the compiler is kind enough to reorder these
* stores, then an IRQ could overwrite our assignments.
*/
barrier();

node->locked = 0;
node->next = NULL;
pv_init_node(node);

It’s passed so long. During that time, the lock may have been released, so we can probe again to see whether we can grab the lock now:

1
2
if (queued_spin_trylock(lock))
goto release;

As our lock has gained a node to enqueue, the next thing is to actuate the enqueue.

1
2
3
4
5
6
7
8
9
/*
* Publish the updated tail.
* We have already touched the queueing cacheline; don't bother with
* pending stuff.
*
* p,*,* -> n,*,*
*/
old = xchg_tail(lock, tail);
next = NULL;

We replaced the tail id by our new tail id, refetching the old tail id. That id is decoded to find qnode. We need that qnode because it’s our former lock owner. After its release, it can find us and unset our lock bit.

Then start waiting! The loop is inside the macro arch_mcs_spin_lock_contended, where ensures the node is head of queue after it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

/*
* if there was a previous node; link it and wait until reaching the
* head of the waitqueue.
*/
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);

/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);

pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);

/*
* While waiting for the MCS lock, the next pointer may have
* been set by another lock waiter. We optimistically load
* the next pointer & prefetch the cacheline for writing
* to reduce latency in the upcoming MCS unlock operation.
*/
next = READ_ONCE(node->next);
if (next)
prefetchw(next);
}


Wait til the lock is free:

1
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

locked:
/*
* claim the lock:
*
* n,0,0 -> 0,0,1 : lock, uncontended
* *,*,0 -> *,*,1 : lock, contended
*
* If the queue head is the only one in the queue (lock value == tail)
* and nobody is pending, clear the tail code and grab the lock.
* Otherwise, we only need to grab the lock.
*/

/*
* In the PV case we might already have _Q_LOCKED_VAL set, because
* of lock stealing; therefore we must also allow:
*
* n,0,1 -> 0,0,1
*
* Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
* above wait condition, therefore any concurrent setting of
* PENDING will make the uncontended transition fail.
*/
if ((val & _Q_TAIL_MASK) == tail) {
if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
goto release; /* No contention */
}

/*
* Either somebody is queued behind us or _Q_PENDING_VAL got set
* which will then detect the remaining tail and queue behind us
* ensuring we'll see a @next.
*/
set_locked(lock);

/*
* contended path; wait for next if not observed yet, release.
*/
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));

arch_mcs_spin_unlock_contended(&next->locked);
pv_kick_node(lock, next);

release:
trace_contention_end(lock, 0);

/*
* release the node
*/
__this_cpu_dec(qnodes[0].mcs.count);

Conclusion

Hacking Linux is more tiring than I thought.. Maybe later articles in the series will deal with the release part. The locking part has demonstrated a heady and rigorous investigation on hardware/software interface and tradeoff between implementation and optimization.

The next article will not look at so complicated code, of course.