flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > Lock-free singly-linked list and thread creation in Linux

Author
Thread Post new topic Reply to topic
redsock



Joined: 09 Oct 2009
Posts: 430
Location: Australia
redsock 15 Nov 2022, 23:34
As a followup to the previous thread about Creating threads in Linux, I wanted to show a working example of recycling stack space technique.

Any of you who have worked with lock-free concurrency will be familiar with the ABA problem. This implementation of a lock-free list avoids the ABA problem by also keeping a version alongside the list's head and using CMPXCHG16B to do the deed.

Note that during thread creation, the same get_or_create_stack function is used, and then each thread performs 1,000,000 get_or_create_stack calls, followed by using the block returned, and then putting the acquired block back into the list by calling return_stack.

This of course represents the absolute worst-case contention because every single thread is really only thrashing on the stack/singly-linked list itself. Still, I thought it would be a very useful educational example Smile

This code compiles as-is without my HeavyThing library into a 1408 byte x64 binary:
Code:
; config
start_threads = 8
loop_iterations = 1000000
stack_size = 2097152

; syscalls
syscall_write = 1
syscall_mmap = 9
syscall_clone = 56
syscall_exit = 60
syscall_exit_group = 231

format  ELF64

; sections
section '.data' writeable align 16
        stacks          dq      0
        stacks_version  dq      0
        threads         dd      0

section '.text' executable align 16

public get_or_create_stack
get_or_create_stack:
        ; thread safe get or create a stack
        push    rbx
        mov     rax, [stacks]
        mov     rdx, [stacks_version]
        test    rax, rax
        jz      .new_stack
        lea     rcx, [rdx+1]
        mov     rbx, [rax]
lock    cmpxchg16b dqword [stacks]
        pop     rbx
        jnz     get_or_create_stack
        ret
.new_stack:
        pop     rbx
        mov     eax, syscall_mmap
        xor     edi, edi
        mov     esi, stack_size
        mov     edx, 0x3
        mov     r10d, 0x22
        mov     r8, -1
        xor     r9d, r9d
        syscall
        mov     rdi, .error_mmap
        test    rax, rax
        jl      .error
        ret
.error:
        mov     eax, syscall_write
        mov     edi, 2
        mov     rsi, .error_mmap
        mov     edx, .error_mmap_len
        syscall
        mov     eax, syscall_exit_group
        mov     edi, 1
        syscall
.error_mmap:
        db 'mmap failed',10
.error_mmap_len = $ - .error_mmap


public return_stack
align 16
return_stack:
        ; thread safe push a stack
        push    rbx
        mov     rax, [stacks]
        mov     rdx, [stacks_version]
        mov     rbx, rdi
        lea     rcx, [rdx+1]
        mov     [rdi], rax
lock    cmpxchg16b dqword [stacks]
        pop     rbx
        jnz     return_stack
        ret

public thread
align 16
thread:
        ; per-thread function, also called by the main thread
lock    add     [threads], 1
        sub     rsp, 32
        mov     r8, qword [.hello1]
        mov     r9, qword [.hello2]
        mov     r10, qword [.hello3]
        mov     [rsp], r8
        mov     [rsp+8], r9
        mov     [rsp+16], r10
        lea     rsi, [rsp+20]
.display_thread_number:
        mov     eax, edi
        and     eax, 0xf
        shr     edi, 4
        mov     al, [.hexchars+rax]
        mov     [rsi], al
        sub     rsi, 1
        test    edi, edi
        jnz     .display_thread_number
        mov     eax, syscall_write
        mov     edi, 1
        mov     rsi, rsp
        mov     edx, 22
        syscall
        add     rsp, 32

        mov     ebx, loop_iterations
.loop:
        call    get_or_create_stack

        ; use the block returned
        mov     [rax], r12d
        mov     [rax+4096], r12d
        mov     [rax+8192], r12d

        mov     rdi, rax
        call    return_stack

        ; complete, do next iteration
        sub     ebx, 1
        jnz     .loop

lock    sub     [threads], 1
        jz      .exit_group
        ; just exit this thread
        mov     eax, syscall_exit
        xor     edi, edi
        syscall
        ; not reached

.exit_group:
        mov     eax, syscall_write
        mov     edi, 1
        mov     rsi, .complete
        mov     edx, .complete_len
        syscall

        mov     eax, syscall_exit_group
        xor     edi, edi
        syscall
        ; not reached
.complete:
        db 'test complete',10
.complete_len = $ - .complete
.hello1: db 'hello fr'
.hello2: db 'om threa'
.hello3: db 'd #  ',10
.hexchars: db '0123456789abcdef'

clone_vm = 0x00000100
clone_fs = 0x00000200
clone_files = 0x00000400
clone_sighand = 0x00000800
clone_thread = 0x00010000

public _start
align 16
_start:
        mov     r15d, start_threads - 1

if start_threads > 1
.newthread:
        call    get_or_create_stack
        xor     r10d, r10d
        xor     r8d, r8d
        lea     rsi, [rax+stack_size-16]
        mov     qword [rsi], thread
        mov     qword [rsi+8], r15
        mov     edi, clone_vm or clone_fs or clone_files or clone_sighand or clone_thread
        mov     eax, syscall_clone
        xor     edx, edx
        syscall
        test    rax, rax
        jz      .inchild
        sub     r15d, 1
        jnz     .newthread
end if
        mov     edi, start_threads
        call    thread
        ; not reached
.inchild:
        xor     ebp, ebp
        pop     rax
        pop     rdi
        call    rax
        ; not reached
    
Again, this is literally worst-case contention on the list. Running this example here on my relatively powerful AMD gives me:
Code:
perf stat ./recycle_test
hello from thread # 7
hello from thread # 6
hello from thread # 5
hello from thread # 4
hello from thread # 3
hello from thread # 2
hello from thread # 8
hello from thread # 1
test complete

 Performance counter stats for './recycle_test':

           4264.86 msec task-clock                #    6.779 CPUs utilized          
               627      context-switches          #  147.015 /sec                   
                42      cpu-migrations            #    9.848 /sec                   
                42      page-faults               #    9.848 /sec                   
       17256394688      cycles                    #    4.046 GHz                      (83.22%)
          34683453      stalled-cycles-frontend   #    0.20% frontend cycles idle     (83.32%)
       15629334883      stalled-cycles-backend    #   90.57% backend cycles idle      (83.32%)
         366542327      instructions              #    0.02  insn per cycle         
                                                  #   42.64  stalled cycles per insn  (83.31%)
          89743403      branches                  #   21.043 M/sec                    (83.56%)
           4860424      branch-misses             #    5.42% of all branches          (83.50%)

       0.629139363 seconds time elapsed

       4.253474000 seconds user
       0.000000000 seconds sys    
Notice the crazy high stalled-cycles-backend Smile Here we have 8 cores running, for a total of 8M get_or_create_stack calls and 8M return_stack calls. Each thread runs exactly 1,000,000 iterations and then exits. The last thread to finish does a full exit_group.

If we consider what this is actually doing, these results are pretty impressive. Changing the number of threads to 1 so that the list is never contended is of course MUCH faster:
Code:
perf stat ./recycle_test
hello from thread # 1
test complete

 Performance counter stats for './recycle_test':

             12.06 msec task-clock                #    0.964 CPUs utilized          
                 1      context-switches          #   82.895 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 7      page-faults               #  580.267 /sec                   
          38502465      cycles                    #    3.192 GHz                      (66.96%)
             31103      stalled-cycles-frontend   #    0.08% frontend cycles idle     (66.85%)
          32777932      stalled-cycles-backend    #   85.13% backend cycles idle      (97.48%)
          29202501      instructions              #    0.76  insn per cycle         
                                                  #    1.12  stalled cycles per insn
           8040810      branches                  #  666.545 M/sec                  
              1036      branch-misses             #    0.01% of all branches          (68.71%)

       0.012513834 seconds time elapsed

       0.009507000 seconds user
       0.003169000 seconds sys    
And, like I concluded the previous thread, if we leave the thread count at 1, and then populate every single page of the 2mb mmap per iteration using this code:
Code:
        mov     rsi, rax
        mov     ecx, 512
.populate_every_page:
        mov     [rsi], ecx
        add     rsi, 4096
        sub     ecx, 1
        jnz     .populate_every_page    
We get a radically lower time than the previous thread using mmap/munmap per iteration:
Code:
perf stat ./recycle_test
hello from thread # 1
test complete

 Performance counter stats for './recycle_test':

           3642.29 msec task-clock                #    0.999 CPUs utilized          
               359      context-switches          #   98.564 /sec                   
                 8      cpu-migrations            #    2.196 /sec                   
               515      page-faults               #  141.395 /sec                   
       15331808242      cycles                    #    4.209 GHz                      (83.32%)
         208300085      stalled-cycles-frontend   #    1.36% frontend cycles idle     (83.31%)
        5716694902      stalled-cycles-backend    #   37.29% backend cycles idle      (83.32%)
        2085348198      instructions              #    0.14  insn per cycle         
                                                  #    2.74  stalled cycles per insn  (83.31%)
         521544823      branches                  #  143.191 M/sec                    (83.41%)
           1034367      branch-misses             #    0.20% of all branches          (83.33%)

       3.644328837 seconds time elapsed

       3.642627000 seconds user
       0.000000000 seconds sys    
3.6 seconds down from 276.9 seconds, nice. Now, if we increase the thread count to 8 and also let it continue to populate every single page per iteration, we finally arrive at:
Code:
perf stat ./recycle_test
hello from thread # 7
hello from thread # 6
hello from thread # 5
hello from thread # 4
hello from thread # 3
hello from thread # 2
hello from thread # 8
hello from thread # 1
test complete

 Performance counter stats for './recycle_test':

          35170.66 msec task-clock                #    7.362 CPUs utilized          
              4518      context-switches          #  128.459 /sec                   
               313      cpu-migrations            #    8.899 /sec                   
              1875      page-faults               #   53.311 /sec                   
      143146340094      cycles                    #    4.070 GHz                      (83.31%)
        1588628589      stalled-cycles-frontend   #    1.11% frontend cycles idle     (83.30%)
      104722605224      stalled-cycles-backend    #   73.16% backend cycles idle      (83.34%)
       16707883896      instructions              #    0.12  insn per cycle         
                                                  #    6.27  stalled cycles per insn  (83.36%)
        4180198704      branches                  #  118.855 M/sec                    (83.38%)
           8971840      branch-misses             #    0.21% of all branches          (83.34%)

       4.777462497 seconds time elapsed

      35.154405000 seconds user
       0.000000000 seconds sys    
Hopefully an educational and useful example of how this all works. Smile

Cheers and all the best

_________________
2 Ton Digital - https://2ton.com.au/
Post 15 Nov 2022, 23:34
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 16 Nov 2022, 16:39
Excellent example. CMPXCHG16B updates RDX:RAX - so, the LOCK loop is double reading the mutex.
Code:
return_stack:
        ; thread safe push RAX stack
        push    rbx
        mov     rbx, rdi
        mov     rax, [stacks]
        mov     rdx, [stacks_version]
@@:     mov     [rbx], rax
        lea     rcx, [rdx+1]
        lock cmpxchg16b dqword [stacks]
        jnz     @B
        pop     rbx
        retn    
Edit: the version is needed, I was forgetting the case of a stack being used and put back before CMPXCHG16B. Oh, that is the ABA problem. Embarassed
Post 16 Nov 2022, 16:39
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 430
Location: Australia
redsock 16 Nov 2022, 20:00
bitRAKE wrote:
Oh, that is the ABA problem. Embarassed
Yes, this is indeed the ABA problem. If we didn't need actual stack functionality, and instead wanted to make a concurrent queue out of it, the CMPXCHG16B and the incrementing version wouldn't be necessary. Only the LOCK CMPXCHG on push, and then an XCHG with NULL to pull the entire queue makes things a bit easier but that wouldn't work for my stack-based example as it really isn't a queue Smile Smile

_________________
2 Ton Digital - https://2ton.com.au/
Post 16 Nov 2022, 20:00
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 17 Nov 2022, 14:07
There is still no need to re-read the data, since you already have the values in rdx:rax. You also have the version. So you can jump back to "test rax, rax" instead.

redsock wrote:
bitRAKE wrote:
Oh, that is the ABA problem. Embarassed
Yes, this is indeed the ABA problem. If we didn't need actual stack functionality, and instead wanted to make a concurrent queue out of it, the CMPXCHG16B and the incrementing version wouldn't be necessary. Only the LOCK CMPXCHG on push, and then an XCHG with NULL to pull the entire queue makes things a bit easier but that wouldn't work for my stack-based example as it really isn't a queue Smile Smile
For a concurrent/async queue, I use two XCHG instead. Pulling is the same as yours, but on push I do:
Code:
    mov rdx, rax
    xchg rax, [queue]
    mov [rdx], rax  ; next = previous queue head    
Yes, there is a race condition here where the "next" is not populated while another thread pulled the queue. But it is so rare, and literally only one instruction, that I just spinlock in the other thread as long as "next" is uninitialized (I use a sentinel value, like -1, to indicate uninitialized as opposed to end-of-list), which will almost always finish quick since the other thread only needs 1 instruction after that.

Worst case, it's pre-empted, so I only spinlock like 127 times, and then yield, then repeat the spinlock. Again, this situation is way too rare, and the lack of having to loop over and over (with cmpxchg) while the lock is contended by multiple writers is a huge bonus. It's still 100% correct though.

If you benchmark it of course that situation will most likely never happen, since it's incredibly rare. Still, it's at least theoretically possible so there's the spinlock and yield.
Post 17 Nov 2022, 14:07
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.