flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > Lock-free singly-linked list and thread creation in Linux |
Author |
|
redsock 15 Nov 2022, 23:34
As a followup to the previous thread about Creating threads in Linux, I wanted to show a working example of recycling stack space technique.
Any of you who have worked with lock-free concurrency will be familiar with the ABA problem. This implementation of a lock-free list avoids the ABA problem by also keeping a version alongside the list's head and using CMPXCHG16B to do the deed. Note that during thread creation, the same get_or_create_stack function is used, and then each thread performs 1,000,000 get_or_create_stack calls, followed by using the block returned, and then putting the acquired block back into the list by calling return_stack. This of course represents the absolute worst-case contention because every single thread is really only thrashing on the stack/singly-linked list itself. Still, I thought it would be a very useful educational example This code compiles as-is without my HeavyThing library into a 1408 byte x64 binary: Code: ; config start_threads = 8 loop_iterations = 1000000 stack_size = 2097152 ; syscalls syscall_write = 1 syscall_mmap = 9 syscall_clone = 56 syscall_exit = 60 syscall_exit_group = 231 format ELF64 ; sections section '.data' writeable align 16 stacks dq 0 stacks_version dq 0 threads dd 0 section '.text' executable align 16 public get_or_create_stack get_or_create_stack: ; thread safe get or create a stack push rbx mov rax, [stacks] mov rdx, [stacks_version] test rax, rax jz .new_stack lea rcx, [rdx+1] mov rbx, [rax] lock cmpxchg16b dqword [stacks] pop rbx jnz get_or_create_stack ret .new_stack: pop rbx mov eax, syscall_mmap xor edi, edi mov esi, stack_size mov edx, 0x3 mov r10d, 0x22 mov r8, -1 xor r9d, r9d syscall mov rdi, .error_mmap test rax, rax jl .error ret .error: mov eax, syscall_write mov edi, 2 mov rsi, .error_mmap mov edx, .error_mmap_len syscall mov eax, syscall_exit_group mov edi, 1 syscall .error_mmap: db 'mmap failed',10 .error_mmap_len = $ - .error_mmap public return_stack align 16 return_stack: ; thread safe push a stack push rbx mov rax, [stacks] mov rdx, [stacks_version] mov rbx, rdi lea rcx, [rdx+1] mov [rdi], rax lock cmpxchg16b dqword [stacks] pop rbx jnz return_stack ret public thread align 16 thread: ; per-thread function, also called by the main thread lock add [threads], 1 sub rsp, 32 mov r8, qword [.hello1] mov r9, qword [.hello2] mov r10, qword [.hello3] mov [rsp], r8 mov [rsp+8], r9 mov [rsp+16], r10 lea rsi, [rsp+20] .display_thread_number: mov eax, edi and eax, 0xf shr edi, 4 mov al, [.hexchars+rax] mov [rsi], al sub rsi, 1 test edi, edi jnz .display_thread_number mov eax, syscall_write mov edi, 1 mov rsi, rsp mov edx, 22 syscall add rsp, 32 mov ebx, loop_iterations .loop: call get_or_create_stack ; use the block returned mov [rax], r12d mov [rax+4096], r12d mov [rax+8192], r12d mov rdi, rax call return_stack ; complete, do next iteration sub ebx, 1 jnz .loop lock sub [threads], 1 jz .exit_group ; just exit this thread mov eax, syscall_exit xor edi, edi syscall ; not reached .exit_group: mov eax, syscall_write mov edi, 1 mov rsi, .complete mov edx, .complete_len syscall mov eax, syscall_exit_group xor edi, edi syscall ; not reached .complete: db 'test complete',10 .complete_len = $ - .complete .hello1: db 'hello fr' .hello2: db 'om threa' .hello3: db 'd # ',10 .hexchars: db '0123456789abcdef' clone_vm = 0x00000100 clone_fs = 0x00000200 clone_files = 0x00000400 clone_sighand = 0x00000800 clone_thread = 0x00010000 public _start align 16 _start: mov r15d, start_threads - 1 if start_threads > 1 .newthread: call get_or_create_stack xor r10d, r10d xor r8d, r8d lea rsi, [rax+stack_size-16] mov qword [rsi], thread mov qword [rsi+8], r15 mov edi, clone_vm or clone_fs or clone_files or clone_sighand or clone_thread mov eax, syscall_clone xor edx, edx syscall test rax, rax jz .inchild sub r15d, 1 jnz .newthread end if mov edi, start_threads call thread ; not reached .inchild: xor ebp, ebp pop rax pop rdi call rax ; not reached Code: perf stat ./recycle_test hello from thread # 7 hello from thread # 6 hello from thread # 5 hello from thread # 4 hello from thread # 3 hello from thread # 2 hello from thread # 8 hello from thread # 1 test complete Performance counter stats for './recycle_test': 4264.86 msec task-clock # 6.779 CPUs utilized 627 context-switches # 147.015 /sec 42 cpu-migrations # 9.848 /sec 42 page-faults # 9.848 /sec 17256394688 cycles # 4.046 GHz (83.22%) 34683453 stalled-cycles-frontend # 0.20% frontend cycles idle (83.32%) 15629334883 stalled-cycles-backend # 90.57% backend cycles idle (83.32%) 366542327 instructions # 0.02 insn per cycle # 42.64 stalled cycles per insn (83.31%) 89743403 branches # 21.043 M/sec (83.56%) 4860424 branch-misses # 5.42% of all branches (83.50%) 0.629139363 seconds time elapsed 4.253474000 seconds user 0.000000000 seconds sys If we consider what this is actually doing, these results are pretty impressive. Changing the number of threads to 1 so that the list is never contended is of course MUCH faster: Code: perf stat ./recycle_test hello from thread # 1 test complete Performance counter stats for './recycle_test': 12.06 msec task-clock # 0.964 CPUs utilized 1 context-switches # 82.895 /sec 0 cpu-migrations # 0.000 /sec 7 page-faults # 580.267 /sec 38502465 cycles # 3.192 GHz (66.96%) 31103 stalled-cycles-frontend # 0.08% frontend cycles idle (66.85%) 32777932 stalled-cycles-backend # 85.13% backend cycles idle (97.48%) 29202501 instructions # 0.76 insn per cycle # 1.12 stalled cycles per insn 8040810 branches # 666.545 M/sec 1036 branch-misses # 0.01% of all branches (68.71%) 0.012513834 seconds time elapsed 0.009507000 seconds user 0.003169000 seconds sys Code: mov rsi, rax mov ecx, 512 .populate_every_page: mov [rsi], ecx add rsi, 4096 sub ecx, 1 jnz .populate_every_page Code: perf stat ./recycle_test hello from thread # 1 test complete Performance counter stats for './recycle_test': 3642.29 msec task-clock # 0.999 CPUs utilized 359 context-switches # 98.564 /sec 8 cpu-migrations # 2.196 /sec 515 page-faults # 141.395 /sec 15331808242 cycles # 4.209 GHz (83.32%) 208300085 stalled-cycles-frontend # 1.36% frontend cycles idle (83.31%) 5716694902 stalled-cycles-backend # 37.29% backend cycles idle (83.32%) 2085348198 instructions # 0.14 insn per cycle # 2.74 stalled cycles per insn (83.31%) 521544823 branches # 143.191 M/sec (83.41%) 1034367 branch-misses # 0.20% of all branches (83.33%) 3.644328837 seconds time elapsed 3.642627000 seconds user 0.000000000 seconds sys Code: perf stat ./recycle_test hello from thread # 7 hello from thread # 6 hello from thread # 5 hello from thread # 4 hello from thread # 3 hello from thread # 2 hello from thread # 8 hello from thread # 1 test complete Performance counter stats for './recycle_test': 35170.66 msec task-clock # 7.362 CPUs utilized 4518 context-switches # 128.459 /sec 313 cpu-migrations # 8.899 /sec 1875 page-faults # 53.311 /sec 143146340094 cycles # 4.070 GHz (83.31%) 1588628589 stalled-cycles-frontend # 1.11% frontend cycles idle (83.30%) 104722605224 stalled-cycles-backend # 73.16% backend cycles idle (83.34%) 16707883896 instructions # 0.12 insn per cycle # 6.27 stalled cycles per insn (83.36%) 4180198704 branches # 118.855 M/sec (83.38%) 8971840 branch-misses # 0.21% of all branches (83.34%) 4.777462497 seconds time elapsed 35.154405000 seconds user 0.000000000 seconds sys Cheers and all the best |
|||
15 Nov 2022, 23:34 |
|
redsock 16 Nov 2022, 20:00
bitRAKE wrote: Oh, that is the ABA problem. |
|||
16 Nov 2022, 20:00 |
|
Furs 17 Nov 2022, 14:07
There is still no need to re-read the data, since you already have the values in rdx:rax. You also have the version. So you can jump back to "test rax, rax" instead.
redsock wrote:
Code: mov rdx, rax xchg rax, [queue] mov [rdx], rax ; next = previous queue head Worst case, it's pre-empted, so I only spinlock like 127 times, and then yield, then repeat the spinlock. Again, this situation is way too rare, and the lack of having to loop over and over (with cmpxchg) while the lock is contended by multiple writers is a huge bonus. It's still 100% correct though. If you benchmark it of course that situation will most likely never happen, since it's incredibly rare. Still, it's at least theoretically possible so there's the spinlock and yield. |
|||
17 Nov 2022, 14:07 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.