flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > Creating a thread in Linux |
Author |
|
redsock 10 Nov 2022, 19:51
In your getErrNo definition, you are testing the highest bit in rax, which is the same as it being negative. Also, since the clone syscall returns a 32 bit value, this can be simplified (my "createThread" equivalent is a function, hence my return values to the parent):
Code: test eax, eax jz .inchild ; if jl, then an error occurred with the clone syscall ; otherwise, return value is child pid and we are in the parent ret My only other thought is: the concept of a guard page works well with contiguous memory areas where you might run off the end undetected. Your use of mmap2 however means that you will just crash if you run off the end as there'll be no contiguous area for you to overrun (and hence the guard page is moot since it accomplishes the same thing). Cheers and well done on figuring it all out |
|||
10 Nov 2022, 19:51 |
|
I 11 Nov 2022, 07:09
I vaguely remember hunting too, ended up with
Code: mov eax,sys_clone mov rdi,CLONE_VM or CLONE_SIGHAND lea rsi,[TStack+0x2000-8] ; If stack not needed can set to zero!, -8 for alignment! xor edx,edx xor r10d,r10d xor r8d,r8d syscall cmp rax,0 ; returns zero for child thread and child PID for parent. jz NewThread Seems to work okay. |
|||
11 Nov 2022, 07:09 |
|
Tycho 11 Nov 2022, 09:36
Thanks for the feedback redsock! What you said made me think about getErrNo function and how it is completely useless in this situation. I removed it from the example. I also changed the calling convention of createthread(), resulting in one ret behind clone() without the need for any compare instruction.
The reason why it was the way it was, is because I always placed the Windows functions in a wrapper, with the idea to one day port this project to Linux. The function originally returns 0 on failure, which what I was simulating here . The wrapper also restores registers which makes it impossible to share the same return path with the new created thread redsock wrote: test eax, eax redsock wrote: Also, since the clone syscall returns a 32 bit value redsock wrote: Your use of mmap2 however means that you will just crash if you run off the end as there'll be no contiguous area for you to overrun On another note, originally I wanted to use flag MAP_GROWSDOWN with mmap2, which would make perfect sense for a stack. Sadly a google search told me this flag is not functioning well on most kernels. redsock wrote: I normally don't release thread stackspace, and prefer to recycle them for later threads. Last edited by Tycho on 11 Nov 2022, 10:26; edited 3 times in total |
|||
11 Nov 2022, 09:36 |
|
Tycho 11 Nov 2022, 10:03
I wrote: I vaguely remember hunting too, ended up with Good to know the stack can be set to zero, I didn't knew that. Thanks _________________ --Love Flat Assembler: small, fast and free -- |
|||
11 Nov 2022, 10:03 |
|
I 11 Nov 2022, 13:01
Not very clear by me, plus I'm a Linux noob.
See https://linux.die.net/man/2/clone Quote: Another difference for sys_clone is that the child_stack argument may be zero, in which case copy-on-write semantics ensure that the child gets separate copies of stack pages when either process modifies the stack. In this case, for correct operation, the CLONE_VM option should not be specified. |
|||
11 Nov 2022, 13:01 |
|
bitRAKE 11 Nov 2022, 15:05
Code: syscall xchg ecx,eax ; TID is s32, >0 jrcxz @F retn @@: ; new thread _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
11 Nov 2022, 15:05 |
|
redsock 11 Nov 2022, 19:49
Tycho wrote: I have my doubt about the return value of clone being a dword. When it fails on my system, it returns a qword (when I used the wrong flag, RAX = FFFFFFFFFFFFFFEA). I would still get the correct errno if I use neg on the dword-part, but what if the function actually succeeded and I don't know that because I don't check the correct signed flag? Tycho wrote: This because, in my understanding, it is possible that the memory page above the memory allocated for the stack can be allocated by my program as well. This means that if the routine uses more memory then allocated for stack, it is going to crash OR overwrite other data. I want to be sure it always crashes! Tycho wrote: This function is so fast that recycling memory usually slows down performance. A "recycled" 2MB stack space (that is actually sparsely populated thanks to the way mmap works) could be placed into a singly-linked list and accessed via a single userspace-only atomic operation that would be virtually cost-free compared to any syscall/kernel interop. A few years back I did a terminal video comparing various language startups and their relevant syscall counts, you might enjoy the syscall tracing parts of it: https://2ton.com.au/videos/tvs_part1/ Cheers |
|||
11 Nov 2022, 19:49 |
|
Tycho 12 Nov 2022, 10:05
I wrote: Not very clear by me, plus I'm a Linux noob. Good that you mention this. Not using the CLONE_VM makes the thread run in a separate memory space. This means it's like using fork() with a few more tweaks. _________________ --Love Flat Assembler: small, fast and free -- |
|||
12 Nov 2022, 10:05 |
|
Tycho 12 Nov 2022, 10:10
redsock wrote: A "recycled" 2MB stack space (that is actually sparsely populated thanks to the way mmap works) could be placed into a singly-linked list and accessed via a single userspace-only atomic operation that would be virtually cost-free compared to any syscall/kernel interop. redsock wrote: There are many many reasons userspace memory management routines exist (see: tcmalloc, malloc, jemalloc, etc). redsock wrote: also means that consecutive runs of your program are unlikely to receive the same mappings redsock wrote: Since all TID/PID values for the linux kernel are < 21 bits in size (most are much smaller), it can safely be treated as a 32 bit value. |
|||
12 Nov 2022, 10:10 |
|
redsock 12 Nov 2022, 22:51
You have some fun times ahead of you comparing the various methods. It is clear you haven't done so yet. I'll leave you to find out for yourself for all but the following bit:
Tycho wrote: This might work if only one thread is responsible for creating the new threads, but if multiple threads are able to create a new thread you have a few race conditions to solve. And it is quite known that lock * instructions can bring down performance very much. It's just not crystal clear that your solution is the best way to go, it depends on what the program tries to accomplish and how it's done. Code: public _start falign _start: call ht$init mov ebx, 1000000 mov r12d, 2097152 .loop: mov eax, syscall_mmap xor edi, edi mov esi, r12d mov edx, 0x3 mov r10d, 0x22 mov r8, -1 xor r9d, r9d syscall mov rdi, .error_mmap test rax, rax jl .error mov esi, r12d mov rdi, rax mov eax, syscall_munmap syscall sub ebx, 1 jnz .loop mov rdi, .complete call string$to_stdoutln mov eax, syscall_exit xor edi, edi syscall .error: call string$to_stdoutln mov eax, syscall_exit mov edi, 1 syscall cleartext .error_mmap, 'mmap failed' cleartext .complete, 'test complete' Code: perf stat ./mmap_test test complete Performance counter stats for './mmap_test': 410.79 msec task-clock # 0.998 CPUs utilized 61 context-switches # 148.494 /sec 6 cpu-migrations # 14.606 /sec 3 page-faults # 7.303 /sec 1729395504 cycles # 4.210 GHz (83.45%) 38195162 stalled-cycles-frontend # 2.21% frontend cycles idle (83.46%) 130986465 stalled-cycles-backend # 7.57% backend cycles idle (83.46%) 3245900681 instructions # 1.88 insn per cycle # 0.04 stalled cycles per insn (83.45%) 661909633 branches # 1.611 G/sec (83.45%) 4006110 branch-misses # 0.61% of all branches (82.72%) 0.411536377 seconds time elapsed 0.044359000 seconds user 0.366976000 seconds sys Since we are not actually using any of the region returned by mmap in our contrived example, the kernel doesn't really have to do a lot of work. What we are seeing is pretty close to syscall-only overhead. LOCK prefixed atomic operations are measured in clock cycles, and even worst-case contended times are crazy better than that. 1729.4259 clock cycles per loop iteration means that even worst-case cache miss for the CMPXCHG would mean we could easily run circles around the syscall-based method (provided our pages haven't been swapped out or some other nastiness, in which case it could be HUGE delays but that is outside the scope of our tests with swap disabled). If we populate a single dword into the first page returned by mmap by adding Code: mov [rax], r12d Code: perf stat ./mmap_test test complete Performance counter stats for './mmap_test': 1902.54 msec task-clock # 0.999 CPUs utilized 218 context-switches # 114.584 /sec 18 cpu-migrations # 9.461 /sec 1000003 page-faults # 525.615 K/sec 8125567896 cycles # 4.271 GHz (83.19%) 292236168 stalled-cycles-frontend # 3.60% frontend cycles idle (83.39%) 1453001032 stalled-cycles-backend # 17.88% backend cycles idle (83.40%) 13707552618 instructions # 1.69 insn per cycle # 0.11 stalled cycles per insn (83.40%) 2712532788 branches # 1.426 G/sec (83.40%) 7065941 branch-misses # 0.26% of all branches (83.23%) 1.903930208 seconds time elapsed 0.159910000 seconds user 1.743019000 seconds sys Code: mov [rax+4096], r12d mov [rax+8192], r12d Code: perf stat ./mmap_test test complete Performance counter stats for './mmap_test': 2706.53 msec task-clock # 0.999 CPUs utilized 308 context-switches # 113.799 /sec 9 cpu-migrations # 3.325 /sec 3000004 page-faults # 1.108 M/sec 11748415171 cycles # 4.341 GHz (83.31%) 705557802 stalled-cycles-frontend # 6.01% frontend cycles idle (83.31%) 2384974834 stalled-cycles-backend # 20.30% backend cycles idle (83.31%) 18529302125 instructions # 1.58 insn per cycle # 0.13 stalled cycles per insn (83.31%) 3447280424 branches # 1.274 G/sec (83.43%) 11183979 branch-misses # 0.32% of all branches (83.33%) 2.708347307 seconds time elapsed 0.283877000 seconds user 2.422956000 seconds sys Code: mov rsi, rax mov ecx, 512 .populate_every_page: mov [rsi], ecx add rsi, 4096 sub ecx, 1 jnz .populate_every_page Code: perf stat ./mmap_test test complete Performance counter stats for './mmap_test': 276934.08 msec task-clock # 0.999 CPUs utilized 30238 context-switches # 109.188 /sec 819 cpu-migrations # 2.957 /sec 512000004 page-faults # 1.849 M/sec 1162203144724 cycles # 4.197 GHz (83.33%) 133040834940 stalled-cycles-frontend # 11.45% frontend cycles idle (83.33%) 348649831792 stalled-cycles-backend # 30.00% backend cycles idle (83.33%) 1805763490028 instructions # 1.55 insn per cycle # 0.19 stalled cycles per insn (83.33%) 330735435159 branches # 1.194 G/sec (83.34%) 1345913386 branch-misses # 0.41% of all branches (83.33%) 277.141320090 seconds time elapsed 31.005226000 seconds user 245.875127000 seconds sys Conclusion: I was dismayed by the first basically syscall-only test at 2.4M/s on this machine. You can quickly see that an atomic operation even on a contended lock will in 99% of conceivable scenarios ridiculously outperform using mmap/munmap as a memory management technique, and especially so for clone-based thread stacks. Cheers and good luck |
|||
12 Nov 2022, 22:51 |
|
revolution 13 Nov 2022, 04:56
Props to redsock for actually running some indicative tests, instead of reading code and trying to predict how it will perform.
I wonder of people see one instruction, syscall, and like to assume it will be "quick" because it looks nice and neat. But if you have to write 20 instructions to do the same action then it becomes more complex and thus is probably "slower"? I mean it might be slower, or not, but there is no way to know without testing. And of course reading Agner Fog's docs and seeing how "scary" lock is and assuming it must be absolutely awful, but failing to consider that the syscall also has to use lock anyway. Sometimes you just have to accept that there isn't an approach that both works in all cases and can somehow magically avoid locks or spins. |
|||
13 Nov 2022, 04:56 |
|
Furs 13 Nov 2022, 18:27
I mean this was kinda clear. A LOCK is only about 100 cycles, on average. And if it's contended, then the other threads are doing work, so it's ok.
Just a syscall by itself is probably 100 cycles if not more. |
|||
13 Nov 2022, 18:27 |
|
redsock 20 Nov 2022, 23:08
redsock wrote: Since all TID/PID values for the linux kernel are < 21 bits in size (most are much smaller), it can safely be treated as a 32 bit value. Code: cat /proc/sys/kernel/pid_max |
|||
20 Nov 2022, 23:08 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.