flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > Creating a thread in Linux

Author
Thread Post new topic Reply to topic
Tycho



Joined: 02 Mar 2008
Posts: 16
Tycho 10 Nov 2022, 11:00
Recently, after years of programming for Windows, I decided to flirt with Linux and port some software. The reason I write this is to share some difficulties I had go through that a google search did not solve. For starters I needed be able to create a new thread. The easiest way would be implementing pthread_create, but I wanted to learn Linux as good as I know Windows, so I decided to go with the system call 'clone'. This is where it got difficult, so please pay close attention.

When you search for clone() you find the following signature
Code:
       int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
                 /* pid_t *parent_tid, void *tls, pid_t *child_tid */ );    

Except for the flags, which you may struggle with for a while, it's pretty strait forward. You create your own stack, give an argument, set the start address and you are able to receive the thread ID (TID). So that's it? No hard part? No, the problem is: this signature is a glibc wrapper function, the actual system call looks like this:

x86-64:
Code:
           long clone(unsigned long flags, void *stack,
                      int *parent_tid, int *child_tid,
                      unsigned long tls);    

(on x86-32: child_tid and tls are swapped)

At first they look quite similar, but wait, where is the first and fourth parameter from glibc? The start address, it is nowhere to be found? So, where does my thread start?? With no prior knowledge I got stuck, and so did my program! Segmentation error! All examples I could find (really useful if you struggle with the flags) use the glibc wrapper.

It's only that after I learned a similar system call called fork (creates a copy of your process) that I got an idea of how clone might work. Clone copies the current thread (depending on the flags). The 'new' created thread has the same start address as the current address of the callers thread on the time it was copied. This means that the new thread starts after the syscall.

The next question is, how do I detect the copied thread so I can jump the desired start address? Luckily this behavior is also similar to the fork function. All registers are the same, except for rsp and rax. rsp in the new thread is set to the address you passed on parameter *stack. rax is set to zero in the new thread. The clone callers thread receives the new TID on rax.

That's it! When creating the stack, you put the argument(s) and the start address on there. This way, once the clone is created, it can jump to the desired start address!

Here is how my function looks like. If someone with more knowledge is able to improve this function, please let me know Smile

Code:
format ELF64 executable
entry Start

;
;flags
;
PROT_READ       equ 0x1
PROT_WRITE      equ 0x2
PROT_NONE       equ 0x0

MAP_ANONYMOUS equ 0x20
MAP_GROWSDOWN equ 0x100
MAP_PRIVATE equ 0x2

STACK_SIZE equ (0x1000*3)
GUARD_PAGE_SIZE equ 0x1000

CLONE_VM        equ 0x00000100
CLONE_FS        equ 0x00000200
CLONE_FILES     equ 0x00000400
CLONE_SIGHAND   equ 0x00000800
CLONE_THREAD    equ 0x00010000
CLONE_IO        equ 0x80000000

psize equ 8

segment readable writeable executable

hello1 db "[thread 1] Hello, I'm the main thread and I'm about the start a second thread!",0Ah
hello2 db "[thread 1] The second thread is created, so I'm quiting",0Ah
hello3:
dd hello3StrEnd-hello3Str
hello3Str db "[thread 2] Hello, I'm the new created thread!",0Ah
hello3StrEnd:

;------------

Start:
;write message in console
mov rdx,hello2-hello1
mov rsi,hello1
mov edi,1           ; STDOUT
mov eax,1 ;syswrite
syscall

;create new thread
mov rsi,hello3
mov rdi,newThread
call createthread

;write message in console
mov edx,hello3-hello2
mov rsi,hello2
mov edi,1           ; STDOUT
mov eax,1 ;syswrite
syscall

;quit
mov eax,60 ;exit
syscall

;--------------

newThread:
mov rax,qword [rsp+psize] ;param

;write message from param in console
mov edx,dword [rax]
add rax,4
mov rsi,rax
mov edi,1           ; STDOUT
mov eax,1 ;syswrite
syscall

;quit
ret

;------------------------------------------------------------------------------
;This function will create a new thread with one parameter
;param:
;  rdi: base:  the base address to start the thread
;  rsi: param: the param
;
;return value: RAX: on success the TID and on failure a negative number
;-----------------------------------------------------------------------------
createthread:
push rdi ;base
push rsi ;param

xor r9,r9 ;offset
mov r8,-1 ;fd
mov r10,MAP_PRIVATE or MAP_ANONYMOUS ;flags
mov edx,PROT_READ or PROT_WRITE ;
mov esi,STACK_SIZE + GUARD_PAGE_SIZE ;length
xor edi,edi ;addr
mov eax,9 ;mmap2
syscall
test rax,rax
jl errCreatethread

;make guard page not readable/wirtable
mov edx,PROT_NONE
mov esi,GUARD_PAGE_SIZE
mov rdi,rax ;address already of rdi
mov eax,10  ;mprotect
syscall

add rdi,STACK_SIZE+GUARD_PAGE_SIZE-(psize*3) ;set stack to the last available memory spot and free space for 3 pointers

;set start address, return address, param on stack
pop qword [rdi+psize*2] ;set param
mov qword [rdi+psize],__exitThread ;set return address
pop qword [rdi] ;set start address

;clone current thread
xor r8,r8 ;tls
xor r10,r10 ;TID (Thread Id)
xor edx,edx ;PID (Process Id)
mov rsi,rdi ;stack
mov edi,CLONE_THREAD or CLONE_SIGHAND or CLONE_FS or CLONE_VM or CLONE_FILES or CLONE_IO
mov eax,56  ;clone
syscall

errCreatethread:
ret  ;go to start address or return from function


;------------------------------------------------------------------------------
;This function is called when a new created thread returns
;------------------------------------------------------------------------------
__exitThread:

pop rax

sub rsp,STACK_SIZE+GUARD_PAGE_SIZE

;free the stack
mov esi,STACK_SIZE + GUARD_PAGE_SIZE ;pdword [pesp]
mov rdi,rsp
mov eax,11  ;munmap
syscall

mov eax,60  ;exit
syscall
int3 ;should not happen    

_________________
--Love Flat Assembler: small, fast and free --


Last edited by Tycho on 11 Nov 2022, 10:27; edited 3 times in total
Post 10 Nov 2022, 11:00
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 430
Location: Australia
redsock 10 Nov 2022, 19:51
In your getErrNo definition, you are testing the highest bit in rax, which is the same as it being negative. Also, since the clone syscall returns a 32 bit value, this can be simplified (my "createThread" equivalent is a function, hence my return values to the parent):
Code:
        test    eax, eax
        jz      .inchild
        ; if jl, then an error occurred with the clone syscall
        ; otherwise, return value is child pid and we are in the parent
        ret    
I normally don't release thread stackspace, and prefer to recycle them for later threads. This might save the __exitThread functionality and could be replaced with the simpler non-group exit (meaning, you could remove the exitThread functionality and place that in your ".inchild" section).

My only other thought is: the concept of a guard page works well with contiguous memory areas where you might run off the end undetected. Your use of mmap2 however means that you will just crash if you run off the end as there'll be no contiguous area for you to overrun (and hence the guard page is moot since it accomplishes the same thing).

Cheers and well done on figuring it all out Smile
Post 10 Nov 2022, 19:51
View user's profile Send private message Reply with quote
I



Joined: 19 May 2022
Posts: 58
I 11 Nov 2022, 07:09
I vaguely remember hunting too, ended up with
Code:
        mov     eax,sys_clone
        mov     rdi,CLONE_VM or CLONE_SIGHAND
        lea     rsi,[TStack+0x2000-8]                   ; If stack not needed can set to zero!, -8 for alignment!
        xor     edx,edx
        xor     r10d,r10d
        xor     r8d,r8d
        syscall                                 
        cmp     rax,0                                   ; returns zero for child thread and child PID for parent.
        jz      NewThread                               
    

Seems to work okay.
Post 11 Nov 2022, 07:09
View user's profile Send private message Reply with quote
Tycho



Joined: 02 Mar 2008
Posts: 16
Tycho 11 Nov 2022, 09:36
Thanks for the feedback redsock! What you said made me think about getErrNo function and how it is completely useless in this situation. I removed it from the example. I also changed the calling convention of createthread(), resulting in one ret behind clone() without the need for any compare instruction.

The reason why it was the way it was, is because I always placed the Windows functions in a wrapper, with the idea to one day port this project to Linux. The function originally returns 0 on failure, which what I was simulating here Smile. The wrapper also restores registers which makes it impossible to share the same return path with the new created thread
redsock wrote:
test eax, eax
jz .inchild
; if jl, then an error occurred with the clone syscall
; otherwise, return value is child pid and we are in the parent
ret
I'm going to use this in my project, thanks for the suggestion! I also applied it to the example above to detect an error with mmap2, something I originally forgot to do.
redsock wrote:
Also, since the clone syscall returns a 32 bit value
I have my doubt about the return value of clone being a dword. When it fails on my system, it returns a qword (when I used the wrong flag, RAX = FFFFFFFFFFFFFFEA). I would still get the correct errno if I use neg on the dword-part, but what if the function actually succeeded and I don't know that because I don't check the correct signed flag?
redsock wrote:
Your use of mmap2 however means that you will just crash if you run off the end as there'll be no contiguous area for you to overrun
The reason for the guard page is to make sure that the program crashes if it uses more stack then is mapped. This because, in my understanding, it is possible that the memory page above the memory allocated for the stack can be allocated by my program as well. This means that if the routine uses more memory then allocated for stack, it is going to crash OR overwrite other data. I want to be sure it always crashes!

On another note, originally I wanted to use flag MAP_GROWSDOWN with mmap2, which would make perfect sense for a stack. Sadly a google search told me this flag is not functioning well on most kernels.
redsock wrote:
I normally don't release thread stackspace, and prefer to recycle them for later threads.
I'm not sure about recycling the stack space. mmap(2) seems really similar to the VirtualAlloc API of Windows. This function is so fast that recycling memory usually slows down performance. I'm speaking for the situation where you want to able to run 'unlimited' threads in theory, if you know how many thread you run simultaneously you may get a performance boost. I will do some tests to see if mmap has the same performance as VirtualAlloc.


Last edited by Tycho on 11 Nov 2022, 10:26; edited 3 times in total
Post 11 Nov 2022, 09:36
View user's profile Send private message Reply with quote
Tycho



Joined: 02 Mar 2008
Posts: 16
Tycho 11 Nov 2022, 10:03
I wrote:
I vaguely remember hunting too, ended up with
Code:
        mov     eax,sys_clone
        mov     rdi,CLONE_VM or CLONE_SIGHAND
        lea     rsi,[TStack+0x2000-8]                   ; If stack not needed can set to zero!, -8 for alignment!
        xor     edx,edx
        xor     r10d,r10d
        xor     r8d,r8d
        syscall                                 
        cmp     rax,0                                   ; returns zero for child thread and child PID for parent.
        jz      NewThread                               
    

Seems to work okay.


Good to know the stack can be set to zero, I didn't knew that. Thanks Smile

_________________
--Love Flat Assembler: small, fast and free --
Post 11 Nov 2022, 10:03
View user's profile Send private message Reply with quote
I



Joined: 19 May 2022
Posts: 58
I 11 Nov 2022, 13:01
Not very clear by me, plus I'm a Linux noob.

See https://linux.die.net/man/2/clone

Quote:
Another difference for sys_clone is that the child_stack argument may be zero, in which case copy-on-write semantics ensure that the child gets separate copies of stack pages when either process modifies the stack. In this case, for correct operation, the CLONE_VM option should not be specified.
Post 11 Nov 2022, 13:01
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 11 Nov 2022, 15:05
Code:
        syscall
        xchg ecx,eax ; TID is s32, >0
        jrcxz @F
        retn
@@:     ; new thread    

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 11 Nov 2022, 15:05
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 430
Location: Australia
redsock 11 Nov 2022, 19:49
Tycho wrote:
I have my doubt about the return value of clone being a dword. When it fails on my system, it returns a qword (when I used the wrong flag, RAX = FFFFFFFFFFFFFFEA). I would still get the correct errno if I use neg on the dword-part, but what if the function actually succeeded and I don't know that because I don't check the correct signed flag?
The nice part about sign extension is that RAX = FFFFFFFFFFFFFFEA is the same value as EAX = FFFFFFEA: both = -22 (hence errno being correct). Since all TID/PID values for the linux kernel are < 21 bits in size (most are much smaller), it can safely be treated as a 32 bit value.
Tycho wrote:
This because, in my understanding, it is possible that the memory page above the memory allocated for the stack can be allocated by my program as well. This means that if the routine uses more memory then allocated for stack, it is going to crash OR overwrite other data. I want to be sure it always crashes!
What I originally said was that because of the way you are using mmap2, you will never get contiguous address mappings unless you painstakingly ask for it. Kernel address randomization also means that consecutive runs of your program are unlikely to receive the same mappings. You really don't need guard pages unless you are doing memory management with your stack a different way. Smile
Tycho wrote:
This function is so fast that recycling memory usually slows down performance.
Yikes to this statement. Lets consider your comment twofold: 1) raw syscall overhead (this is to say, who cares what the kernel is actually doing in said syscall, actually userspace to ring 0 transition), and 2) what mmap actually has to accomplish to meet your request. There are many many reasons userspace memory management routines exist (see: tcmalloc, malloc, jemalloc, etc).

A "recycled" 2MB stack space (that is actually sparsely populated thanks to the way mmap works) could be placed into a singly-linked list and accessed via a single userspace-only atomic operation that would be virtually cost-free compared to any syscall/kernel interop.

A few years back I did a terminal video comparing various language startups and their relevant syscall counts, you might enjoy the syscall tracing parts of it: https://2ton.com.au/videos/tvs_part1/
Cheers Smile

_________________
2 Ton Digital - https://2ton.com.au/
Post 11 Nov 2022, 19:49
View user's profile Send private message Reply with quote
Tycho



Joined: 02 Mar 2008
Posts: 16
Tycho 12 Nov 2022, 10:05
I wrote:
Not very clear by me, plus I'm a Linux noob.

See https://linux.die.net/man/2/clone

Quote:
Another difference for sys_clone is that the child_stack argument may be zero, in which case copy-on-write semantics ensure that the child gets separate copies of stack pages when either process modifies the stack. In this case, for correct operation, the CLONE_VM option should not be specified.


Good that you mention this. Not using the CLONE_VM makes the thread run in a separate memory space. This means it's like using fork() with a few more tweaks.

_________________
--Love Flat Assembler: small, fast and free --
Post 12 Nov 2022, 10:05
View user's profile Send private message Reply with quote
Tycho



Joined: 02 Mar 2008
Posts: 16
Tycho 12 Nov 2022, 10:10
redsock wrote:
A "recycled" 2MB stack space (that is actually sparsely populated thanks to the way mmap works) could be placed into a singly-linked list and accessed via a single userspace-only atomic operation that would be virtually cost-free compared to any syscall/kernel interop.
This might work if only one thread is responsible for creating the new threads, but if multiple threads are able to create a new thread you have a few race conditions to solve. And it is quite known that lock * instructions can bring down performance very much. It's just not crystal clear that your solution is the best way to go, it depends on what the program tries to accomplish and how it's done.
redsock wrote:
There are many many reasons userspace memory management routines exist (see: tcmalloc, malloc, jemalloc, etc).
Breaking up pages in to smaller pieces of memory is a great thing. But again, performance wise, it's the question what your program tries to do, how it's done and how efficient the memory breakup is done. If you need the size of a full page, there is no reason to break up the memory. Recently I was working on a program that broke up memory and filled up every small gab it could find. But one routine requested a lot of memory with the size of a quarter or half a page. The routine still tried to fill up every small gab, but now it had to go through a lot of memory, it slowed down to much. To resolve this issue, I created a more efficient way for a specific part of the program, that did not fill up every small gab, was not multi-threaded safe, did not reassign unused memory and had to be released all at once or not at all. My program is speedy gonzales now, where it was a turtle before. Just to give you an example that there is not one way to do things right! That doesn't mean that the functions you mention have a purpose, they really do Smile !
redsock wrote:
also means that consecutive runs of your program are unlikely to receive the same mappings
For 64-bit programs is probably a small probability that it goes wrong, a risk I'm still not willing to take, but I understand that other people do. What I didn't told you is that the program I'm working on is created is such a way it can be compiled for 86-64 and 86-32. I created macro's like pedx-> rdx or edx and psize is 8 or 4 (like used in example above). The Linux version is also going to be build in the same way. I rather not risk the functioning of my program on randomization with the limited amount of memory-space 32-bit systems have to offer.
redsock wrote:
Since all TID/PID values for the linux kernel are < 21 bits in size (most are much smaller), it can safely be treated as a 32 bit value.
I wasn't aware of this, thanks for sharing!
Post 12 Nov 2022, 10:10
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 430
Location: Australia
redsock 12 Nov 2022, 22:51
You have some fun times ahead of you comparing the various methods. It is clear you haven't done so yet. I'll leave you to find out for yourself for all but the following bit:
Tycho wrote:
This might work if only one thread is responsible for creating the new threads, but if multiple threads are able to create a new thread you have a few race conditions to solve. And it is quite known that lock * instructions can bring down performance very much. It's just not crystal clear that your solution is the best way to go, it depends on what the program tries to accomplish and how it's done.
Consider the following code (compiled against my own HeavyThing library just for string output):
Code:
public _start
falign
_start:
        call    ht$init

        mov     ebx, 1000000
        mov     r12d, 2097152
.loop:
        mov     eax, syscall_mmap
        xor     edi, edi
        mov     esi, r12d
        mov     edx, 0x3
        mov     r10d, 0x22
        mov     r8, -1
        xor     r9d, r9d
        syscall
        mov     rdi, .error_mmap
        test    rax, rax
        jl      .error
        
        mov     esi, r12d
        mov     rdi, rax
        mov     eax, syscall_munmap
        syscall

        sub     ebx, 1
        jnz     .loop

        mov     rdi, .complete
        call    string$to_stdoutln

        mov     eax, syscall_exit
        xor     edi, edi
        syscall
.error:
        call    string$to_stdoutln
        mov     eax, syscall_exit
        mov     edi, 1
        syscall
cleartext .error_mmap, 'mmap failed'
cleartext .complete, 'test complete'
    
I have a relatively powerful AMD workstation that ran this test for me against a Linux 5.14.21 kernel at a decent pace of 4.21ghz:
Code:
perf stat ./mmap_test
test complete

 Performance counter stats for './mmap_test':

            410.79 msec task-clock                #    0.998 CPUs utilized          
                61      context-switches          #  148.494 /sec                   
                 6      cpu-migrations            #   14.606 /sec                   
                 3      page-faults               #    7.303 /sec                   
        1729395504      cycles                    #    4.210 GHz                      (83.45%)
          38195162      stalled-cycles-frontend   #    2.21% frontend cycles idle     (83.46%)
         130986465      stalled-cycles-backend    #    7.57% backend cycles idle      (83.46%)
        3245900681      instructions              #    1.88  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.45%)
         661909633      branches                  #    1.611 G/sec                    (83.45%)
           4006110      branch-misses             #    0.61% of all branches          (82.72%)

       0.411536377 seconds time elapsed

       0.044359000 seconds user
       0.366976000 seconds sys
    
Your misguided thought that a LOCK prefix and possible race conditions (all perfectly reasonable to deal with via normal atomic operations) is somehow worse off that that is... well, misguided. My reasonably fast CPU is taking 410.79 nanoseconds to do the 2mb mmap and subsequent munmap, or, for a single CPU flat out doing nothing else, I can get at most ~2.4m alloc/free per second.

Since we are not actually using any of the region returned by mmap in our contrived example, the kernel doesn't really have to do a lot of work. What we are seeing is pretty close to syscall-only overhead.

LOCK prefixed atomic operations are measured in clock cycles, and even worst-case contended times are crazy better than that. 1729.4259 clock cycles per loop iteration means that even worst-case cache miss for the CMPXCHG would mean we could easily run circles around the syscall-based method (provided our pages haven't been swapped out or some other nastiness, in which case it could be HUGE delays but that is outside the scope of our tests with swap disabled).

If we populate a single dword into the first page returned by mmap by adding
Code:
        mov     [rax], r12d
    
between the successful call to mmap and munmap, look at what this does to our performance:
Code:
perf stat ./mmap_test
test complete

 Performance counter stats for './mmap_test':

           1902.54 msec task-clock                #    0.999 CPUs utilized          
               218      context-switches          #  114.584 /sec                   
                18      cpu-migrations            #    9.461 /sec                   
           1000003      page-faults               #  525.615 K/sec                  
        8125567896      cycles                    #    4.271 GHz                      (83.19%)
         292236168      stalled-cycles-frontend   #    3.60% frontend cycles idle     (83.39%)
        1453001032      stalled-cycles-backend    #   17.88% backend cycles idle      (83.40%)
       13707552618      instructions              #    1.69  insn per cycle         
                                                  #    0.11  stalled cycles per insn  (83.40%)
        2712532788      branches                  #    1.426 G/sec                    (83.40%)
           7065941      branch-misses             #    0.26% of all branches          (83.23%)

       1.903930208 seconds time elapsed

       0.159910000 seconds user
       1.743019000 seconds sys
    
Indeed, this takes 4.63 times as long on my machine. Now lets populate two more pages for a total of 3 dirty pages in our 2mb mmap by adding another two instructions after our last one:
Code:
        mov     [rax+4096], r12d
        mov     [rax+8192], r12d
    
And again, the performance stats are telling:
Code:
perf stat ./mmap_test
test complete

 Performance counter stats for './mmap_test':

           2706.53 msec task-clock                #    0.999 CPUs utilized          
               308      context-switches          #  113.799 /sec                   
                 9      cpu-migrations            #    3.325 /sec                   
           3000004      page-faults               #    1.108 M/sec                  
       11748415171      cycles                    #    4.341 GHz                      (83.31%)
         705557802      stalled-cycles-frontend   #    6.01% frontend cycles idle     (83.31%)
        2384974834      stalled-cycles-backend    #   20.30% backend cycles idle      (83.31%)
       18529302125      instructions              #    1.58  insn per cycle         
                                                  #    0.13  stalled cycles per insn  (83.31%)
        3447280424      branches                  #    1.274 G/sec                    (83.43%)
          11183979      branch-misses             #    0.32% of all branches          (83.33%)

       2.708347307 seconds time elapsed

       0.283877000 seconds user
       2.422956000 seconds sys
    
We have gone from more-or-less syscall-only overhead rate of 2.4 million operations per second down to 369.47 per second. Even more telling, if we populate every page returned by getting rid of our three page moves and replace them with:
Code:
        mov     rsi, rax
        mov     ecx, 512
.populate_every_page:
        mov     [rsi], ecx
        add     rsi, 4096
        sub     ecx, 1
        jnz     .populate_every_page
    
This run I had to wait a while for. Of particular note, have a look at the page-faults count here:
Code:
perf stat ./mmap_test
test complete

 Performance counter stats for './mmap_test':

         276934.08 msec task-clock                #    0.999 CPUs utilized          
             30238      context-switches          #  109.188 /sec                   
               819      cpu-migrations            #    2.957 /sec                   
         512000004      page-faults               #    1.849 M/sec                  
     1162203144724      cycles                    #    4.197 GHz                      (83.33%)
      133040834940      stalled-cycles-frontend   #   11.45% frontend cycles idle     (83.33%)
      348649831792      stalled-cycles-backend    #   30.00% backend cycles idle      (83.33%)
     1805763490028      instructions              #    1.55  insn per cycle         
                                                  #    0.19  stalled cycles per insn  (83.33%)
      330735435159      branches                  #    1.194 G/sec                    (83.34%)
        1345913386      branch-misses             #    0.41% of all branches          (83.33%)

     277.141320090 seconds time elapsed

      31.005226000 seconds user
     245.875127000 seconds sys
    
276.9 seconds!!! 3.61 loop iterations per second! I digress.

Conclusion: I was dismayed by the first basically syscall-only test at 2.4M/s on this machine. You can quickly see that an atomic operation even on a contended lock will in 99% of conceivable scenarios ridiculously outperform using mmap/munmap as a memory management technique, and especially so for clone-based thread stacks.

Cheers and good luck

_________________
2 Ton Digital - https://2ton.com.au/
Post 12 Nov 2022, 22:51
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 13 Nov 2022, 04:56
Props to redsock for actually running some indicative tests, instead of reading code and trying to predict how it will perform.

I wonder of people see one instruction, syscall, and like to assume it will be "quick" because it looks nice and neat. But if you have to write 20 instructions to do the same action then it becomes more complex and thus is probably "slower"? I mean it might be slower, or not, but there is no way to know without testing.

And of course reading Agner Fog's docs and seeing how "scary" lock is and assuming it must be absolutely awful, but failing to consider that the syscall also has to use lock anyway. Sometimes you just have to accept that there isn't an approach that both works in all cases and can somehow magically avoid locks or spins.
Post 13 Nov 2022, 04:56
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Nov 2022, 18:27
I mean this was kinda clear. A LOCK is only about 100 cycles, on average. And if it's contended, then the other threads are doing work, so it's ok.

Just a syscall by itself is probably 100 cycles if not more.
Post 13 Nov 2022, 18:27
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 430
Location: Australia
redsock 20 Nov 2022, 23:08
redsock wrote:
Since all TID/PID values for the linux kernel are < 21 bits in size (most are much smaller), it can safely be treated as a 32 bit value.
Update:
Code:
cat /proc/sys/kernel/pid_max    
on a Debian 11 AARCH64 machine I have here has it set to 4194304, and it is arbitrarily configurable up to 31 bits.

_________________
2 Ton Digital - https://2ton.com.au/
Post 20 Nov 2022, 23:08
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.