flat assembler
Message board for the users of flat assembler.

Index > Main > 64bit Thread Safe Array Stack

Author
Thread Post new topic Reply to topic
r22



Joined: 27 Dec 2004
Posts: 805
r22 21 Sep 2006, 01:52
Here's a TS array stack that uses LOCK and XADD.
It's very simple but also very useful for multithreaded programming.

Code:
.data
;;concurrent array stack data
stackPtr     dq 0
stackHead    dq 0
stackEnd     dq 0  

.code
;;;INPUT rcx = size to make array stack
;;;returns NONZERO = SUCCESS  0 = FAIL
StackInit:
        shl     rcx,3
        mov     rdx,rcx ;;; size to qword
        xor     ecx,ecx
        push    rdx ;;; save
        mov     r8d,MEM_COMMIT or MEM_RESERVE
        mov     r9d,PAGE_READWRITE
        call    [VirtualAlloc]
        test    rax,rax
        jz      .fail
        mov     qword[stackPtr],rax
        mov     qword[stackHead],rax
        pop     rcx
        add     rax,rcx
        mov     qword[stackEnd],rax
        ret     0
   .fail:
        xor     eax,eax
        invoke  ExitProcess,0
        ret     0

;;;return NONZERO = SUCCESS  0 = FAIL
StackDelete:
        mov     rcx,qword[stackHead]
        xor     edx,edx
        mov     r8d,MEM_RELEASE
        call    [VirtualFree]
        test    rax,rax
        jz      .fail
        ret     0
   .fail:
        xor     eax,eax
        ret     0

;;;RCX = addr of data to put to the stack
;;;returns NONZERO = SUCCESS  0 = FAILURE
StackPush:
        mov     eax,8
   lock xadd    qword[stackPtr],rax  ;;; LOCKED exchange and add
        cmp     rax,qword[stackEnd]
        jae      .fail
        mov     qword[rax],rcx ;;;store addr of data
        ret     0
   .fail:
        xor     eax,eax
        ret     0

;;;return data addr
;;;return if underflow 0 = failure
StackPop:
        mov     rax,-8
   lock xadd    qword[stackPtr],rax ;;; locked exchange and add(sub)
        cmp     rax,qword[stackHead]
        jbe      .fail
        mov     rax,qword[rax-8]  ;;;get addr of data
        ret     0
   .fail:
        xor     eax,eax
        ret     0 
    


By extending the implementation you can allow for mutliple instances of the stack as opposed to the stack instance. But since I only needed one instance I didn't implement the extra overhead of having to pass a pointer to the stack data to every function.

Hope someone finds it useful.

EDIT: I fixed a few errors. Eventually I'll actualy test the code to make sure it works :P


Last edited by r22 on 21 Sep 2006, 15:59; edited 1 time in total
Post 21 Sep 2006, 01:52
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 21 Sep 2006, 13:05
I find it hard to use. Don't you need a database in each thread to mark all the owned stack parts. If you push or pop, you will never know what memory space was it and what has been changed by other theads in the mean time. Maybe if all the threads agree to play on constant number of stack slots.
Post 21 Sep 2006, 13:05
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 21 Sep 2006, 16:03
This is a user defined stack and ADT (abstract data type). By using LOCK XADD many threads can access it without having concurrency issues.
Post 21 Sep 2006, 16:03
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 21 Sep 2006, 18:04
Yeah, I understand how it works, but consider:
1) Thread ONE pushes "13" on the stack
2) Thread TWO pushes "4" and "71" on the stack
3) Thread THREE pushes "5" on the stack
4) Thread TWO pops ?? "5" and "71" from the stack
5) Thread ONE pops "4" from the stack
6) Thread THREE pops "13" from the stack

Isn't it a bit confusing - this way you will never know where valuable data is. If you agree on ALL threads that stack has only one position, then you can make a IPC-behaviour, but what else can it be used for?

You said something about testing - would you tell me how that test program works, maybe I get some ideas from that!?
Post 21 Sep 2006, 18:04
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
okasvi



Joined: 18 Aug 2005
Posts: 382
Location: Finland
okasvi 21 Sep 2006, 18:56
I think it is meant for several threads pushing and one thread pop'ing?
Post 21 Sep 2006, 18:56
View user's profile Send private message MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 21 Sep 2006, 19:01
I think it can be using to process requests in LIFO order. For example 1 thread (or more) pushing jobs to do and many other threads poping jobs to work on. It's just an example but I think that shared stack can have many useful uses (and the same for shared queues).
Post 21 Sep 2006, 19:01
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 21 Sep 2006, 19:05
Good point, about usage.
I guess the only proper use for this setup would be as a sort of time irrelavent queue.

I'm using it as a storage for Worker Threads to get jobs off of. Since I just need ALL the jobs to get processed concurrently (in ANY order) it suits my purposes. So technically the PUSH operation (for my needs) doesn't even need to be thread safe as it'll be filled up before the worker threads are initialized. So, it's a 'Concurrent Single Use Work-Request List'.

Specifically I'm pushing structures with a file handle and encryption key and the worker threads are popping off the structures and encrypting the files.

The reason I created it was because the version I had that used a Mutex seemed to fail with WinXP64. The WaitForSingleObject api caused an exception after the CreateMutex api, so rather then waste time with it a used LOCK and XADD.

I thought it was an interesting snippet of code (since you don't see LOCK or XADD used much), but it's ability to be useful in a range of activities is definetly suspect.
Post 21 Sep 2006, 19:05
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 21 Sep 2006, 21:51
Seems like a pretty nice idea for work queues - a bus lock certainly beats WaitFor* and similar Smile
Post 21 Sep 2006, 21:51
View user's profile Send private message Visit poster's website Reply with quote
UCM



Joined: 25 Feb 2005
Posts: 285
Location: Canada
UCM 21 Sep 2006, 22:37
Why is it always "lock xadd", and nothing else?
Post 21 Sep 2006, 22:37
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 22 Sep 2006, 00:12
UCM, XADD is the only lockable opcode that lets you acquire a value in memory and update that value in memory in the same instruction. Well other than a CMPXCHG (but you have to use a branch loop to make sure it worked correctly).

In high level code (java) it would look like.
//lock xadd mem64, reg64
synchronize{
temp = reg64;
reg64 = mem64;
mem64 += temp;
}

It would be great if someone could come up with a way of making a Queue rather then a stack (FILO as opposed to FIFO) without using OS locking mechanisms

It wouldn't be too much of a headache to implement using CMPXCHG but the looping required makes me think that it might be TOO cpu intensive and would probably be better off using a Semaphore or Mutex. If anyone knows how spin locks work with events and such feel free to enlighten me.
Post 22 Sep 2006, 00:12
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 22 Sep 2006, 07:44
r22: whether it's too CPU intensive depends on how long time will be spent in the loop. Critical Sections on windows actually tries to "spinlock" for a bit, before entering a wait-state... simply because a short spinlock can be faster than a full r3->r0->r3 + waiting.
Post 22 Sep 2006, 07:44
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 23 Sep 2006, 08:05
r22: nice, could you please make some template with lock() / release() functions so we can see locking part separated?
Post 23 Sep 2006, 08:05
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 23 Sep 2006, 19:56
Not quite sure what you mean vid.

;;;LOCK XADD
LOCK{
XCHG qword[stackPtr],rax
ADD qword[stackPtr],rax
}RELEASE{}

is the only part of the functions that are synchronized.

The stack implemented at the top of this thread is a poor example for such things. It was coded to suit my purpose for only needing Concurrent POPping of data off of it.
If you want to riddle: What case(s) would cause the above stack to lose a peice of data...
...Answer
Thread1: PUSHes DATA1
Thread2: POPes but loses context before it acquires DATA1 but after it's LOCKed XADD decrements the stackPtr
Thread1: PUSHes DATA2
Thread2: Returns with DATA2 and DATA1 is lost

The 64bit Concurrent Circular Queue is a better template/example of OS independent x86 locking because of it's NULL checks and no context related data integrity issues.
Post 23 Sep 2006, 19:56
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Dec 2007, 02:41
What if the PUSHing thread is preempted just after "lock xadd" and then another thread fully executes POP from the beginning to the end? Wouldn't that thread get garbage?
Post 26 Dec 2007, 02:41
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 26 Dec 2007, 04:26
Your right, the stack had issues, which was why I made the Ring Buffered Queue. I think the queue is much more stable.

http://board.flatassembler.net/topic.php?t=5887
Post 26 Dec 2007, 04:26
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 26 Dec 2007, 06:22
I like the bit instructions for this kind of stuff.
Code:
; store RDX address in empty slot
    bts rdx,0
.x: add rax,8
    lock bts [rax],0
    jc .x
    mov [rax],rdx
; careful, RAX could run past end of buffer    
...usually several low-order bits are availble when using power of two data structures.

(Doesn't the partial implementation of 64-bit address space by the CPU allow some upper bits to be used as well? Not a good idea, really.)
Code:
    mov ecx,buff_items
.0: lock btr [eax+ecx*4-4],0
    dec ecx
    ja .0
    jnc .error
    mov [eax+ecx*4],edx
    retn    
...the tricky double flag branch - here a zero bit zero means busy - also stop when all buffer tested. Are there any compilers that use the bit instructions or stack flags of multiple instructions?
Post 26 Dec 2007, 06:22
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 26 Dec 2007, 14:16
bitRAKE wrote:
(Doesn't the partial implementation of 64-bit address space by the CPU allow some upper bits to be used as well? Not a good idea, really.)

Indeed not a good idea; it would work "for now", but while we're not going to see 64 bits of physical memory anytime soon, does the "partial implementation" affect virtual addresses as well, or only physical memory?

_________________
Image - carpe noctem
Post 26 Dec 2007, 14:16
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 26 Dec 2007, 16:54
f0dder, I was thinking in light of masking the value prior to use as would need to be done if lower bits are used. Insuring the upper X bits are zero for the virtual address space used shouldn't be too difficult in most situations.
Post 26 Dec 2007, 16:54
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 27 Dec 2007, 02:07
bitRAKE wrote:
f0dder, I was thinking in light of masking the value prior to use as would need to be done if lower bits are used. Insuring the upper X bits are zero for the virtual address space used shouldn't be too difficult in most situations.


Yes, of course you'd mask Smile, but what if windows suddenly decided it wants to allocate very high virtual addresses? b00m, you're dead. Iirc there's a tool for having windows allocate your stuff very high, at least for drivers... and a lot of other interesting things, like randomly failing memory requests etc., to test your code robustness.

Handling alignment properly and using (AND masking Smile) lower bits is okay, but please don't make too many assumptions about virtual memory space... that apps need a special PE header flag set to take advantage of 3/1 user/kernel mapping on 32bit windows shows why.

_________________
Image - carpe noctem
Post 27 Dec 2007, 02:07
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.