flat assembler
Message board for the users of flat assembler.

Index > Main > Reading array on multi-core/-processor system

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 25 Oct 2010, 16:21
Hello Very Happy

I have a question concerning multi-threading:
Code:
findItem: ; EAX = ItemID ; Returns ItemEntry in EDX:EAX
      mov        ecx, ITEMS_POOL_SIZE

.search:
      mov        edx, dword [itemsPool + (ecx - 1)*8 + 4]
      cmp        eax, dword [itemsPool + (ecx - 1)*8 + 0]
      je         .return

      loop       .search

      mov        eax, -1

.return:
      ret    
Does this code works correctly or EDX could be loaded after the value to which EAX was compared has changed? Writers of the items pool use "lock cmpxchg8b" and the array is 8-byte aligned, and when the pool is full it starts to be filled again from index 0, so it kills older elements.
Post 25 Oct 2010, 16:21
View user's profile Send private message Reply with quote
b1528932



Joined: 21 May 2010
Posts: 287
b1528932 25 Oct 2010, 21:06
itemsPool + (ecx - 1)*8 + 4
its not correct SIB syntax?


regarding your question - yes.
Post 25 Oct 2010, 21:06
View user's profile Send private message Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 25 Oct 2010, 22:19

only a reading (or compare) causes no problem with multi-tasking.
Problems arise when one writes and changes data in memory.
cmpxch8b seems totally unnecessary here.
and "lock" in addition ? my god ! Razz
useless here !

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 25 Oct 2010, 22:19
View user's profile Send private message Send e-mail Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 25 Oct 2010, 22:52
Perhaps I've should have explained what I was fearing of Razz

My concern is, that considering we have processors that are capable of executing instructions out of order, and that there is no dependence between the MOV and CMP, perhaps the CPU could execute the CMP first at times, then, another core initiates the LOCK CMPXCHG8b, and then the MOV is executed, if this can happen, then I'm screwed. But perhaps the CPUs arbitrate this somehow, to avoid out of order execution issues?

Another way I though of was doing this instead:
Code:
findItem: ; EAX = ItemID ; Returns ItemEntry in EDX:EAX
      mov        ecx, ITEMS_POOL_SIZE

.search:
      cmp        eax, dword [itemsPool + (ecx - 1)*8 + 0]
      je         .found

      loop       .search

.notFound:
      mov        eax, -1
      ret

.found:
      mov        edx, dword [itemsPool + (ecx - 1)*8 + 4]
      mov        ecx, dword [itemsPool + (ecx - 1)*8 + 0]

      ; I hope no processor in the foreseeable future will optimize out the two instructions below Razz
      xor        ecx, edx
      xor        ecx, edx

      cmp        eax, ecx
      jne        .notFound

      ret    
Now, the processor should be forced to have read both values before comparing identifiers, but I think I'm still not solving the out of order problem actually... Maybe I'll have to also use CMPXCHG8B when reading?

b1528932, your right, but fasm takes care of converting it to [ecx*8 + (itemsPool - 4)] (the value inside the parentheses is resolved at compile-time) which it is what the processor can handle.
Post 25 Oct 2010, 22:52
View user's profile Send private message Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 25 Oct 2010, 23:22
Quote:
perhaps the CPU could execute the CMP first at times,
then, another core initiates the LOCK CMPXCHG8b
and then the MOV is executed
If you don't modify any data in memory
there is no risk of conflicts of sequences between cores.
If this was not the case, no software would work!

that said,
you talk about " CMPXCHG8b", but I see no " CMPXCHG8b" in your code.


_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 25 Oct 2010, 23:22
View user's profile Send private message Send e-mail Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 25 Oct 2010, 23:47
Code:
addItem:
      push       ebx edi

      mov        edi, 8
lock  xadd       [poolPointer], edi
      and        edi, 8*ITEMS_POOL_SIZE - 1

      mov        eax, dword [itemsPool + edi + 0]
      mov        edx, dword [itemsPool + edi + 4]
      mov        ebx, someId ; No two items in the pool have the same ID with the exception of -1 (invalid ID)
      mov        ecx, someData ; Uniqueness not guaranteed.
lock  cmpxchg8b  qword [itemsPool + edi]

      pop        edi ebx
      ret    
This may be under execution of another core at any time.

After reading the following in AMD manuals:
Quote:
Locked Instructions—Before completing a locked instruction
(an instruction executed using the LOCK prefix), all
previous reads and writes must be written to memory, and
the locked instruction must complete before completing
subsequent writes. Locked writes are never buffered,
although locked reads and writes are cacheable.

Then, perhaps this is safe (but maybe overkill):
Code:
findItem: ; EAX = ItemID ; Returns ItemEntry in EDX:EAX
      mov        ecx, ITEMS_POOL_SIZE

.search:
      cmp        eax, dword [itemsPool + (ecx - 1)*8 + 0]
      je         .found

      loop       .search

.notFound:
      mov        eax, -1
      ret

.found:
      mov        edx, dword [itemsPool + (ecx - 1)*8 + 4]

lock  cmpxchg    dword [itemsPool + (ecx - 1)*8 + 0], eax
      jnz        .notFound

      ret    
Post 25 Oct 2010, 23:47
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20338
Location: In your JS exploiting you and your system
revolution 25 Oct 2010, 23:51
You have no choice here, you have to lock the memory that you are reading. The lock must be either guaranteed atomic, or guaranteed to exclude all writers while you read.

Why are you not locking? AFIAK no one has discovered a magic code sequence that can avoid using locks. And the lock is almost cost free anyway in most circumstances. If speed is an issue and you found that avoiding the small lock penalty gave you a massive speed boost then it is time to redesign the whole app data flow.
Post 25 Oct 2010, 23:51
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Oct 2010, 00:29
AFAIK, all processors do perform atomic read/write without the need of locking, LOCK is required (and valid only when used) on read-modify-write instructions. The problem here is that I have two reads, and I must make sure that when the item-id comparison is about to be made, the MOV EDX, ... was already executed and won't hold the value of a different item (with different ID). If the CPU were following program order strictly, this wouldn't be an issue, but seems there are no guarantees of order preservation.

Looks that my latest code is the way to go (besides LFENCE maybe, but I don't want SSE2 dependence).

Thanks to all for following this thread Very Happy
Post 26 Oct 2010, 00:29
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20338
Location: In your JS exploiting you and your system
revolution 26 Oct 2010, 00:43
LOCK will not give any guarantee of order. And LOCK also does not guarantee order among multiple CPUs. LOCK only provides read-modify-write guarantees, it won't give you ordering guarantees.

Your problem is to read two consecutive 32-bit words and make sure that nothing else modifies any one of those values while you are reading. Is that correct?

If so then you have no choice but to protect both those reads somehow to ensure that no other CPU will get in there and make a change. So that means you either have to find an atomic 64-bit instruction (like cmpxchg8b) or use a higher level solution like a mutex or critical section.

One thing I can be sure of is that using a lock on a single dword will not solve your problem.

Note also that there are specific domain solutions that CAN read 64-bit values without a lock but they do not apply to generic situations. Reading a guaranteed monotonically increasing 64-bit counter can be done without a lock, but it is a very specific situation that relies upon the property of the counter.
Post 26 Oct 2010, 00:43
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Oct 2010, 05:27
Quote:

Your problem is to read two consecutive 32-bit words and make sure that nothing else modifies any one of those values while you are reading. Is that correct?

Partially, writes are always with LOCK CMPXCHG8B, but when reading, since I don't have a 64-bit load instruction, I read the higher half first, and then check if the lower half has the value I was looking for. If the processor performs these operations in program order, it should work, but this is not guaranteed. Therefore, because of the quote I posted above says and which also Intel seems to agree, it looks that my latest code should be working.

Do you see a case in which I could still have EDX loaded with higher half of a different item? (and EAX not set to -1 on return, i.e. situation undetected by the code)

PS: Besides the uniqueness property I've told about the lower part, I forgot to mention that no entry will be overwritten with an item having the same id (lower half).
Post 26 Oct 2010, 05:27
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20338
Location: In your JS exploiting you and your system
revolution 26 Oct 2010, 05:44
Can you make your code MMX dependant?
Code:
movq mm0,[location]    
Post 26 Oct 2010, 05:44
View user's profile Send private message Visit poster's website Reply with quote
b1528932



Joined: 21 May 2010
Posts: 287
b1528932 26 Oct 2010, 11:04
Quote:
If the processor performs these operations in program order, it should work

I dont think so. all access must be atomic. You read high half, then core 2 change it, and you read low half. You have 2 parts of not the same data.

use a spinlock to protect it, or use lock cmpxchg8b. There is no other way.
Post 26 Oct 2010, 11:04
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Oct 2010, 16:06
revolution, no.

b1528932, in the case of happening what you describe, my code would detect the problem because the lower half would be no longer equal to EAX (writes never store the same ID the items pool entry had previously, and writes are 64-bit atomic).

All I need to make sure this will work is to ensure that the higher half is read first and the lower half last or else have both halves read at the same time (but without MMX, SSE or FPU). lock cmpxchg8b would ensure 64-bit atomic read, but it will actually write to memory as well and requires many registers too... If this is really needed then I will have to go for it, but isn't the "lock cmpxchg dword [... + 0], eax" (I wish I had a read-only way) ensuring that the "mov edx, dword [... + 4] " above completes (or at least the data is fetched) first?

To both, the implementation must be wait-free.
Post 26 Oct 2010, 16:06
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20338
Location: In your JS exploiting you and your system
revolution 26 Oct 2010, 16:16
LocoDelAssembly wrote:
(writes never store the same ID the items pool entry had previously, and writes are 64-bit atomic)
In that case then you are good to go with the clock counter solution:
Code:
do_it_again:
       mov     eax,[lower]
 mov     edx,[upper]
 mov     ecx,[lower]
 cmp     ecx,eax
     jnz     .do_it_again
; do your comparing below here    
Post 26 Oct 2010, 16:16
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Oct 2010, 16:27
Why the processor wouldn't reorder that like the code below?
Code:
do_it_again:
        mov     eax,[lower]
        mov     ecx,[lower]
        mov     edx,[upper] ; LOCK CMPXCHG8B from another code happening here
        cmp     ecx,eax
        jnz     .do_it_again    
Post 26 Oct 2010, 16:27
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20338
Location: In your JS exploiting you and your system
revolution 26 Oct 2010, 16:33
I forgot to show the lock, sorry.
Code:
do_it_again:
        mov     eax,[lower]
lock    mov     edx,[upper]
        mov     ecx,[lower]
        cmp     ecx,eax
        jnz     .do_it_again
; do your comparing below here    
But for anyone else reading this: This is not a general solution. This relies upon the behaviour the the domain values being unique each update cycle.
Post 26 Oct 2010, 16:33
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20338
Location: In your JS exploiting you and your system
revolution 26 Oct 2010, 16:40
Oh, I just realised why there was no lock in the code I copied: it is from a lock free solution. If you can modify the writer code ...
Code:
writer:
;...
    ;edx:eax has a new value to store
       mov     [lower_copy],eax
;put lock or some other synchro code here
       mov     [upper],edx
 mov     [lower],eax
;...


;...
reader:
;...
    .read_it:
     mov     eax,[lower]
 mov     edx,[upper]
 mov     ecx,[lower_copy]
    cmp     ecx,eax
     jnz     .read_it    
Post 26 Oct 2010, 16:40
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Oct 2010, 17:13
revolution, you gave me an extremely happiness moment with your code above but some time after I realized it might not work. I think your code assumes single writer, multiple readers.
Post 26 Oct 2010, 17:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20338
Location: In your JS exploiting you and your system
revolution 26 Oct 2010, 17:48
Hmm, you are right. I didn't realise you have multiple writers.

My brain has been so addled this last month, so many mistakes and forgotten things. I gotta get back to my home base, but still got two more weeks here till done ... gotta keep working revolution ... don't stop now.
Post 26 Oct 2010, 17:48
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4039
Location: vpcmpistri
bitRAKE 28 Oct 2010, 07:03
LOCKs occurring within a cacheline shared by multiple cores must quiet the bus (very slow). LOCKs on isolated cachelines use bus in normal fashion. Very Happy

So, don't worry about the LOCK - just make ITEMS_POOL_SIZE large enough for all readers/writers to move around on unique cachelines! (And of course, align all accesses.) I like to make all writers also readers to minimize cache thrashing across the bus during LOCK.
Post 28 Oct 2010, 07:03
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.