flat assembler
Message board for the users of flat assembler.

Index > Main > traversing memory, fastest way to find a "null"

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
hi guys Smile

I'm in the need for a fast method to find the first null inside a memory area.

imagine a 1D array of N elements, each element being 32bit in length. an element may be 0 or not, if it's 0 we stop - else, we continue until we reach the limit of the array or we find a 0.

I tried a simple linear iteration and, well, it ended up being the bottleneck of my application.

Any ideas on how to do this the fast way?.

The code I'm working on is a simple memory pool library but as I mentioned above, the only bottleneck it has is the search.

I don't know whether it would be feasible to avoid said step at all or not, right now I need to find a way to solve this bottleneck or else the code becomes useless.

Crying or Very sad

PS: the array could have 10000 elements for all we know, and that's where things slow down since I'm traversing it on each call to the "malloc" routine.
Post 02 Nov 2008, 04:24
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Faster? Under what conditions? Which CPU(s)? Multi-core aware? Do you mostly expect short searches or long searches? String array handling? How many strings? Byte, word or dword? Which OS? Are the data already in the cache or external memory? SDRAM clock speed? How big is the L1 cache? L2 cache? L3 cache?

There are so many ways in which to test for "fastest". And one persons results may not match with another persons results.
Post 02 Nov 2008, 04:44
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I cannot answer to all the questions since I do not control what other people use but the OS is XP and the 'array' is just a continuous chunk of memory, no strings here just dwords.

the array could have 10 or 50000 elements. I need to find a way that would allow me to quickly find an empty slot so I can flag it.

right now the linear run I'm doing is just too slow, specially with a huge amount of elements.

PS: in my original post I did mention 32bit elements.
Post 02 Nov 2008, 04:55
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
adnimo wrote:
I cannot answer to all the questions since I do not control what other people use but the OS is XP ...
So there is no answer to your original problem for exactly that reason. High speed memory traversal relies on many factors. And most of those factors are out of your control, so no solution can be forthcoming.

However, instead of trying to win the impossible battle of memory traversal perhaps you can consider a different approach. Turn the problem on it's head and design a different method of doing what you need. You have posted no code so I can't give you any specific suggestions but things like binary search trees might be able to help you. Or maybe have the worker code keep track of it's highest memory usage and give that to the allocator function to give it a head start. Perhaps other things also, just start thinking about alternatives.
Post 02 Nov 2008, 05:04
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I have, but I'm still trying to figure out the best way to do it.

here is an example of the data and how it changes throughout runtime:

initial array:

0
0
0
0
0

after a few allocations on the pool

X
X
X
0
0

we release one of the handles

X
0
X
0
0

now, in that case the linear search would be a breeze (if we wanted to find the first 0 in the array) but imagine 50.000 elements where a 0 could only be found after traversing through 30k+ elements... I know I need to rethink this, I just don't have a solid idea on what could benefit me in this case.
Post 02 Nov 2008, 05:09
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Maybe use a bitmap.

Maybe when freeing a handle always move the last item into the freed position.
Post 02 Nov 2008, 05:13
View user's profile Send private message Visit poster's website Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 693
Location: Adelaide
sinsi
Using a REP prefix is supposed to work well the larger the memory area is, so using "rep scasd" with eax=0 can be fast.
Post 02 Nov 2008, 06:24
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Like everything else in life SIMD is fastest.
If you can yell at your 4kids to come inside ALL AT ONCE its 4x more efficient that yelling at each one individually.
Now that my original Star Trek-esk over simplification is complete.

***NOTE***
The following code will run slower for small arrays you can add case handling for different sizes to use REP SCASD or something.
You could also expand this code to check 64bytes at a time (16 DWORDs).
Also, its not tested, nor has it been compiled, frankly i just wrote it in the web browser.

Code:
.data
ArrayPTR dd ? ;;address to the memory 
;;MAKE SURE THE MEMORY IS 16byte aligned
;;64byte aligned would probably be even better avg cache read etc.etc
ArrayLEN dd ? ;;size in DWORDs
;;MAKE SURE ArrayLEN is a multiple of 4 DWORDS (16bytes)
;; (because i'm too lazy to write an algorithm with corner cases)

.code
ALIGN 16
FindZero: ;;returns index of a zero element or -1 for failure
PUSH ebx
MOV ecx,[ArrayLEN]
XOR ebx,ebx
MOV edx,[ArrayPTR]
SUB ebx,ecx ;; -count
LEA edx,[edx+ecx*4] ;;now edx+ebx*4 = first element ;;4=DWORD bytes
PXOR xmm4,xmm4
AMDPad16 ;;<-this is a hacky NOP padding macro search the forum for it
.Search:
;;get data
MOVDQA xmm0,[edx+ebx*4]
;;any zeros?
PCMPEQD xmm0,xmm4
;;check for zeros
PMOVMSKB eax,xmm0
BSF eax,eax
JZ .Skip
;;We found a zeroed DWORD
;;eax will be either 0, 4, 8, or 12
SHR eax,2 ;;divide by 4 (eax = 0,1,2, or 3)
ADD ebx,ecx ;;normalize the index pointer
ADD eax,ebx
JMP .End
.Skip:
ADD ebx,4
JNZ .Search
MOV eax,-1 ;;Failed to find any zeros
.End:
POP ebx
ret 0
    


This snippet should get your started
Post 02 Nov 2008, 17:29
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
hi, thanks for the suggestions/code I'll be reading the latter for a while until it makes sense to me Laughing

today I woke up thinking about skip-lists and that's almost what I did, although it feels more like an unrolled list to me.

what I did was:

a) use "buckets" of N elements, I chose 64 elements minus 1 to round up since I have a dword that acts as a counter at the end.

b) when in need for a new handle, traverse as before but only do so if the bucket is not full

this gave me a 4.5x speed increase against my original lineal-search, but my code is still slower than if I were to allocate and free small portions of memory (using the heap under windows in this case) at least on modern PCs.

any ideas?

PS: I mention 'modern PCs' since I ran a few benchs on an old P2 and my method was actually faster than doing the allocs and frees using the OS' API instead of a big memory pool. but it wasn't a valid test since it wasn't even the same OS to begin with.
Post 03 Nov 2008, 02:23
View user's profile Send private message Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr
adnimo,

You may find it useful to have a hint variable containing index into your array, defined as follows:

Initially: hint == 0.

Allocation: search from hint for free slot in array, allocate it and store index+1 in hint.

Deallocation: free the slot, if index<hint then store it in hint.

You can even use low bits of (properly aligned) pointer to store the hint (VirtualAlloc() is your friend), though it will be hardware-dependent. I mean, use VirtualAlloc() to reserve/commit memory for array. Address is guaranteed to be aligned on 64 kB boundary, so you have 16 bits to store something useful (don't forget to mask them out for VirtualFree()). Even more, there is PAGE_GUARD protection flag which is fun to try out.
Post 03 Nov 2008, 10:49
View user's profile Send private message Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I just profiled my code, it seems as if the free() function is the bottleneck now and it makes sense, it has to traverse the bucket it resides on just to take 1 from the count. Before (with the linear search) I just had to assign 0 to the handle I already had.

However, this only happens if the amount of handles is small (100 or so) with a big amount (I tested with 100000 handles) the bottleneck becomes the alloc().

I'm puzzled. By bottleneck I mean 99% of the runtime is spent on the alloc() only if the amount of items is bigger than X. This means that my skip-kind-of-list is not doing as good as I thought it would.

When running both routines in tandem (ie free(malloc()) ) I find that the bottleneck becomes the free() since it has to do the linear traversing of the bucket.

Now, I kind of have an idea on how to solve the free() bottleneck but I've no idea how am I going to solve the alloc() deal at all...

Tweaking the bucket size only seemed to worsen things, specially with bigger buckets (I think it just wouldn't fit inside the cache?).


One thing I noticed is that, when running in pairs of free(malloc()) it takes most of the time traversing the list because it goes forwards and all of the bucket items are empty, so in order to reach the count value it has to go through all of them..

How fast are reverse loops? does it even matter in which way I iterate?

In the case where buckets will only be half or less empty, using a reverse loop would help (ie. I would have to put the count value before the handles array instead of having it after).

The problem is, I think most of the times the buckets will be half full and thus, having the count value after the array and traversing it on forward motion seems to make more sense. Yet, a test with full buckets showed that it's much slower than the backward traverse (by a factor of 5)

I'm so lost at the moment, I don't know what else to do!

The hint idea seems usable specially if I can get to hint which bucket I can start from, but I think this will only give me a biased result (some buckets will always be half empty and whatnot)

Hmm... Another test showed that dynamically sized buckets would be much better (ie they should be bigger if the amount of total items the pool can handle is big, or better said - the bucket count should be relative to the maximum amount of handles allowed)
Post 03 Nov 2008, 14:46
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Perhaps you might like to read about linked-lists.
Post 03 Nov 2008, 15:28
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I know what they are, I implemented single, double and unrolled linked-lists and it's the worst format you could use for this, specially given the amount of space you waste (it's about 80% waste for double linkedlists and around 40% for unrolled).

Plus, a skip-list would be the best way to go, yet this is a special case IMO. If you read carefully you'll notice I already have a skip-list implementation (kind of)

with a normal linkedlist I would gain nothing, in fact I would lose all. for double linked lists, I sure could get to the head in a snap but I'd still have a huge waste of cycles managing the list and whatnot.

I need to either store the head pointer every 4 bytes in the list (big, big waste of memory) or get a better way of finding empty handles in the buckets. Right now the bottleneck is in both the alloc and free sides, depending on the test scenario.

It seems as if using dynamically sized buckets could help further optimize the memory pool handling for the mallocs but it adds some complexity to the code, right now I think this would only make things worse because it only solves one small problem that just happens to occur when you use a huge amount of items (and that won't be the case always, since I'll have a small amount most of the time).

I'm no guru in data structures that's why I ask for pointers on all this, I know most of you guys have a lot of experience in the field and that's why I'm bothering you Very Happy
Post 03 Nov 2008, 16:06
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
I have used a simple method of allocating 64bytes chunks of memory in the past. It is essentially two linked lists. One list is the allocated chunks and the other list is the free chunks (actually stored in the freed memory). All I did was move the pointers to the chunks from one list to the other, a simple procedure, as they were either freed or allocated. Needed only a few CPU cycles per allocation/deallocation. No scanning or searching required (except at start-up when the memory is initially allocated and the freed linked list is built the first time).
Post 03 Nov 2008, 16:21
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
Could you please try to benchmark it against, say, the heap functions in Windows?

I'm skeptical about the speed but also the memory usage, LLs waste a lot of data specially the double ones.

Either way I could learn from your results if that could be possible (ie. you benchmarking it).

I will try your concept when I get some spare time and benchmark against my current solution, to see whats best to continue with.
Post 03 Nov 2008, 16:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
adnimo wrote:
Could you please try to benchmark it against, say, the heap functions in Windows?
Sorry, not possible for me to do, I have used this on an ARM system.
adnimo wrote:
I'm skeptical about the speed but also the memory usage, LLs waste a lot of data specially the double ones.
Well you have to make a trade-off somewhere. Either you must be prepared to accept some extra memory usage or you must be prepared to accept the performance loss. No algorithm or data structure will be perfect for every situation.
Post 03 Nov 2008, 16:43
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I'm still having trouble trying to picture your idea
Post 03 Nov 2008, 16:58
View user's profile Send private message Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr
adnimo,

What do you mean, biased? hint is simply a high water mark, all slots with index<hint are allocated.

Mapping from hint to corresponding bucket is straightforward also. For example, you have struct { void *slots[63]; int slots_used; } buckets[MAX], then hint%63 is a starting index into buckets[hint/63].slots and so on…

Better yet, you can (as revolution suggest) use bitmap and bsf instruction to find empty slot. Or use indirection of some level (tree).

Indeed, some specification for your allocator will allow us to help you more.
Post 03 Nov 2008, 17:07
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Does your malloc allocates a fixed size on each call? If that is the case I have an idea that should be O(1) on both malloc and free (which I suppose that it is the same idea of revolution with the exception that I use only one list).

If your malloc is parameter-less I can write the code if you promise that you will publish your benchmarks (with the code so we can test too).
Post 03 Nov 2008, 22:29
View user's profile Send private message Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I was pointed to this paper today at work: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4759

It seems as if this is the holy grail for what I need but after a quick look at the paper I still haven't got that all needed epiphany Sad

LocoDelAssembly: yes they are of fixed size. I'm trying to implement a memory pool library for my other libraries since I'm suffering from the heap overhead (the struct appended by windows on each memory handle is around 12byte long...). And this is noticeable specially on 32bit systems where I'm using huge linkedlists and whatnot!

I would rather read a nice explanation than code in this case, since I would like to comprehend the theory and later on try to implement it by myself (ie. learn something while I'm at it!) but thanks for the offer

Smile

[edit] on another thought, if you want to post some code... Laughing [/edit]
Post 04 Nov 2008, 01:39
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.