flat assembler
Message board for the users of flat assembler.
Index
> Main > traversing memory, fastest way to find a "null" Goto page 1, 2 Next |
Author |
|
revolution 02 Nov 2008, 04:44
Faster? Under what conditions? Which CPU(s)? Multi-core aware? Do you mostly expect short searches or long searches? String array handling? How many strings? Byte, word or dword? Which OS? Are the data already in the cache or external memory? SDRAM clock speed? How big is the L1 cache? L2 cache? L3 cache?
There are so many ways in which to test for "fastest". And one persons results may not match with another persons results. |
|||
02 Nov 2008, 04:44 |
|
adnimo 02 Nov 2008, 04:55
I cannot answer to all the questions since I do not control what other people use but the OS is XP and the 'array' is just a continuous chunk of memory, no strings here just dwords.
the array could have 10 or 50000 elements. I need to find a way that would allow me to quickly find an empty slot so I can flag it. right now the linear run I'm doing is just too slow, specially with a huge amount of elements. PS: in my original post I did mention 32bit elements. |
|||
02 Nov 2008, 04:55 |
|
revolution 02 Nov 2008, 05:04
adnimo wrote: I cannot answer to all the questions since I do not control what other people use but the OS is XP ... However, instead of trying to win the impossible battle of memory traversal perhaps you can consider a different approach. Turn the problem on it's head and design a different method of doing what you need. You have posted no code so I can't give you any specific suggestions but things like binary search trees might be able to help you. Or maybe have the worker code keep track of it's highest memory usage and give that to the allocator function to give it a head start. Perhaps other things also, just start thinking about alternatives. |
|||
02 Nov 2008, 05:04 |
|
adnimo 02 Nov 2008, 05:09
I have, but I'm still trying to figure out the best way to do it.
here is an example of the data and how it changes throughout runtime: initial array: 0 0 0 0 0 after a few allocations on the pool X X X 0 0 we release one of the handles X 0 X 0 0 now, in that case the linear search would be a breeze (if we wanted to find the first 0 in the array) but imagine 50.000 elements where a 0 could only be found after traversing through 30k+ elements... I know I need to rethink this, I just don't have a solid idea on what could benefit me in this case. |
|||
02 Nov 2008, 05:09 |
|
revolution 02 Nov 2008, 05:13
Maybe use a bitmap.
Maybe when freeing a handle always move the last item into the freed position. |
|||
02 Nov 2008, 05:13 |
|
sinsi 02 Nov 2008, 06:24
Using a REP prefix is supposed to work well the larger the memory area is, so using "rep scasd" with eax=0 can be fast.
|
|||
02 Nov 2008, 06:24 |
|
r22 02 Nov 2008, 17:29
Like everything else in life SIMD is fastest.
If you can yell at your 4kids to come inside ALL AT ONCE its 4x more efficient that yelling at each one individually. Now that my original Star Trek-esk over simplification is complete. ***NOTE*** The following code will run slower for small arrays you can add case handling for different sizes to use REP SCASD or something. You could also expand this code to check 64bytes at a time (16 DWORDs). Also, its not tested, nor has it been compiled, frankly i just wrote it in the web browser. Code: .data ArrayPTR dd ? ;;address to the memory ;;MAKE SURE THE MEMORY IS 16byte aligned ;;64byte aligned would probably be even better avg cache read etc.etc ArrayLEN dd ? ;;size in DWORDs ;;MAKE SURE ArrayLEN is a multiple of 4 DWORDS (16bytes) ;; (because i'm too lazy to write an algorithm with corner cases) .code ALIGN 16 FindZero: ;;returns index of a zero element or -1 for failure PUSH ebx MOV ecx,[ArrayLEN] XOR ebx,ebx MOV edx,[ArrayPTR] SUB ebx,ecx ;; -count LEA edx,[edx+ecx*4] ;;now edx+ebx*4 = first element ;;4=DWORD bytes PXOR xmm4,xmm4 AMDPad16 ;;<-this is a hacky NOP padding macro search the forum for it .Search: ;;get data MOVDQA xmm0,[edx+ebx*4] ;;any zeros? PCMPEQD xmm0,xmm4 ;;check for zeros PMOVMSKB eax,xmm0 BSF eax,eax JZ .Skip ;;We found a zeroed DWORD ;;eax will be either 0, 4, 8, or 12 SHR eax,2 ;;divide by 4 (eax = 0,1,2, or 3) ADD ebx,ecx ;;normalize the index pointer ADD eax,ebx JMP .End .Skip: ADD ebx,4 JNZ .Search MOV eax,-1 ;;Failed to find any zeros .End: POP ebx ret 0 This snippet should get your started |
|||
02 Nov 2008, 17:29 |
|
adnimo 03 Nov 2008, 02:23
hi, thanks for the suggestions/code I'll be reading the latter for a while until it makes sense to me
today I woke up thinking about skip-lists and that's almost what I did, although it feels more like an unrolled list to me. what I did was: a) use "buckets" of N elements, I chose 64 elements minus 1 to round up since I have a dword that acts as a counter at the end. b) when in need for a new handle, traverse as before but only do so if the bucket is not full this gave me a 4.5x speed increase against my original lineal-search, but my code is still slower than if I were to allocate and free small portions of memory (using the heap under windows in this case) at least on modern PCs. any ideas? PS: I mention 'modern PCs' since I ran a few benchs on an old P2 and my method was actually faster than doing the allocs and frees using the OS' API instead of a big memory pool. but it wasn't a valid test since it wasn't even the same OS to begin with. |
|||
03 Nov 2008, 02:23 |
|
baldr 03 Nov 2008, 10:49
adnimo,
You may find it useful to have a hint variable containing index into your array, defined as follows: Initially: hint == 0. Allocation: search from hint for free slot in array, allocate it and store index+1 in hint. Deallocation: free the slot, if index<hint then store it in hint. You can even use low bits of (properly aligned) pointer to store the hint (VirtualAlloc() is your friend), though it will be hardware-dependent. I mean, use VirtualAlloc() to reserve/commit memory for array. Address is guaranteed to be aligned on 64 kB boundary, so you have 16 bits to store something useful (don't forget to mask them out for VirtualFree()). Even more, there is PAGE_GUARD protection flag which is fun to try out. |
|||
03 Nov 2008, 10:49 |
|
adnimo 03 Nov 2008, 14:46
I just profiled my code, it seems as if the free() function is the bottleneck now and it makes sense, it has to traverse the bucket it resides on just to take 1 from the count. Before (with the linear search) I just had to assign 0 to the handle I already had.
However, this only happens if the amount of handles is small (100 or so) with a big amount (I tested with 100000 handles) the bottleneck becomes the alloc(). I'm puzzled. By bottleneck I mean 99% of the runtime is spent on the alloc() only if the amount of items is bigger than X. This means that my skip-kind-of-list is not doing as good as I thought it would. When running both routines in tandem (ie free(malloc()) ) I find that the bottleneck becomes the free() since it has to do the linear traversing of the bucket. Now, I kind of have an idea on how to solve the free() bottleneck but I've no idea how am I going to solve the alloc() deal at all... Tweaking the bucket size only seemed to worsen things, specially with bigger buckets (I think it just wouldn't fit inside the cache?). One thing I noticed is that, when running in pairs of free(malloc()) it takes most of the time traversing the list because it goes forwards and all of the bucket items are empty, so in order to reach the count value it has to go through all of them.. How fast are reverse loops? does it even matter in which way I iterate? In the case where buckets will only be half or less empty, using a reverse loop would help (ie. I would have to put the count value before the handles array instead of having it after). The problem is, I think most of the times the buckets will be half full and thus, having the count value after the array and traversing it on forward motion seems to make more sense. Yet, a test with full buckets showed that it's much slower than the backward traverse (by a factor of 5) I'm so lost at the moment, I don't know what else to do! The hint idea seems usable specially if I can get to hint which bucket I can start from, but I think this will only give me a biased result (some buckets will always be half empty and whatnot) Hmm... Another test showed that dynamically sized buckets would be much better (ie they should be bigger if the amount of total items the pool can handle is big, or better said - the bucket count should be relative to the maximum amount of handles allowed) |
|||
03 Nov 2008, 14:46 |
|
revolution 03 Nov 2008, 15:28
Perhaps you might like to read about linked-lists.
|
|||
03 Nov 2008, 15:28 |
|
adnimo 03 Nov 2008, 16:06
I know what they are, I implemented single, double and unrolled linked-lists and it's the worst format you could use for this, specially given the amount of space you waste (it's about 80% waste for double linkedlists and around 40% for unrolled).
Plus, a skip-list would be the best way to go, yet this is a special case IMO. If you read carefully you'll notice I already have a skip-list implementation (kind of) with a normal linkedlist I would gain nothing, in fact I would lose all. for double linked lists, I sure could get to the head in a snap but I'd still have a huge waste of cycles managing the list and whatnot. I need to either store the head pointer every 4 bytes in the list (big, big waste of memory) or get a better way of finding empty handles in the buckets. Right now the bottleneck is in both the alloc and free sides, depending on the test scenario. It seems as if using dynamically sized buckets could help further optimize the memory pool handling for the mallocs but it adds some complexity to the code, right now I think this would only make things worse because it only solves one small problem that just happens to occur when you use a huge amount of items (and that won't be the case always, since I'll have a small amount most of the time). I'm no guru in data structures that's why I ask for pointers on all this, I know most of you guys have a lot of experience in the field and that's why I'm bothering you |
|||
03 Nov 2008, 16:06 |
|
revolution 03 Nov 2008, 16:21
I have used a simple method of allocating 64bytes chunks of memory in the past. It is essentially two linked lists. One list is the allocated chunks and the other list is the free chunks (actually stored in the freed memory). All I did was move the pointers to the chunks from one list to the other, a simple procedure, as they were either freed or allocated. Needed only a few CPU cycles per allocation/deallocation. No scanning or searching required (except at start-up when the memory is initially allocated and the freed linked list is built the first time).
|
|||
03 Nov 2008, 16:21 |
|
adnimo 03 Nov 2008, 16:35
Could you please try to benchmark it against, say, the heap functions in Windows?
I'm skeptical about the speed but also the memory usage, LLs waste a lot of data specially the double ones. Either way I could learn from your results if that could be possible (ie. you benchmarking it). I will try your concept when I get some spare time and benchmark against my current solution, to see whats best to continue with. |
|||
03 Nov 2008, 16:35 |
|
revolution 03 Nov 2008, 16:43
adnimo wrote: Could you please try to benchmark it against, say, the heap functions in Windows? adnimo wrote: I'm skeptical about the speed but also the memory usage, LLs waste a lot of data specially the double ones. |
|||
03 Nov 2008, 16:43 |
|
adnimo 03 Nov 2008, 16:58
I'm still having trouble trying to picture your idea
|
|||
03 Nov 2008, 16:58 |
|
baldr 03 Nov 2008, 17:07
adnimo,
What do you mean, biased? hint is simply a high water mark, all slots with index<hint are allocated. Mapping from hint to corresponding bucket is straightforward also. For example, you have struct { void *slots[63]; int slots_used; } buckets[MAX], then hint%63 is a starting index into buckets[hint/63].slots and so on… Better yet, you can (as revolution suggest) use bitmap and bsf instruction to find empty slot. Or use indirection of some level (tree). Indeed, some specification for your allocator will allow us to help you more. |
|||
03 Nov 2008, 17:07 |
|
LocoDelAssembly 03 Nov 2008, 22:29
Does your malloc allocates a fixed size on each call? If that is the case I have an idea that should be O(1) on both malloc and free (which I suppose that it is the same idea of revolution with the exception that I use only one list).
If your malloc is parameter-less I can write the code if you promise that you will publish your benchmarks (with the code so we can test too). |
|||
03 Nov 2008, 22:29 |
|
adnimo 04 Nov 2008, 01:39
I was pointed to this paper today at work: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4759
It seems as if this is the holy grail for what I need but after a quick look at the paper I still haven't got that all needed epiphany LocoDelAssembly: yes they are of fixed size. I'm trying to implement a memory pool library for my other libraries since I'm suffering from the heap overhead (the struct appended by windows on each memory handle is around 12byte long...). And this is noticeable specially on 32bit systems where I'm using huge linkedlists and whatnot! I would rather read a nice explanation than code in this case, since I would like to comprehend the theory and later on try to implement it by myself (ie. learn something while I'm at it!) but thanks for the offer [edit] on another thought, if you want to post some code... [/edit] |
|||
04 Nov 2008, 01:39 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.