flat assembler
Message board for the users of flat assembler.

Index > Main > How to speed up hashtable? Nearly all the time L2 cache miss

Author
Thread Post new topic Reply to topic
peter_k



Joined: 27 Dec 2006
Posts: 6
Location: Poland
peter_k 05 Feb 2007, 13:18
I'm coding program where one of the narrow throats is accesing the elements in hash-table structure. My hashtable is quite big - 128 MB, each element is of the size about 8 bytes. The elements are accessed all the time randomly. Nearly all the time i'm getting L2 cache miss (i'm not the pc expert, but when the needed data are not in L2, then processor is reading all, 4kB page from RAM into L2?)

Is it possible to speed up accessing the elements? I don't need reading 4kB of memory all the time (due to that i'm accessing them randomly). I just need this 8 bytes... I know very little about prefetch, prefetch0, etc. Can i speed up my application using them? Can they read data with L2 bypass?

EDIT: There are few types of hashtables. I use the simplest one - just the common table, but with huge size. If you don't know what is hashtable, you can treat my hashtable as table. There will be the same problem with accessing random elements.
Post 05 Feb 2007, 13:18
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 05 Feb 2007, 14:56
Quote:

4kB page from RAM into L2?)

No, only cache line size which can be 16, 32 or 64 bytes depending of the CPU.

If you access the same element twice or elements of the same line, the second time should be cached (if nothing else flushed that cache line).
Post 05 Feb 2007, 14:56
View user's profile Send private message Reply with quote
zir_blazer



Joined: 05 Dec 2006
Posts: 66
zir_blazer 05 Feb 2007, 15:13
Different Processors architectures got a different branch prediction unit, part of the Cache Miss or Hit ratio is based on how effective it is.
If the hash table uses 128 MB and it can be readed randomly, then its pretty probable than the Cache Miss will be VERY high, because if you have only 512 KB or 1 MB of the data in the Cache L2, then it can still pick anything in the others 127 MB, so its then more probable (99% and more?) that it will not hit.
Post 05 Feb 2007, 15:13
View user's profile Send private message MSN Messenger Reply with quote
peter_k



Joined: 27 Dec 2006
Posts: 6
Location: Poland
peter_k 05 Feb 2007, 15:59
LocoDelAssembly wrote:
No, only cache line size which can be 16, 32 or 64 bytes depending of the CPU.
Hmm, my book is writting something another ("Code Optimization: Effective Memory Usage" by Kris Kaspersky)
Quote:
(...) The minimum unit of data used when interacting with the memory is not a 32-byte burst cycle; rather, it's an entire page. The delays that occur when accessing the page for the first time are slightly smaller than the time required to read the entire page. This means that, no matter how many bytes of the page are read, processing takes the same time amount as if it were read as a whole. (...)
and
Quote:
(...) When accessing the page for the first time, the processor reads these data from the physical memory into its internal buffer, the Translation Look-aside Buffer (TLB). All further attempts to access the same page occur without delays until this information is removed from the TLB. (...)
Post 05 Feb 2007, 15:59
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 05 Feb 2007, 16:50
I think that Mr Kaspersky misunderstood how TLB works (or those quotes are too much out of context). The idea of TLB is to cache the page address to avoid the need of navigate throught the page table every time you access memory, BUT NO READING THE ENTIRE 4 KB OF PAGE DATA.

AMD64 Architecture Programmer's Manual Volume 2 -- System Programming Rev 3.12 wrote:
5.5 Translation-Lookaside Buffer (TLB)
When paging is enabled, every memory access has its virtual address automatically translated into a
physical address using the page-translation hierarchy. Translation-lookaside buffers (TLBs), also
known as page-translation caches, nearly eliminate the performance penalty associated with page
translation. TLBs are special on-chip caches that hold the most-recently used virtual-to-physical
address translations. Each memory reference (instruction and data) is checked by the TLB. If the
translation is present in the TLB, it is immediately provided to the processor, thus avoiding external
memory references for accessing page tables.
TLBs take advantage of the principle of locality. That is, if a memory address is referenced, it is likely
that nearby memory addresses will be referenced in the near future. In the context of paging, the
proximity of memory addresses required for locality can be broad—it is equal to the page size. Thus, it
is possible for a large number of addresses to be translated by a small number of page translations. This
high degree of locality means that almost all translations are performed using the on-chip TLBs.
Post 05 Feb 2007, 16:50
View user's profile Send private message Reply with quote
peter_k



Joined: 27 Dec 2006
Posts: 6
Location: Poland
peter_k 05 Feb 2007, 18:06
Ok

Is it possible that random memory access takes > 200 cycles?

I have made small experiment (this is not the hashtable which i have described in the first topic; just another test with random memory accessing):

int tab[N]; // I'm doing tests for different sizes of N

I'm filling the table with random values from 1 to N-1 (but there are no two elements of tab with the same value! AND tab[i] != i for all i < N). Only the last element of tab has "0".

After that i'm start profiling:
The C code looks something like that (i'm showing only the part which is profiled):
Code:
                        actual = 1;
                        do {
                                actual = tab[actual];   
                        } while (actual != 0);    
So this code is accessing ALL tab elements in random order.
The most inner loop is translated to just few assembly instructions:
Code:
again:  mov  eax, [405060 + eax*4]
        test eax, eax
        jnz  again
    
I'm measuring number of ticks which has pass in this piece of code. So diving total number of cycles per number of iteration should give us time to run one iteration of the loop.

I've made this test for different sizes of N (in Kilobytes or Megabytes).
After doing the tests i've made a chart:

Image

The two instruction test eax, eax AND jnz again are quite fast for the pc and takes about 1-2 cycles per iteration. So nearly all the time is loosed while accessing the memory!

And i forget, here is my PC specyfication:
2 * 256MB DDR SDRAM (dual channel) <- probably the most important
Intel Pentium M Processor 730 (1.60 GHz, 2 MB Cache)
System: Microsoft Windows XP Home Edition with SP2

I have 2MB of L2, and as you can see in the chart, until 2048kB time per iteration is <=16 cycles (L1 cache miss).
And when the N < 32kB (L1) time per iteration is about 3-4 cycles.
This is ok. But for larger values of N the access time is growing all the time!

Do you have any suggestions why the access time to memory is such big? For N = 128 MB it is 275 cycles... Or maybe this is normal?
Post 05 Feb 2007, 18:06
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 05 Feb 2007, 19:54
Quote:

I'm filling the table with random values from 1 to N-1 (but there are no two elements of tab with the same value! AND tab[i] != i for all i < N). Only the last element of tab has "0".


That ensures that the array will force the code to iterate N times?

Probably the random function produces less locality references when N is big producing both, cache misses and TLB misses.

You forgot to specify the memory speed but supposing your memory has a latency of 100 ns (Everest shows that latency for a Pentium M730J 1600 MHz Dual DDR2-400), it means the CPU waits 1.6*100 = 160 cycles to get access to data.

Try using VirtualAlloc with PAGE_NOCACHE to see what happens.
Post 05 Feb 2007, 19:54
View user's profile Send private message Reply with quote
peter_k



Joined: 27 Dec 2006
Posts: 6
Location: Poland
peter_k 05 Feb 2007, 20:49
LocoDelAssembly wrote:

You forgot to specify the memory speed but supposing your memory has a latency of 100 ns (Everest shows that latency for a Pentium M730J 1600 MHz Dual DDR2-400), it means the CPU waits 1.6*100 = 160 cycles to get access to data.

Sorry, here it is:
Image
LocoDelAssembly wrote:

Try using VirtualAlloc with PAGE_NOCACHE to see what happens.

Lol, now every memory access takes more or less constant time.
For low values of N i'm getting about 212
For high values of N i'm getting about 236 (Without using VirtualAlloc i'm getting more or less the same here / i don't know why in the chart which i have post i was getting 275 cycles. Now i'm getting lower results. Maybe i've forget to turn on code optimization before).
And this is because the processor is not using L2 when accessing my tab... Interesting. Anyway thanks for the solution! Wink
Post 05 Feb 2007, 20:49
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 06 Feb 2007, 19:33
You could try to prelfetch memory of the next element by inserting a MMX/SSE prefetch instruction somewhere like this:
Code:
;try either...
again:
        mov     eax, [tab + eax*4]
        test     eax, eax
        prefetch1/2/nta [eax]
        jnz     again

;...or this one
again:
        mov     eax, [tab + eax*4]
        prefetch1/2/nta [eax]
        test     eax, eax
        jnz     again
    

but this implies you are using assembly or have similar inlined instrinsics in C Very Happy
Also I'm not sure the prefetching will bring much here since the loop is so tight, and maybe newer CPU already predict such accesses correctly so the prefetch instructions may be just supperfluous.

You would have to play around a bit with those prefetch instruction, like posted above, you have 2 possible positions where to put those prefetch instructions and for each position you should try which of prefetch, prefetch1, prefetch2, (maybe prefetch3) prefetchnta is best, I'm just too lazy to look in the intel docs and

ALSO keep in mind that such prefetching is CPU-DEPENDEND!, meaning it may run faster on your CPU, while it might run worse on others.


Another thing: you shouldn't use hashing for entries with small values in the first place, since hashing should only be taken into account (if possible) when the actual entries are bigger (and/or dynamic), for example like each entry is big structure or a string which length varies greatly (both the Fasm's preprocessor and assembler uses such technics with a FNV1a).

Also try to complete all computations on 1 entry before switch to the next, the reason for that is to increase memory locality. This C-pseudocode should explain what I mean:
Code:
;consider tab to be an array of n-ints
;doActionX all do something with the passed value

;not so fast
int i;
for(i=0; i<n; i++){ doAction1(&tab[i]); }
for(i=0; i<n; i++){ doAction2(&tab[i]); }
for(i=0; i<n; i++){ doAction3(&tab[i]); }

;faster
int i;
for(i=0; i<n; i++){
  doAction1(&tab[i]);
  doAction2(&tab[i]);
  doAction3(&tab[i]);
}
    


And if you absolutely must have smaller entries within a hash table, try to put multiple values into 1 hash entry.

Only if that all isn't possible, then use the approach written above

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 06 Feb 2007, 19:33
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 08 Feb 2007, 05:21
Peter, you should profile the useage of your actual code and optimize from their. For instance if values seem to be REaccessed from the hashtable then making a smaller ~256BYTE ring buffer that holds the last X values chosen from the hash table could speed things up. Iterating through X values (almost always) stored in L2 BEFORE you do a random access on 128MB might improve performance ONLY if values seem to repeat themselves fairly often in your profiling.

But when dealing with a look up table of > ~1MB your always going to be constrained by your memory speed no matter how well you optimize the code it can't predict random accesses.

If you gave us more information about what exactly you are trying to accomplish with this code someone maybe be able to help you (as opposed to the grasping at straws that's been going on up to this point).
Post 08 Feb 2007, 05:21
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.