flat assembler
Message board for the users of flat assembler.

Index > Main > How to flood a CPUs branch prediction buffers with garbage

Author
Thread Post new topic Reply to topic
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 11 Jan 2008, 17:54
This code tries to flood the branch prediction and branch target buffer
and propably other branch optimizations too.

It's main use should be to stress-test the jump prediction logic
EDIT: there is an updated version below

Code:
use32

flood_BP_BTB:
    push    eax edx
;this should be the size of the instructions of 1 round of the repeat loop
;which also must be equal for the initialization, main and finitialization part
  .code_cell_size= 12

;initialization part
  repeat       31
  rdtsc
       bt      eax,%-1
     db      0Fh,82h ;jc     dword
       dd      (%-2) * .code_cell_size
  end repeat

;random seed.
;Uncomment the %t if you're feeling lucky and this will give you
;a different code each time you compile the code
  d= 03191BCF4h        ;03197BCF1h     ;%t
  h= d
;main part
  repeat 0DAh    ;0FBh
    h= h + 9E3779B97F4A7C15h           ;~ MAX_INT * (sqrt(5)/2 - 1/2)
    d= ((d shl 43 + d shr 21) xor h) + d;some random random-number function
       rdtsc
       bt      eax,(d and 1F00h) shr 8
     db      0Fh,82h ;jc     dword
       dd      ((d and 03Fh) - 20h) * .code_cell_size
  end repeat

;finitialization part
  repeat 32
  rdtsc
       bt      eax,32-%
    db      0Fh,82h ;jc     dword
       dd      (%-33) * .code_cell_size
  end repeat
    pop     edx eax
     ret
    

Actually I find it pretty much useless. Does anyone see a better purpose for it than I do?

I have attached a linux implementation that also measures its number of CPU-cycles the above takes.
You should then clearly see that the execution time is completely random since the path the CPU takes are also completely random.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||


Last edited by MCD on 14 Jan 2008, 22:25; edited 1 time in total
Post 11 Jan 2008, 17:54
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 11 Jan 2008, 18:15
Isn't having three branches in eight bytes also bad(tm)? Something to do with the way the data is stored in the chip - just rattling off the top of my head - I don't recall exactly.
Post 11 Jan 2008, 18:15
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8358
Location: Kraków, Poland
Tomasz Grysztar 11 Jan 2008, 22:00
Writing those things like this would be perhaps more flexible (and easier to modify):
Code:
;initialization part
  repeat 31
   .current_cell = $
        rdtsc
        bt      eax,%-1
        jc      near  .current_cell + (%-1) * .code_cell_size
  end repeat
  .code_cell_size = $-.current_cell    

No DD instruction making, no hand-calculated sizes, etc.
Post 11 Jan 2008, 22:00
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 12 Jan 2008, 14:39
Tomasz Grysztar wrote:
Writing those things like this would be perhaps more flexible (and easier to modify):
Code:
;initialization part
  repeat 31
   .current_cell = $
        rdtsc
        bt      eax,%-1
        jc      near  .current_cell + (%-1) * .code_cell_size
  end repeat
  .code_cell_size = $-.current_cell    

No DD instruction making, no hand-calculated sizes, etc.

this is what I initially tried, but it didn't worked for the following reason:

The relative jump offsets are different and difficult to predict even at assembly time. Unfortunately, some of them fit into the short range and others don't (near range). Thus each JC instruction as you supposed it would have a different size and so the entire offset calculation as we both suggested it wouldn't work at all. A correct and working offset calculation would be incredibly complicated.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 12 Jan 2008, 14:39
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8358
Location: Kraków, Poland
Tomasz Grysztar 12 Jan 2008, 16:47
MCD wrote:
The relative jump offsets are different and difficult to predict even at assembly time. Unfortunately, some of them fit into the short range and others don't (near range). Thus each JC instruction as you supposed it would have a different size and so the entire offset calculation as we both suggested it wouldn't work at all.

That's why I put the "near" keyword there, to ensure the instruction always has the same length (I checked the code before posting, so that it generates exactly the same bytes).
Post 12 Jan 2008, 16:47
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 12 Jan 2008, 20:23
Have you consulted with Agner - I think when calculating the fused numbers he needs to have some code to deal with "zeroing" (or "trashing") the BTB.
You'll need to have a really clever algorithm to trash it all, because there might be some conditions under which BTB might be fushed and filling starts over again :S or ... I don't know much about the internals.
Post 12 Jan 2008, 20:23
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 14 Jan 2008, 19:58
Tomasz Grysztar wrote:

That's why I put the "near" keyword there, to ensure the instruction always has the same length (I checked the code before posting, so that it generates exactly the same bytes).

okay. So how does it come that near jumps always have the same size while dword jumps haven't?
I really expected for "jc dword some_where" to always have be relative dword jump? how does this come?
Post 14 Jan 2008, 19:58
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8358
Location: Kraków, Poland
Tomasz Grysztar 14 Jan 2008, 21:53
See this thread: http://board.flatassembler.net/topic.php?t=5162
In short, you can have "jc near dword" or "jc near word" to generate full-size jumps using either the EIP or IP, and "jc short dword" or "jc short word" to generate short jumps, also either EIP or IP.
Post 14 Jan 2008, 21:53
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 14 Jan 2008, 22:12
Tomasz Grysztar wrote:
See this thread: http://board.flatassembler.net/topic.php?t=5162
In short, you can have "jc near dword" or "jc near word" to generate full-size jumps using either the EIP or IP, and "jc short dword" or "jc short word" to generate short jumps, also either EIP or IP.
thanks, I forgot about this.

But another problem I see with your suggested code is that it generates jumps out of the code into nowhere (if I'm right), this is mainly because you placed this line:
Code:
  .code_cell_size = $-.current_cell    
after the repeat loop, so that the .code_cell_size is the size of the whole initialization part, and you jump to a multiple (-1 to 30) of this .code_cell_size forward.

consider the last jump in the initialization part:
% = 31
.code_cell_size = 12 * 31 = 372

so the CPU may jump to:
.current_cell + (31-1) * 372 = .current_cell + 11160 which is definitly out of memory
Post 14 Jan 2008, 22:12
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 14 Jan 2008, 22:20
This is an updated version.

I removed the hand-crafted conditional jumps.
I also rearranged the code to be more easy to maintain by automatically calculating the code_cell_size.
I finally added another conditional jump in the code cells which depends on the same bit test.
Code:
use32 
 
flood_BP_BTB:
;this should be the size of all instructions of 1 round of the repeat loop
;which must also be equal for the initialization, main and finitialization part
;we can't know their exact values at this point, so we preset them to 0
init_code_cell_size = 0
code_cell_size           = 0
finit_code_cell_size     = 0

;the actual cell code
macro       init_code_cell  i {
    rdtsc
       bt      eax,i
       jc      near $+6 + (i -1) * init_code_cell_size
     jnc     near $+6 + (61-i) * init_code_cell_size
}
macro      code_cell       i {
    rdtsc
       bt      eax,i and 1Fh
       jc      near $+6 + (i shr 8  and 3Fh - 20h) * code_cell_size
        jnc     near $+6 + (i shr 16 and 3Fh - 20h) * code_cell_size
}
macro finit_code_cell i {
    rdtsc
       bt      eax,i
       jc      near $+6 - (i +    1) * finit_code_cell_size
        jnc     near $+6 - (63-(i*2)) * finit_code_cell_size
}

;define actual code cell sizes now
virtual     at 0
        init_code_cell  0
   init_code_cell_size= $
end virtual
virtual        at 0
        code_cell       0
   code_cell_size= $
end virtual
virtual     at 0
        finit_code_cell 0
   finit_code_cell_size= $
end virtual
;check if all 3 code cell sizes are equal, else code will produce
;unreliable results
if ( init_code_cell_size <> code_cell_size) | \
   (finit_code_cell_size <> code_cell_size)
  display  "ERROR: code cell sizes do not match."
  "ERROR: code cell sizes do not match."
end if

;initialization part
  repeat 31
 init_code_cell  %-1
  end repeat
;random seed:        Uncomment the %t if you're feeling lucky and this will give you
;a different code each time you compile it
  d= 03191BCF4h       ;03197BCF1h     ;%t
  h= d
;main part
  repeat 080h
    h= h + 9E3779B97F4A7C15h            ;~ MAX_INT * (sqrt(5)/2 - 1/2)
    d= ((d shl 43 + d shr 21) xor h) + d;some random random-number function
       code_cell       d
  end repeat
;finitialization part
  repeat 32
   finit_code_cell 32-%
  end repeat
        ret
    


Description: This is a linux implementation with a TSC CPU-cycle measuring
Download
Filename: flood_btb.asm
Filesize: 5.86 KB
Downloaded: 394 Time(s)


_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 14 Jan 2008, 22:20
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8358
Location: Kraków, Poland
Tomasz Grysztar 14 Jan 2008, 22:35
MCD wrote:
But another problem I see with your suggested code is that it generates jumps out of the code into nowhere (if I'm right), this is mainly because you placed this line:
Code:
  .code_cell_size = $-.current_cell    
after the repeat loop, so that the .code_cell_size is the size of the whole initialization part, and you jump to a multiple (-1 to 30) of this .code_cell_size forward.

The ".current_cell" is a variable updated on each repetition, so when I do "$-.current_cell" I calculate the size of the last cell, not the whole code.

As I said above - I really checked that the code I gave assembles to exactly the same sequence of bytes. Wink
Post 14 Jan 2008, 22:35
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 15 Jan 2008, 07:09
Okay, but how accurate theoretically is it now? How much can a benchmark with 0 instructions vary between CPUs & runs? I mean one RDTSC /push eax / RDTSC / sub eax,[esp] could potentially have many problems...
Post 15 Jan 2008, 07:09
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.