flat assembler
Message board for the users of flat assembler.

Index > Main > FPU vs CPU with quadwords?

Author
Thread Post new topic Reply to topic
calpol2004



Joined: 16 Dec 2004
Posts: 110
calpol2004 20 Feb 2008, 19:58
I've been doing some work with 64bit quad-words and so far i've been using the fpu to manipulate them (some aren't floating point), and i've been curious to as of how slow/fast the fpu instructions are compared to doing them with 32bit registers and cpu instructions. I am aware that SSE extensions are faster than the fpu but i don't want to restrict my application to Pentium 3 and upwards.

For example when copying one quad-word to another i do this:

Code:
fild    qword[Count]
fistp   qword[LastCount]  
    


Would it be quicker to for example put the value into eax and ecx and copy them to the new locations or do this?

I am about to build an application where ill use timers to measure some fpu intensive operation to an equal cpu operation but im checking to see if anyone knows :/.
Post 20 Feb 2008, 19:58
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 20 Feb 2008, 20:05
You shouldn't use the FPU to move/copy non-floating point numbers. The FPU like to change them sometimes, especially when they look like SNaN/QNaN or if you get >2^53 it will round them off. Use SSE2 integer instructions or MMX.
Post 20 Feb 2008, 20:05
View user's profile Send private message Visit poster's website Reply with quote
calpol2004



Joined: 16 Dec 2004
Posts: 110
calpol2004 20 Feb 2008, 20:16
Hmm okay.

Would it count to just have the integers stored as floating point numbers? I suppose not. I have read a fair bit about fpu inaccuracies and how its essentially deprecated these days, im not even sure if it's wise to use it for floating point operations Confused.

As for SSE2 i don't really want to restrict the application to a rather modern processor unless i have to (i may use this later on, theres probably a way of detecting if it's available and using the appropriate code).

MMX seems alright though, i use the performance counter in my application so im pretty much restricted to pentium and up at the moment anyway. I'll look it up.
Post 20 Feb 2008, 20:16
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 20 Feb 2008, 20:21
Maybe you can try the AMD docs to see if you like them better.
Post 20 Feb 2008, 20:21
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 24 Feb 2008, 22:52
Another solution may be to switch the pointers to the QWords as opposed to the actual memory.

Code:
.data
Count: dq 0
LastCount dq 0

.code
MOV esi, Count
MOV edi, LastCount

.Loop:
;;Modify Count [esi] and [esi+4]
;;Perform some logic
;;Compare LastCount to something [edi] and [edi+4]
;;Count - > LastCount 
XCHG esi,edi
JMP .Loop
    


Calpol - without knowing exactly what you're trying to do I think a variation of the code above could be a more efficient solution to your problem.
Post 24 Feb 2008, 22:52
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4358
Location: Now
edfed 25 Feb 2008, 00:15
Quote:
Another solution may be to switch the pointers to the QWords as opposed to the actual memory.

or buy a X86-64 machine
then to operate on qwords, it will be simple as
add rax,[rbx+rsi*8+8]
Post 25 Feb 2008, 00:15
View user's profile Send private message Visit poster's website Reply with quote
calpol2004



Joined: 16 Dec 2004
Posts: 110
calpol2004 28 Feb 2008, 23:35
edfed wrote:
Quote:
Another solution may be to switch the pointers to the QWords as opposed to the actual memory.

or buy a X86-64 machine
then to operate on qwords, it will be simple as
add rax,[rbx+rsi*8+8]


I am actually running on a system with a 64bit cpu, its rather frustrating not being able to do that :/.

here's a simply extract of my code:

Code:
invoke   QueryPerformanceCounter,Count  
fild        qword[Count]
fild        qword[LastCount]
fsubp
fld         dword[Resolution]             
fdivp
fstp       dword[TimeElapsed]
fild        qword[Count]
fistp      qword[LastCount]
...
    



The code gets executed as part of a loop, it gets the current timer count, then subtracts the previous count from it to get the time elapsed, then divides that by the resolution to get the measurement in ms (acquired by QueryPerformanceFrequency() for ticks/s and dividg that by 1000). I then save the current time in LastCount for the next iteration and use timeelapsed for my calculations.

I don't think pointer manipulations would work in my case although in some cases it's quite a good option, but since the registers will have to be preserved so that when i reference esi/edi on the next iteration the same values are still there from when i left and the registers "remember" which one points to what. I'm using lots of api calls and calls to opengl dll's, not all of which preserve registers.

what i am thinking of doing is "simulating" the quad-word operation using 32bit instructions. Which shouldn't be that hard for this section of code, as it's just subtraction and movement (MMX movq? or 2 cpu mov's?), but it will take more instructions overall. I know fpu instructions are slower than cpu instructions but how much slower?

im thinking something like this for the transfer:

Code:
movq    mm0,qword[Count]
movq    qword[LastCount],mm0
emms 
    


and for subtraction loading the high end of Count into edx and low into ebx and the high end of last count into eax and ecx and doing some bit magic. Would it be faster to do a scalar MMX subtraction?

Just to add, I do have more code like this, i don't want people im freaking out over a handful of cpu cycles Cool.
Post 28 Feb 2008, 23:35
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 29 Feb 2008, 02:42
Since you are trying to optimise the speed them you might want to consider using fmul instead of fdiv. Pre-calculate the reciprocal and save it as InvResolution.

Also you can directly use memory as an operand, you don't have to fld each value explicitly.
Post 29 Feb 2008, 02:42
View user's profile Send private message Visit poster's website Reply with quote
calpol2004



Joined: 16 Dec 2004
Posts: 110
calpol2004 29 Feb 2008, 13:04
Yes i realized that a while ago, I'd forgotton to change them over. just went through the entire program and saved myself a ton of instructions Rolling Eyes, fisub only works with 32bit and 16bit memory operands though Sad.

Wow, Are fpu instructions that slow? so much so that you can spend a handful of cpu instructions in order to optimize one fpu instruction? I think ill start looking up some instruction speed tables from AMD/Intel docs, see exactly how slow fpu instructions are Shocked.


Last edited by calpol2004 on 29 Feb 2008, 13:36; edited 2 times in total
Post 29 Feb 2008, 13:04
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 29 Feb 2008, 13:09
calpol2004 wrote:
Are fpu instructions that slow? so much so that you can spend a handful of cpu instructions in order to optimize one fpu instruction?
div is a very slow instruction in both integer and floating point domains. In a loop executing millions+ of them it can make a huge difference in runtime.
Post 29 Feb 2008, 13:09
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 29 Feb 2008, 21:05
Quote:

I don't think pointer manipulations would work in my case although in some cases it's quite a good option, but since the registers will have to be preserved so that when i reference esi/edi on the next iteration the same values are still there from when i left and the registers "remember" which one points to what. I'm using lots of api calls and calls to opengl dll's, not all of which preserve registers.

Ironically your situation is a perfect candidate for pointer exchange. Preserving ESI/EDI is part of the calling convention.

Code:
.data
align 16
Count dq 0, 0 ;extra padd
align 16
LastCount dq 0, 0 ;extra padd
Resolution dd 1/1000
align 16
ElapsedTime rd ?
...
.code
...
mov esi,Count
mov edi,LastCount 
...
invoke   QueryPerformanceCounter,[esi] 
MOVQ    xmm0,qword[esi] 
PSUBQ   xmm0,[edi] 
MOVQ    qword[edi],xmm0    
fild        qword[edi] 
fmul      dword[Resolution]             
fstp       dword[TimeElapsed] 
XCHG     esi,edi 
...
    


64bit version
Code:
...
invoke QueryPerformanceCounter, Count
MOV rax,[Count]
SUB rax,[LastCount]
CVTSI2SS xmm0,rax
MULSS xmm0,dword[Resolution]
PUSH [Count]
MOVSS dword[ElapsedTime],xmm0
POP [LastCount]
...
    
Post 29 Feb 2008, 21:05
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 01 Mar 2008, 11:32
Strictly speaking you should not hard-code frequency (1/Resolution) as it obtained via API thus we cannot change div to mul alas.
P.s. is it really SSE scalar mul(div) is faster than ALU one? I think only packed benefits. wrong?
Post 01 Mar 2008, 11:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 01 Mar 2008, 11:47
asmfan:

Nobody mentioned hard-coding the frequency. Pre-calculation is not hard-coding!

Also, nobody said that SSE mul/div is faster than the ALU. What post are you reading?
Post 01 Mar 2008, 11:47
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 01 Mar 2008, 11:57
Everything related to QueryPerformance* API and r22 code.
Post 01 Mar 2008, 11:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 01 Mar 2008, 12:02
I think that r22's code is not to be taken literally. Where you see this "Resolution dd 1/1000" the assembler will generate "Resolution dd 0". I expect it was meant to be only conceptual ideas of different ways to perform the task.
Post 01 Mar 2008, 12:02
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 01 Mar 2008, 12:15
Btw I thought fasm is able to generate FP from expressions.
If using API for measuring smthg then no precalculation or other is possible cuz no frequency is known only available via API.
And relatively r22 code i asked if it so that scalar SSE is faster than ALU or FPU in x32/x64? Is this code really optimised for speed execution?
Post 01 Mar 2008, 12:15
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 01 Mar 2008, 12:30
fasm does not compute floating point expressions, only constants. 1/1000 is computed as an integer, hence, 1/1000=0

We can easily pre-compute the reciprocal of the resolution, Call the API once at program startup, compute 1/resolution and store it in memory. Then just use mul with the pre-computed value you stored.

r22's code needs to use the SEE and/or FPU because we are using floats. The ALU only deals with integers. In theory the FPU and SSE units are ~equal in speed as far as the raw computations are concerned. But practical usage makes the SSE faster and easier to use.
Post 01 Mar 2008, 12:30
View user's profile Send private message Visit poster's website Reply with quote
calpol2004



Joined: 16 Dec 2004
Posts: 110
calpol2004 02 Mar 2008, 11:47
r22 wrote:
Quote:

I don't think pointer manipulations would work in my case although in some cases it's quite a good option, but since the registers will have to be preserved so that when i reference esi/edi on the next iteration the same values are still there from when i left and the registers "remember" which one points to what. I'm using lots of api calls and calls to opengl dll's, not all of which preserve registers.

Ironically your situation is a perfect candidate for pointer exchange. Preserving ESI/EDI is part of the calling convention.

Code:
.data
align 16
Count dq 0, 0 ;extra padd
align 16
LastCount dq 0, 0 ;extra padd
Resolution dd 1/1000
align 16
ElapsedTime rd ?
...
.code
...
mov esi,Count
mov edi,LastCount 
...
invoke   QueryPerformanceCounter,[esi] 
MOVQ    xmm0,qword[esi] 
PSUBQ   xmm0,[edi] 
MOVQ    qword[edi],xmm0    
fild        qword[edi] 
fmul      dword[Resolution]             
fstp       dword[TimeElapsed] 
XCHG     esi,edi 
...
    


64bit version
Code:
...
invoke QueryPerformanceCounter, Count
MOV rax,[Count]
SUB rax,[LastCount]
CVTSI2SS xmm0,rax
MULSS xmm0,dword[Resolution]
PUSH [Count]
MOVSS dword[ElapsedTime],xmm0
POP [LastCount]
...
    


I see. All seems so obvious now i've seen how you did it :p. I always forget to align and pad my variables as well. What i'm worried about now is that i wouldn't be able to use esi/edi throughout the entire loop again and if i did i'd have to put them somewhere temporarily first. But i don't use them yet, and if i did then the chances are it wouldn't be executed every loop iteration. A register to register transfer for temporary storage will be pretty fast so there will be an increase in speed still.

Calculation of the reciprocal should still be easy too without constants, adding a few instructions to the initialization will be worth even just a few cycles taken away from runtime. Just exactly how slow is the mmx emms? If i use MMX then i will need to use this instruction to clear the fpu stack for when i need to use the fpu again, an entire clear of all the mmx registers seems pretty slow to me :/.
Post 02 Mar 2008, 11:47
View user's profile Send private message MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 02 Mar 2008, 20:11
dd 1/1000 was indeed a place holder as revolution concluded. Sorry for the ambiguity BUT since the PURPOSE of the post was to outline POINTER EXCHANGING (and how its only optimal in a 32bit situation) and not best practice for mul vs div I didn't feel obligated to elaborate.

calpol - EMMS use is guess and check start without it and only use it if your results aren't correct or FPU exception occurs, BUT the XMM code wasn't the point, the point was the XCHG ESI,EDI. You don't have to use ESI and EDI, you can use EBX and/or EBP, because they are all preserved by the calling convention (32bit windows).
Post 02 Mar 2008, 20:11
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 04 Mar 2008, 15:59
[quote="calpol2004"]Just exactly how slow is the mmx emms? If i use MMX then i will need to use this instruction to clear the fpu stack for when i need to use the fpu again, an entire clear of all the mmx registers seems pretty slow to me :/.[/quote]

If you have AMD, use femms if you dont need it to be cleared to zero.
Post 04 Mar 2008, 15:59
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.