flat assembler
Message board for the users of flat assembler.
  
|  Index
      > Main > FPU vs CPU with quadwords? | 
| Author | 
 | 
| revolution 20 Feb 2008, 20:05 You shouldn't use the FPU to move/copy non-floating point numbers. The FPU like to change them sometimes, especially when they look like SNaN/QNaN or if you get >2^53 it will round them off. Use SSE2 integer instructions or MMX. | |||
|  20 Feb 2008, 20:05 | 
 | 
| calpol2004 20 Feb 2008, 20:16 Hmm okay. 
 Would it count to just have the integers stored as floating point numbers? I suppose not. I have read a fair bit about fpu inaccuracies and how its essentially deprecated these days, im not even sure if it's wise to use it for floating point operations  . As for SSE2 i don't really want to restrict the application to a rather modern processor unless i have to (i may use this later on, theres probably a way of detecting if it's available and using the appropriate code). MMX seems alright though, i use the performance counter in my application so im pretty much restricted to pentium and up at the moment anyway. I'll look it up. | |||
|  20 Feb 2008, 20:16 | 
 | 
| revolution 20 Feb 2008, 20:21 Maybe you can try the AMD docs to see if you like them better. | |||
|  20 Feb 2008, 20:21 | 
 | 
| r22 24 Feb 2008, 22:52 Another solution may be to switch the pointers to the QWords as opposed to the actual memory.
 Code: .data Count: dq 0 LastCount dq 0 .code MOV esi, Count MOV edi, LastCount .Loop: ;;Modify Count [esi] and [esi+4] ;;Perform some logic ;;Compare LastCount to something [edi] and [edi+4] ;;Count - > LastCount XCHG esi,edi JMP .Loop Calpol - without knowing exactly what you're trying to do I think a variation of the code above could be a more efficient solution to your problem. | |||
|  24 Feb 2008, 22:52 | 
 | 
| edfed 25 Feb 2008, 00:15 Quote: Another solution may be to switch the pointers to the QWords as opposed to the actual memory. or buy a X86-64 machine then to operate on qwords, it will be simple as add rax,[rbx+rsi*8+8] | |||
|  25 Feb 2008, 00:15 | 
 | 
| calpol2004 28 Feb 2008, 23:35 edfed wrote: 
 I am actually running on a system with a 64bit cpu, its rather frustrating not being able to do that :/. here's a simply extract of my code: Code: invoke QueryPerformanceCounter,Count fild qword[Count] fild qword[LastCount] fsubp fld dword[Resolution] fdivp fstp dword[TimeElapsed] fild qword[Count] fistp qword[LastCount] ... The code gets executed as part of a loop, it gets the current timer count, then subtracts the previous count from it to get the time elapsed, then divides that by the resolution to get the measurement in ms (acquired by QueryPerformanceFrequency() for ticks/s and dividg that by 1000). I then save the current time in LastCount for the next iteration and use timeelapsed for my calculations. I don't think pointer manipulations would work in my case although in some cases it's quite a good option, but since the registers will have to be preserved so that when i reference esi/edi on the next iteration the same values are still there from when i left and the registers "remember" which one points to what. I'm using lots of api calls and calls to opengl dll's, not all of which preserve registers. what i am thinking of doing is "simulating" the quad-word operation using 32bit instructions. Which shouldn't be that hard for this section of code, as it's just subtraction and movement (MMX movq? or 2 cpu mov's?), but it will take more instructions overall. I know fpu instructions are slower than cpu instructions but how much slower? im thinking something like this for the transfer: Code: movq mm0,qword[Count] movq qword[LastCount],mm0 emms and for subtraction loading the high end of Count into edx and low into ebx and the high end of last count into eax and ecx and doing some bit magic. Would it be faster to do a scalar MMX subtraction? Just to add, I do have more code like this, i don't want people im freaking out over a handful of cpu cycles  . | |||
|  28 Feb 2008, 23:35 | 
 | 
| revolution 29 Feb 2008, 02:42 Since you are trying to optimise the speed them you might want to consider using fmul instead of fdiv. Pre-calculate the reciprocal and save it as InvResolution.
 Also you can directly use memory as an operand, you don't have to fld each value explicitly. | |||
|  29 Feb 2008, 02:42 | 
 | 
| calpol2004 29 Feb 2008, 13:04 Yes i realized that a while ago, I'd forgotton to change them over. just went through the entire program and saved myself a ton of instructions   , fisub only works with 32bit and 16bit memory operands though  . Wow, Are fpu instructions that slow? so much so that you can spend a handful of cpu instructions in order to optimize one fpu instruction? I think ill start looking up some instruction speed tables from AMD/Intel docs, see exactly how slow fpu instructions are  . Last edited by calpol2004 on 29 Feb 2008, 13:36; edited 2 times in total | |||
|  29 Feb 2008, 13:04 | 
 | 
| revolution 29 Feb 2008, 13:09 calpol2004 wrote: Are fpu instructions that slow? so much so that you can spend a handful of cpu instructions in order to optimize one fpu instruction? | |||
|  29 Feb 2008, 13:09 | 
 | 
| r22 29 Feb 2008, 21:05 Quote: 
 Ironically your situation is a perfect candidate for pointer exchange. Preserving ESI/EDI is part of the calling convention. Code: .data align 16 Count dq 0, 0 ;extra padd align 16 LastCount dq 0, 0 ;extra padd Resolution dd 1/1000 align 16 ElapsedTime rd ? ... .code ... mov esi,Count mov edi,LastCount ... invoke QueryPerformanceCounter,[esi] MOVQ xmm0,qword[esi] PSUBQ xmm0,[edi] MOVQ qword[edi],xmm0 fild qword[edi] fmul dword[Resolution] fstp dword[TimeElapsed] XCHG esi,edi ... 64bit version Code: ... invoke QueryPerformanceCounter, Count MOV rax,[Count] SUB rax,[LastCount] CVTSI2SS xmm0,rax MULSS xmm0,dword[Resolution] PUSH [Count] MOVSS dword[ElapsedTime],xmm0 POP [LastCount] ... | |||
|  29 Feb 2008, 21:05 | 
 | 
| asmfan 01 Mar 2008, 11:32 Strictly speaking you should not hard-code frequency (1/Resolution) as it obtained via API thus we cannot change div to mul alas.
 P.s. is it really SSE scalar mul(div) is faster than ALU one? I think only packed benefits. wrong? | |||
|  01 Mar 2008, 11:32 | 
 | 
| revolution 01 Mar 2008, 11:47 asmfan:
 Nobody mentioned hard-coding the frequency. Pre-calculation is not hard-coding! Also, nobody said that SSE mul/div is faster than the ALU. What post are you reading? | |||
|  01 Mar 2008, 11:47 | 
 | 
| asmfan 01 Mar 2008, 11:57 Everything related to QueryPerformance* API and r22 code. | |||
|  01 Mar 2008, 11:57 | 
 | 
| revolution 01 Mar 2008, 12:02 I think that r22's code is not to be taken literally. Where you see this "Resolution dd 1/1000" the assembler will generate "Resolution dd 0". I expect it was meant to be only conceptual ideas of different ways to perform the task. | |||
|  01 Mar 2008, 12:02 | 
 | 
| asmfan 01 Mar 2008, 12:15 Btw I thought fasm is able to generate FP from expressions.
 If using API for measuring smthg then no precalculation or other is possible cuz no frequency is known only available via API. And relatively r22 code i asked if it so that scalar SSE is faster than ALU or FPU in x32/x64? Is this code really optimised for speed execution? | |||
|  01 Mar 2008, 12:15 | 
 | 
| revolution 01 Mar 2008, 12:30 fasm does not compute floating point expressions, only constants. 1/1000 is computed as an integer, hence, 1/1000=0
 We can easily pre-compute the reciprocal of the resolution, Call the API once at program startup, compute 1/resolution and store it in memory. Then just use mul with the pre-computed value you stored. r22's code needs to use the SEE and/or FPU because we are using floats. The ALU only deals with integers. In theory the FPU and SSE units are ~equal in speed as far as the raw computations are concerned. But practical usage makes the SSE faster and easier to use. | |||
|  01 Mar 2008, 12:30 | 
 | 
| calpol2004 02 Mar 2008, 11:47 r22 wrote: 
 I see. All seems so obvious now i've seen how you did it :p. I always forget to align and pad my variables as well. What i'm worried about now is that i wouldn't be able to use esi/edi throughout the entire loop again and if i did i'd have to put them somewhere temporarily first. But i don't use them yet, and if i did then the chances are it wouldn't be executed every loop iteration. A register to register transfer for temporary storage will be pretty fast so there will be an increase in speed still. Calculation of the reciprocal should still be easy too without constants, adding a few instructions to the initialization will be worth even just a few cycles taken away from runtime. Just exactly how slow is the mmx emms? If i use MMX then i will need to use this instruction to clear the fpu stack for when i need to use the fpu again, an entire clear of all the mmx registers seems pretty slow to me :/. | |||
|  02 Mar 2008, 11:47 | 
 | 
| r22 02 Mar 2008, 20:11 dd 1/1000 was indeed a place holder as revolution concluded. Sorry for the ambiguity BUT since the PURPOSE of the post was to outline POINTER EXCHANGING (and how its only optimal in a 32bit situation) and not best practice for mul vs div I didn't feel obligated to elaborate.
 calpol - EMMS use is guess and check start without it and only use it if your results aren't correct or FPU exception occurs, BUT the XMM code wasn't the point, the point was the XCHG ESI,EDI. You don't have to use ESI and EDI, you can use EBX and/or EBP, because they are all preserved by the calling convention (32bit windows). | |||
|  02 Mar 2008, 20:11 | 
 | 
| rugxulo 04 Mar 2008, 15:59 [quote="calpol2004"]Just exactly how slow is the mmx emms? If i use MMX then i will need to use this instruction to clear the fpu stack for when i need to use the fpu again, an entire clear of all the mmx registers seems pretty slow to me :/.[/quote]
 If you have AMD, use femms if you dont need it to be cleared to zero. | |||
|  04 Mar 2008, 15:59 | 
 | 
| < Last Thread | Next Thread > | 
| Forum Rules: 
 | 
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.