DavidB3

Joined: 09 Apr 2014
Posts: 3
DavidB3
Hi,

I'm a Delphi programmer for years but this is my first try in asm.

I'm trying to make 2 functions similar with SumInt (it sums all items from an Integer array) but with 2 differences:
1) Should work with Byte and Word array;
2) Should sum only a part of the array.

The original SumInt function from Delphi:

Code:
```function SumInt(const Data: array of Integer): Integer;
asm  // IN: EAX = ptr to Data, EDX = High(Data) = Count - 1
// loop unrolled 4 times, 5 clocks per loop, 1.2 clocks per datum
PUSH EBX
MOV  ECX, EAX         // ecx = ptr to data
MOV  EBX, EDX
XOR  EAX, EAX
AND  EDX, not 3
AND  EBX, 3
SHL  EDX, 2
JMP  @Vector.Pointer[EBX*4]
@Vector:
DD @@1
DD @@2
DD @@3
DD @@4
@@4:
ADD  EAX, [ECX+12+EDX]
JO   RaiseOverflowError
@@3:
ADD  EAX, [ECX+8+EDX]
JO   RaiseOverflowError
@@2:
ADD  EAX, [ECX+4+EDX]
JO   RaiseOverflowError
@@1:
ADD  EAX, [ECX+EDX]
JO   RaiseOverflowError
SUB  EDX,16
JNS  @@4
POP  EBX
end;    ```

So far I've done those:

Code:
```function SumByte(const PDataStart: Pointer; const Count: Integer): Integer;
// EAX is PDataStart and the result, EDX is Count
asm
MOV  ECX, EAX
XOR  EAX, EAX
CMP  EDX, 0
JE   @end
PUSH EBX
XOR  EBX, EBX
SUB  ECX, 1
@loop:

MOV  BL, [ECX + EDX]
ADD  EAX, EBX

DEC  EDX
JNZ  @loop
POP  EBX
@end:
end;

function SumWord(const PDataStart: Pointer; const Count: Integer): Integer;
// EAX is PDataStart and the result, EDX is Count
asm
MOV  ECX, EAX
XOR  EAX, EAX
CMP  EDX, 0
JE   @end
PUSH EBX
SUB  ECX, 2
SHL  EDX, 1
XOR  EBX, EBX
@loop:

MOV  BX, [ECX + EDX]
ADD  EAX, EBX

SUB  EDX, 2
JNZ  @loop
POP  EBX
@end:
end;    ```

Usage: SumInt/SumWord(@Array[Start], Count)

They seem to work but:
1) I don't know if it's the fastest code. Is there a faster one (but safe)?
2) The speed tests give some weird results. Sometimes they are faster than regular code, sometimes they are slower or with the same speed. It depends on how and where I add other code outside the test and outside these functions (so theoretically it shouldn't influence the result).

Could you please help me?
Thank you.

Delphi version and OS: 7 + XP SP3 and XE5 + 7 SP1 x86

Regards,
David

PS: I tried to use Code tags but they don't seem to work, sorry

Use the forward slash. Okay, I already did it for you. revolution
09 Apr 2014, 06:27
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17853
Location: In your JS exploiting you and your system
revolution
Speed tests for something like this will not be useful. As you saw the results vary a lot because internally the CPU is doing other things that sometimes do and sometimes don't affect the timings.

I'd suggest that you need to look at your entire program to see if and where it needs optimising. Little functions like this are only worthwhile looking at if such a thing is called thousands of times per second for endless hours where saving 5% runtime might actually have some realtime benefit.

Also, what might be faster on your system could be slower on another system. If your algorithm is strong then such small linear timing details won't matter much.
09 Apr 2014, 06:36
DavidB3

Joined: 09 Apr 2014
Posts: 3
DavidB3
Thank you.

revolution wrote:
Speed tests for something like this will not be useful. As you saw the results vary a lot because internally the CPU is doing other things that sometimes do and sometimes don't affect the timings.

Ok, but how can I know if it's faster than regular code code and it's worth using?

revolution wrote:
I'd suggest that you need to look at your entire program to see if and where it needs optimising. Little functions like this are only worthwhile looking at if such a thing is called thousands of times per second for endless hours where saving 5% runtime might actually have some realtime benefit.

I already optimized the most part of the code.
And these function are sometimes called millions of times per minute. So it's worth trying to code them in assembler.
09 Apr 2014, 07:06
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17853
Location: In your JS exploiting you and your system
revolution
DavidB3 wrote:
Ok, but how can I know if it's faster than regular code code and it's worth using?
If you can't detect any significant changes in runtime then that is your answer: It is neither faster or slower.

If it was significantly faster, and you are calling it millions of times, then you would know pretty quickly by the reduced runtimes.

Optimising for speed is hard. It is not just a simple matter of writing a tighter loop or avoiding div. Things like algorithm selection, cache management and streaming data to DRAM etc. are where the majority of speed-ups are to be found. Usually only horribly inefficient code will be responsive to simplistic instruction level improvements. And these days many compilers do a reasonable job of not producing horribly inefficient code (as long as you get the algorithms right).
09 Apr 2014, 08:44
DavidB3

Joined: 09 Apr 2014
Posts: 3
DavidB3
revolution wrote:
If you can't detect any significant changes in runtime then that is your answer: It is neither faster or slower.

If it was significantly faster, and you are calling it millions of times, then you would know pretty quickly by the reduced runtimes.

As I mentioned the test results are weird.
I tried directly with the application's code. It showed ~23% speed increase.
Ok. I added some code in other areas (that has nothing to do with the code involved in test). I tested again, this time was no difference in speed (?!).
I'm not new in this kind of testing and ALWAYS I got consistent results.
This is happening ONLY when I start using the assembler from Delphi.
So either it's a bug in Delphi assembler (but I doubt that) or it's a bug in my assembler code which allows it to access memory it shouldn't have.
09 Apr 2014, 09:16
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17853
Location: In your JS exploiting you and your system
revolution
DavidB3 wrote:
I added some code in other areas (that has nothing to do with the code involved in test).
Well that could be your problem. Everything the CPU does affects the other things. This can happen by loading/evicting more data or code to/from the caches. Or because code or data alignment has changed. Or a number of other things that have can an effect. This is something I mentioned above, the whole program needs to be assessed to see where changes are most effective.
09 Apr 2014, 09:22
