flat assembler
Message board for the users of flat assembler.
Index
> Main > Optimizing Education...Please? Goto page 1, 2 Next |
Author |
|
r22 25 May 2005, 06:48
This was the best I could come up with.
If a couple assembler veterans post their code on this thread we can break out the benchmarking :D. Code: HexStr2Bin: .strPtr equ esp+4 push ebp push ebx mov ebp,[.strPtr+8] ;get str ptr lea edx,[ebp-1] ;copy & DEC str ptr .len: add edx,1 cmp byte[edx],0 jnz .len ;edx = last byte in str NULL xor ecx,ecx xor eax,eax .conv: dec edx ;loop through the string backwards cmp edx,ebp ;done when edx is less then ebp jl .ret movzx ebx,byte[edx] ;get the byte sub bl,30h ;65-48=17=7 97-48=49-7=42=32 cmp bl,17 ;not exact number to use but works jl .continue sub bl,7 cmp bl,17 ;not exact number to use but works jl .continue sub bl,32 .continue: ;the byte should be between 0-15 shl ebx,cl ;shift left AKA multiply by power of 16 add eax,ebx ;make the return value add cl,4 ;jump up to a new power of 16 for shift left (multiply) jmp .conv .ret: pop ebx pop ebp ret 4 |
|||
25 May 2005, 06:48 |
|
Madis731 25 May 2005, 08:25
Code: ;<<<-c810----CODE TO CONVERT STRING TO BINARY--------------->>> toBinary: ;ebx=base,esi=source,edi=dest xor ecx,ecx ;clear values so we can add up xor eax,eax ;to them later .nxtnmbr: mul ebx ;multiply with base so we can mov cl,[esi] ;get another number to add to sub cl,30h ;lower part of it. cmp cl,10 jc .nmbr ;0..9? sub cl,7 ;make A..Z=>10..35 .nmbr: cmp ecx,ebx ;is the input bigger than base-1 jc .inlimit cmp [chng],-1 ;Did we come here in previous jz .inlimit ;loop? mov [chng],-1 ;No disturbance in the next loop div ebx ;pre-divide by base (needed!) push eax ;push eax/base mov eax,ebx ;base in eax for multiply .test: mul ebx ;eax*=base push eax mov eax,edx mul ebx mov edx,eax pop eax cmp ecx,eax ;is the input still bigger? jnc .test mul dword[esp] ;it wasn't so return our eax add esp,4 ;instead of popping we emulate .inlimit: ;because pop reads memory mov [chng],0 ;continue when input is in the add eax,ecx ;limits of 0..base-1 inc esi ;point to next char cmp [esi],byte 0 ;0 is the end marker jnz .nxtnmbr ;loop if non-zero character mov [edi],eax ;result to destination pointer mov [edi+4],edx ret ;<<<-c81F----END OF CODE------------------------------------>>> Code: ;<<<-c820----CODE TO CONVERT BINARY TO STRING--------------->>> toString: ;eax=base,BINdata=source, cmp eax,2 ;STRdata=destination jnc .cont ;base_2 - the lowest that exists mov [STRdata],"<=LO" ;Easier than defining data mov [STRdata+4],"W!" ;yet perfectly legal ret .cont: cmp eax,37 ;base_36 defined as maximum in jc .cont2 ;this program mov [STRdata],"<=HI" mov [STRdata+4],"GH!" ret .cont2: push ebp ;We passed the checks xor ebx,ebx ;clear some registers xor esi,esi ;... xor edx,edx ;. mov ebp,STRdata ;define important pointers mov edi,eax ;can't be in eax, because it is mov eax,[BINdata] ;used for calculations mul/div .again: div edi ;dividing by base gives us lea ecx,[edx+30h] ;numbers in reverse order cmp ecx,3Ah ;They must be translated to jc .push ;readable characters and pushed add ecx,07h ;to stack so we can easily .push: ;reverse them mov ch,cl ;byte-by-byte: every word push push cx ;erases cl so ch preserves it inc esp ;this allows to virtually push inc ebx ;bytes (instead of words) xor edx,edx ;clear remainder of division cmp eax,0 ;eax==0 ? nothing to divide jnz .again .test: sub ebx,4 ;popping is hard jc .ovr ;n%4==0 ? lets pop 4 bytes pop [STRdata+eax] ;dword pop add eax,4 ;correct eax jmp .test ;do it again .ovr: pop [STRdata+eax] ;pop dword sub esp,4 ;correct stack cmp ebx,-3 ;check how many we left there jc .exit ;none? ok, we can finish jz .three ;jump accordingly cmp ebx,-2 ;now checking to get additional jz .two ;flags inc esp ;only 1 inc eax .two: ;2.. inc esp inc eax .three: ;3.. inc esp ;(inc esp)x3 but we increased inc eax ;one too many in the beginning .exit: ;so it's a whole dword mov [STRdata+eax], 0 ;string must end with 00h pop ebp ret ;<<<-c82F----END OF CODE------------------------------------>>> These are N conversions - 2..36 ASCII<=>BIN |
|||
25 May 2005, 08:25 |
|
smiddy 25 May 2005, 09:28
r22 & Madis731,
You've both given me a few things to think about for sure. You both use operations I normally don't so this will give me the opportunity to preview them in a working state, thanks! Both of you shortened variable names. Is this considered normal style? The reason I ask is it seems like, in FASMs case, you can take the luxury of using names that are far more descriptive. I suspect this style comes from coding from way-back when variable lengths where limited? I like the idea of checking the speed too. I only know a few things about speed. From a lot of your posts, it seems you have ideas, I think, on hyperthreading (or multiprocessor) speed increases. I say this based on the dependancies between operations and the need for one to complete before another operations. I suspect for parallel processing. Is there a quick reference on clock cycles for processor types on the net? I know for example XOR EAX,EAX has less overhead in drive space, but is it a faster operator than MOV EAX,0? Enough questions for now...I have to run to work. Thanks again for the code examples. I'll study them more closely this evening. |
|||
25 May 2005, 09:28 |
|
Madis731 25 May 2005, 19:59
No short names are bad and you should use as long ones as neccessary, but NOT too long that may confuse people.
Why I'm used to this bad habit? I really don't know , but for my defence - .toString can only be translated in one way you can convert binary to string conserning numbers . You must thank Agner Fog for the next one: http://www.agner.org/assem/ The file in the attachment is result of his work. Also you should check out 9. "64bit BCD add" from http://www.azillionmonkeys.com/qed/asmexample.html because if I ever wanted to optimize - I would loose the compares to check for characters in a HEX ASCII string. "Push them over the edge" so-to-speak. This topic is too tricky to tell you to go search the board or manuals so here it goes: ------------------------- XOR was to be thanked in the 16-bit ages where MOV reg,const would take 5 bytes so taking 3 CPU reads from the memory. Today (32-bit) it is not so important (though still faster and smaller!), because 32 bits (read 4 bytes) can be read at once so only two reads are neccessary. In case of an XOR reg,reg you can clear 2 registers in one memory-read if you're lucky(meaning 4 byte aligned code). Example is appropriate: Code: align 4 mov eax,0 ;B8 00 00 00 00 mov edx,0 ;BA 00 00 00 00 ;Result=10 bytes -> 3 to 4 memory-reads mov eax,0 ;B8 00 00 00 00 mov edx,eax ;8B D0 (STALL!) ;Result=6 bytes -> 2 to 3 memory-reads xor eax,eax ;33 C0 xor edx,edx ;33 D2 ;Result=4 bytes -> 1 to 2 memory-reads I would use XOR for speed+size I would use MOV for readability (maybe speed?) |
|||
25 May 2005, 19:59 |
|
Madis731 25 May 2005, 20:08
Sorry for double posting, but you should know another good trick when you want to avoid bad dependency checking with XOR reg,reg:
Code: and eax,0 ;83 E0 00 <= notice the optimized instruction and edx,0 ;83 E2 00 <= 1-byte 00 clears the whole 32-bits ;Result=6 bytes -> 2 to 3 memory-reads ;There is no stall and should heat less CPU because of the ;simpler circuitry of AND compared to XOR Did some preliminary testings: you first code was not so bad ~325±5 clocks. Your use of time-taking instructions for shorter code didn't pay off it makes unexpected stalls in parts of your code. I hoped it would be much more improved but it only improved 50% by the second try on that code - r22: 167±0 clocks. I like the stability though BIG "+". Last edited by Madis731 on 25 May 2005, 20:27; edited 1 time in total |
|||
25 May 2005, 20:08 |
|
FlashBurn 25 May 2005, 20:15
The best algo I know is this:
Code: xor ebx,ebx xor eax,eax .loop: lodsb test al,al jz .end make a binary number out of the ascii byte shl ebx,4 or ebx,eax jmp .loop .end: mov eax,ebx ret |
|||
25 May 2005, 20:15 |
|
Madis731 26 May 2005, 06:02
The next code is not wrapped inside CALL|RET so its not fair to test it side-by-side with other snippets, but still it only takes 111 clocks. I think that memory reads/writes and call-ret overhead won't be much. Oh, and the code itself ofcourse:
Code: mov edi,"F3D4" ;Constant1 mov eax,edi and eax,40404040h shr eax,6 imul ebx,eax,7 ;lea ebx,[eax*4+eax] ;lea ebx,[eax*2+ebx] sub edi,ebx mov eax,edi and eax,000F000Fh and edi,0F000F00h ror eax,12 rol edi,8 or eax,edi mov edi,eax shr edi,8 or eax,edi mov edi,"CA9E" ;Constant2 mov edx,edi and edx,40404040h shr edx,6 imul ebx,edx,7 ;lea ebx,[esi*4+esi] ;lea ebx,[esi*2+ebx] sub edi,ebx mov ebp,edi and ebp,000F000Fh and edi,0F000F00h ror ebp,12-8 rol edi,8+8 or ebp,edi mov edi,ebp shl edi,8 or ebp,edi shl eax,16 shr ebp,16 or eax,ebp I haven'toptimized it neither for speed nor size, because it was late last night and I just wanted to get it raedy. ...but I fell asleep anyway so...your comments|critisism, please! |
|||
26 May 2005, 06:02 |
|
smiddy 26 May 2005, 11:50
Madi731,
Thanks, I have placed http://www.azillionmonkeys.com/qed/tech.shtml in my book marks for sure. There are a few pages there I hadn't seen yet. I took a few glimpses at Paul Hsieh's pages last night. I will study them as well. So you are aware, my background is hardware (EE), so the digital info makes a lot of sense. Particularlly the ANDing of a zero to a register. I suppose there are a lot more boolian tricks that can be played. That probably comes with more experience I'm certain. This evening if I get the opportunity I'll take a closer look at your last code fragment. It is very tight indeed. Thanks again! Also, when I posted my procedure, I hadn't had the opportunity to test it myself until last night. I suppose, luckily, it worked the first try. From the standpoint of readability, my peice does that quite well I think. Although a 2:1 increase as you say isn't huge for one pass, but say with a PCI table conversion for ASCII Hex to number, with nearly 5,000 devices, 330 clocks/device x 5,000 devices/list = 1,650,000 clocks/list versus 825,000 clocks/list. On an old machine, that would be maddening to wait on, especially since their clocks/operation were longer too. This is good stuff...can you see my excitement? FlashBurn, Thanks for the snippet. I'll take a closer look at it this evening as well. |
|||
26 May 2005, 11:50 |
|
Madis731 26 May 2005, 16:45
Its time for MMX:
I found this algorithm to be very interesting, because there is no such instruction that would pack BYTES=>NIBBLES. I would write a 10clock program then Code: movq mm0,[ASCIIhex] movq mm1,[Compare1] movq mm2,[Compare2] movq mm3,[Compare3] movq mm4,[Compare4] movq mm5,mm4 psllw mm5,8 pcmpgtb mm1,mm0 psubb mm0,mm2 pandn mm1,mm3 psubb mm0,mm1 movq mm1,mm0 pand mm0,mm4 pand mm1,mm5 psrlw mm1,4 por mm0,mm1 packuswb mm0,mm0 movd eax,mm0 ASCIIhex dq "F3D4CA9E" Compare1 dq 4040404040404040h Compare2 dq 3030303030303030h Compare3 dq 0707070707070707h Compare4 dq 00FF00FF00FF00FFh |
|||
26 May 2005, 16:45 |
|
smiddy 26 May 2005, 20:52
WOW!
I haven't opened the can of 64-bits yet. I am making a few assumptions that most common 32-bit/16-bit operations are available with a prefix or suffix? Are there specifics for each register too, like MUL and DIV which use multiple registers (this leads into FPU operations too, to which I have a little experience with). Please forgive my elementary questions. I intend on getting a good book (if you know of one, please let me know) on the different instructions sets. The one book (which is fairly worn) Using Assembly Language 2nd Edition by Allen L . Wyatt (C) 1990. I've had it that long...but it doesn't have the MMX, SSE, etcetera. Although it does have quite a bit (pun intended) on FPU operators. Coding examples for most of the 32-bit operations, for example SETNAE, is just a simple: Code: SETNAE CL Without the ability to see specific flow it becomes tough to follow, unless, and I should probably do this, I use the instruction(s) one at a time to get to know them better. Again <cough> thanks for the information. I have a lot to go over just from this thread alone. |
|||
26 May 2005, 20:52 |
|
r22 27 May 2005, 04:00
I was disappointed by the MMX code :[
1 ) It reads the strings BACKWARDS string db 'F','F','F','F',0 ;non mmx versions = 65535 string1 db 'F','F','F','F','0','0','0','0',0 ;mmx version = 65535 2 ) Also it only handles UPPERCASE letters. Fixing the above two problems makes the MMX code slower than the following. Using the optimizations from others on this thread I tweaked my original code. I'm pretty sure this code can be easily ported to 64bit simply by replacing the E's with R's. Code: align 8 HexStr2Bin: .strPtr equ esp+4 mov ecx,[.strPtr] xor eax,eax .conv: movzx edx,byte[ecx] ;more speed IF following 4 lines can be DE-branchified sub dl,30h js .ret ;the byte is null return test dl,11110000b ;check for A-F a-f jz .ctn ;not an upper/lower case char skip to ctn: and dl,00001111b ;make A-F a-f = 0-5 add dl,9 ;correct 0-5 to 10-15 .ctn: shl eax,4 ;shift eax to the next power of 16 add ecx,1 ;increment the string ptr or al,dl ;OR in the new value jmp .conv ;continue the loop .ret: ret 4 The unrolled version... Code: align 8 HexStr2Bin: .strPtr equ esp+4 mov ecx,[.strPtr] xor eax,eax .conv: mov edx,[ecx] sub dl,30h js .ret test dl,11110000b jz .ctn and dl,00001111b add dl,9 .ctn: shl eax,4 or al,dl shr edx,8 sub dl,30h js .ret test dl,11110000b jz .ctn1 and dl,00001111b add dl,9 .ctn1: shl eax,4 or al,dl shr edx,8 sub dl,30h js .ret test dl,11110000b jz .ctn2 and dl,00001111b add dl,9 .ctn2: shl eax,4 or al,dl shr edx,8 sub dl,30h js .ret test dl,11110000b jz .ctn3 and dl,00001111b add dl,9 .ctn3: shl eax,4 add ecx,4 or al,dl jmp .conv .ret: ret 4 Well, I'm done kicking a dead horse. |
|||
27 May 2005, 04:00 |
|
Madis731 27 May 2005, 08:44
I don't know what is the matter with my CPU or my RAM, but the tighter one takes 162 clocks and the unrolled onw 201 clocks - does the unrolling mess up my CPUs branch prediction?
|
|||
27 May 2005, 08:44 |
|
tom tobias 27 May 2005, 10:02
smiddy wrote: I know for example XOR EAX,EAX has less overhead in drive space, but is it a faster operator than MOV EAX,0? May I ask how you know this???? I am laboring under the possibly FALSE impression, that most of the information about timings of instructions is folklore based on 80386 days..... I have writtent this before, and I will write it again. Even if XOR EAX, EAX is ONE THOUSAND times faster than MOV EAX, Zero, I will still employ the latter expression EVERY time. So many myths pervade this field. One of those myths is the silly notion of speed. Faster is better. Sure. Faster is better, so is power saving. In fact, as between saving electricity, and reducing execution time, I suspect that for MOST OF THE world's population, saving electricity is far more important than saving execution time. Unless one plans to turn off the computer sooner, saving electricity, as a result of this nanosecond faster computation, one should stick to MOV EAX, Zero. Computers today are so fast, that saving a few clock cycles at the cost of IMPROPER programming style (substituting unreadable CODE for easily understood PROGRAM) is an UNFORGIVEABLE error. But, apart from the argument that readability reigns supreme in evaluating alternate programs to accomplish the same goal, WHERE'S the timing data to support the notion that XOR EAX, EAX executes faster than MOV EAX, Zero???? In particular, where are the benchmarks published? Where's the url to a web site that explicitly defends this obsolete practice, employed religiously by legions of assembly language programmers, who no longer understand WHY Intel developed Boolean and bitwise operators in the first place! (to save precious memory!) Regards, tom |
|||
27 May 2005, 10:02 |
|
MazeGen 27 May 2005, 11:33
tom tobias wrote:
Hi tom, I'll just copy&paste some info from the Famous Pentium Optimization Manual Quote:
Quote:
According to these timings, MOV can under some circumstances a bit quicker, but XOR r32,r32 has a special hardware support: Quote:
So, what is quicker then? Last edited by MazeGen on 27 May 2005, 12:41; edited 1 time in total |
|||
27 May 2005, 11:33 |
|
smiddy 27 May 2005, 11:36
tom tobias wrote:
Hi Tom, I wasn't trying to pick a fight here. I truely want to understand. I agree that readability is, especially after say six months of coding, you return to your original and you go, "WTF was that?" But since you asked, empirically I have boot sector code I wrote that reads in the FAT and root directory. Suffice it to say, I ran out of room initially for the 512 bytes. So I was looking for ways to reduce my on disk overhead. One way was to change all the MOV register, 0 to XOR register, register. A byte per each, around twelve of them. I subsequently was able to fit everything into one sector. As for speed, well, that was my question, not a statement. Personally I prefer hard evidence too. I am an avid fan of lab work, it makes me happy... So, if we need benchmarks and such perhaps one day I'll get the opportunity to work on that and place them on the net for all to view. But alas I was merely asking the question to see what was available on information. I'm getting the impression after re-reading your post that you may not have read the entire thread. I also get the impression that you really didn't read my question either... I'm sure I understand your perspective, I'm not certain you understand mine. Was this a knee-jerk reaction on your part? I would hate to continue posing questions and be slammed as being miss-guided, when I'm actually seeking guidance. What gives? |
|||
27 May 2005, 11:36 |
|
smiddy 27 May 2005, 11:47
@MazeGen,
Thanks for posting this information and the link. I have a lot to learn... |
|||
27 May 2005, 11:47 |
|
tom tobias 27 May 2005, 20:13
Thanks MazeGen and Smiddy, good replies, both.
No, I was not seeking to belittle or argue with anyone. The reference to Pentium optimization is quite interesting. I think a proper benchmark study would be useful to sort out the answer to the question of timing for XOR versus MOV. However, I am quite certain that few of the vast majority of participants of this forum will commence employing MOV EAX,zero, instead of XOR EAX, EAX, even if a properly designed benchmark study demonstrated approximate equivalence in timing for both instructions. For the most part, people have lots of excuses for continuing to write CODE, instead of creating programs. One of the main reasons for reluctance to adopt a more readable approach to writing programs is given by smiddy, quite clearly: He needed to save space--> that was paramount for him. This notion of saving memory by writing "tight code" is found in many places on the FASM forum. Not long ago, for example, one encountered a competition based upon using the fewest instructions to implement some task. Note, no one was interested in a competition about writing the CLEAREST, most readily transparent, easiest to comprehend PROGRAM, no, what was important was saving space, i.e. memory, as if this were 1955 again. Hmm. I must be from another planet.... Seems to me, I have heard that argument, about the importance of saving memory, as if it were some kind of ADVANTAGE, for more than four decades now.....I have no interest in short, obscure instructions that whittle 39 milliseconds of time off the execution of some equally obtuse muddle of CODE, and reside in a space smaller than a thimble. I like LONG, DETAILED, INFORMATIVE PROGRAMS, written so that someone can actually read the text and understand what the programmer seeks to accomplish. I LIKE SQUANDERING memory, I am happy when I encounter VERBOSITY with actual descriptions of a procedure, or an algorithm or a data structure. Saving n bytes in order to FIT some piece of code into a narrow little space is not my idea of progress. For sure, if one is really keen on saving space, the x86 instruction set is unsuitable......regards, tom |
|||
27 May 2005, 20:13 |
|
madmatt 28 May 2005, 20:32
Fortunately , you can use the structured macros provided with the Fasm package and write some very clean and structured code without bloating the code size up, and without compromising the blinding speed that assembly provides. Thank you Privalov for a job well done!!
MadMatt |
|||
28 May 2005, 20:32 |
|
beppe85 29 May 2005, 22:57
I think it's hard to do timings, as thousands of "xor r32, r32" or "mov r32, 0" are far from being real instruction sequences. In this cases you need to look for pairings, IMO a pain. Only starting with P4 "xor" may perform better(just guessing), as "mov" is a longer encoding but break dependencies in 686. Sure they will perform equally well if they fit in cache.
And "xor r32, r32" is clear for me... You got it? "clear" hehehehe |
|||
29 May 2005, 22:57 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.