flat assembler
Message board for the users of flat assembler.
Index
> Main > A bunch of binary to decimal conversion routines Goto page 1, 2 Next 
Author 

wht36
I know binary (unsigned dword) to decimal conversion has been discussed many times before but I still find it very interesting as it taught me some stuff about fixed point maths and speed optimisation issues.
While in general there is no need to speed optimise a binarytodecimal routine because one does not use it frequently, I would still like to offer my version of a speedoptimised binarytodecimal routine. Code: ;BinarytoASCII Decimal Conversion Suppressing Leading Zeros ;Adapted from the algo's by Paul Dixon, Lingo, John O'Harrow ;EAX = number to convert; EDI = where to store result dw2a: cmp eax,1000000000 mov edx,0D1B71759h ; 2^45/10000 = fixed point 1/10000 with 45 digits mov ebp,eax ; save a copy of the number jb .9dig mul edx ; EDX = integer (quotient) + 45 digits after the dot shr edx,13 ; round down EDX quotient (6 high digits) mov eax,68DB9h ; 2^32/10000+1= 1/10000 with 32 digits & rounded up imul esi,edx,10000 ; ESI = quotient * 10000 mul edx ; * 1/10000 with 32 digits (so EDX = integer = first 2 digits) sub ebp,esi ; EBP = remainder (lower 4 digits) mov edx,dword [chartab+edx+edx] ; look up 2 digits mov esi,100 ; load ready for later mov [edi], edx ; write them to answer .dig8: mul esi ; get next 2 digits mov edx,dword [chartab+edx+edx] ; look up 2 digits mov [edi+2],edx ; write them to answer mul esi ; get next 2 digits .dig6a: mov edx,dword [chartab+edx+edx] ; look up 2 digits .dig4a: mov [edi+4], edx ; write them to answer .dig4: mov eax,28F5C29h ; 2^32\100 +1 mul ebp ; ebp= lower 4 digits imul eax,edx,200 movzx eax,word [chartab+ebp*2+eax] mov edx,dword [chartab+edx+edx] add edi,10 mov [edi4],edx mov [edi2],eax ret .2dig: cmp eax,10 movzx eax,word [chartab+eax+eax] ; look them up mov [edi+6],eax jae .dig2a mov [edi+6],ah .dig2a: sbb edi,8 ret .4dig: cmp eax,1000 ; 1000 to 9999 lea edi,[edi6] jae .dig4 cmp eax,100 ; 099 lea edx,[eax+eax*4] ; 100999 jb .2dig lea edx,[eax+edx*8] shr edx,12 ;EDX = n * 41/4096 = quotient, accurate up to 9990 add edi,9 imul ebx,edx,200 or dl,'0' mov [edi3],edx movzx eax,word [chartab+ebx+ebp*2] mov [edi2],eax ret .8dig: mul edx mov edx,dword [chartab+edx+edx] ; look up 2 digits mov [edi+2],edx ; write to answer ; mul esi ; get final 2 digits shr eax,7 imul edx,eax,100 ;fast multiplication add edx,1000000h ;round up before rounding down shr edx,25 jmp .dig6a .6dig: cmp eax,100000 mov ecx,28F5C29h ;((2^32)+1001)/100} mov ebx,eax ;Dividend} sbb edi,6 mul ecx ;EDX = Dividend DIV 100} mov eax,edx ;Set Next Dividend} imul edx,200 ;2 * (100 * Dividend DIV 100)} mov esi,eax ;Dividend} movzx ebx,word [chartab+ebx*2+edx] ;Dividend MOD 100 in ASCII} mul ecx ;EDX = Dividend DIV 100} imul eax,edx,200 cmp edx,10 mov eax,dword [chartab+esi*2+eax] ;Dividend MOD 100 in ASCII} mov edx,dword [chartab+edx+edx] mov [edi5],dh jb .5dig mov [edi6],edx .5dig: mov [edi4],eax mov [edi2],ebx ret .7dig: lea ebx,[edx+edx*4] lea ebx,[edx+ebx*8] shr ebx,12 lea edi,[edi1] imul ecx,ebx,200 or bl,'0' mov [edi+3],ebx movzx edx,word [chartab+ecx+edx*2] jmp .dig4a .9dig: cmp eax,10000 jb .4dig cmp eax,1000000 jb .6dig mul edx shr edx,13 ; EDX = 5 high digits mov eax,28F5C29h ; 2^32\100 +1 imul ebx,edx,10000 ; EBX = quotient * 10000 lea edi,[edi2] sub ebp,ebx ; EBP = remainder (lower 4 digits) cmp edx,1000 jb .7dig cmp edx,10000 jb .8dig mov eax,68DB9h ; 2^32/10000+1= 1/10000 with 32 digits & rounded up mul edx ; * 1/10000 with 32 digits (so EDX = integer = first 2 digits) mov esi,100 or dl,'0' inc edi mov [edi+1],edx jmp .dig8 chartab db "00","01","02","03","04","05","06","07","08","09" db "10","11","12","13","14","15","16","17","18","19" db "20","21","22","23","24","25","26","27","28","29" db "30","31","32","33","34","35","36","37","38","39" db "40","41","42","43","44","45","46","47","48","49" db "50","51","52","53","54","55","56","57","58","59" db "60","61","62","63","64","65","66","67","68","69" db "70","71","72","73","74","75","76","77","78","79" db "80","81","82","83","84","85","86","87","88","89" db "90","91","92","93","94","95","96","97","98","99" I find it instructive that a routine can be speed optimised by segregating a problem into different cases, in this case, a binary number is divided into different cases depending on how many decimal digits it has. Amont the general conversion routines I've tested the above routine appear to be the fastest for all numbers >= 10 on my computer (which is rather old, so may not be the most accurate). The other routines I have included in the attached zip file (source & exe). The source file also has some nice macros (I think), including  code timing macros for measuring in clock cycles (using rdtsc) & and milliseconds (using the system counter) [converted from MASM32]  macros to use mixed strings & numbers, to generate custom labels, and to convert binary to decimal using assembler directives [my own macros]


14 Dec 2008, 19:39 

wht36
Yes, I have forgotten to mention the effect of loading data and code into cache. Binary to decimal conversions are generally called only once or twice so I myself don't use these "optimised" routines in my programs. I suspect a simple divide by 10 algorithm has the smallest footprint and is probably as fast (humanly noticeable anyway) as the fastest routine if the routine is only called a few times in real life.
Nevertheless I find the exercise of trying to optimise a binary to decimal conversion routine "fun". While the fastest routine probably has no real life application by itself per se, I think it is nice to demonstrate the various ways of dividing a number by 10 (such as by dividing by a higher radix, multiplying by reciprocal, by LEA and shifting, and by table look up). In the general case where all numbers need to be catered for, I think branches for special case considerations can be beneficial if one structures the branches carefully (in any case a 32 bit binary to decimal conversion routine must branch if it wants to suppress the leading zeroes). I haven't read any formal documents in branching, but I think if one structures the branches so that the cases taking up the most computing (such as a 10 digit number) uses the least amount of branching (one branch test, no branch taken) then there can be some gain in performance (as in this routine). Certainly if one knows the range to convert before hand (the scope of usage) then one can tailor make the routine to only convert those ranges, allowing further optimisation. One of the reason that my routine is so big is that I try to include all the cases, which I think can be helpful in that one can select out that particular method and use in their own routine if their routine caters only for certain cases. I have also included other routines in my source so that one can examine them and use ideas from those as well. The putdec routine has the smallest size and may be useful for compo writers. The routine that displays number of clocks/milliseconds actually stores the digits in reverse. The macro routine to store assembly constants/variables into memory is rather ugly. Perhaps a more elegant method of storing an assembly constant/variable at assembly time is as follows (not tested) Code: if data eqtype 0 local dividend, divisor divisor = 10 dividend = data while dividend >= 10* divisor divisor = divisor * 10 end while while divisor > 1 db (dividend/divisor)+"0" dividend = dividend mod divisor divisor = divisor/10 end while db dividend+"0" end if EDIT: The effect of cache loading is particularly dramatic if one only runs the routine once (change the count variable to 1 and reassemble & run to see what I mean). In that case the speed appears heavily dependent on the order in which routines were called and the size of the routine, such that the first routine called is most severely penalised, especially if it is a large routine. This load penalty is not humanly noticeable, and when one is trying to speed optimise a routine the presumed premise is that the routine will be run many times. This means that the initial penalty is generally not noticeable unless one is trying to optimise an extremely large routine, in which case reducing the code size may become important (so much so that using loops may be faster than not using loops!). Last edited by wht36 on 16 Dec 2008, 06:48; edited 2 times in total 

15 Dec 2008, 05:00 

bitRAKE
IIRC, the AMD optimization manual had a branchless zero suppression algorithm  maybe it had a couple branches. I can't agree more: itoa is fun to optimize. Many times people coming to assembly from a HLL will ask how to display a value from register or memory. Making it a great debugging aid and a foundation for exploration.
Another approach is to maintain the number in ASCII  eliminating the need for conversion. Or minimizing the conversion with BCD representations (packed/unpacked). A number might have an associated units (meters, inches, bytes) and significant figures limited to make the value more humanly readable. I don't know why people don't like long hex dumps. 

15 Dec 2008, 06:20 

wht36
Yes, I remember in the old days banks and institutions that deal with decimal numbers use BCD and not binary. I think BCD is much more effective and accurate when one works with decimal numbers all the time. Not sure about the current status though. Perhaps someone who designs software for banks would know?
Hexadecimal is probably how programmers should count in and not decimals. I find it quite hard though. I guess growing up with 10 fingers made me accustomed to counting in decimals I remember something about AMD as well. I think it was Terje who wrote the algorithm? If I recall correctly, that algorithm doesn't suppress zeroes so doesn't need branching. However, I think it is not very fast for small numbers (no early out mechanism). 

15 Dec 2008, 06:42 

bitRAKE
wht36 wrote: I remember something about AMD as well. I think it was Terje who wrote the algorithm? If I recall correctly, that algorithm doesn't suppress zeroes so doesn't need branching. However, I think it is not very fast for small numbers (no early out mechanism). ...but I remember another  maybe somewhere else? edit: Just checked...25112.PDF, page 183 has the algorithm. 

15 Dec 2008, 10:05 

edfed
i have made two functions.
num2ascii: .bin: .hex: i plan on .dec: and .udec: but the main problem is: the lengh of decimal source. for .bin and .hex, i just make a nibble conversion as many times it is needed. i can convert very very long numbers in binary or hexa ascii representation. but with decimal, it is a little more difficult because of the division by 10^x first. and how can i divide very very long numbers by a power of 10? and how can i convert a string of 4233 bytes into it's decimal representation? Last edited by edfed on 15 Dec 2008, 20:40; edited 2 times in total 

15 Dec 2008, 16:21 

wht36
bitRake wrote: ...Just checked...25112.PDF, page 183 has the algorithm. Wow, thanks for the reference, that is very clever code indeed, using ebx to keep track of the leading zero and sbb to increment only when a nonzero character is encountered, thus avoiding the need to branch completely. Thanks again! I have added the AMD routine in the attachment below. By the way, while a method for speed optimisation is to remove branching and unroll loops, I have noticed that speed does not always increase, and may in fact decrease, possibly due to  lack of early out mechanism  the need to load the cache with new code (jumping back to code that is already loaded in the cache may be in fact faster!) edfed wrote: ...and how can i divide very very long numbers by a power of 10? Hmm, those are very good questions. I have not studied those at all. Let me think about them.... Edit: To divide a large number by say 10, the nonoptimised (easy to understand) process is as follows: [Picture that the large number (dividend) is composed of many dwords in memory; divide it by 10, and the quotient will also be quite big and composed of many dwords in memory.] 1. clear EDX 2. load EAX with the MOST significant dword in your dividend 3. divide by 10 4. EAX now holds the MOST significant dword of the final quotient (save it as the most significant dword part of the quotient) 5. load EAX with the next most significant dword in the dividend 6. divide by 10 again 7. EAX is the next significant dword of the quotient, save it EAX as the next siginficant dword in the quotient 8. go back to 5. until all the dwords in the dividend have been processed 9. The final EAX is the last (LEAST significant) dword in the quotient (save it). The final EDX is the actual remainder [This process is the same process one uses to divide manually as taught in school, so to understand the process it's helpful to picture in one's mind how one would divide manually. Remember that one should specify the divisor 10 as a dword (e.g. put it in a 32 bit register, e.g. ECX) so that EDX:EAX will be divided by the divisor, the remainder placed in EDX and the quotient in EAX.] To convert a huge binary number into a decimal string then, the process is then as follows: 1. divide the huge number by 10 as outlined above, save the final EDX (remainder) as the LEAST significant decimal digit in the final decimal result 2. divide the quotient by 10 again, save the remainder as the NEXT least significant decimal digit 3. repeat step 2 until the quotient is less than 10 4. store the quotient as the final (most significant) decimal digit in your final decimal string. 5. convert the decimal digits to ascii by adding an ascii '0' to each of the digits. To make this process faster, instead of using 10 as the divisor, one uses 10^9 (the largest power of 10 to fit in a dword) as the divisor and going through the same steps (1..9 and 1..5) as above one first convert the big number into a 10^9 based number (composed of an array of dwords, each less than 10^9). Then use an optimised binary to decimal converter to output each of the dword in the 10^9 series to the final decimal form. Hope my explaination is clear enough. I will try code some examples when I have time. Edit2: If the large number is not composed of a whole number of dwords (e.g. 4233 bytes), I think the method can be modified as follows: Get the size in bits of the final nondword digit= (size & 3) shl 3 e.g. 4233 & 3 shl 3 = 1 shl 3 = 8 Subtract this number from 32 e.g. 32  8 = 24 For the last dword loaded, shift EAX right that number of times to remove the extra bits e.g. shr EAX,24


15 Dec 2008, 16:51 

bitRAKE
wht36 wrote: To make this process faster, instead of using 10 as the divisor, one uses 10^9 With SSE it should be possible to extract 48 or 52 digits in one pass, but it's complicated to code, imho. Maybe someone wants to display 2^43112609  1 expanded. (The binary version fits in L2 cache, btw. So, it'd be a worth wild optimization at the instruction level as well.) 

17 Dec 2008, 04:48 

wht36
bitRAKE wrote:
Hmm, but one needs the remainder to continue dividing through the whole big number. Would the remainder still be valid with a divide and shift operation? The SSE option sound very intriguing. I must find sometime to learn about it. 

17 Dec 2008, 15:36 

edfed
this solution loks good but not far a very large number.
for a 32bits value, it is ok, but for Nbytes, it will take like: (N/2)*N*2,56 divisions. it is too much, but it seems to be the only reliable way to do it. reliable because this algorythm is so simple it can be used with any instruction set. 

17 Dec 2008, 18:30 

r22
Yep, dividing by nonpowers of 2 is stupid.
If I had a really big number that I wanted to divide by ten. I'd just ... original shift right by 4 plus original shift right by 5 plus original shift right by 7 Ignore the .0015625 error and never ponder division again. Laziness mixed with an obsessive compulsive need to be efficient is a dangerous combination. 

18 Dec 2008, 01:04 

wht36
edfed wrote: this solution loks good but not far a very large number. If speed is of the essence, then the best way is to store the number as BCD, or to limit the number of conversions. Unfortunately for conversion from a binary number to decimal, both the REMAINDER and the QUOTIENT are needed, and I cannot think of a simple and quick way of doing it when it's a huge binary to decimal number conversion. If one is converting a HUGE number, and a division is to be avoided, then I think the fastest way is to multiply using fixed point maths. However, unlike DIV which uses a qword (EDX:EAX) as the source, MUL only uses a dword (EAX) as a source . For calculation of remainder AND quotients, shifts are only efficient for small divisors. For large divisors the number of shifts and additions are too numerous and the hardware multiply is usually much quicker. I suspect the algorithm given by r22 has too much rounding error and is only valid for small numbers (the .0015625 rounding error means that you are off by 1 or 2 for every 1000 that is present in your number). Some of the alternatives to division by nonpowers of 2 are approximations and so with a HUGE number one must be very careful about validation, because the errors become very significant if the number is very large. And even after one has gotten the quotient by one of the alternatives, one still need to get the REMAINDER, either by multiplying the quotient and subtracting that from the original number, or by multiplying the fixed point result after removing the integer portion of the fixed point result. By the way, by "(N/2)*N*2,56 divisions" do you mean dividing by 10 or by dividing by 10^9? 

18 Dec 2008, 04:58 

bitRAKE
wht36 wrote: Would the remainder still be valid with a divide and shift operation? Seems like much work, but not sweeping the whole big integer array could be massive savings....and I just like the number 5. 

18 Dec 2008, 07:08 

neville
r22 wrote: Ignore the .0015625 error This is actually a fixed +1.5% error ( wht36: note not 0.15% i.e. off by 1 or 2 for every 100, not 1000) so at best only the first 2 decimal digits of any result will be correct; all the remaining digits will be meaningless. For example, dividing the decimal number 268435456 by 10 using your method would give the result as 27262976, instead of 26843545 (+ remainder 6). Or dividing decimal 1,000,000,000 by 10 would give 101,562,500 instead of 100,000,000. The correct answer in hex is 5F5E100, but the answer given would be 60DB884 in hex  not one digit the same! Of course it depends on the level of precision required, but the problem with your shifting method r22 boils down to the fact that decimal 1/10 = 0.1 can NOT be exactly represented as a binary (or hexadecimal) fraction: 0.1 (decimal) == 0.1999999999... (hex)  it is a recurring hexadecimal! Your method effectively says 0.1 (decimal) == 0.1A (hex) No valid remainder could ever be calculated using this method, because most of the preceeding (more significant) digits are wrong. The error could be reduced from +1.5% to 0.4% by replacing your last "shift right 7" with: shift right 8 plus shift right 9 which effectively equates 0.1 dec to 0.198 hex ... but it still doesn't guarantee even 3 correct decimal digits. Just to get 3 correct digits (+0.1% error) you'd have to also: plus shift right 11 which equates 0.1 dec to 0.19A hex and requires a total of 5 shifts and 4 adds... BTW, there's nothing to be gained by shifting the original each time (since the shifted bits are immediately lost)  so a progressive shift and add would be quicker and more efficient. _________________ FAMOS  the first memory operating system 

18 Dec 2008, 10:19 

edfed
i have found a little idea:
convert each Byte in decimal multiplied by it's rank. LSB => 0 to 255 LSB+1 => 0 to 65535255 granularity 256 etc etc then, just add all the strings of converted bytes together into one unique result. or maybe each bits, converted into a BCD string, and added to the total if 1, nothing if 0 and finally, converted in ascii. yes, now it's evident. to convert an infinite decimal number into a decimal signed ascii string, the best is to convert each nibble in bcd multiply by power of ten with shift left combinaisons of the bcd value ( shl4), or the ascii char ( shl8). when the nibble is converted in the buffer convert the next one, and again, shift the numbers to multiply it by power of ten. and to shift the numbers: use shl [eax],ecx Code: number dd 0000100000,500000,1000,42222222,786544,1,1,1,1,1,1,12,13 ; this number should be enormous. it is the sum of each ;[dword]*2^32*rank) .end dd 0 buffer rb 3000 .end dd 0 ... mov esi,number mov edi,buffer clear buffer ... mov bl,[esi] etc etc ... 

18 Dec 2008, 17:20 

neville
edfed wrote: i have found a little idea: edfed, what does this have to do with binary to decimal conversion _________________ FAMOS  the first memory operating system 

18 Dec 2008, 18:42 

iseyler
Here is what I use to convert a 64bit binary value into a string.
Code: ;  ; os_int_to_string  Convert a binary interger into an string string ; IN: RAX = binary integer, RDI = location to store string ; OUT: RDI = pointer to end of string ; Notes: This function is a wrapper for the os_int_to_string_worker function below. The wrapper is needed to maintain the original contents of the registers and to terminate the string. ; Min return value is 0 and max return value is 18446744073709551615 so your string needs to be able to store at least 21 characters (20 for the number and 1 for the string terminator). os_int_to_string: push rdx push rcx push rax call os_int_to_string_worker mov al, 0x00 stosb ; Store the null terminator at the end of the string pop rax pop rcx pop rdx ret ; The real worker of the os_int_to_string function. Using recursing to generate the string. Adapted from AsmLib for Linux. os_int_to_string_worker: mov rcx, 10 os_int_to_string_recurse: xor rdx, rdx div rcx ; RDX:RAX / RCX push rdx or rax, rax jz os_int_to_string_store call os_int_to_string_recurse os_int_to_string_store: pop rax or al, '0' stosb ret ;  Ian _________________ BareMetal OS  http://www.returninfinity.com/baremetalos Monotasking 64bit OS for x8664 based computers written entirely in Assembly 

18 Dec 2008, 20:04 

wht36
bitRAKE wrote:
Hmm, I am having difficulty visualizing this process of shifting the least Bbits into the remainder. For example say we have 15, we want to divide by 10^1, the expected result is quotient 1, remainder 1 Code: 15/(5^1) shr 1 = quotient 1, remainder 0 1111b / 101b shr 1 = 1b, least bits = 1b, remainder 0 How does one then shift the least bit into the remainder? By multiplying it by 5? Wouldn't this limit the size of the 5^n division, because if you add the least bits * 5 to the remainder, one risks overflowing the remainder? edfed wrote: i have found a little idea: This simply resums all the binary digits and would arrive at the same binary result as the original binary number. To sum the decimal digits, see http://www.cs.uiowa.edu/~jones/bcd/decimal.html (Extract from that webpage below) Code: Consider a 16 bit number n. This can be broken into 4 fields of 4 bits each, the base 16 digits; call them n3, n2, n1 and n0. n _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ________________  n  n  n  n  3 2 1 0 The value of the number can be expressed in terms of these 4 fields as follows: n = 4096n3 + 256n2 + 16n1 + n0 In the same way, if the value of n, expressed in decimal, is d4d3d2d1d0, where each of d4, d3, d2, d1 and d0 are decimal digits, the value of n can be expressed as: n = 10000d4 + 1000d3 + 100d2 + 10d1 + d0 Our problem then, is to compute the di from the ni without dealing directly with the larger number n. To do this, we first note that the factors used in combining the values of ni can themselves be expressed as sums of multiples of powers of 10. Therefore, we can rewrite the original expression as: n = n3( 4×1000 + 0×100 + 9×10 + 6×1 ) + n2( 2×100 + 5×10 + 6×1 ) + n1( 1×10 + 6×1 ) + n0( 1×1 ) If distribute the ni over the expressions for each factor, then factor out the multiples of 100, we get the following: n = 1000(4n3) + 100(0n3 + 2n2) + 10(9n3 + 5n2 + 1n1) + 1(6n3 + 6n2 + 6n1 + 1n0) We can use this to arrive at first approximations ai for d3 through d0: a3 = 4n3 a2 = 0n3 + 2n2 a1 = 9n3 + 5n2 + 1n1 a0 = 6n3 + 6n2 + 6n1 + 1n0 I have tried to extend the above to a 32 bit source number .... It is not very workable, because for EACH digit, one has to calculate the result separately with a different combination of additions and multiplies. So for a 10 digit number, one needs 9 combinations of adds & multiplies. Now for a huge number with n digits, one would need n1 combinations, each combination depends on the result of the previous, and with almost certain chance that the result will overflow into the next x number of digits (try to use this method to convert a 32 bit number into decimal and you will see what I mean) .... 

19 Dec 2008, 03:25 

bitRAKE
wht36 wrote: Hmm, I am having difficulty visualizing this process of shifting the least Bbits into the remainder. A = N mod 5^B C = (NA)/5^B mod 2^B ; least bits of quotient N mod 10^B = 5^B * C + A 5^B fits in our chosen size, so 10^B will be okay in twice the size. I'll post the 64bit code, maybe before I go to sleep... 

19 Dec 2008, 04:38 

Goto page 1, 2 Next < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.