flat assembler
Message board for the users of flat assembler.

 Index > Main > A bunch of binary to decimal conversion routines Goto page 1, 2  Next
Author
wht36

Joined: 18 Sep 2005
Posts: 106
wht36
I know binary (unsigned dword) to decimal conversion has been discussed many times before but I still find it very interesting as it taught me some stuff about fixed point maths and speed optimisation issues.

While in general there is no need to speed optimise a binary-to-decimal routine because one does not use it frequently, I would still like to offer my version of a speed-optimised binary-to-decimal routine.
Code:
```;Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros
;Adapted from the algo's by Paul Dixon, Lingo, John O'Harrow
;EAX = number to convert; EDI = where to store result
dw2a:   cmp     eax,1000000000
mov     edx,0D1B71759h          ; 2^45/10000 = fixed point 1/10000 with 45 digits
mov     ebp,eax                 ; save a copy of the number
jb      .9dig
mul     edx                     ; EDX = integer (quotient) + 45 digits after the dot
shr     edx,13                  ; round down EDX quotient (6 high digits)
mov     eax,68DB9h              ; 2^32/10000+1= 1/10000 with 32 digits & rounded up
imul    esi,edx,10000           ; ESI = quotient * 10000
mul     edx                     ; * 1/10000 with 32 digits (so EDX = integer = first 2 digits)
sub     ebp,esi                 ; EBP = remainder (lower 4 digits)
mov     edx,dword [chartab+edx+edx]     ; look up 2 digits
mov     [edi], edx      ; write them to answer
.dig8:    mul     esi                             ; get next 2 digits
mov     edx,dword [chartab+edx+edx]     ; look up 2 digits
mov     [edi+2],edx                     ; write them to answer
mul     esi                             ; get next 2 digits
.dig6a:      mov     edx,dword [chartab+edx+edx]     ; look up 2 digits
.dig4a:       mov     [edi+4], edx                    ; write them to answer
.dig4:    mov     eax,28F5C29h                    ; 2^32\100 +1
mul     ebp                             ; ebp= lower 4 digits
imul    eax,edx,-200
movzx   eax,word [chartab+ebp*2+eax]
mov     edx,dword [chartab+edx+edx]
mov     [edi-4],edx
mov     [edi-2],eax
ret
.2dig:       cmp     eax,10
movzx   eax,word [chartab+eax+eax]      ; look them up
mov     [edi+6],eax
jae     .dig2a
mov     [edi+6],ah
.dig2a:       sbb     edi,-8
ret
.4dig:       cmp     eax,1000                ; 1000 to 9999
lea     edi,[edi-6]
jae     .dig4
cmp     eax,100                 ; 0-99
lea     edx,[eax+eax*4]         ; 100-999
jb      .2dig
lea     edx,[eax+edx*8]
shr     edx,12                  ;EDX = n * 41/4096 = quotient, accurate up to 9990
imul    ebx,edx,-200
or      dl,'0'
mov     [edi-3],edx
movzx   eax,word [chartab+ebx+ebp*2]
mov     [edi-2],eax
ret
.8dig:       mul     edx
mov     edx,dword [chartab+edx+edx]     ; look up 2 digits
mov     [edi+2],edx                     ; write to answer
;  mul     esi                             ; get final 2 digits
shr     eax,7
imul    edx,eax,100                     ;fast multiplication
add     edx,1000000h                    ;round up before rounding down
shr     edx,25
jmp     .dig6a
.6dig:    cmp     eax,100000
mov     ecx,28F5C29h           ;((2^32)+100-1)/100}
mov     ebx,eax                ;Dividend}
sbb     edi,-6
mul     ecx                     ;EDX = Dividend DIV 100}
mov     eax,edx                ;Set Next Dividend}
imul    edx,-200               ;-2 * (100 * Dividend DIV  100)}
mov     esi,eax                ;Dividend}
movzx   ebx,word [chartab+ebx*2+edx] ;Dividend MOD 100 in ASCII}
mul     ecx                     ;EDX = Dividend DIV 100}
imul    eax,edx,-200
cmp     edx,10
mov     eax,dword [chartab+esi*2+eax] ;Dividend MOD 100 in ASCII}
mov     edx,dword [chartab+edx+edx]
mov     [edi-5],dh
jb      .5dig
mov     [edi-6],edx
.5dig:       mov     [edi-4],eax
mov     [edi-2],ebx
ret
.7dig:       lea     ebx,[edx+edx*4]
lea     ebx,[edx+ebx*8]
shr     ebx,12
lea     edi,[edi-1]
imul    ecx,ebx,-200
or      bl,'0'
mov     [edi+3],ebx
movzx   edx,word [chartab+ecx+edx*2]
jmp     .dig4a
.9dig:    cmp     eax,10000
jb      .4dig
cmp     eax,1000000
jb      .6dig
mul     edx
shr     edx,13                  ; EDX = 5 high digits
mov     eax,28F5C29h                    ; 2^32\100 +1
imul    ebx,edx,10000           ; EBX = quotient * 10000
lea     edi,[edi-2]
sub     ebp,ebx                 ; EBP = remainder (lower 4 digits)
cmp     edx,1000
jb      .7dig
cmp     edx,10000
jb      .8dig
mov     eax,68DB9h              ; 2^32/10000+1= 1/10000 with 32 digits & rounded up
mul     edx                     ; * 1/10000 with 32 digits (so EDX = integer = first 2 digits)
mov     esi,100
or      dl,'0'
inc     edi
mov     [edi+1],edx
jmp     .dig8

chartab    db "00","01","02","03","04","05","06","07","08","09"
db "10","11","12","13","14","15","16","17","18","19"
db "20","21","22","23","24","25","26","27","28","29"
db "30","31","32","33","34","35","36","37","38","39"
db "40","41","42","43","44","45","46","47","48","49"
db "50","51","52","53","54","55","56","57","58","59"
db "60","61","62","63","64","65","66","67","68","69"
db "70","71","72","73","74","75","76","77","78","79"
db "80","81","82","83","84","85","86","87","88","89"
db "90","91","92","93","94","95","96","97","98","99"
```

I find it instructive that a routine can be speed optimised by segregating a problem into different cases, in this case, a binary number is divided into different cases depending on how many decimal digits it has.

Amont the general conversion routines I've tested the above routine appear to be the fastest for all numbers >= 10 on my computer (which is rather old, so may not be the most accurate).

The other routines I have included in the attached zip file (source & exe).

The source file also has some nice macros (I think), including
- code timing macros for measuring in clock cycles (using rdtsc) & and milliseconds (using the system counter) [converted from MASM32]
- macros to use mixed strings & numbers, to generate custom labels, and to convert binary to decimal using assembler directives [my own macros]

 Description: Bunch of unsigned integer to decimal conversion routines, code timing macros, & message display macros Download Filename: itoa's.zip Filesize: 9.62 KB Downloaded: 181 Time(s)

14 Dec 2008, 19:39
bitRAKE

Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
The average gain needs to be greater than the branch cost for special case consideration to be beneficial. It is easy to mask the costs of loading data and code into cache when performing these types of benchmarks - giving a false sense of optimization. This brings to light the importance of specifying the usage being optimized for.

This benchmark is completely applicable beyond some large number of values converted; but if we are talking about general usage then this benchmark is useless. If itoa is only ever called infrequently enough to ensure code/data is not cached (very likely scenario); then what is the fastest routine?
15 Dec 2008, 00:21
wht36

Joined: 18 Sep 2005
Posts: 106
wht36
Yes, I have forgotten to mention the effect of loading data and code into cache. Binary to decimal conversions are generally called only once or twice so I myself don't use these "optimised" routines in my programs. I suspect a simple divide by 10 algorithm has the smallest footprint and is probably as fast (humanly noticeable anyway) as the fastest routine if the routine is only called a few times in real life.

Nevertheless I find the exercise of trying to optimise a binary to decimal conversion routine "fun". While the fastest routine probably has no real life application by itself per se, I think it is nice to demonstrate the various ways of dividing a number by 10 (such as by dividing by a higher radix, multiplying by reciprocal, by LEA and shifting, and by table look up).

In the general case where all numbers need to be catered for, I think branches for special case considerations can be beneficial if one structures the branches carefully (in any case a 32 bit binary to decimal conversion routine must branch if it wants to suppress the leading zeroes). I haven't read any formal documents in branching, but I think if one structures the branches so that the cases taking up the most computing (such as a 10 digit number) uses the least amount of branching (one branch test, no branch taken) then there can be some gain in performance (as in this routine).

Certainly if one knows the range to convert before hand (the scope of usage) then one can tailor make the routine to only convert those ranges, allowing further optimisation. One of the reason that my routine is so big is that I try to include all the cases, which I think can be helpful in that one can select out that particular method and use in their own routine if their routine caters only for certain cases.

I have also included other routines in my source so that one can examine them and use ideas from those as well. The putdec routine has the smallest size and may be useful for compo writers. The routine that displays number of clocks/milliseconds actually stores the digits in reverse.

The macro routine to store assembly constants/variables into memory is rather ugly. Perhaps a more elegant method of storing an assembly constant/variable at assembly time is as follows (not tested)
Code:
```if data eqtype 0
local dividend, divisor
divisor = 10
dividend = data
while dividend >= 10* divisor
divisor = divisor * 10
end while
while divisor > 1
db (dividend/divisor)+"0"
dividend = dividend mod divisor
divisor = divisor/10
end while
db dividend+"0"
end if    ```

EDIT: The effect of cache loading is particularly dramatic if one only runs the routine once (change the count variable to 1 and reassemble & run to see what I mean). In that case the speed appears heavily dependent on the order in which routines were called and the size of the routine, such that the first routine called is most severely penalised, especially if it is a large routine.

This load penalty is not humanly noticeable, and when one is trying to speed optimise a routine the presumed premise is that the routine will be run many times. This means that the initial penalty is generally not noticeable unless one is trying to optimise an extremely large routine, in which case reducing the code size may become important (so much so that using loops may be faster than not using loops!).

Last edited by wht36 on 16 Dec 2008, 06:48; edited 2 times in total
15 Dec 2008, 05:00
bitRAKE

Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
IIRC, the AMD optimization manual had a branchless zero suppression algorithm - maybe it had a couple branches. I can't agree more: itoa is fun to optimize. Many times people coming to assembly from a HLL will ask how to display a value from register or memory. Making it a great debugging aid and a foundation for exploration.

Another approach is to maintain the number in ASCII - eliminating the need for conversion. Or minimizing the conversion with BCD representations (packed/unpacked). A number might have an associated units (meters, inches, bytes) and significant figures limited to make the value more humanly readable. I don't know why people don't like long hex dumps.
15 Dec 2008, 06:20
wht36

Joined: 18 Sep 2005
Posts: 106
wht36
Yes, I remember in the old days banks and institutions that deal with decimal numbers use BCD and not binary. I think BCD is much more effective and accurate when one works with decimal numbers all the time. Not sure about the current status though. Perhaps someone who designs software for banks would know?

Hexadecimal is probably how programmers should count in and not decimals. I find it quite hard though. I guess growing up with 10 fingers made me accustomed to counting in decimals

I remember something about AMD as well. I think it was Terje who wrote the algorithm? If I recall correctly, that algorithm doesn't suppress zeroes so doesn't need branching. However, I think it is not very fast for small numbers (no early out mechanism).
15 Dec 2008, 06:42
bitRAKE

Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
wht36 wrote:
I remember something about AMD as well. I think it was Terje who wrote the algorithm? If I recall correctly, that algorithm doesn't suppress zeroes so doesn't need branching. However, I think it is not very fast for small numbers (no early out mechanism).
http://board.flatassembler.net/topic.php?t=3924
...but I remember another - maybe somewhere else?

edit: Just checked...25112.PDF, page 183 has the algorithm.

_________________
15 Dec 2008, 10:05
edfed

Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed

num2ascii:
.bin:
.hex:

i plan on .dec: and .udec: but the main problem is:

the lengh of decimal source.

for .bin and .hex, i just make a nibble conversion as many times it is needed.
i can convert very very long numbers in binary or hexa ascii representation.
but with decimal, it is a little more difficult because of the division by 10^x first.
and how can i divide very very long numbers by a power of 10?

and how can i convert a string of 4233 bytes into it's decimal representation?

Last edited by edfed on 15 Dec 2008, 20:40; edited 2 times in total
15 Dec 2008, 16:21
wht36

Joined: 18 Sep 2005
Posts: 106
wht36
bitRake wrote:
...Just checked...25112.PDF, page 183 has the algorithm.

Wow, thanks for the reference, that is very clever code indeed, using ebx to keep track of the leading zero and sbb to increment only when a non-zero character is encountered, thus avoiding the need to branch completely. Thanks again!

I have added the AMD routine in the attachment below.

By the way, while a method for speed optimisation is to remove branching and unroll loops, I have noticed that speed does not always increase, and may in fact decrease, possibly due to
- lack of early out mechanism
- the need to load the cache with new code (jumping back to code that is already loaded in the cache may be in fact faster!)

edfed wrote:
...and how can i divide very very long numbers by a power of 10?

Hmm, those are very good questions. I have not studied those at all. Let me think about them....

Edit: To divide a large number by say 10, the non-optimised (easy to understand) process is as follows:
[Picture that the large number (dividend) is composed of many dwords in memory; divide it by 10, and the quotient will also be quite big and composed of many dwords in memory.]
1. clear EDX
3. divide by 10
4. EAX now holds the MOST significant dword of the final quotient (save it as the most significant dword part of the quotient)
5. load EAX with the next most significant dword in the dividend
6. divide by 10 again
7. EAX is the next significant dword of the quotient, save it EAX as the next siginficant dword in the quotient
8. go back to 5. until all the dwords in the dividend have been processed
9. The final EAX is the last (LEAST significant) dword in the quotient (save it). The final EDX is the actual remainder
[This process is the same process one uses to divide manually as taught in school, so to understand the process it's helpful to picture in one's mind how one would divide manually.

Remember that one should specify the divisor 10 as a dword (e.g. put it in a 32 bit register, e.g. ECX) so that EDX:EAX will be divided by the divisor, the remainder placed in EDX and the quotient in EAX.]

To convert a huge binary number into a decimal string then, the process is then as follows:
1. divide the huge number by 10 as outlined above, save the final EDX (remainder) as the LEAST significant decimal digit in the final decimal result
2. divide the quotient by 10 again, save the remainder as the NEXT least significant decimal digit
3. repeat step 2 until the quotient is less than 10
4. store the quotient as the final (most significant) decimal digit in your final decimal string.
5. convert the decimal digits to ascii by adding an ascii '0' to each of the digits.

To make this process faster, instead of using 10 as the divisor, one uses 10^9 (the largest power of 10 to fit in a dword) as the divisor and going through the same steps (1..9 and 1..5) as above one first convert the big number into a 10^9 based number (composed of an array of dwords, each less than 10^9). Then use an optimised binary to decimal converter to output each of the dword in the 10^9 series to the final decimal form.

Hope my explaination is clear enough.
I will try code some examples when I have time.

Edit2: If the large number is not composed of a whole number of dwords (e.g. 4233 bytes), I think the method can be modified as follows:

Get the size in bits of the final non-dword digit= (size & 3) shl 3 e.g. 4233 & 3 shl 3 = 1 shl 3 = 8
Subtract this number from 32 e.g. 32 - 8 = 24
For the last dword loaded, shift EAX right that number of times to remove the extra bits e.g. shr EAX,24

15 Dec 2008, 16:51
bitRAKE

Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
wht36 wrote:
To make this process faster, instead of using 10 as the divisor, one uses 10^9
Instead of dividing by the largest power of 10, divide by a power of 5 and shift right at the same time to divide by the same power of two. This allows 13 digits to be extracted in a single pass of the big integer (27 digits in x86-64). I'm thinking the divide latency is sufficient to hide the shift and maybe register shuffling for multiple parallel passes on x86-64.

With SSE it should be possible to extract 48 or 52 digits in one pass, but it's complicated to code, imho. Maybe someone wants to display 2^43112609 - 1 expanded. (The binary version fits in L2 cache, btw. So, it'd be a worth wild optimization at the instruction level as well.)

_________________
17 Dec 2008, 04:48
wht36

Joined: 18 Sep 2005
Posts: 106
wht36
bitRAKE wrote:
wht36 wrote:
To make this process faster, instead of using 10 as the divisor, one uses 10^9
Instead of dividing by the largest power of 10, divide by a power of 5 and shift right at the same time to divide by the same power of two. This allows 13 digits to be extracted in a single pass of the big integer (27 digits in x86-64). I'm thinking the divide latency is sufficient to hide the shift and maybe register shuffling for multiple parallel passes on x86-64.

With SSE it should be possible to extract 48 or 52 digits in one pass, but it's complicated to code, imho. Maybe someone wants to display 2^43112609 - 1 expanded. (The binary version fits in L2 cache, btw. So, it'd be a worth wild optimization at the instruction level as well.)

Hmm, but one needs the remainder to continue dividing through the whole big number. Would the remainder still be valid with a divide and shift operation? The SSE option sound very intriguing. I must find sometime to learn about it.
17 Dec 2008, 15:36
edfed

Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
this solution loks good but not far a very large number.
for a 32bits value, it is ok, but for Nbytes, it will take like:

(N/2)*N*2,56 divisions.

it is too much, but it seems to be the only reliable way to do it.
reliable because this algorythm is so simple it can be used with any instruction set.
17 Dec 2008, 18:30
r22

Joined: 27 Dec 2004
Posts: 805
r22
Yep, dividing by non-powers of 2 is stupid.

If I had a really big number that I wanted to divide by ten.
I'd just ...
original shift right by 4
plus
original shift right by 5
plus
original shift right by 7
Ignore the .0015625 error
and never ponder division again.

Laziness mixed with an obsessive compulsive need to be efficient is a dangerous combination.
18 Dec 2008, 01:04
wht36

Joined: 18 Sep 2005
Posts: 106
wht36
edfed wrote:
this solution loks good but not far a very large number.
for a 32bits value, it is ok, but for Nbytes, it will take like:

(N/2)*N*2,56 divisions.

it is too much, but it seems to be the only reliable way to do it.
reliable because this algorythm is so simple it can be used with any instruction set.

If speed is of the essence, then the best way is to store the number as BCD, or to limit the number of conversions. Unfortunately for conversion from a binary number to decimal, both the REMAINDER and the QUOTIENT are needed, and I cannot think of a simple and quick way of doing it when it's a huge binary to decimal number conversion.

If one is converting a HUGE number, and a division is to be avoided, then I think the fastest way is to multiply using fixed point maths. However, unlike DIV which uses a qword (EDX:EAX) as the source, MUL only uses a dword (EAX) as a source .

For calculation of remainder AND quotients, shifts are only efficient for small divisors. For large divisors the number of shifts and additions are too numerous and the hardware multiply is usually much quicker.

I suspect the algorithm given by r22 has too much rounding error and is only valid for small numbers (the .0015625 rounding error means that you are off by 1 or 2 for every 1000 that is present in your number).

Some of the alternatives to division by non-powers of 2 are approximations and so with a HUGE number one must be very careful about validation, because the errors become very significant if the number is very large.

And even after one has gotten the quotient by one of the alternatives, one still need to get the REMAINDER, either by multiplying the quotient and subtracting that from the original number, or by multiplying the fixed point result after removing the integer portion of the fixed point result.

By the way, by "(N/2)*N*2,56 divisions" do you mean dividing by 10 or by dividing by 10^9?
18 Dec 2008, 04:58
bitRAKE

Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
wht36 wrote:
Would the remainder still be valid with a divide and shift operation?
It is important to preserve the remainder as before, but we need the least B-bits of it to go with the remainder of the 5^B division. The shift doesn't actually need to take place - could even choose a lazy byte granular B (8,16,24,..).

Seems like much work, but not sweeping the whole big integer array could be massive savings....and I just like the number 5.
18 Dec 2008, 07:08
neville

Joined: 13 Jul 2008
Posts: 507
Location: New Zealand
neville
r22 wrote:
Ignore the .0015625 error
and never ponder division again.

This is actually a fixed +1.5% error ( wht36: note not 0.15% i.e. off by 1 or 2 for every 100, not 1000) so at best only the first 2 decimal digits of any result will be correct; all the remaining digits will be meaningless.
For example, dividing the decimal number 268435456 by 10 using your method would give the result as 27262976, instead of 26843545 (+ remainder 6).
Or dividing decimal 1,000,000,000 by 10 would give 101,562,500 instead of 100,000,000. The correct answer in hex is 5F5E100, but the answer given would be 60DB884 in hex - not one digit the same!

Of course it depends on the level of precision required, but the problem with your shifting method r22 boils down to the fact that decimal 1/10 = 0.1 can NOT be exactly represented as a binary (or hexadecimal) fraction:
0.1 (decimal) == 0.1999999999... (hex) - it is a recurring hexadecimal!
Your method effectively says 0.1 (decimal) == 0.1A (hex)

No valid remainder could ever be calculated using this method, because most of the preceeding (more significant) digits are wrong.

The error could be reduced from +1.5% to -0.4% by replacing your last "shift right 7" with:
shift right 8
plus
shift right 9
which effectively equates 0.1 dec to 0.198 hex ... but it still doesn't guarantee even 3 correct decimal digits. Just to get 3 correct digits (+0.1% error) you'd have to also:
plus
shift right 11
which equates 0.1 dec to 0.19A hex and requires a total of 5 shifts and 4 adds...

BTW, there's nothing to be gained by shifting the original each time (since the shifted bits are immediately lost) - so a progressive shift and add would be quicker and more efficient.

_________________
FAMOS - the first memory operating system
18 Dec 2008, 10:19
edfed

Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
i have found a little idea:

convert each Byte in decimal multiplied by it's rank.
LSB => 0 to 255
LSB+1 => 0 to 65535-255 granularity 256
etc etc

then, just add all the strings of converted bytes together into one unique result.

or maybe each bits, converted into a BCD string, and added to the total if 1, nothing if 0

and finally, converted in ascii.

yes, now it's evident.
to convert an infinite decimal number into a decimal signed ascii string, the best is to convert each nibble in bcd
multiply by power of ten with shift left combinaisons of the bcd value ( shl4), or the ascii char ( shl8).

when the nibble is converted in the buffer
convert the next one, and again, shift the numbers to multiply it by power of ten.

and to shift the numbers:
use shl [eax],ecx

Code:
```
number dd 0000100000,500000,1000,42222222,786544,1,1,1,1,1,1,12,13
; this number should be enormous. it is the sum of each
;[dword]*2^32*rank)
.end  dd 0
buffer rb 3000
.end dd 0
...
mov esi,number
mov edi,buffer
clear buffer
...
mov bl,[esi]
etc etc ...
```
18 Dec 2008, 17:20
neville

Joined: 13 Jul 2008
Posts: 507
Location: New Zealand
neville
edfed wrote:
i have found a little idea:

convert each Byte in decimal multiplied by it's rank.
LSB => 0 to 255
LSB+1 => 0 to 65535-255 granularity 256
etc etc

edfed, what does this have to do with binary to decimal conversion

_________________
FAMOS - the first memory operating system
18 Dec 2008, 18:42
iseyler

Joined: 17 Jul 2008
Posts: 3
Location: Kitchener, Ontario
iseyler
Here is what I use to convert a 64-bit binary value into a string.

Code:
```; ---------------------------------------------------------------
; os_int_to_string -- Convert a binary interger into an string string
; IN: RAX = binary integer, RDI = location to store string
; OUT: RDI = pointer to end of string
; Notes: This function is a wrapper for the os_int_to_string_worker function below. The wrapper is needed to maintain the original contents of the registers and to terminate the string.
; Min return value is 0 and max return value is 18446744073709551615 so your string needs to be able to store at least 21 characters (20 for the number and 1 for the string terminator).
os_int_to_string:
push rdx
push rcx
push rax

call os_int_to_string_worker
mov al, 0x00
stosb                   ; Store the null terminator at the end of the string

pop rax
pop rcx
pop rdx
ret

; The real worker of the os_int_to_string function. Using recursing to generate the string. Adapted from AsmLib for Linux.
os_int_to_string_worker:
mov rcx, 10
os_int_to_string_recurse:
xor rdx, rdx
div rcx         ; RDX:RAX / RCX
push rdx
or rax, rax
jz os_int_to_string_store
call os_int_to_string_recurse
os_int_to_string_store:
pop rax
or al, '0'
stosb
ret
; ---------------------------------------------------------------
```

-Ian

_________________
BareMetal OS - http://www.returninfinity.com/bare-metal-os
Mono-tasking 64-bit OS for x86-64 based computers written entirely in Assembly
18 Dec 2008, 20:04
wht36

Joined: 18 Sep 2005
Posts: 106
wht36
bitRAKE wrote:
wht36 wrote:
Would the remainder still be valid with a divide and shift operation?
It is important to preserve the remainder as before, but we need the least B-bits of it to go with the remainder of the 5^B division. The shift doesn't actually need to take place - could even choose a lazy byte granular B (8,16,24,..).

Seems like much work, but not sweeping the whole big integer array could be massive savings....and I just like the number 5.

Hmm, I am having difficulty visualizing this process of shifting the least B-bits into the remainder. For example say we have 15, we want to divide by 10^1, the expected result is quotient 1, remainder 1
Code:
```15/(5^1) shr 1 = quotient 1, remainder 0
1111b / 101b shr 1 = 1b, least bits = 1b, remainder 0
```

How does one then shift the least bit into the remainder? By multiplying it by 5? Wouldn't this limit the size of the 5^n division, because if you add the least bits * 5 to the remainder, one risks overflowing the remainder?

edfed wrote:
i have found a little idea:

convert each Byte in decimal multiplied by it's rank.
LSB => 0 to 255
LSB+1 => 0 to 65535-255 granularity 256
etc etc

This simply re-sums all the binary digits and would arrive at the same binary result as the original binary number. To sum the decimal digits, see http://www.cs.uiowa.edu/~jones/bcd/decimal.html
(Extract from that webpage below)
Code:
```Consider a 16 bit number n. This can be broken into 4 fields of 4 bits each, the base 16 digits; call them n3, n2, n1 and n0.

n
|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
|   n   |   n   |   n   |   n   |
3       2       1       0

The value of the number can be expressed in terms of these 4 fields as follows:

n = 4096n3 + 256n2 + 16n1 + n0

In the same way, if the value of n, expressed in decimal, is d4d3d2d1d0, where each of d4, d3, d2, d1 and d0 are decimal digits, the value of n can be expressed as:

n = 10000d4 + 1000d3 + 100d2 + 10d1 + d0

Our problem then, is to compute the di from the ni without dealing directly with the larger number n.

To do this, we first note that the factors used in combining the values of ni can themselves be expressed as sums of multiples of powers of 10. Therefore, we can rewrite the original expression as:

n =
n3( 4×1000 + 0×100 + 9×10 + 6×1 ) +
n2( 2×100 + 5×10 + 6×1 ) +
n1( 1×10 + 6×1 ) +
n0( 1×1 )

If distribute the ni over the expressions for each factor, then factor out the multiples of 100, we get the following:

n =
1000(4n3) +
100(0n3 + 2n2) +
10(9n3 + 5n2 + 1n1) +
1(6n3 + 6n2 + 6n1 + 1n0)

We can use this to arrive at first approximations ai for d3 through d0:

a3 = 4n3
a2 = 0n3 + 2n2
a1 = 9n3 + 5n2 + 1n1
a0 = 6n3 + 6n2 + 6n1 + 1n0     ```

I have tried to extend the above to a 32 bit source number .... It is not very workable, because for EACH digit, one has to calculate the result separately with a different combination of additions and multiplies. So for a 10 digit number, one needs 9 combinations of adds & multiplies. Now for a huge number with n digits, one would need n-1 combinations, each combination depends on the result of the previous, and with almost certain chance that the result will overflow into the next x number of digits (try to use this method to convert a 32 bit number into decimal and you will see what I mean) ....
19 Dec 2008, 03:25
bitRAKE

Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
wht36 wrote:
Hmm, I am having difficulty visualizing this process of shifting the least B-bits into the remainder.
Yeah, I kind of jumped over the math - it's better to get a feel for what is happening, imho. When you say, "to multiply by 5...overflow", I'm sure you understand:

A = N mod 5^B
C = (N-A)/5^B mod 2^B ; least bits of quotient
N mod 10^B = 5^B * C + A

5^B fits in our chosen size, so 10^B will be okay in twice the size. I'll post the 64-bit code, maybe before I go to sleep...

_________________