flat assembler
Message board for the users of flat assembler.

Index > Assembly > Division by multiplication - revisited

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 06 Sep 2017, 21:31
Back on the ASM Community message board, when there was a discussion about performing division by invariant divisor with a MUL instead of DIV, I looked to create an explanation of why this works through the theory of 2-adic numbers. But recently, when I played with these ideas again, I realized that it is possible to analyze it just on the basis of pure integer arithmetic. I also made a new (at least to me) variant of the algorithm, which works correctly for every possible input and is easily provable.

This is how the algorithm looks implemented in fasm/fasmg scripting language:
Code:
; Preparation stage:

BITNESS = 32    ; number of bits in word

DIVISOR = 3     ; a number to divide by

; convert DIVISOR into MAGIC and SHIFT constants:
SHIFT = bsr (DIVISOR-1) + 1
MAGIC = ((1 shl SHIFT - DIVISOR) shl BITNESS) / DIVISOR + 1

; Division algorithm

NUMBER = 7FFFFFFFh      ; an example number to divide

PRODUCT = NUMBER * MAGIC + NUMBER shl BITNESS
QUOTIENT = PRODUCT shr (SHIFT+BITNESS)

REMAINDER = ( (PRODUCT and (1 shl (SHIFT+BITNESS) - 1)) * DIVISOR ) shr (SHIFT+BITNESS)    
You need fasmg to get this running for larger input numbers, since fasm 1 overflows easily. It also shows how to use the low bits of the product to produce the remainder from the division, though this is not used the following example.

The implementation in assembly language is a bit more complex. Read further below to learn why it has additional quirks compared to the classic ones:
Code:
; Preparation stage:
        mov     rbx,[divisor]
        mov     rcx,rbx
        dec     rcx
        bsr     rcx,rcx
        inc     cl
        mov     rdx,1
        shl     rdx,cl
        sub     rdx,rbx
        xor     rax,rax
        div     rbx
        inc     rax
        mov     [magic],rax
        dec     cl
        mov     [shift_m1],cl    
Code:
; Division:
        mov     rbx,[number]
        mov     rax,[magic]
        mov     cl,[shift_m1]
        mul     rbx
        add     rdx,rbx
        rcr     rdx,1
        shr     rdx,cl
        mov     [quotient],rdx    
Note that it stores "shift minus one" value, because one of the needed bit shifts is made through RCR in order to not loose the carry bit when dividing large numbers. The 32-bit version would be exactly the same, just replace all the 64-bit register names with 32-bit ones.
_____________________________________

And now the explanation. To analyze the problem purely from the point of view of the integer arithmetic, we can formulate it as follows.

For a given invariant divisor D < 2^B (where B is the number of bits in word) we want to find a bit shift S and a "magic" number M such that:
M*D = 2^(B+S) + R
where 0 <= R <= 2^S

If we succeed, then to get the quotient Q = N / D, we can do the following calculation:
N*M = Q*2^(B+S) + L
with 0 <= L < 2^(B+S)
(this shows that we can drop the low B+S bits and take the high ones to get the Q)

Let's take a look at the N*M*D product and substitute M*D from the definition of M:
N*M*D = N*2^(B+S) + N*R
Because N < 2^B and R <= 2^S, we have N*R < 2^(B+S), so the high bits of N*M*D contain original number N.

If we look at the components of N*M, we can evaluate the same product in a different way:
N*M*D = (Q*2^(B+S) + L)*D = Q*D*2^(B+S) + L*D
now, as L < 2^(B+S), L*D < D*2^(B+S), therefore high bits of L*D contain a number smaller than D and we can represent it as:
L*D = X*2^(B+S) + Y
where X < D and Y < 2^(B+S)

Now:
N*M*D = Q*D*2^(B+S) + X*2^(B+S) + Y
therefore high bits of N*M*D contain just Q*D + X and we already know that they also equal the original number N, so:
N = Q*D + X
and because X is positive and smaller than D, this proves that the quotient obtained this way is correct.

Moreover, this demonstrates that high bits of L*D contain the remainder.
_____________________________________

Now let's look at a classic example, this time for B=32:
For D = 3 we can take M = 0AAAAAAABh and S = 1, then
M*D = 200000001h = 2^(1+32) + 1
which fulfills the requirements, because 1 < 2 = 2^S. Therefore, to divide a number by 3, we can simply do this:
Code:
        mov     eax,[number]
        mov     edx,0AAAAAAABh
        mul     edx
        shr     edx,1
        mov     [quotient],edx    
And a couple more instructions could be added to recover the remainder from the low bits of the product, though this is more of a curiosity, since computation of remainder when the quotient is already present is a simple task anyway:
Code:
        mov     ecx,3
        mov     ebx,0
        cmovc   ebx,ecx
        mul     ecx
        add     edx,ebx
        shr     edx,1
        mov     [remainder],edx    
Note that the highest bit of the 33 low bits of the product landed in CF, and it is multiplied by 3 manually and then added the high bits of the second product.
_____________________________________

For some divisors finding M and S such that R < 2^S may be troublesome.
We can try the most straightforward route and just take smallest S such that D <= 2^S, then divide 2^(S+B) by D and increase it by one to cover the deficit caused by the truncation inherent in integer division:
M = 2^(S+B) / D + 1
then 2^(S+B) < M*D <= 2^(S+B) + D, so the requirements are fulfilled, the only problem is that M is going to be larger than 2^B, so it no longer fits in a machine word.

But, because we take S such that 2^S is the smallest power of two that is not smaller than D:
2^S / D = 1
and therefore M contains only a single bit set above the machine word. We can take M' = M - 2^B and get the "magic" number that fits in the machine word. But when multiplying by it, we have to remember that we need to actually multiply by M' + 2^B, which is the same as multiplying by M' and then adding N to the high bits.

This is how the algorithm I presented in the beginning works. The multiplier it computes is:
(2^(S+B) / D + 1) - 2^B = ((2^S - D) * 2^B) / D + 1
a then the product is adjusted by adding N*2^B before producing the quotient. This may produce carry, therefore the first of the right shifts that follows has to be RCL, to get the overflowing bit back into position.
_____________________________________

If we prefer to stick to the classic algorithm with M fitting in the machine word, it may not always be possible to find the right combination of M and S as defined above. But we can also search for a pair such that:
M*D = 2^(B+S) - R
where R <= 2^S, as previously.

What this modifies in the evaluations is that:
N*M*D = N*2^(B+S) - N*R
and since N*R again is smaller than 2^(B+S), the high bits of N*M*D this time are N-1.

So when we multiply by a number of this form, we actually compute (N-1) / D instead of N / D. The classic algorithm deals with this by simply increasing the number by one before multiplication.

For example, for D = 7 we can take M = 92492492h and S = 2. Then:
M*D = 3FFFFFFFEh = 2^(2+32) - 2
and 2 < 4 = 2^S. Therefore we can divide by 7 this way:
Code:
        mov     eax,[number]
        mov     edx,92492492h
        inc     eax
        mul     edx
        shr     edx,2
        mov     [quotient],edx    
A tiny weak spot of this routine is that it fails for input 0FFFFFFFFh. This is the reason why I decided that the more complex algorithm I presented first is also worth sharing, as it should work correctly for any input and any divisor (if there is no mistake in my proof).


Last edited by Tomasz Grysztar on 07 Feb 2019, 09:25; edited 1 time in total
Post 06 Sep 2017, 21:31
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 07 Sep 2017, 11:12
Now instead of taking the smallest power of two that is greater or equal to D, we can take the greatest power of two not greater than D and use it as S. If we now divide 2^(S+B) by D, we get:
2^(S+B) = M * D + P, P < D
Let's set R = D - P, then
M * D = 2^(S+B) - R
If R <= 2^S, we can use the algorithm (but we need INC instruction before MUL).
If R > 2^S, then P < 2^S, because we chose S such that D < 2^(S+1). Then we can take M' = M + 1 and use the algorithm without INC, because:
M' * D = 2^(S+B) - R + D = 2^(S+B) + P

This bring us to the following implementation of the classic algorithm:
Code:
; Preparation stage:
        mov     ebx,[divisor]
        bsr     ecx,ebx
        mov     [shift],cl
        mov     edx,1
        shl     edx,cl
        xor     eax,eax
        div     ebx
        shr     edx,cl
        setc    cl
        mov     [increment],ecx
        xor     cl,1
        add     eax,ecx
        mov     [magic],eax    
Code:
; Division:
        mov     eax,[number]
        mov     edx,[magic]
        add     eax,[increment]
        mul     edx
        mov     cl,[shift]
        shr     edx,cl
        mov     [quotient],edx    
The [integer] variable has value 0 or 1, it serves as a conditional INC instruction.
Again, no floating point calculations needed, the entire reasoning is solid just in terms of integer relations.
This algorithm does not work for divisors being powers of two, since M does not fit into machine word then. These divisors would need to be detected and division turned into a simple shift instead.
_____________________________________

Note that both the algorithms given use the large values of S even if there may exist a smaller S for which the requirements would be fulfilled. An example for B=32 is number 641, where the algorithm works for S = 0 and M = 663D81h:
663D81h*641 = 2^32 + 1
R = 1 <= 2^S

So 32-bit divison by 641 is a rare case when we can do without the shift instruction:
Code:
        mov     eax,[number]
        mov     edx,663D81h
        mul     edx
        mov     [quotient],edx    
Post 07 Sep 2017, 11:12
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 07 Sep 2017, 11:28
Tomasz Grysztar wrote:
Therefore we can divide by 7 this way:
Code:
        mov     eax,[number]
        mov     edx,92492492h
        inc     eax
        mul     edx
        shr     edx,2
        mov     [quotient],edx    
A tiny weak spot of this routine is that it fails for input 0FFFFFFFFh. This is the reason why I decided that the more complex algorithm I presented first is also worth sharing, as it should work correctly for any input and any divisor (if there is no mistake in my proof).
If branches are cheap then we can add a single jz:
Code:
        mov     eax,[number]
        mov     edx,92492492h
        inc     eax
        jz      @f
        mul     edx
    @@: shr     edx,2
        mov     [quotient],edx    
I use this method with ARM code with a conditional mul instruction. Although only some divisors need this, so it is somewhat wasteful to make it generic. Also if you know your numerator doesn't need it then leave out the jz.

The most common divisor 10 where M=CCC...CCD and S=3 can use the normal N*M >> S approach.
Post 07 Sep 2017, 11:28
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 07 Sep 2017, 12:01
revolution wrote:
If branches are cheap then we can add a single jz
You are right, this correctly handles the overflow (just like RCL in my first algorithm). It only gives the correct upper half of the product, though - so if you wanted to recover remainder, you'd need to do it some other way.

With this correction in mind, the only remaining advantage of the first algorithm is that it does not require any assumptions concerning divisors. But since the problematic divisors are the powers of two, we'd be better off handling them separately anyway.
Post 07 Sep 2017, 12:01
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 07 Sep 2017, 12:09
Tomasz Grysztar wrote:
It only gives the correct upper half of the product, though ...
eax will be zero, so I thought it is the required full 64-bit result of (N+1)*M?
Post 07 Sep 2017, 12:09
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 07 Sep 2017, 12:21
revolution wrote:
Tomasz Grysztar wrote:
It only gives the correct upper half of the product, though ...
eax will be zero, so I thought it is the required full 64-bit result of (N+1)*M?
Ah, you got me there. Smile It may not even be a coincidence that all the values end up being correct.
Post 07 Sep 2017, 12:21
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 07 Sep 2017, 12:21
revolution wrote:
Also if you know your numerator doesn't need it then leave out the jz.
If we consider additional assumptions about the numerator, then there is one more interesting thing that can be deduced from my proof.

The requirement that R <= 2^S is only needed to show that N*R < 2^(S+B). So if we can get the same condition through the assumptions on N, it may enable new choices of S, even S = 0.

So if we know that N is always small enough that N*D <= 2^B, we can take S = 0, M = 2^B / D + 1 (except when D is a power of 2, when we should drop "+ 1"). Then multiplying by M gives correct N/D in the high bits for all N < M.

For example for D = 10, if we can safely assume that N is smaller than M = 2^32 / 10 + 1 = 1999999Ah, we can just multiply by it and take the high 32 bits as a result.
Post 07 Sep 2017, 12:21
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 08 Sep 2017, 15:05
What about dividing numbers larger than the machine word? With DIV instruction it is possible to chain multiple divisions together, the remainder from the previous division is already in EDX/RDX ready to become the high word (smaller than the divisor, so overflow never happens) of the subsequent division in chain. With such chaining it is possible to divide arbitrarily large number by a small divisor.

The high word (EDX or RDX) must be smaller than the divisor, otherwise an overflow would occur (with DIV instruction it would be an exception 0). If we take V = bsr D, then:
N < 2^(B+V+1), because high word of N is smaller than the divisor
R < D < 2^(V+1)
N*R < 2^(B+2*V+2)
So we can take S = 2*V+2. Obviously M = 2^(B+S) / D + 1 is going to be larger than machine word, so the the multiplication is no longer going to be simple.

For D = 10, we get S = 8 and M = 2^40 / 10 + 1 = 199999999Ah. This led me to the following 32-bit routine that can replace "DIV 10" (without checking for overflow, input EDX must be smaller than 10):
Code:
div10:
; this trashes EBX and ECX, you may need to preserve them
        push    eax edx
        imul    edx,19h
        mov     ecx,edx
        mov     edx,09999999Ah
        mov     ebx,eax
        mul     edx
        mov     eax,ebx
        mov     ebx,edx
        mov     edx,19h
        mul     edx
        add     ebx,eax
        adc     ecx,edx
        pop     eax
        mov     edx,09999999Ah
        mul     edx
        add     eax,ebx
        adc     ecx,edx
        shrd    eax,ecx,8
        pop     edx
        imul    ecx,eax,10
        sub     edx,ecx
        retn    
This is a long function with five multiplication instructions, but when I took a routine for converting very large integers to decimal and replaced the DIV instruction with a call to "div10", it still gave a noticeable improvement on my machine. If anything, this demonstrates how expensive DIV can sometimes be.
Post 08 Sep 2017, 15:05
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 09 Sep 2017, 09:48
Since this is a thread on division, I though I would collect my thoughts on multiprecision division. Throughout this post, let X be the word size of the machine - so usually X=2^{32} \text{ or } X=2^{64}, but any integer X \ge 4 will do. Given word-sized positive integers A_0, B_0, B_1, the x86 architecture has an instruction div for calculating

0 \le \left \lfloor{ \frac{B_1 X + B_0}{A_0}}\right \rfloor < X\text{.}
An exception is raised if this quotient is  \ge X, i.e. not word sized.

The goal of division is to calculate 
\left \lfloor{ \frac{B}{A}\right \rfloor
for multi-word integers A and B. When A is word-sized, this can be calculated by stitching together div's, so the problem is to solve the case when A is bigger than a word. In the grade school algorithm, the words of the quotient are produced from most significant to least significant. This means that the grade school algorithm is only dependent upon a solution to the following problem:

Given big integers A and B with 0 \le \frac{B}{A} < X and A \ge X, calculate the word-sized integer \lfloor{ \frac{B}{A} \rfloor.

Now, by shifting A and B, this problem can also be represented as calculating

0 \le \left \lfloor{ \frac{B+b}{A+a}\right \rfloor < X\text{,}
where A and B are integers with X^2/2 \le A < X^2 and 0 \le B < X^3, and a and b are real numbers with 0 \le a,b < 1.
Fact: With the above assumptions, it holds that
-1 \le \left \lfloor{ \frac{B+b}{A+a}}\right \rfloor - \left \lfloor{ \frac{B}{A}}\right \rfloor  \le 0.
This means that the difficult (n+1)-by-n word division \lfloor(B+b)/(A+a)\rfloor can be reduced to the simpler 3-by-2 division \lfloor B/A\rfloor. If the latter does not match the former, is must be over by only one. In the grade school algorithm, this corresponds to overestimating a quotient term, and thus having to 'fix it up'. The point is that if the quotient term is estimate incorrectly, it is overestimated and only overestimated by one.
Proof: Write B = Q A + R where Q = \lfloor B/A\rfloor and 0 \le R < A. Then,

\frac{B+b}{A+a} - Q = \frac{R + b - a Q}{A+a}\text{.}
It is easy to see that

\frac{R + b - a Q}{A+a} \le \frac{R + b}{A+a} < 1\text{,}
and some manipulation of the assumption  (Q A + R + b)/(A+a) < X shows that

\frac{R + b - a Q}{A+a} > - \frac{a X - (R + b)}{A} > - \frac{a X}{A}\text{.}
This last quantity is greater than -1 by the assumptions on A.

So now the grade school division has been reduced to the division problem

0 \le \left \lfloor{ \frac{B_2 X^2 + B_1 X + B_0}{A_1 X + A_0}}\right \rfloor \le X\text{,}
with the assumptions that the A_i and B_i are word-sized and that X/2 \le A_1 < X .
This will be covered in a future post.
Post 09 Sep 2017, 09:48
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 09 Sep 2017, 11:07
In the previous post, large integer division was reduced to a sequence of 3-by-2 divisions followed by n-by-1 multiply-subtract operations, in true grade school fashion. The quotient of the 3-by-2 division was between 0 and X inclusive. Since the quotient terms for the grade school division are by definition word-sized, we need to calculate

q = \text{min}\left( X-1, \left \lfloor{ \frac{B_2 X^2 + B_1 X + B_0}{A_1 X + A_0}}\right \rfloor \right)\text{,}
where it is assumed that this floor is \le X and of course all of the previous assumptions, including X/2 \le A_1.

If you follow the proof given in the previous post for this problem you will see that

-2 \le  \left \lfloor{ \frac{B_2 X^2 + B_1 X + B_0}{A_1 X + A_0}}\right \rfloor - \left \lfloor{ \frac{B_2 X + B_1}{A_1}}\right \rfloor \le 0\text{.}
However, care is required as the division  \lfloor (B_2 X + B_1)/A_1 \rfloor can overflow a word.
I propose that the following algorithm computes q.
Code:
if B2 < A1
    (q, r) = div(B2:B1, A1)
    (lo, hi) = mul(q, A0)
    T0 = sub(B0, lo)
    T1 = sbb(r, hi)
    T2 = sbb(0, 0)  ; copy carry flag across T2
    if T2 <> 0
        q = dec(q)
        T0 = add(T0, A0)
        T1 = adc(T1, A1)
        T2 = adc(T2, 0)
        if T2 <> 0
            q = dec(q)
            T0 = add(T0, A0)    ; only calculated
            T1 = adc(T1, A1)    ; for the asserts
            T2 = adc(T2, 0)     ; below
        end if
    end if
    assert( T2 == 0)                    ; T2:T1:T0 is supposed
    assert( T1 * X + T0 < A2 * X + A1)  ; to contain the remainder
    return q
else
    q = X - 1
    T0 = B0
    T1 = sub(B1, A0)
    T2 = sbb(B2, A1)
    T3 = sbb(0, 0)
    T0 = add(T0, A0)
    T1 = adc(T1, A1)
    T2 = adc(T2, 0)
    T3 = adc(T3, 0)
    if T3 <> 0
        q = dec(q)
    end if
    return q
end if    


I do not know at this time why at most one fixup is required in the "else" block. Of course two fixups are required in general by the above inequality.

Also, notice that this algorithm contains a call to div with a fixed divisor for each quotient word produced in the whole grade school algorithm. Thomasz would probably want to change this invariant division into some multiplications.
Post 09 Sep 2017, 11:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 09 Sep 2017, 11:59
When I was doing this stuff a few years ago I found that Barrett and Montgomery were the most used two methods for division and modulo for multi-precision operations. There are also whole books dedicated to this subject, and many PhD theses that attempt to improve the efficiency and performance in specific CPUs.
Post 09 Sep 2017, 11:59
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 09 Sep 2017, 13:28
tthsqe wrote:
Since this is a thread on division, I though I would collect my thoughts on multiprecision division.
I wanted to continue my thoughts on "magic multipliers" but I see that you have moved the topic in a slightly different direction already. Perhaps we should split the thread?

Also, what I wanted to focus on here is the analysis of these tricks purely on the grounds of integer arithmetic, with no even slightest mention of real (or even p-adic) numbers. This is my personal hobby. Wink
Post 09 Sep 2017, 13:28
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 09 Sep 2017, 14:50
I don't think it needs to be split. I am just giving you more to think about.
What you have been doing is the division of singe words (like the c "/" operator). You can interpret what I am saying as request for a division of two words by one word (like the x86 "div" instruction) by using multiplications with precomputed numbers.
Post 09 Sep 2017, 14:50
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 09 Sep 2017, 14:53
Back to the beginning: the idea behind the division by multiplication can be presented as follows.

If we want to divide integer N by divisor D, what we can try is to find number X = Q2^n + L such that 0<=L<2^n and:
XD = N2^n + Y,
0<=Y<2^n.
The same product can also be written as:
XD = QD2^n + LD
and because L<2^n:
LD<D2^n,
so:
LD = R2^n + Z,
0<=R<D,
0<=Z<2^n.
Therefore XD = (QD+R)2^n + Z and thus N = QD+R which gives us correct division with remainder, since 0<=R<D. We get the result Q simply by taking the high bits of X. We can also get the remainder R by taking high bits of LD.

Now, to find the number X we can multiply N by so-called magic multiplier M, which is a number such that:
MD = 2^n + P,
0<=P<D
and then:
XD = NMD = N2^n + NP
which gives the correct result as long as NP < 2^n.
All the algorithms presented above made this work by finding n large enough that this inequality would be fulfilled for all N in the required range.

But there is still another approach we could try. When we compute M, we also know P, and we can use this knowledge to try to adjust the result even if NP ends up being large enough to meddle with the high bits of NMD.

We can compute NP and split into high and low part:
NP = N_1{2^n} + P_1
and then try to adjust for the error by taking:
N' = N-N_1
X' = N'M = (N-N_1)M
This leads to:
X'D = N'MD = (N-N_1)MD=(N-N_1)2^n+(N-N_1)P=(N-N_1)2^n+N_1{2^n} + P_1-N_1{P}=N2^n+P_1-N_1{P}
Now, keeping in mind that:
P_1<2^n
we also need to ensure that:
N_1{P}<2^n.
Then the worst that may happen is that the high bits of X'D might contain N - 1 instead of N. This can be easily fixed after computing the remainder - if remainder ends up being larger than D we can subtract D from it and increase the quotient by one.

Let's try it on an example, with D = 10, n = 32. Then M=1999999Ah and P = 4:
MD = 1999999Ah*10 = 2^{32} + 4
To simulate a DIV instruction we need to operate on range:
N<D*2^{32}
With this in mind we have:
NP<PD*2^{32}
so
N_1<PD
and finally:
N_1{P}<P^2{D}=160
This is far from being dangerous to the high bits of X'D.

It is time for an implementation:
Code:
div10:
;        cmp     edx,10
;        jae     int0
        push    ebx ecx
        push    eax
        mov     ebx,eax
        mov     ecx,edx
        shld    edx,eax,2
        sub     ebx,edx
        sbb     ecx,0
        mov     eax,ebx
        mov     ebx,1999999Ah
        mul     ebx
        mov     eax,ecx
        imul    eax,ebx
        add     eax,edx
        pop     edx
        imul    ecx,eax,10
        sub     edx,ecx
        cmp     edx,10
        jb      .done
        sub     edx,10
        inc     eax
    .done:
        pop     ecx ebx
        retn    
This ends up being faster than the previous one, it only has three multiplications (and only one of them is a full-sized MUL) thanks to the fact that P=4 and multiplication by P can be done by shifting.
For 64-bit version just replace all 32-bit registers with their 64-bit counterparts and the constant with 199999999999999Ah (P is the same).
Post 09 Sep 2017, 14:53
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 09 Sep 2017, 14:58
tthsqe wrote:
You can interpret what I am saying as request for a division of two words by one word (like the x86 "div" instruction) by using multiplications with precomputed numbers.
What I wrote above is about such division, though perhaps not general enough for your needs. It only works for divisors D such that:
D^3<2^n.
But when dividing large integer by another large integer, you usually need to divide by a normalized divisor, such that:
2^{n-1}<=D<2^n.
I have not yet tried to solve this variant with my integer-based approach.

When a divisor is normalized, the algorithm I presented first has the shift equal to the size of machine word (the assembly implementation I provided does not work then, because SHRD sees such shift as zero), so the result can be taken straight from the third word after multiplication. Also, multiplying the lowest word seems pointless, because it can affect the result by no more than one, and this can be corrected when computing remainder, as in my previous post. However to extend the range of algorithm, some kind of compensation for the NP term would be needed. I don't know if I can solve this variant with integer arithmetic as nicely as the previous ones.
Post 09 Sep 2017, 14:58
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 28 Feb 2018, 15:56
Tomasz Grysztar wrote:
This bring us to the following implementation of the classic algorithm:
Code:
; Preparation stage:
        mov     ebx,[divisor]
        bsr     ecx,ebx
        mov     [shift],cl
        mov     edx,1
        shl     edx,cl
        xor     eax,eax
        div     ebx
        shr     edx,cl
        setc    cl
        mov     [increment],ecx
        xor     cl,1
        add     eax,ecx
        mov     [magic],eax    
Code:
; Division:
        mov     eax,[number]
        mov     edx,[magic]
        add     eax,[increment]
        mul     edx
        mov     cl,[shift]
        shr     edx,cl
        mov     [quotient],edx    
The [integer] variable has value 0 or 1, it serves as a conditional INC instruction.
Again, no floating point calculations needed, the entire reasoning is solid just in terms of integer relations.
This algorithm does not work for divisors being powers of two, since M does not fit into machine word then. These divisors would need to be detected and division turned into a simple shift instead.
Hi, I'm wondering, does this algorithm fail for anything other than power of 2s? (those can be easily handled)

I'm trying to write a compile-time tool to help with generating magic numbers and shifts in the most optimal way possible at compile-time (so I don't care about the runtime divisors), but they have to be generic (the +1 condition is an extra insn other than mul/shift). For now I use this algorithm for full-width multiplies, and your first algorithm for smaller muls, because then the mask with width+1 bits will fit and I won't have to do the +1 (inc) in some cases (just a mul and shift).

But as you said:
Tomasz Grysztar wrote:
Note that both the algorithms given use the large values of S even if there may exist a smaller S for which the requirements would be fulfilled. An example for B=32 is number 641, where the algorithm works for S = 0 and M = 663D81h:
663D81h*641 = 2^32 + 1
R = 1 <= 2^S

So 32-bit divison by 641 is a rare case when we can do without the shift instruction:
Code:
        mov     eax,[number]
        mov     edx,663D81h
        mul     edx
        mov     [quotient],edx    
Is there a way to reliably handle such cases? It doesn't really matter if it's a bit "expensive" to detect (loops?) since it is a tool done at compile-time, not runtime, which I don't care about. i.e. I just want it to spill out a magic, a shift (which can be 0 if it's more optimal) and an add (which is 0 most of the time).

Also, is there a better way to implement such divisions by constants when the numbers are smaller than full-width? (e.g. 16-bit or 8-bit numbers involved, or even 32-bit if x64). I mean, I can obviously use the entire register for an 16-bit operation so was wondering if there's a better way (I currently use your first method with the mask overflowing into 16th bit for example, for 16-bit number, and do a 32-bit mul).


What I consider optimal: least amount of extra instructions other than mul. Wink

(and yes, I will also implement this tool in C++ as a compile-time metaprogramming thing not just an asm tool, I just found out GCC can be quite retarded sometimes, so...)
Post 28 Feb 2018, 15:56
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 28 Feb 2018, 16:54
Furs wrote:
Hi, I'm wondering, does this algorithm fail for anything other than power of 2s? (those can be easily handled)
The algorithm only fails for powers of two because M is then 2^{32}, just outside the range of a 32-bit register. For all other divisors M is smaller than that.

Furs wrote:
Is there a way to reliably handle such cases? It doesn't really matter if it's a bit "expensive" to detect (loops?) since it is a tool done at compile-time, not runtime, which I don't care about. i.e. I just want it to spill out a magic, a shift (which can be 0 if it's more optimal) and an add (which is 0 most of the time).
As I have shown above, the algorithms with no increment give correct results for any input N such that NP < 2^n (where P is remainder from the division of 2^n by D, the quotient of that division is M). You can iterate through all possible values of n = 32 + shift (starting with shift = 0) and find the smallest one that has the remainder P small enough for the condition to hold (when shift is zero, n = 32 and P must not be larger than 1). This way you should find the smallest possible shift for which the algorithm without increment works. Sometimes the smallest shift is still too large and you are forced to use the variant with increment.

Furs wrote:
Also, is there a better way to implement such divisions by constants when the numbers are smaller than full-width? (e.g. 16-bit or 8-bit numbers involved, or even 32-bit if x64). I mean, I can obviously use the entire register for an 16-bit operation so was wondering if there's a better way (I currently use your first method with the mask overflowing into 16th bit for example, for 16-bit number, and do a 32-bit mul).
What limits the range where the algorithm works is the already mentioned NP < 2^n constraint. If you choose to not use shift and take n equal to the size of register, you can safely work with values of half that size. For example if n = 32 and both N and D are 16-bit numbers, then also P fits into 16 bits (since it is a remainder, smaller than D) and the inequality holds.

Hence to divide 16-bit N by 16-bit D, you can use M = 2^32 div D, do a 32-bit multiplication and take DX as a result (high word of EDX is always going to be zero).


Last edited by Tomasz Grysztar on 01 Mar 2018, 10:54; edited 1 time in total
Post 28 Feb 2018, 16:54
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 28 Feb 2018, 18:16
I'm a bit confused. So I need to iterate all shift values, but how can I check if the condition holds if I don't know the value of N? Should I assume the highest value possible for it that it can have? I mean, N is a runtime value, only the divisor is at compile time.

And also if I divide 2^32 by 641, the remainder is 640, not 1, so how come it works?

Tomasz Grysztar wrote:
What limits the range where the algorithm works is the already mentioned NP < 2^n constraint. If you choose to not use shift and take n equal to the size of register, you can safely work with values of half that size. For example if n = 32 and both N and D are 16-bit numbers, then also P fits into 16 bits (since it is a remainder, smaller than D) and the inequality holds.

Hence to divide 16-bit N by 16-bit D, you can use M = 2^32 div D, do a 32-bit multiplication and take DX as a result (high word of EDX is always going to be zero).
Thanks, that's even better, only a mul is needed.

I think I understand it now. Sorry for being slow, I'm not that good at math-speak. Smile


EDIT: I'm guessing P is not the remainder, but rather D-remainder, am I right? I looked before at your example with 10, and you said P = 4, which fits just as well as on 641. (i.e. 10-6 = 4, and 641-640 = 1)
Post 28 Feb 2018, 18:16
View user's profile Send private message Reply with quote
pabloreda



Joined: 24 Jan 2007
Posts: 116
Location: Argentina
pabloreda 28 Feb 2018, 21:19
I implement this from
http://www.flounder.com/multiplicative_inverse.htm

I test with some values and work well, of course for power of two I use the shift transformation.

I'm implementing now transform multiplication with shift and adds, but search the optimal combination is hard, few web pages with information, I try to dig in source code of gcc but not found this.
Post 28 Feb 2018, 21:19
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 28 Feb 2018, 21:37
pabloreda wrote:
I'm implementing now transform multiplication with shift and adds, but search the optimal combination is hard, few web pages with information, I try to dig in source code of gcc but not found this.
The thing is, GCC is not always optimal, that's why I'm "rolling my own" tool. Confused I used to use GCC just to generate the magic since it was easy but now I need a "hardcore" tool because GCC is not perfect, I'm personally sick of seeing compiler developers saying "not worth the effort", it's been like 20 years, how hard can it be?

Anyway, this is kind of what I have right now, in pseudo-code (I have it in C++ and inline asm (for rcr) but I figure it's not so easy to understand):

Code:
; 'max' is the max value of the numerator N  (the output depends on the programmer knowing this in advance)
; N is the numerator
; D is the divisor
if(D is power of 2) return N / D;  // compiler handles this here

for(shift in 0 ... 31) {
  temp = 2^32 << shift;
  if(max*(D - temp mod D) < temp) break;
}

if(shift < 32 && (magic, which is (2^32 << shift) / D, fits in 32 bits)
  return (N*(magic+1) >> 32) >> shift;

// check if we can't use 'inc' method
if(max == 0xFFFFFFFF)
{
  // use Tomasz's method
  shift = bsr(D-1) + 1;
  magic = (((2^32)-b) << 32)/b + 1;

  return (((N*magic >> 32) + N) rcr 1) >> (shift-1);
}
else
{
  shift = bsr(D);
  magic = (2^32 << shift) / b;

  return (N+1)*magic >> 32 >> shift;
}    
Is something like this optimal? I want to write a script in PARI/gp next so I don't have to worry about overflows or anything and it will always work (just to give the magic and whatever else is required).
Post 28 Feb 2018, 21:37
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.