flat assembler
Message board for the users of flat assembler.
Index
> Windows > Specific usage of bignum multiplication Goto page 1, 2 Next 
Author 

AlexP 14 Apr 2008, 01:46
I've read all that I can find in two day's time for the most efficient way of multiplying together two bignum's, but most are for a certain range (way above my applications). So, I decided that it would be best to start a thread devoted to this specific matter, not a generic one like the bignum thread or the several division threads.
My range is for the RSA public key (n = pq) multiplication, where the size of n is between 512 and 16384 bits. This gives the bit length of p and q (assuming that they are of equal length) to be between 256 and 8192 bits. This gives the number of applications of whatever multplication algorithm I use to be between 8 and 256 (assuming it inputs two dwords). My question is what is the most efficient way to multiply the two together for this range? I have looked up every algo I could find, and I would not like to do an FFT implementation for such a low range, or a SchÃ¶nhageStrassen algorithm. I would rather have either of these choices, and I really need help on the subject because I've never done this before: http://en.wikipedia.org/wiki/Multiplication_algorithm 1) "Schoolboy multiplication", aka multiply every dword in one array with every dword in the other, running time is O(n^2) I believe. An example is in the bignum thread. http://board.flatassembler.net/topic.php?p=65726 2) "Peasant algorithm", quoted as "For each '1' bit in the multiplier, shift the multiplicand an appropriate amount and then sum the shifted values." http://en.wikipedia.org/wiki/Peasant_multiplication 3) "Karatsuba", running time O(N^1.58) or so, it uses a divideandconquer method for each of two operands. http://en.wikipedia.org/wiki/Karatsuba_algorithm 4) "Shift+Add", I do believe this is the peasant algorithm... http://courses.cs.vt.edu/~cs1104/BuildingBlocks/multiply.040.html Well, I would love advice from someone who has done this before. As I said, I do not want to choose anything more complex because I've read that it only saves time for VERY large numbers, like tens of thousands of digits. Anything at all would be helpful! 

14 Apr 2008, 01:46 

AlexP 14 Apr 2008, 02:48
Yeah, I read that FFT is for tens of thousands of digits... I'm going to stick with the schoolboy, but I will do time tests to check if the shift/add (aka peasant, aka egyption, aka russian, you get it..) algorithm would work any better. Might as well, I never settle for something that's easy .
PS: Do you think it was a good idea to allow RSA key sizes up to 16,384 bits? I don't know how long encryption/decryption would take for that size, but I'll try it out whenever I get it working... 

14 Apr 2008, 02:48 

revolution 14 Apr 2008, 02:53
For the key size, sure. I set mine for any length and the code automatically chooses the multiplication algo based upon the size of the inputs. It switches to Karatsuba at a certain point and then to FFT after that. I tested some very large pq pairs just for fun, but it did take a while (many days) to find the primes when I got to >20000 digits for p and q. Usually the GCD(p,q) was very large and I would then search for another pair.
Forget about the peasant algo, it can't be faster than the schoolboy mult, too much overhead and not enough work done for each pass. 

14 Apr 2008, 02:53 

AlexP 14 Apr 2008, 02:56
Okay, and I didn't see FFT or Karatsuba in your code... I guess I wasn't reading it right, could you explain what the Karatsuba looks like? (In the macro code you used)


14 Apr 2008, 02:56 

revolution 14 Apr 2008, 02:58
Oh no, not in the macros I posted here. The code I am talking about is asm code, which has not been posted here.


14 Apr 2008, 02:58 

AlexP 14 Apr 2008, 02:59
Okay, so your code that is not posted I'm assuming you will not even give me a glimpse of?
Quote: Usually the GCD(p,q) was very large and I would then search for another pair. 

14 Apr 2008, 02:59 

revolution 14 Apr 2008, 03:11
AlexP wrote: Okay, so your code that is not posted I'm assuming you will not even give me a glimpse of? AlexP wrote:


14 Apr 2008, 03:11 

AlexP 14 Apr 2008, 03:19
Okay, just making sure you weren't crazy about GCD(p,q).
Well, maybe one day I can spend my time working like you do, and I can have my own RSA lib! Q1: Does your company allow you to use your own code? (fear of losing it/decompilation?) Or do they do like the movie Paycheck and wipe your memory after an assignment . Q2: How early did you start work like this?! I mean, were you > 15 years of age when you started out in this line of study? 

14 Apr 2008, 03:19 

revolution 14 Apr 2008, 03:33
A1: I usually don't offer my own code.
A2: 13yrs old when I started with Z80 assembly to compute a few thousand digits of PI. 

14 Apr 2008, 03:33 

AlexP 14 Apr 2008, 11:54
R1: Ohhh, so they force it from you? (I'm kidding...)
R2: Darn! Well, maybe I'm not too late to start learning this.... I did try the digits of PI thing, but all I had to go on was some C code, and it didn't work ... It was my first main project in ASM too.. 

14 Apr 2008, 11:54 

revolution 14 Apr 2008, 12:06
AlexP wrote: R2: Darn! Well, maybe I'm not too late to start learning this.... I did try the digits of PI thing, but all I had to go on was some C code, and it didn't work ... It was my first main project in ASM too.. 

14 Apr 2008, 12:06 

AlexP 14 Apr 2008, 14:18
Actually it was C# code, and I gave up that project because it was my first, and I could not get it working! < I wonder why >. I will revisit it eventually if it hits my interest...
Question: The MillerRabin test seems nice, it will be tough to actually implement though, but how many applications do you think I should apply? I mean, if I'm scanning a range of thousands or tens of thousands of numbers (maybe more...) and testing for primality, how many times do you think I should use it? 

14 Apr 2008, 14:18 

revolution 14 Apr 2008, 15:35
AlexP wrote: Question: The MillerRabin test seems nice, it will be tough to actually implement though, but how many applications do you think I should apply? I mean, if I'm scanning a range of thousands or tens of thousands of numbers (maybe more...) and testing for primality, how many times do you think I should use it? Code: k in bits: t 0 160:34 161 163:33 164 166:32 167 169:31 170 173:30 174 177:29 178 181:28 182 185:27 186 190:26 191 195:25 196 201:24 202 208:23 209 215:22 216 222:21 223 231:20 232 241:19 242 252:18 253 264:17 265 278:16 279 294:15 295 313:14 314 334:13 335 360:12 361 392:11 393 430:10 431 479: 9 480 542: 8 543 626: 7 627 746: 6 747 926: 5 9271232: 4 12331853: 3 1854 oo: 2 

14 Apr 2008, 15:35 

r22 14 Apr 2008, 17:08
@people who like code
You can always use SSE to speed it up the multiplication. PMULUDQ Then once you have the Qwords just use an offset ADC to combine them back into a result. ***DISCLAIMER*** The following was written while at work (in the browser) so it's UNtested. The compiler that resides between my ears has validated it, but its been known to make up opcodes, misspell label names and not check array bounds. Code: .data align 16 p rd 512 ;; d1, 0, d2, 0, d3, 0, etc... q rd 512 ;; d1, 0, d2, 0, d3, 0, etc... ntemp rd 512 + 16;; padding so carry code doesn't bomb ;; q1l, q1h, q2l, q2h, etc... nresult rd 512 + 16;; padding so carry code doesn't bomb .code Mul8192By8192To16384: ;;SSE2 compatible (for legacies sake) ;; ASSUME nresult is zeroed, and p and q are filled in the correct format push ebp ebx esi edi xor esi,esi ;;multiplier index (copied dword) p [dwX][0][dwX][0] .multiplier: ;;clear ntemp pxor xmm0,xmm0 mov eax,512*816 .clearntemp: movdqa [ntemp+eax],xmm0 sub eax,16 jns .clearntemp ;;load movd xmm0, [esi+p] xor edi,edi ;;multiplicand index (unpacked dwords) q [dw1][0][dw2][0] ;;and temp result n index [dq1][dq2] pshufd xmm0, xmm0, 11000100b ;;copy dword to upper .multiply: movdqa xmm1, [edi+q] ;;multiply d2dq pmuludq xmm1,xmm0 ;;store temp result movdqa [edi+ntemp],xmm1 ;;increment add edi,16 test edi,512*4 jz .multiply ;;compress ntemp and add to nresult ;;start by adding high dword of qword with low dword of next qword and correcting for carry mov edx,4 .addwithcarrypropegation: mov eax,[edx+ntemp+4] add [edx+ntemp],eax lea ecx,[edx+8] .carry: adc [ecx+ntemp],0 lea ecx,[ecx+4] jc .carry add edx,8 cmp edx,512*4+4 jl .addwithcarrypropegation ;;compress [d1][d2][~][d3][~][d4] etc... into [d1][d2][d3][d4] etc... mov edx,8 mov ecx,12 .compress: mov eax,[ecx] mov [edx],eax add ecx,8 add edx,4 cmp ecx,512*4+4 jl .compress ;;add ntemp to nresult mov edi,esi ;;nresult offset xor ecx,ecx ;;ntemp offset shl edi,1 ;;/2 qword>dword clc ;;shl might does this ? already .addwithcarry: mov eax,[ecx+ntemp] adc [edi+nresult],eax lea edx,[edi+4] .carr: adc [edx+nresult],0 lea edx,[edx+4] jc .carr add ecx,4 add edi,4 test edi,512*4 jz .addwithcarry add esi,8 test esi,512*4 jz .multiplier pop ebp edi esi ebx ret 0 If you notice errors repost the code snippet fixed for the benefit of future readers of the thread. If it's totally unusable I'll just edit/remove it. 

14 Apr 2008, 17:08 

AlexP 14 Apr 2008, 20:35
Yeah, I was thinking about using large SSE instructions, but I have not studied SSE and decided against it. I will try your code later, thank for the go at it!
PS: Isn't there supposed to be two loops? One inner loop to multiply one dword in array_1 by every dword in array_2, and another outer loop to increment the dword in array_1? This is what I found in another implementation... Revolution: I should ask, where did you get those statistics from? Last I checked, the Rabin was 1/4 wrong when it reported prime. 

14 Apr 2008, 20:35 

revolution 15 Apr 2008, 00:36
AlexP wrote: Revolution: I should ask, where did you get those statistics from? Last I checked, the Rabin was 1/4 wrong when it reported prime. Code: IEEE P1363 / D13 (Draft Version 13). Standard Specifications for Public Key Cryptography Annex A (Informative). NumberTheoretic Background. A.15.2 Testing a Randomly Generated Integer for Primality 

15 Apr 2008, 00:36 

AlexP 15 Apr 2008, 00:55
Wow, I didn't know something like that existed. I'll have to read that. And yes, I am a newbie. I just got to the section of the 'pseudoprimes' chapter where I glanced at the other side of the page, and saw '1/4', and the relative equations. Would be nice if they were wrong though . I'll get those specs to study.
I'm glad that someone actually took the incentive to list the steps for Rabin stepbystep . I will have fun figuring out how to do modular exponentiation, but that is what researching is for. The other problem, finding n1 = w(2^v) has caught my attention as a fun little problem. I'm thinking that there HAS to be some easier way to figure that out in binary... I'll get it sooner or later, thanks again 

15 Apr 2008, 00:55 

revolution 15 Apr 2008, 01:08
If you like the IEEE P1363 then you should also get X9.62 standard. But be careful, both documents are buggy, you need to compare to each other and do your own tests to check which is the correct information. Even where the two docs agree you must also check against other sources for correctness.


15 Apr 2008, 01:08 

AlexP 15 Apr 2008, 01:29
Hmm, I tried some x9.31 I believe, apparently I have to buy them?! ...
PS: is X9.62 about ECC?! Everywhere I go I cannot get near an ANSII spec because I have to pay for them, or login. Is there some site I can go to to view ANSII specs? 

15 Apr 2008, 01:29 

Goto page 1, 2 Next < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992023, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.