flat assembler
Message board for the users of flat assembler.
Index
> High Level Languages > What's the math behind this CRC algo? |
Author |
|
revolution 08 Sep 2016, 13:38
Indeed. It is not a CRC in the usual sense of the acronym. It appears to be some homebrew thing, and as such is likely to not be of similar quality to a "real" CRC.
|
|||
08 Sep 2016, 13:38 |
|
bitRAKE 08 Sep 2016, 18:22
IDK, it seems fairly standard CRC code:
http://wg8.de/wg8n1454_17n3497_Ballot_3rdCD14443-3.3.pdf (page 50) https://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf I think the optimization makes it look unusual. The second reference above talks about the math. Maybe, it's the reverse reciprocal polynomial of 8810? _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
08 Sep 2016, 18:22 |
|
idle 08 Sep 2016, 19:15
tnx 4 the linx, bitRake, ps: i was curious to investigate it
Code: unsigned short UpdateCrc(unsigned char ch, unsigned short *lpwCrc) { //CRC_B case: //ch = ch xor 0xff //! = not ch, take this one as the routine seems to be bitwise // = -ch-1, arith seems non-applicable here ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF)); //temp_ch = ch shl 4 , least bits 0000b, the algo seems drunky: why do we pop/split // ch = ch xor temp_ch, least bits of ch left unchanged //... //so, we took a char/byte, inverted it, lower part of inverted left same, higher part of inverted XOR'ed to the lower ch = (ch^(ch<<4)); //... //return result stage complicated - we should make it easier *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4); //mind it is a CRC_B case, ch temporarily movzx-extended, and also those XORs *lpwCrc = 0xffff ^ word(ch << 8) ^ (ch<<3) ^ (ch>>4); // = not ..., it looks like, damn, sorry return; } |
|||
08 Sep 2016, 19:15 |
|
revolution 08 Sep 2016, 19:45
Oh wow, that is cool, kind of. Hard core "optimisation" if it really does compute a standard CRC function. It desperately needs extensive commenting though.
|
|||
08 Sep 2016, 19:45 |
|
bitRAKE 08 Sep 2016, 19:56
Possible x86 conversion (not optimal):
Code: UpdateCrc: ; Update CRC AX with byte CL xor cl,al ; ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF)); mov ch,cl ; ch = (ch^(ch<<4)); shl cl,4 xor cl,ch ; *lpwCrc = (*lpwCrc >> 8) shl cx,8 ; ^((unsigned short)ch<<8) shr ax,8 ; ^((unsigned short)ch<<3) mov dx,cx ; ^((unsigned short)ch>>4); xor ax,cx ; shr dx,4+8 shr cx,8-3 xor ax,dx xor ax,cx retn |
|||
08 Sep 2016, 19:56 |
|
zhak 08 Sep 2016, 21:35
bitRAKE wrote: Maybe, it's the reverse reciprocal polynomial of 8810? Hmm Wiki says "But the reciprocal polynomial is not the same as the original polynomial, and the CRCs generated using it are not the same (even modulo bit reversal) as those generated by the original polynomial." However, this algo produces the same checksum as Code: mov bx, 0x8408 mov ax, 0xffff mov di, 0xffff mov rsi, msg mov ecx, 3 call crc16t jmp $ msg db 0x0f, 0xaa, 0xff ; RefIn/Out = true ; bx poly ; ax init ; di xorOut ; ecx length ; rsi message ptr ; crc16t: xor al, byte [rsi] mov dl, 8 @@: shr ax, 1 jnc short $ + 4 xor eax, ebx dec dl jnz short @b inc rsi dec ecx jnz short crc16t xor eax, edi ret Well, because it is supposed to, because it is implementation of this standard (CRC-16-X25) |
|||
08 Sep 2016, 21:35 |
|
bitRAKE 09 Sep 2016, 16:28
http://www.zlib.net/crc_v3.txt
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf http://www.hackersdelight.org/crc.pdf Seems to be a lot of research on high-speed CRC. |
|||
09 Sep 2016, 16:28 |
|
zhak 11 Sep 2016, 20:13
So, I executed a small performance test on straightforward (crc16t) and optimized (crc16t_1021) algos.
Code: ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; Compute CRC-16 for RefIn/RefOut = true ; ; Parameters: ; AX Init ; BX Poly (reflected) ; ECX Message length in bytes (max. 65536) ; RSI Pointer to message ; DI XorOut ; crc16t: xor al, byte [rsi] mov dl, 8 @@: shr ax, 1 jnc short $ + 4 xor eax, ebx dec dl jnz short @b inc rsi dec ecx jnz short crc16t xor eax, edi ret ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; Compute CRC-16 for Poly = 0x1021 and RefIn/RefOut = true ; ; Parameters: ; AX Init ; ECX Message length in bytes (max. 65536) ; RSI Pointer to message ; DI XorOut ; crc16t_1021: movzx edx, byte [rsi] xor dl, al mov bl, dl shl bl, 4 xor dl, bl mov ebp, edx mov ebx, edx shr ax, 8 shl ebp, 8 shl ebx, 3 shr edx, 4 xor eax, ebp xor ebx, edx xor eax, ebx inc rsi dec ecx jnz short crc16t_1021 xor eax, edi ret Test conditions: - Measure (using RDTSC) how many cycles (in average) does it take to compute CRC for random (generated with RDRAND) message - Message length = 65535 bytes - CRC for each message is computed 1000 times - Test executed on 1000 different messages - Test performed in UEFI (long mode, one active core, single thread) on Intel Core M-5 Y10c @ 0.80 GHz So, the straightforward function takes 6962181 cyclese, while the optimized one takes 551631, in average. It works 12 times faster! Frankly speaking, I didn't expect that, so I ran tests twice I'll also measure that pclmulqdq method and update results. |
|||
11 Sep 2016, 20:13 |
|
revolution 12 Sep 2016, 01:16
How does it compare to the normal table CRC method? Nobody really uses the naive bitwise CRC if performance is an issue.
|
|||
12 Sep 2016, 01:16 |
|
rugxulo 13 Sep 2016, 22:20
Just some ideas (although I am no pro):
Code: shr ax, 1 Also try "shr ax, 17". (EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.) Code: dec dl Try "sub dl,1" (and similarly "add" instead of "inc"). Code: dec ecx jnz short crc16t Are you absolutely sure that "loop" would be worse here?? Code: movzx edx, byte [rsi] Try "xor edx, edx" and "mov dl,byte [rsi]" instead. Code: shl bl, 4 Try "add" instead (or even "lea"). (Admittedly, modern machines probably don't care anymore about such things. I've only been skimming some old docs, just thought it was curious.) |
|||
13 Sep 2016, 22:20 |
|
bitRAKE 14 Sep 2016, 16:10
There are false dependencies due to the use of multiple register sizes - that can be said for both routines though. The optimization is different in kind. I don't think unrolling the loop of the other would even make it as fast.
Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense. Of course, hardware optimization is another realm - most the research has that focus. |
|||
14 Sep 2016, 16:10 |
|
rugxulo 14 Sep 2016, 19:34
rugxulo wrote:
Okay, clearly it's 33 I meant instead of 17 since all shifts (nowadays) are mod 32. Of course this is mostly ridiculous, but it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well. |
|||
14 Sep 2016, 19:34 |
|
revolution 14 Sep 2016, 20:09
bitRAKE wrote: Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense. rugxulo wrote: ... it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well. |
|||
14 Sep 2016, 20:09 |
|
rugxulo 14 Sep 2016, 21:51
revolution wrote: Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects. Assembly Gems (mentioned above) says this: Quote:
BBL Forth's S32.TXT says this: Quote:
HelpPC says this: Quote:
Darek Mihocka says this: Quote:
revolution wrote:
Not me, I don't have any good code. As is obvious (to me), anything that doesn't measure more than a few seconds in difference isn't significant. So I don't have any obvious benchmarks (that would have to run millions of times just to hope for a big enough difference). Truly, higher clock speed in the Gigahertz has spoiled us. |
|||
14 Sep 2016, 21:51 |
|
revolution 14 Sep 2016, 22:02
rugxulo wrote: Truly, higher clock speed in the Gigahertz has spoiled us. |
|||
14 Sep 2016, 22:02 |
|
zhak 15 Sep 2016, 10:13
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
|
|||
15 Sep 2016, 10:13 |
|
revolution 15 Sep 2016, 10:21
zhak wrote: Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed |
|||
15 Sep 2016, 10:21 |
|
zhak 15 Sep 2016, 10:28
*edit*
Below are results as measured by Agner Fog's PMCTest tool. (crc computed once for a random 64k message) Code: STRAIGHT (BITWISE) Processor 0 Clock Core cyc Instruct Uops 6826837 7286942 2621541 2558016 6585698 7244283 2621533 2556118 6644231 7308660 2621533 2556000 6888307 7300361 2621536 2557726 6650729 7315814 2621533 2556000 6649959 7314974 2621533 2556000 6652640 7317914 2621533 2556000 6641692 7305871 2621533 2556000 6651110 7316235 2621533 2556000 7542197 7356542 2621535 2563536 6772477 7288164 2621534 2556186 6652046 7317261 2621533 2556000 6648680 7313805 2621533 2556000 6649662 7314636 2621533 2556000 6654796 7320292 2621533 2556000 7166295 7330060 2621535 2561044 6654975 7320484 2621533 2556000 6644049 7308466 2621533 2556000 6644111 7308526 2621533 2556000 6647080 7311796 2621533 2556000 -------------------------------------------- OPTIMIZED Processor 0 Clock Core cyc Instruct Uops 526358 565746 1114127 1050415 506004 556603 1114120 1048587 505997 556603 1114120 1048587 508813 559692 1114120 1048705 506093 556703 1114120 1048587 506247 556872 1114120 1048587 506250 556873 1114120 1048587 506250 556874 1114120 1048587 506096 556706 1114120 1048587 506247 556871 1114120 1048587 582404 564981 1114120 1049299 506203 556829 1114120 1048587 506207 556826 1114120 1048587 506007 556605 1114120 1048587 505997 556605 1114120 1048587 506000 556605 1114120 1048587 506207 556828 1114120 1048587 506007 556607 1114120 1048587 506000 556606 1114120 1048587 506207 556826 1114120 1048587 -------------------------------------------- TABLE Processor 0 Clock Core cyc Instruct Uops 463139 470490 458772 526135 420240 462264 458765 524308 420164 462438 458765 524308 420241 462254 458765 524308 420207 462219 458765 524308 420216 462233 458765 524308 420130 462138 458765 524308 420281 462300 458765 524308 420204 462215 458765 524308 420216 462231 458765 524308 420121 462124 458765 524308 420198 462208 458765 524308 420130 462137 458765 524308 420274 462301 458765 524308 420121 462124 458765 524308 420151 462165 458765 524308 420167 462177 458765 524308 420339 462369 458765 524308 420130 462136 458765 524308 420198 462208 458765 524308 Table algo implementation: Code: ; ; rcx message ; rdx length ; r8 init ; r9 xorout ; stack ptr to table ; crc16_tbl: movzx eax, r8w mov r10, [rsp+40] ; shadow stack should be reserved on call @@: movzx r11d, al shr eax, 8 xor r11b, byte [rcx] xor ax, [r10 + r11 * 2] inc rcx dec edx jnz short @b xor eax, r9d ret BTW I also tried different variants of instructions, like subsituting DEC ECX/JNZ with LOOP or INC REG with ADD REG, 1; etc. Those don't make significant difference. |
|||
15 Sep 2016, 10:28 |
|
zhak 20 Sep 2016, 18:56
Final test. fast pclmulqdq implementation as described in http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
The same random 64Kbytes message Code: Processor 0 Clock Core cyc Instruct Uops 85403 34466 31788 48981 16972 18671 31781 47151 15612 17175 31781 47151 15593 17161 31781 47151 15603 17165 31781 47151 15609 17173 31781 47151 15582 17138 31781 47151 15403 16949 31781 47151 15399 16951 31781 47151 15372 16911 31781 47151 15381 16925 31781 47151 15369 16910 31781 47151 15301 16836 31781 47151 15295 16830 31781 47151 15301 16837 31781 47151 15298 16835 31781 47151 15646 17212 31781 47151 15301 16840 31781 47151 15301 16837 31781 47151 15298 16835 31781 47151 This is really impressive. Although the algo itself is huge. around 1200 bytes |
|||
20 Sep 2016, 18:56 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.