flat assembler
Message board for the users of flat assembler.

flat assembler > High Level Languages > What's the math behind this CRC algo?

Author
Thread Post new topic Reply to topic
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
I've came across one interesting CRC_B (X.25) implementation.

Algo description is the following:
width=16 poly=0x1021 init=0xffff refin=true refout=true xorout=0xffff check=0x906e

Implementation (sorry, it's in C, as original in the book):
Code:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #define CRC_A 1 #define CRC_B 2 #define BYTE unsigned char unsigned short UpdateCrc(unsigned char ch, unsigned short *lpwCrc) { ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF)); ch = (ch^(ch<<4)); *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4); return(*lpwCrc); } void ComputeCrc(int CRCType, char *Data, int Length, BYTE *TransmitFirst, BYTE *TransmitSecond) { unsigned char chBlock; unsigned short wCrc; switch(CRCType) { case CRC_A: wCrc = 0x6363; /* ITU-V.41 */ break; case CRC_B: wCrc = 0xFFFF; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */ break; default: return; } do { chBlock = *Data++; UpdateCrc(chBlock, &wCrc); } while (--Length); if (CRCType == CRC_B) wCrc = ~wCrc; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */ *TransmitFirst = (BYTE) (wCrc & 0xFF); *TransmitSecond = (BYTE) ((wCrc >> 8) & 0xFF); return; }

I'm very curious about math behind UpdateCrc() func.
It doesn't involve standard 'shift one bit > check if 1 > xor with poly if true > repeat 8 times' approach, but rather involves multiple shifts and xors to compute all in one step.
Maybe anybody keen in algos and math could explain it? Maybe give some formulas? (I can't find any). My idea is to adapt this approach to use with other polynoms, not only 0x1021 (0x8408 in this implementation)
Post 08 Sep 2016, 13:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15871
Location: 162173 Ryugu
Indeed. It is not a CRC in the usual sense of the acronym. It appears to be some homebrew thing, and as such is likely to not be of similar quality to a "real" CRC.
Post 08 Sep 2016, 13:38
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
IDK, it seems fairly standard CRC code:
http://wg8.de/wg8n1454_17n3497_Ballot_3rdCD14443-3.3.pdf (page 50)
https://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

I think the optimization makes it look unusual. The second reference above talks about the math. Maybe, it's the reverse reciprocal polynomial of 8810?

_________________
utf8everywhere.org
Post 08 Sep 2016, 18:22
View user's profile Send private message Visit poster's website Reply with quote
idle



Joined: 06 Jan 2011
Posts: 351
Location: ukraina
tnx 4 the linx, bitRake, ps: i was curious to investigate it
Code:
unsigned short UpdateCrc(unsigned char ch, unsigned short *lpwCrc) { //CRC_B case: //ch = ch xor 0xff //! = not ch, take this one as the routine seems to be bitwise // = -ch-1, arith seems non-applicable here ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF)); //temp_ch = ch shl 4 , least bits 0000b, the algo seems drunky: why do we pop/split // ch = ch xor temp_ch, least bits of ch left unchanged //... //so, we took a char/byte, inverted it, lower part of inverted left same, higher part of inverted XOR'ed to the lower ch = (ch^(ch<<4)); //... //return result stage complicated - we should make it easier *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4); //mind it is a CRC_B case, ch temporarily movzx-extended, and also those XORs *lpwCrc = 0xffff ^ word(ch << 8) ^ (ch<<3) ^ (ch>>4); // = not ..., it looks like, damn, sorry return; }
Post 08 Sep 2016, 19:15
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15871
Location: 162173 Ryugu
Oh wow, that is cool, kind of. Hard core "optimisation" if it really does compute a standard CRC function. It desperately needs extensive commenting though.
Post 08 Sep 2016, 19:45
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
Possible x86 conversion (not optimal):
Code:
UpdateCrc: ; Update CRC AX with byte CL xor cl,al ; ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF)); mov ch,cl ; ch = (ch^(ch<<4)); shl cl,4 xor cl,ch ; *lpwCrc = (*lpwCrc >> 8) shl cx,8 ; ^((unsigned short)ch<<8) shr ax,8 ; ^((unsigned short)ch<<3) mov dx,cx ; ^((unsigned short)ch>>4); xor ax,cx ; shr dx,4+8 shr cx,8-3 xor ax,dx xor ax,cx retn
I agree: the second line is baffling without going through the maths (which I have not done).
Post 08 Sep 2016, 19:56
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
bitRAKE wrote:
Maybe, it's the reverse reciprocal polynomial of 8810?

Hmm Wiki says "But the reciprocal polynomial is not the same as the original polynomial, and the CRCs generated using it are not the same (even modulo bit reversal) as those generated by the original polynomial."
However, this algo produces the same checksum as
Code:
mov bx, 0x8408 mov ax, 0xffff mov di, 0xffff mov rsi, msg mov ecx, 3 call crc16t jmp $ msg db 0x0f, 0xaa, 0xff ; RefIn/Out = true ; bx poly ; ax init ; di xorOut ; ecx length ; rsi message ptr ; crc16t: xor al, byte [rsi] mov dl, 8 @@: shr ax, 1 jnc short $ + 4 xor eax, ebx dec dl jnz short @b inc rsi dec ecx jnz short crc16t xor eax, edi ret

Well, because it is supposed to, because it is implementation of this standard (CRC-16-X25)
Post 08 Sep 2016, 21:35
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
Post 09 Sep 2016, 16:28
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
So, I executed a small performance test on straightforward (crc16t) and optimized (crc16t_1021) algos.

Code:
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; Compute CRC-16 for RefIn/RefOut = true ; ; Parameters: ; AX Init ; BX Poly (reflected) ; ECX Message length in bytes (max. 65536) ; RSI Pointer to message ; DI XorOut ; crc16t: xor al, byte [rsi] mov dl, 8 @@: shr ax, 1 jnc short $ + 4 xor eax, ebx dec dl jnz short @b inc rsi dec ecx jnz short crc16t xor eax, edi ret ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; Compute CRC-16 for Poly = 0x1021 and RefIn/RefOut = true ; ; Parameters: ; AX Init ; ECX Message length in bytes (max. 65536) ; RSI Pointer to message ; DI XorOut ; crc16t_1021: movzx edx, byte [rsi] xor dl, al mov bl, dl shl bl, 4 xor dl, bl mov ebp, edx mov ebx, edx shr ax, 8 shl ebp, 8 shl ebx, 3 shr edx, 4 xor eax, ebp xor ebx, edx xor eax, ebx inc rsi dec ecx jnz short crc16t_1021 xor eax, edi ret


Test conditions:
- Measure (using RDTSC) how many cycles (in average) does it take to compute CRC for random (generated with RDRAND) message
- Message length = 65535 bytes
- CRC for each message is computed 1000 times
- Test executed on 1000 different messages
- Test performed in UEFI (long mode, one active core, single thread) on Intel Core M-5 Y10c @ 0.80 GHz

So, the straightforward function takes 6962181 cyclese, while the optimized one takes 551631, in average. It works 12 times faster! Frankly speaking, I didn't expect that, so I ran tests twice Smile
I'll also measure that pclmulqdq method and update results.
Post 11 Sep 2016, 20:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15871
Location: 162173 Ryugu
How does it compare to the normal table CRC method? Nobody really uses the naive bitwise CRC if performance is an issue.
Post 12 Sep 2016, 01:16
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2279
Location: Usono (aka, USA)
Just some ideas (although I am no pro):

Code:
shr ax, 1


Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)

Code:
dec dl


Try "sub dl,1" (and similarly "add" instead of "inc").

Code:
dec ecx jnz short crc16t


Are you absolutely sure that "loop" would be worse here??

Code:
movzx edx, byte [rsi]


Try "xor edx, edx" and "mov dl,byte [rsi]" instead.

Code:
shl bl, 4


Try "add" instead (or even "lea").

(Admittedly, modern machines probably don't care anymore about such things. I've only been skimming some old docs, just thought it was curious.)
Post 13 Sep 2016, 22:20
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
There are false dependencies due to the use of multiple register sizes - that can be said for both routines though. The optimization is different in kind. I don't think unrolling the loop of the other would even make it as fast.

Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.

Of course, hardware optimization is another realm - most the research has that focus.
Post 14 Sep 2016, 16:10
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2279
Location: Usono (aka, USA)
rugxulo wrote:

Code:
shr ax, 1


Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)


Okay, clearly it's 33 I meant instead of 17 since all shifts (nowadays) are mod 32. Of course this is mostly ridiculous, but it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.
Post 14 Sep 2016, 19:34
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15871
Location: 162173 Ryugu
bitRAKE wrote:
Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.
Since we are discussing performance, and thus we can assume the programmer has already identified that this code is an important bottleneck to solve, then I think we should assume that the CRC table will be in cache. To assume otherwise would suggest that this code is not really all that important and the whole discussion is not needed.
rugxulo wrote:
... it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.
Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects. So who will now prove this assumption is wrong? Wink
Post 14 Sep 2016, 20:09
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2279
Location: Usono (aka, USA)
revolution wrote:
Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects.


Assembly Gems (mentioned above) says this:
Quote:

On the 486, there is an anomaly in the instruction latencies. Note the following:

SHR/SAR/SHL reg,1 (opcodes D0, D1) have latency 3
SHR/SAR/SHL rem,imm8 (opcodes C0, C1) have latency 2


BBL Forth's S32.TXT says this:
Quote:

C:\BBLASM\S32.TXT 32-bit instruction timings for the 80486
...
Shifts and rotates are now relatively slow. Note that 1 bit
shifts are slower than N-bit immediate shifts. WEIRD!


HelpPC says this:
Quote:

SHR - Shift Logical Right

Clocks
Operands 808x 286 386 486

(reg,1) 2 2 3 2
(reg,immed8) - 5+n 3 3


Darek Mihocka says this:
Quote:

Analyzing the results - why the Pentium 4 fails to deliver
...
MISTAKE #6 - Shifts and rotates are slow - It seems Intel has taken yet another step back ... by eliminating the high-speed barrel shifter found in all previous 386, 486, Pentium ... chips. Instead, they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor.


revolution wrote:

So who will now prove this assumption is wrong? Wink


Not me, I don't have any good code. As is obvious (to me), anything that doesn't measure more than a few seconds in difference isn't significant. So I don't have any obvious benchmarks (that would have to run millions of times just to hope for a big enough difference). Truly, higher clock speed in the Gigahertz has spoiled us.
Post 14 Sep 2016, 21:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15871
Location: 162173 Ryugu
rugxulo wrote:
Truly, higher clock speed in the Gigahertz has spoiled us.
I think that the spoiling is more from the fact that CPUs can outperform any memory currently available. We can "afford" to be lazy with individual clocks here and there because often we spend more time waiting for DRAM to give us some more data to crunch, thus hiding our laziness. It also means that cache management is almost always the most productive way to improve performance for anything non-trivial. But it is situation dependant as always.
Post 14 Sep 2016, 22:02
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
Post 15 Sep 2016, 10:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15871
Location: 162173 Ryugu
zhak wrote:
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
For a size metric then a simple naive loop-within-a-loop will probably be both the slowest and the smallest. And make sure to use LOOP if possible also.
Post 15 Sep 2016, 10:21
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
*edit*
Below are results as measured by Agner Fog's PMCTest tool.

(crc computed once for a random 64k message)

Code:
STRAIGHT (BITWISE) Processor 0 Clock Core cyc Instruct Uops 6826837 7286942 2621541 2558016 6585698 7244283 2621533 2556118 6644231 7308660 2621533 2556000 6888307 7300361 2621536 2557726 6650729 7315814 2621533 2556000 6649959 7314974 2621533 2556000 6652640 7317914 2621533 2556000 6641692 7305871 2621533 2556000 6651110 7316235 2621533 2556000 7542197 7356542 2621535 2563536 6772477 7288164 2621534 2556186 6652046 7317261 2621533 2556000 6648680 7313805 2621533 2556000 6649662 7314636 2621533 2556000 6654796 7320292 2621533 2556000 7166295 7330060 2621535 2561044 6654975 7320484 2621533 2556000 6644049 7308466 2621533 2556000 6644111 7308526 2621533 2556000 6647080 7311796 2621533 2556000 -------------------------------------------- OPTIMIZED Processor 0 Clock Core cyc Instruct Uops 526358 565746 1114127 1050415 506004 556603 1114120 1048587 505997 556603 1114120 1048587 508813 559692 1114120 1048705 506093 556703 1114120 1048587 506247 556872 1114120 1048587 506250 556873 1114120 1048587 506250 556874 1114120 1048587 506096 556706 1114120 1048587 506247 556871 1114120 1048587 582404 564981 1114120 1049299 506203 556829 1114120 1048587 506207 556826 1114120 1048587 506007 556605 1114120 1048587 505997 556605 1114120 1048587 506000 556605 1114120 1048587 506207 556828 1114120 1048587 506007 556607 1114120 1048587 506000 556606 1114120 1048587 506207 556826 1114120 1048587 -------------------------------------------- TABLE Processor 0 Clock Core cyc Instruct Uops 463139 470490 458772 526135 420240 462264 458765 524308 420164 462438 458765 524308 420241 462254 458765 524308 420207 462219 458765 524308 420216 462233 458765 524308 420130 462138 458765 524308 420281 462300 458765 524308 420204 462215 458765 524308 420216 462231 458765 524308 420121 462124 458765 524308 420198 462208 458765 524308 420130 462137 458765 524308 420274 462301 458765 524308 420121 462124 458765 524308 420151 462165 458765 524308 420167 462177 458765 524308 420339 462369 458765 524308 420130 462136 458765 524308 420198 462208 458765 524308


Table algo implementation:
Code:
; ; rcx message ; rdx length ; r8 init ; r9 xorout ; stack ptr to table ; crc16_tbl: movzx eax, r8w mov r10, [rsp+40] ; shadow stack should be reserved on call @@: movzx r11d, al shr eax, 8 xor r11b, byte [rcx] xor ax, [r10 + r11 * 2] inc rcx dec edx jnz short @b xor eax, r9d ret

BTW I also tried different variants of instructions, like subsituting DEC ECX/JNZ with LOOP or INC REG with ADD REG, 1; etc. Those don't make significant difference.
Post 15 Sep 2016, 10:28
View user's profile Send private message Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
Final test. fast pclmulqdq implementation as described in http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

The same random 64Kbytes message
Code:
Processor 0 Clock Core cyc Instruct Uops 85403 34466 31788 48981 16972 18671 31781 47151 15612 17175 31781 47151 15593 17161 31781 47151 15603 17165 31781 47151 15609 17173 31781 47151 15582 17138 31781 47151 15403 16949 31781 47151 15399 16951 31781 47151 15372 16911 31781 47151 15381 16925 31781 47151 15369 16910 31781 47151 15301 16836 31781 47151 15295 16830 31781 47151 15301 16837 31781 47151 15298 16835 31781 47151 15646 17212 31781 47151 15301 16840 31781 47151 15301 16837 31781 47151 15298 16835 31781 47151

This is really impressive. Although the algo itself is huge. around 1200 bytes
Post 20 Sep 2016, 18:56
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 2004-2018, Tomasz Grysztar.

Powered by rwasa.