flat assembler
Message board for the users of flat assembler.
 flat assembler > High Level Languages > What's the math behind this CRC algo?
Author
zhak

Joined: 12 Apr 2005
Posts: 489
Location: Belarus

# What's the math behind this CRC algo?

I've came across one interesting CRC_B (X.25) implementation.

Algo description is the following:
width=16 poly=0x1021 init=0xffff refin=true refout=true xorout=0xffff check=0x906e

Implementation (sorry, it's in C, as original in the book):
 Code: #include  #include  #include  #include  #define CRC_A 1 #define CRC_B 2 #define BYTE unsigned char  unsigned short UpdateCrc(unsigned char ch, unsigned short *lpwCrc) {         ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF));         ch = (ch^(ch<<4));         *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4);         return(*lpwCrc); } void ComputeCrc(int CRCType, char *Data, int Length, BYTE *TransmitFirst, BYTE *TransmitSecond) {         unsigned char chBlock;         unsigned short wCrc;         switch(CRCType) {                 case CRC_A:                         wCrc = 0x6363; /* ITU-V.41 */                         break;                 case CRC_B:                         wCrc = 0xFFFF; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */                         break;                 default:                         return;         }         do {                 chBlock = *Data++;                 UpdateCrc(chBlock, &wCrc);         } while (--Length);         if (CRCType == CRC_B) wCrc = ~wCrc; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */         *TransmitFirst = (BYTE) (wCrc & 0xFF);         *TransmitSecond = (BYTE) ((wCrc >> 8) & 0xFF);         return; }

I'm very curious about math behind UpdateCrc() func.
It doesn't involve standard 'shift one bit > check if 1 > xor with poly if true > repeat 8 times' approach, but rather involves multiple shifts and xors to compute all in one step.
Maybe anybody keen in algos and math could explain it? Maybe give some formulas? (I can't find any). My idea is to adapt this approach to use with other polynoms, not only 0x1021 (0x8408 in this implementation)
08 Sep 2016, 13:22
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15494
Indeed. It is not a CRC in the usual sense of the acronym. It appears to be some homebrew thing, and as such is likely to not be of similar quality to a "real" CRC.
08 Sep 2016, 13:38
bitRAKE

Joined: 21 Jul 2003
Posts: 2627
Location: dank orb
IDK, it seems fairly standard CRC code:
http://wg8.de/wg8n1454_17n3497_Ballot_3rdCD14443-3.3.pdf (page 50)
https://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

I think the optimization makes it look unusual. The second reference above talks about the math. Maybe, it's the reverse reciprocal polynomial of 8810?

_________________
The generation of random numbers is too important to be left to chance - Robert R Coveyou
08 Sep 2016, 18:22
idle

Joined: 06 Jan 2011
Posts: 349
Location: ukraina
tnx 4 the linx, bitRake, ps: i was curious to investigate it
 Code: unsigned short UpdateCrc(unsigned char ch, unsigned short *lpwCrc) {     //CRC_B case:         //ch = ch xor 0xff          //!  = not ch, take this one as the routine seems to be bitwise         //   = -ch-1, arith seems non-applicable here              ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF));          //temp_ch = ch shl 4      , least bits 0000b, the algo seems drunky: why do we pop/split         //     ch = ch xor temp_ch, least bits of ch left unchanged         //...         //so, we took a char/byte, inverted it, lower part of inverted left same, higher part of inverted XOR'ed to the lower                  ch = (ch^(ch<<4));         //...         //return result stage complicated - we should make it easier         *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4);         //mind it is a CRC_B case, ch temporarily movzx-extended, and also those XORs               *lpwCrc = 0xffff ^ word(ch << 8) ^ (ch<<3) ^ (ch>>4);         //      = not ..., it looks like, damn, sorry              return; }
08 Sep 2016, 19:15
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15494
Oh wow, that is cool, kind of. Hard core "optimisation" if it really does compute a standard CRC function. It desperately needs extensive commenting though.
08 Sep 2016, 19:45
bitRAKE

Joined: 21 Jul 2003
Posts: 2627
Location: dank orb
Possible x86 conversion (not optimal):
 Code: UpdateCrc: ; Update CRC AX with byte CL     xor cl,al           ; ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF));     mov ch,cl           ; ch = (ch^(ch<<4));     shl cl,4     xor cl,ch                         ; *lpwCrc = (*lpwCrc >> 8)     shl cx,8            ;           ^((unsigned short)ch<<8)     shr ax,8            ;           ^((unsigned short)ch<<3)     mov dx,cx           ;           ^((unsigned short)ch>>4);     xor ax,cx           ;     shr dx,4+8     shr cx,8-3     xor ax,dx     xor ax,cx     retn
I agree: the second line is baffling without going through the maths (which I have not done).
08 Sep 2016, 19:56
zhak

Joined: 12 Apr 2005
Posts: 489
Location: Belarus
 bitRAKE wrote: Maybe, it's the reverse reciprocal polynomial of 8810?

Hmm Wiki says "But the reciprocal polynomial is not the same as the original polynomial, and the CRCs generated using it are not the same (even modulo bit reversal) as those generated by the original polynomial."
However, this algo produces the same checksum as
 Code: mov bx, 0x8408   mov ax, 0xffff   mov di, 0xffff   mov rsi, msg   mov ecx, 3   call crc16t   jmp \$ msg db 0x0f, 0xaa, 0xff ; RefIn/Out = true ; bx    poly ; ax    init ; di    xorOut ; ecx   length ; rsi   message ptr ; crc16t:   xor al, byte [rsi]   mov dl, 8   @@:   shr ax, 1   jnc short \$ + 4   xor eax, ebx   dec dl   jnz short @b   inc rsi   dec ecx   jnz short crc16t   xor eax, edi   ret

Well, because it is supposed to, because it is implementation of this standard (CRC-16-X25)
08 Sep 2016, 21:35
bitRAKE

Joined: 21 Jul 2003
Posts: 2627
Location: dank orb
09 Sep 2016, 16:28
zhak

Joined: 12 Apr 2005
Posts: 489
Location: Belarus
So, I executed a small performance test on straightforward (crc16t) and optimized (crc16t_1021) algos.

 Code: ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; Compute CRC-16 for RefIn/RefOut = true ; ; Parameters: ; AX      Init ; BX      Poly (reflected) ; ECX     Message length in bytes (max. 65536) ; RSI     Pointer to message ; DI      XorOut ; crc16t:   xor al, byte [rsi]   mov dl, 8   @@:   shr ax, 1   jnc short \$ + 4   xor eax, ebx   dec dl   jnz short @b   inc rsi   dec ecx   jnz short crc16t   xor eax, edi   ret ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; Compute CRC-16 for Poly = 0x1021 and RefIn/RefOut = true ; ; Parameters: ; AX      Init ; ECX     Message length in bytes (max. 65536) ; RSI     Pointer to message ; DI      XorOut ; crc16t_1021:   movzx edx, byte [rsi]   xor dl, al   mov bl, dl   shl bl, 4   xor dl, bl   mov ebp, edx   mov ebx, edx   shr ax, 8   shl ebp, 8   shl ebx, 3   shr edx, 4   xor eax, ebp   xor ebx, edx   xor eax, ebx   inc rsi   dec ecx   jnz short crc16t_1021   xor eax, edi   ret

Test conditions:
- Measure (using RDTSC) how many cycles (in average) does it take to compute CRC for random (generated with RDRAND) message
- Message length = 65535 bytes
- CRC for each message is computed 1000 times
- Test executed on 1000 different messages
- Test performed in UEFI (long mode, one active core, single thread) on Intel Core M-5 Y10c @ 0.80 GHz

So, the straightforward function takes 6962181 cyclese, while the optimized one takes 551631, in average. It works 12 times faster! Frankly speaking, I didn't expect that, so I ran tests twice
I'll also measure that pclmulqdq method and update results.
11 Sep 2016, 20:13
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15494
How does it compare to the normal table CRC method? Nobody really uses the naive bitwise CRC if performance is an issue.
12 Sep 2016, 01:16
rugxulo

Joined: 09 Aug 2005
Posts: 2187
Location: Usono (aka, USA)
Just some ideas (although I am no pro):

 Code: shr ax, 1

Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)

 Code: dec dl

 Code: dec ecx   jnz short crc16t

Are you absolutely sure that "loop" would be worse here??

 Code: movzx edx, byte [rsi]

Try "xor edx, edx" and "mov dl,byte [rsi]" instead.

 Code: shl bl, 4

(Admittedly, modern machines probably don't care anymore about such things. I've only been skimming some old docs, just thought it was curious.)
13 Sep 2016, 22:20
bitRAKE

Joined: 21 Jul 2003
Posts: 2627
Location: dank orb
There are false dependencies due to the use of multiple register sizes - that can be said for both routines though. The optimization is different in kind. I don't think unrolling the loop of the other would even make it as fast.

Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.

Of course, hardware optimization is another realm - most the research has that focus.
14 Sep 2016, 16:10
rugxulo

Joined: 09 Aug 2005
Posts: 2187
Location: Usono (aka, USA)

rugxulo wrote:

 Code: shr ax, 1

Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)

Okay, clearly it's 33 I meant instead of 17 since all shifts (nowadays) are mod 32. Of course this is mostly ridiculous, but it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.
14 Sep 2016, 19:34
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15494
 bitRAKE wrote: Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.
Since we are discussing performance, and thus we can assume the programmer has already identified that this code is an important bottleneck to solve, then I think we should assume that the CRC table will be in cache. To assume otherwise would suggest that this code is not really all that important and the whole discussion is not needed.
 rugxulo wrote: ... it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.
Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects. So who will now prove this assumption is wrong?
14 Sep 2016, 20:09
rugxulo

Joined: 09 Aug 2005
Posts: 2187
Location: Usono (aka, USA)
 revolution wrote: Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects.

Assembly Gems (mentioned above) says this:
 Quote: On the 486, there is an anomaly in the instruction latencies. Note the following: SHR/SAR/SHL reg,1 (opcodes D0, D1) have latency 3 SHR/SAR/SHL rem,imm8 (opcodes C0, C1) have latency 2

BBL Forth's S32.TXT says this:
 Quote: C:\BBLASM\S32.TXT 32-bit instruction timings for the 80486 ... Shifts and rotates are now relatively slow. Note that 1 bit shifts are slower than N-bit immediate shifts. WEIRD!

HelpPC says this:
 Quote: SHR - Shift Logical Right Clocks Operands 808x 286 386 486 (reg,1) 2 2 3 2 (reg,immed8) - 5+n 3 3

Darek Mihocka says this:
 Quote: Analyzing the results - why the Pentium 4 fails to deliver ... MISTAKE #6 - Shifts and rotates are slow - It seems Intel has taken yet another step back ... by eliminating the high-speed barrel shifter found in all previous 386, 486, Pentium ... chips. Instead, they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor.

 revolution wrote: So who will now prove this assumption is wrong?

Not me, I don't have any good code. As is obvious (to me), anything that doesn't measure more than a few seconds in difference isn't significant. So I don't have any obvious benchmarks (that would have to run millions of times just to hope for a big enough difference). Truly, higher clock speed in the Gigahertz has spoiled us.
14 Sep 2016, 21:51
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15494
 rugxulo wrote: Truly, higher clock speed in the Gigahertz has spoiled us.
I think that the spoiling is more from the fact that CPUs can outperform any memory currently available. We can "afford" to be lazy with individual clocks here and there because often we spend more time waiting for DRAM to give us some more data to crunch, thus hiding our laziness. It also means that cache management is almost always the most productive way to improve performance for anything non-trivial. But it is situation dependant as always.
14 Sep 2016, 22:02
zhak

Joined: 12 Apr 2005
Posts: 489
Location: Belarus
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
15 Sep 2016, 10:13
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15494
 zhak wrote: Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
For a size metric then a simple naive loop-within-a-loop will probably be both the slowest and the smallest. And make sure to use LOOP if possible also.
15 Sep 2016, 10:21
zhak

Joined: 12 Apr 2005
Posts: 489
Location: Belarus
*edit*
Below are results as measured by Agner Fog's PMCTest tool.

(crc computed once for a random 64k message)

 Code: STRAIGHT (BITWISE) Processor 0      Clock   Core cyc   Instruct       Uops    6826837    7286942    2621541    2558016    6585698    7244283    2621533    2556118    6644231    7308660    2621533    2556000    6888307    7300361    2621536    2557726    6650729    7315814    2621533    2556000    6649959    7314974    2621533    2556000    6652640    7317914    2621533    2556000    6641692    7305871    2621533    2556000    6651110    7316235    2621533    2556000    7542197    7356542    2621535    2563536    6772477    7288164    2621534    2556186    6652046    7317261    2621533    2556000    6648680    7313805    2621533    2556000    6649662    7314636    2621533    2556000    6654796    7320292    2621533    2556000    7166295    7330060    2621535    2561044    6654975    7320484    2621533    2556000    6644049    7308466    2621533    2556000    6644111    7308526    2621533    2556000    6647080    7311796    2621533    2556000 -------------------------------------------- OPTIMIZED Processor 0      Clock   Core cyc   Instruct       Uops     526358     565746    1114127    1050415     506004     556603    1114120    1048587     505997     556603    1114120    1048587     508813     559692    1114120    1048705     506093     556703    1114120    1048587     506247     556872    1114120    1048587     506250     556873    1114120    1048587     506250     556874    1114120    1048587     506096     556706    1114120    1048587     506247     556871    1114120    1048587     582404     564981    1114120    1049299     506203     556829    1114120    1048587     506207     556826    1114120    1048587     506007     556605    1114120    1048587     505997     556605    1114120    1048587     506000     556605    1114120    1048587     506207     556828    1114120    1048587     506007     556607    1114120    1048587     506000     556606    1114120    1048587     506207     556826    1114120    1048587 -------------------------------------------- TABLE Processor 0      Clock   Core cyc   Instruct       Uops     463139     470490     458772     526135     420240     462264     458765     524308     420164     462438     458765     524308     420241     462254     458765     524308     420207     462219     458765     524308     420216     462233     458765     524308     420130     462138     458765     524308     420281     462300     458765     524308     420204     462215     458765     524308     420216     462231     458765     524308     420121     462124     458765     524308     420198     462208     458765     524308     420130     462137     458765     524308     420274     462301     458765     524308     420121     462124     458765     524308     420151     462165     458765     524308     420167     462177     458765     524308     420339     462369     458765     524308     420130     462136     458765     524308     420198     462208     458765     524308

Table algo implementation:
 Code: ; ; rcx   message ; rdx   length ; r8    init ; r9    xorout ; stack ptr to table ; crc16_tbl:   movzx eax, r8w   mov r10, [rsp+40]    ; shadow stack should be reserved on call   @@:   movzx r11d, al   shr eax, 8   xor r11b, byte [rcx]   xor ax, [r10 + r11 * 2]   inc rcx   dec edx   jnz short @b   xor eax, r9d   ret

BTW I also tried different variants of instructions, like subsituting DEC ECX/JNZ with LOOP or INC REG with ADD REG, 1; etc. Those don't make significant difference.
15 Sep 2016, 10:28
zhak

Joined: 12 Apr 2005
Posts: 489
Location: Belarus
Final test. fast pclmulqdq implementation as described in http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

The same random 64Kbytes message
 Code: Processor 0      Clock   Core cyc   Instruct       Uops      85403      34466      31788      48981      16972      18671      31781      47151      15612      17175      31781      47151      15593      17161      31781      47151      15603      17165      31781      47151      15609      17173      31781      47151      15582      17138      31781      47151      15403      16949      31781      47151      15399      16951      31781      47151      15372      16911      31781      47151      15381      16925      31781      47151      15369      16910      31781      47151      15301      16836      31781      47151      15295      16830      31781      47151      15301      16837      31781      47151      15298      16835      31781      47151      15646      17212      31781      47151      15301      16840      31781      47151      15301      16837      31781      47151      15298      16835      31781      47151

This is really impressive. Although the algo itself is huge. around 1200 bytes
20 Sep 2016, 18:56
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------Blog General----------------MainDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsCompiler InternalsIDE DevelopmentOS ConstructionNon-x86 architecturesHigh Level LanguagesProgramming Language DesignProjects and IdeasExamples and Tutorials Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum