flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > High Level Languages > What's the math behind this CRC algo?

Author
Thread Post new topic Reply to topic
zhak



Joined: 12 Apr 2005
Posts: 479
Location: Belarus
What's the math behind this CRC algo?
I've came across one interesting CRC_B (X.25) implementation.

Algo description is the following:
width=16 poly=0x1021 init=0xffff refin=true refout=true xorout=0xffff check=0x906e

Implementation (sorry, it's in C, as original in the book):

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define CRC_A 1
#define CRC_B 2
#define BYTE unsigned char 

unsigned short UpdateCrc(unsigned char chunsigned short *lpwCrc)
{
        ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF));
        ch = (ch^(ch<<4));
        *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4);
        return(*lpwCrc);
}

void ComputeCrc(int CRCTypechar *Dataint LengthBYTE *TransmitFirstBYTE *TransmitSecond)
{
        unsigned char chBlock;
        unsigned short wCrc;
        switch(CRCType) {
                case CRC_A:
                        wCrc = 0x6363; /* ITU-V.41 */
                        break;
                case CRC_B:
                        wCrc = 0xFFFF; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */
                        break;
                default:
                        return;
        }

        do {
                chBlock = *Data++;
                UpdateCrc(chBlock, &wCrc);
        } while (--Length);
        if (CRCType == CRC_BwCrc = ~wCrc; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */
        *TransmitFirst = (BYTE) (wCrc & 0xFF);
        *TransmitSecond = (BYTE) ((wCrc >> 8) & 0xFF);
        return;
}



I'm very curious about math behind UpdateCrc() func.
It doesn't involve standard 'shift one bit > check if 1 > xor with poly if true > repeat 8 times' approach, but rather involves multiple shifts and xors to compute all in one step.
Maybe anybody keen in algos and math could explain it? Maybe give some formulas? (I can't find any). My idea is to adapt this approach to use with other polynoms, not only 0x1021 (0x8408 in this implementation)
Post 08 Sep 2016, 13:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua
Indeed. It is not a CRC in the usual sense of the acronym. It appears to be some homebrew thing, and as such is likely to not be of similar quality to a "real" CRC.
Post 08 Sep 2016, 13:38
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2624
Location: dank orb
IDK, it seems fairly standard CRC code:
http://wg8.de/wg8n1454_17n3497_Ballot_3rdCD14443-3.3.pdf (page 50)
https://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

I think the optimization makes it look unusual. The second reference above talks about the math. Maybe, it's the reverse reciprocal polynomial of 8810?

_________________
The generation of random numbers is too important to be left to chance - Robert R Coveyou
Post 08 Sep 2016, 18:22
View user's profile Send private message Visit poster's website Reply with quote
idle



Joined: 06 Jan 2011
Posts: 346
Location: ukraina
tnx 4 the linx, bitRake, ps: i was curious to investigate it

Code:
 
unsigned short UpdateCrc(unsigned char chunsigned short *lpwCrc)
{     //CRC_B case:
        //ch = ch xor 0xff 
        //!  = not chtake this one as the routine seems to be bitwise
        //   = -ch-1arith seems non-applicable here     
        ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF))
        //temp_ch = ch shl 4      , least bits 0000bthe algo seems drunkywhy do we pop/split
        //     ch = ch xor temp_chleast bits of ch left unchanged
        //...
        //sowe took a char/byteinverted itlower part of inverted left samehigher part of inverted XOR'ed to the lower         
        ch = (ch^(ch<<4));
        //...
        //return result stage complicated - we should make it easier
        *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4);
        //mind it is a CRC_B casech temporarily movzx-extendedand also those XORs      
        *lpwCrc = 0xffff ^ word(ch << 8^ (ch<<3^ (ch>>4);
        //      = not ...it looks likedamnsorry     
        return;
}


Post 08 Sep 2016, 19:15
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua
Oh wow, that is cool, kind of. Hard core "optimisation" if it really does compute a standard CRC function. It desperately needs extensive commenting though.
Post 08 Sep 2016, 19:45
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2624
Location: dank orb
Possible x86 conversion (not optimal):

Code:
UpdateCrc; Update CRC AX with byte CL
    xor cl,al           ; ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF));

    mov ch,cl           ; ch = (ch^(ch<<4));
    shl cl,4
    xor cl,ch
                        ; *lpwCrc = (*lpwCrc >> 8)
    shl cx,8            ;           ^((unsigned short)ch<<8)
    shr ax,8            ;           ^((unsigned short)ch<<3)
    mov dx,cx           ;           ^((unsigned short)ch>>4);
    xor ax,cx           ;
    shr dx,4+8
    shr cx,8-3
    xor ax,dx
    xor ax,cx
    retn

I agree: the second line is baffling without going through the maths (which I have not done).
Post 08 Sep 2016, 19:56
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 479
Location: Belarus

bitRAKE wrote:
Maybe, it's the reverse reciprocal polynomial of 8810?


Hmm Wiki says "But the reciprocal polynomial is not the same as the original polynomial, and the CRCs generated using it are not the same (even modulo bit reversal) as those generated by the original polynomial."
However, this algo produces the same checksum as

Code:

  mov bx0x8408
  mov ax0xffff
  mov di0xffff
  mov rsimsg
  mov ecx3
  call crc16t
  jmp $

msg db 0x0f0xaa0xff

; RefIn/Out = true
; bx    poly
; ax    init
; di    xorOut
; ecx   length
; rsi   message ptr
;
crc16t:
  xor albyte [rsi]
  mov dl8
  @@:
  shr ax1
  jnc short $ + 4
  xor eaxebx
  dec dl
  jnz short @b
  inc rsi
  dec ecx
  jnz short crc16t
  xor eaxedi
  ret



Well, because it is supposed to, because it is implementation of this standard (CRC-16-X25)
Post 08 Sep 2016, 21:35
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2624
Location: dank orb
Post 09 Sep 2016, 16:28
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 479
Location: Belarus
So, I executed a small performance test on straightforward (crc16t) and optimized (crc16t_1021) algos.


Code:

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Compute CRC-16 for RefIn/RefOut = true
;
; Parameters:
; AX      Init
; BX      Poly (reflected)
; ECX     Message length in bytes (max. 65536)
; RSI     Pointer to message
; DI      XorOut
;
crc16t:
  xor albyte [rsi]
  mov dl8
  @@:
  shr ax1
  jnc short $ + 4
  xor eaxebx
  dec dl
  jnz short @b
  inc rsi
  dec ecx
  jnz short crc16t
  xor eaxedi
  ret

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Compute CRC-16 for Poly = 0x1021 and RefIn/RefOut = true
;
; Parameters:
; AX      Init
; ECX     Message length in bytes (max. 65536)
; RSI     Pointer to message
; DI      XorOut
;
crc16t_1021:
  movzx edxbyte [rsi]
  xor dlal
  mov bldl
  shl bl4
  xor dlbl
  mov ebpedx
  mov ebxedx
  shr ax8
  shl ebp8
  shl ebx3
  shr edx4
  xor eaxebp
  xor ebxedx
  xor eaxebx
  inc rsi
  dec ecx
  jnz short crc16t_1021
  xor eaxedi
  ret




Test conditions:
- Measure (using RDTSC) how many cycles (in average) does it take to compute CRC for random (generated with RDRAND) message
- Message length = 65535 bytes
- CRC for each message is computed 1000 times
- Test executed on 1000 different messages
- Test performed in UEFI (long mode, one active core, single thread) on Intel Core M-5 Y10c @ 0.80 GHz

So, the straightforward function takes 6962181 cyclese, while the optimized one takes 551631, in average. It works 12 times faster! Frankly speaking, I didn't expect that, so I ran tests twice Smile
I'll also measure that pclmulqdq method and update results.
Post 11 Sep 2016, 20:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua
How does it compare to the normal table CRC method? Nobody really uses the naive bitwise CRC if performance is an issue.
Post 12 Sep 2016, 01:16
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2124
Location: Usono (aka, USA)
Just some ideas (although I am no pro):


Code:

  shr ax1




Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)


Code:

  dec dl




Try "sub dl,1" (and similarly "add" instead of "inc").


Code:

  dec ecx
  jnz short crc16t




Are you absolutely sure that "loop" would be worse here??


Code:

  movzx edxbyte [rsi]




Try "xor edx, edx" and "mov dl,byte [rsi]" instead.


Code:

  shl bl4




Try "add" instead (or even "lea").

(Admittedly, modern machines probably don't care anymore about such things. I've only been skimming some old docs, just thought it was curious.)
Post 13 Sep 2016, 22:20
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2624
Location: dank orb
There are false dependencies due to the use of multiple register sizes - that can be said for both routines though. The optimization is different in kind. I don't think unrolling the loop of the other would even make it as fast.

Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.

Of course, hardware optimization is another realm - most the research has that focus.
Post 14 Sep 2016, 16:10
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2124
Location: Usono (aka, USA)

rugxulo wrote:


Code:

  shr ax1




Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)



Okay, clearly it's 33 I meant instead of 17 since all shifts (nowadays) are mod 32. Of course this is mostly ridiculous, but it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.
Post 14 Sep 2016, 19:34
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

bitRAKE wrote:
Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.

Since we are discussing performance, and thus we can assume the programmer has already identified that this code is an important bottleneck to solve, then I think we should assume that the CRC table will be in cache. To assume otherwise would suggest that this code is not really all that important and the whole discussion is not needed.

rugxulo wrote:
... it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.

Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects. So who will now prove this assumption is wrong? Wink
Post 14 Sep 2016, 20:09
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2124
Location: Usono (aka, USA)

revolution wrote:
Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects.



Assembly Gems (mentioned above) says this:

Quote:

On the 486, there is an anomaly in the instruction latencies. Note the following:

SHR/SAR/SHL reg,1 (opcodes D0, D1) have latency 3
SHR/SAR/SHL rem,imm8 (opcodes C0, C1) have latency 2



BBL Forth's S32.TXT says this:

Quote:

C:\BBLASM\S32.TXT 32-bit instruction timings for the 80486
...
Shifts and rotates are now relatively slow. Note that 1 bit
shifts are slower than N-bit immediate shifts. WEIRD!



HelpPC says this:

Quote:

SHR - Shift Logical Right

Clocks
Operands 808x 286 386 486

(reg,1) 2 2 3 2
(reg,immed8) - 5+n 3 3



Darek Mihocka says this:

Quote:

Analyzing the results - why the Pentium 4 fails to deliver
...
MISTAKE #6 - Shifts and rotates are slow - It seems Intel has taken yet another step back ... by eliminating the high-speed barrel shifter found in all previous 386, 486, Pentium ... chips. Instead, they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor.




revolution wrote:

So who will now prove this assumption is wrong? Wink



Not me, I don't have any good code. As is obvious (to me), anything that doesn't measure more than a few seconds in difference isn't significant. So I don't have any obvious benchmarks (that would have to run millions of times just to hope for a big enough difference). Truly, higher clock speed in the Gigahertz has spoiled us.
Post 14 Sep 2016, 21:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

rugxulo wrote:
Truly, higher clock speed in the Gigahertz has spoiled us.

I think that the spoiling is more from the fact that CPUs can outperform any memory currently available. We can "afford" to be lazy with individual clocks here and there because often we spend more time waiting for DRAM to give us some more data to crunch, thus hiding our laziness. It also means that cache management is almost always the most productive way to improve performance for anything non-trivial. But it is situation dependant as always.
Post 14 Sep 2016, 22:02
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 479
Location: Belarus
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
Post 15 Sep 2016, 10:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

zhak wrote:
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed

For a size metric then a simple naive loop-within-a-loop will probably be both the slowest and the smallest. And make sure to use LOOP if possible also.
Post 15 Sep 2016, 10:21
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 479
Location: Belarus
*edit*
Below are results as measured by Agner Fog's PMCTest tool.

(crc computed once for a random 64k message)


Code:

STRAIGHT (BITWISE)

Processor 0
     Clock   Core cyc   Instruct       Uops
   6826837    7286942    2621541    2558016
   6585698    7244283    2621533    2556118
   6644231    7308660    2621533    2556000
   6888307    7300361    2621536    2557726
   6650729    7315814    2621533    2556000
   6649959    7314974    2621533    2556000
   6652640    7317914    2621533    2556000
   6641692    7305871    2621533    2556000
   6651110    7316235    2621533    2556000
   7542197    7356542    2621535    2563536
   6772477    7288164    2621534    2556186
   6652046    7317261    2621533    2556000
   6648680    7313805    2621533    2556000
   6649662    7314636    2621533    2556000
   6654796    7320292    2621533    2556000
   7166295    7330060    2621535    2561044
   6654975    7320484    2621533    2556000
   6644049    7308466    2621533    2556000
   6644111    7308526    2621533    2556000
   6647080    7311796    2621533    2556000

--------------------------------------------

OPTIMIZED

Processor 0
     Clock   Core cyc   Instruct       Uops
    526358     565746    1114127    1050415
    506004     556603    1114120    1048587
    505997     556603    1114120    1048587
    508813     559692    1114120    1048705
    506093     556703    1114120    1048587
    506247     556872    1114120    1048587
    506250     556873    1114120    1048587
    506250     556874    1114120    1048587
    506096     556706    1114120    1048587
    506247     556871    1114120    1048587
    582404     564981    1114120    1049299
    506203     556829    1114120    1048587
    506207     556826    1114120    1048587
    506007     556605    1114120    1048587
    505997     556605    1114120    1048587
    506000     556605    1114120    1048587
    506207     556828    1114120    1048587
    506007     556607    1114120    1048587
    506000     556606    1114120    1048587
    506207     556826    1114120    1048587

--------------------------------------------

TABLE

Processor 0
     Clock   Core cyc   Instruct       Uops
    463139     470490     458772     526135
    420240     462264     458765     524308
    420164     462438     458765     524308
    420241     462254     458765     524308
    420207     462219     458765     524308
    420216     462233     458765     524308
    420130     462138     458765     524308
    420281     462300     458765     524308
    420204     462215     458765     524308
    420216     462231     458765     524308
    420121     462124     458765     524308
    420198     462208     458765     524308
    420130     462137     458765     524308
    420274     462301     458765     524308
    420121     462124     458765     524308
    420151     462165     458765     524308
    420167     462177     458765     524308
    420339     462369     458765     524308
    420130     462136     458765     524308
    420198     462208     458765     524308




Table algo implementation:

Code:

;
; rcx   message
; rdx   length
; r8    init
; r9    xorout
; stack ptr to table
;
crc16_tbl:
  movzx eaxr8w
  mov r10, [rsp+40]    ; shadow stack should be reserved on call
  @@:
  movzx r11dal
  shr eax8
  xor r11bbyte [rcx]
  xor ax, [r10 + r11 * 2]
  inc rcx
  dec edx
  jnz short @b
  xor eaxr9d
  ret



BTW I also tried different variants of instructions, like subsituting DEC ECX/JNZ with LOOP or INC REG with ADD REG, 1; etc. Those don't make significant difference.
Post 15 Sep 2016, 10:28
View user's profile Send private message Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 479
Location: Belarus
Final test. fast pclmulqdq implementation as described in http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

The same random 64Kbytes message

Code:

Processor 0
     Clock   Core cyc   Instruct       Uops
     85403      34466      31788      48981
     16972      18671      31781      47151
     15612      17175      31781      47151
     15593      17161      31781      47151
     15603      17165      31781      47151
     15609      17173      31781      47151
     15582      17138      31781      47151
     15403      16949      31781      47151
     15399      16951      31781      47151
     15372      16911      31781      47151
     15381      16925      31781      47151
     15369      16910      31781      47151
     15301      16836      31781      47151
     15295      16830      31781      47151
     15301      16837      31781      47151
     15298      16835      31781      47151
     15646      17212      31781      47151
     15301      16840      31781      47151
     15301      16837      31781      47151
     15298      16835      31781      47151



This is really impressive. Although the algo itself is huge. around 1200 bytes
Post 20 Sep 2016, 18:56
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.