flat assembler
Message board for the users of flat assembler.

flat assembler > High Level Languages > What's the math behind this CRC algo?

Author
Thread Post new topic Reply to topic
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
I've came across one interesting CRC_B (X.25) implementation.

Algo description is the following:
width=16 poly=0x1021 init=0xffff refin=true refout=true xorout=0xffff check=0x906e

Implementation (sorry, it's in C, as original in the book):
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define CRC_A 1
#define CRC_B 2
#define BYTE unsigned char 

unsigned short UpdateCrc(unsigned char ch, unsigned short *lpwCrc)
{
        ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF));
        ch = (ch^(ch<<4));
        *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4);
        return(*lpwCrc);
}

void ComputeCrc(int CRCType, char *Data, int Length, BYTE *TransmitFirst, BYTE *TransmitSecond)
{
        unsigned char chBlock;
        unsigned short wCrc;
        switch(CRCType) {
                case CRC_A:
                        wCrc = 0x6363; /* ITU-V.41 */
                        break;
                case CRC_B:
                        wCrc = 0xFFFF; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */
                        break;
                default:
                        return;
        }

        do {
                chBlock = *Data++;
                UpdateCrc(chBlock, &wCrc);
        } while (--Length);
        if (CRCType == CRC_B) wCrc = ~wCrc; /* ISO/IEC 13239 (formerly ISO/IEC 3309) */
        *TransmitFirst = (BYTE) (wCrc & 0xFF);
        *TransmitSecond = (BYTE) ((wCrc >> 8) & 0xFF);
        return;
}
    

I'm very curious about math behind UpdateCrc() func.
It doesn't involve standard 'shift one bit > check if 1 > xor with poly if true > repeat 8 times' approach, but rather involves multiple shifts and xors to compute all in one step.
Maybe anybody keen in algos and math could explain it? Maybe give some formulas? (I can't find any). My idea is to adapt this approach to use with other polynoms, not only 0x1021 (0x8408 in this implementation)
Post 08 Sep 2016, 13:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15982
Location: Qo'noS
Indeed. It is not a CRC in the usual sense of the acronym. It appears to be some homebrew thing, and as such is likely to not be of similar quality to a "real" CRC.
Post 08 Sep 2016, 13:38
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
IDK, it seems fairly standard CRC code:
http://wg8.de/wg8n1454_17n3497_Ballot_3rdCD14443-3.3.pdf (page 50)
https://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

I think the optimization makes it look unusual. The second reference above talks about the math. Maybe, it's the reverse reciprocal polynomial of 8810?

_________________
utf8everywhere.org
Post 08 Sep 2016, 18:22
View user's profile Send private message Visit poster's website Reply with quote
idle



Joined: 06 Jan 2011
Posts: 352
Location: ukraina
tnx 4 the linx, bitRake, ps: i was curious to investigate it
Code:
 
unsigned short UpdateCrc(unsigned char ch, unsigned short *lpwCrc)
{     //CRC_B case:
        //ch = ch xor 0xff 
        //!  = not ch, take this one as the routine seems to be bitwise
        //   = -ch-1, arith seems non-applicable here     
        ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF)); 
        //temp_ch = ch shl 4      , least bits 0000b, the algo seems drunky: why do we pop/split
        //     ch = ch xor temp_ch, least bits of ch left unchanged
        //...
        //so, we took a char/byte, inverted it, lower part of inverted left same, higher part of inverted XOR'ed to the lower         
        ch = (ch^(ch<<4));
        //...
        //return result stage complicated - we should make it easier
        *lpwCrc = (*lpwCrc >> 8)^((unsigned short)ch << 8)^((unsigned short)ch<<3)^((unsigned short)ch>>4);
        //mind it is a CRC_B case, ch temporarily movzx-extended, and also those XORs      
        *lpwCrc = 0xffff ^ word(ch << 8) ^ (ch<<3) ^ (ch>>4);
        //      = not ..., it looks like, damn, sorry     
        return;
}
    
Post 08 Sep 2016, 19:15
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15982
Location: Qo'noS
Oh wow, that is cool, kind of. Hard core "optimisation" if it really does compute a standard CRC function. It desperately needs extensive commenting though.
Post 08 Sep 2016, 19:45
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
Possible x86 conversion (not optimal):
Code:
UpdateCrc: ; Update CRC AX with byte CL
    xor cl,al           ; ch = (ch^(unsigned char)((*lpwCrc) & 0x00FF));

    mov ch,cl           ; ch = (ch^(ch<<4));
    shl cl,4
    xor cl,ch
                        ; *lpwCrc = (*lpwCrc >> 8)
    shl cx,8            ;           ^((unsigned short)ch<<8)
    shr ax,8            ;           ^((unsigned short)ch<<3)
    mov dx,cx           ;           ^((unsigned short)ch>>4);
    xor ax,cx           ;
    shr dx,4+8
    shr cx,8-3
    xor ax,dx
    xor ax,cx
    retn    
I agree: the second line is baffling without going through the maths (which I have not done).
Post 08 Sep 2016, 19:56
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
bitRAKE wrote:
Maybe, it's the reverse reciprocal polynomial of 8810?

Hmm Wiki says "But the reciprocal polynomial is not the same as the original polynomial, and the CRCs generated using it are not the same (even modulo bit reversal) as those generated by the original polynomial."
However, this algo produces the same checksum as
Code:
  mov bx, 0x8408
  mov ax, 0xffff
  mov di, 0xffff
  mov rsi, msg
  mov ecx, 3
  call crc16t
  jmp $

msg db 0x0f, 0xaa, 0xff

; RefIn/Out = true
; bx    poly
; ax    init
; di    xorOut
; ecx   length
; rsi   message ptr
;
crc16t:
  xor al, byte [rsi]
  mov dl, 8
  @@:
  shr ax, 1
  jnc short $ + 4
  xor eax, ebx
  dec dl
  jnz short @b
  inc rsi
  dec ecx
  jnz short crc16t
  xor eax, edi
  ret
    

Well, because it is supposed to, because it is implementation of this standard (CRC-16-X25)
Post 08 Sep 2016, 21:35
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
Post 09 Sep 2016, 16:28
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
So, I executed a small performance test on straightforward (crc16t) and optimized (crc16t_1021) algos.

Code:
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Compute CRC-16 for RefIn/RefOut = true
;
; Parameters:
; AX      Init
; BX      Poly (reflected)
; ECX     Message length in bytes (max. 65536)
; RSI     Pointer to message
; DI      XorOut
;
crc16t:
  xor al, byte [rsi]
  mov dl, 8
  @@:
  shr ax, 1
  jnc short $ + 4
  xor eax, ebx
  dec dl
  jnz short @b
  inc rsi
  dec ecx
  jnz short crc16t
  xor eax, edi
  ret

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Compute CRC-16 for Poly = 0x1021 and RefIn/RefOut = true
;
; Parameters:
; AX      Init
; ECX     Message length in bytes (max. 65536)
; RSI     Pointer to message
; DI      XorOut
;
crc16t_1021:
  movzx edx, byte [rsi]
  xor dl, al
  mov bl, dl
  shl bl, 4
  xor dl, bl
  mov ebp, edx
  mov ebx, edx
  shr ax, 8
  shl ebp, 8
  shl ebx, 3
  shr edx, 4
  xor eax, ebp
  xor ebx, edx
  xor eax, ebx
  inc rsi
  dec ecx
  jnz short crc16t_1021
  xor eax, edi
  ret
    


Test conditions:
- Measure (using RDTSC) how many cycles (in average) does it take to compute CRC for random (generated with RDRAND) message
- Message length = 65535 bytes
- CRC for each message is computed 1000 times
- Test executed on 1000 different messages
- Test performed in UEFI (long mode, one active core, single thread) on Intel Core M-5 Y10c @ 0.80 GHz

So, the straightforward function takes 6962181 cyclese, while the optimized one takes 551631, in average. It works 12 times faster! Frankly speaking, I didn't expect that, so I ran tests twice Smile
I'll also measure that pclmulqdq method and update results.
Post 11 Sep 2016, 20:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15982
Location: Qo'noS
How does it compare to the normal table CRC method? Nobody really uses the naive bitwise CRC if performance is an issue.
Post 12 Sep 2016, 01:16
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2302
Location: Usono (aka, USA)
Just some ideas (although I am no pro):

Code:
  shr ax, 1
    


Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)

Code:
  dec dl
    


Try "sub dl,1" (and similarly "add" instead of "inc").

Code:
  dec ecx
  jnz short crc16t
    


Are you absolutely sure that "loop" would be worse here??

Code:
  movzx edx, byte [rsi]
    


Try "xor edx, edx" and "mov dl,byte [rsi]" instead.

Code:
  shl bl, 4
    


Try "add" instead (or even "lea").

(Admittedly, modern machines probably don't care anymore about such things. I've only been skimming some old docs, just thought it was curious.)
Post 13 Sep 2016, 22:20
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2653
Location: dank orb
There are false dependencies due to the use of multiple register sizes - that can be said for both routines though. The optimization is different in kind. I don't think unrolling the loop of the other would even make it as fast.

Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.

Of course, hardware optimization is another realm - most the research has that focus.
Post 14 Sep 2016, 16:10
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2302
Location: Usono (aka, USA)
rugxulo wrote:

Code:
  shr ax, 1
    


Also try "shr ax, 17".

(EDIT: This may be wrong, I didn't test. Just for clarity, to show the intended point, I was thinking of this.)


Okay, clearly it's 33 I meant instead of 17 since all shifts (nowadays) are mod 32. Of course this is mostly ridiculous, but it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.
Post 14 Sep 2016, 19:34
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15982
Location: Qo'noS
bitRAKE wrote:
Table methods are faster, but only once the table is in cache and then we are talking about a memory bound process with the incoming data. There is a corner use case where the data is being generated (through decompression, etc.) where optimization makes sense.
Since we are discussing performance, and thus we can assume the programmer has already identified that this code is an important bottleneck to solve, then I think we should assume that the CRC table will be in cache. To assume otherwise would suggest that this code is not really all that important and the whole discussion is not needed.
rugxulo wrote:
... it is claimed to help at least one processor (486), so I honestly don't know (can't know ... right, revolution?) without testing whether it would affect other, newer cpus as well.
Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects. So who will now prove this assumption is wrong? Wink
Post 14 Sep 2016, 20:09
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2302
Location: Usono (aka, USA)
revolution wrote:
Yes, absolutely true. Without proper testing any assumption can turn out to be very wrong. However I am willing to make the wild assumption that doing 'shr reg,1' compared to 'shr reg,33' will show no difference in performance for any CPU 486 or higher. The masked barrel shifter should completely ignore all the high order bits with no side effects.


Assembly Gems (mentioned above) says this:
Quote:

On the 486, there is an anomaly in the instruction latencies. Note the following:

SHR/SAR/SHL reg,1 (opcodes D0, D1) have latency 3
SHR/SAR/SHL rem,imm8 (opcodes C0, C1) have latency 2


BBL Forth's S32.TXT says this:
Quote:

C:\BBLASM\S32.TXT 32-bit instruction timings for the 80486
...
Shifts and rotates are now relatively slow. Note that 1 bit
shifts are slower than N-bit immediate shifts. WEIRD!


HelpPC says this:
Quote:

SHR - Shift Logical Right

Clocks
Operands 808x 286 386 486

(reg,1) 2 2 3 2
(reg,immed8) - 5+n 3 3


Darek Mihocka says this:
Quote:

Analyzing the results - why the Pentium 4 fails to deliver
...
MISTAKE #6 - Shifts and rotates are slow - It seems Intel has taken yet another step back ... by eliminating the high-speed barrel shifter found in all previous 386, 486, Pentium ... chips. Instead, they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor.


revolution wrote:

So who will now prove this assumption is wrong? Wink


Not me, I don't have any good code. As is obvious (to me), anything that doesn't measure more than a few seconds in difference isn't significant. So I don't have any obvious benchmarks (that would have to run millions of times just to hope for a big enough difference). Truly, higher clock speed in the Gigahertz has spoiled us.
Post 14 Sep 2016, 21:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15982
Location: Qo'noS
rugxulo wrote:
Truly, higher clock speed in the Gigahertz has spoiled us.
I think that the spoiling is more from the fact that CPUs can outperform any memory currently available. We can "afford" to be lazy with individual clocks here and there because often we spend more time waiting for DRAM to give us some more data to crunch, thus hiding our laziness. It also means that cache management is almost always the most productive way to improve performance for anything non-trivial. But it is situation dependant as always.
Post 14 Sep 2016, 22:02
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
Post 15 Sep 2016, 10:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15982
Location: Qo'noS
zhak wrote:
Since those are not table algos and are supposed to be slow by their nature, I applied optimizations for size, not speed
For a size metric then a simple naive loop-within-a-loop will probably be both the slowest and the smallest. And make sure to use LOOP if possible also.
Post 15 Sep 2016, 10:21
View user's profile Send private message Visit poster's website Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
*edit*
Below are results as measured by Agner Fog's PMCTest tool.

(crc computed once for a random 64k message)

Code:
STRAIGHT (BITWISE)

Processor 0
     Clock   Core cyc   Instruct       Uops
   6826837    7286942    2621541    2558016
   6585698    7244283    2621533    2556118
   6644231    7308660    2621533    2556000
   6888307    7300361    2621536    2557726
   6650729    7315814    2621533    2556000
   6649959    7314974    2621533    2556000
   6652640    7317914    2621533    2556000
   6641692    7305871    2621533    2556000
   6651110    7316235    2621533    2556000
   7542197    7356542    2621535    2563536
   6772477    7288164    2621534    2556186
   6652046    7317261    2621533    2556000
   6648680    7313805    2621533    2556000
   6649662    7314636    2621533    2556000
   6654796    7320292    2621533    2556000
   7166295    7330060    2621535    2561044
   6654975    7320484    2621533    2556000
   6644049    7308466    2621533    2556000
   6644111    7308526    2621533    2556000
   6647080    7311796    2621533    2556000

--------------------------------------------

OPTIMIZED

Processor 0
     Clock   Core cyc   Instruct       Uops
    526358     565746    1114127    1050415
    506004     556603    1114120    1048587
    505997     556603    1114120    1048587
    508813     559692    1114120    1048705
    506093     556703    1114120    1048587
    506247     556872    1114120    1048587
    506250     556873    1114120    1048587
    506250     556874    1114120    1048587
    506096     556706    1114120    1048587
    506247     556871    1114120    1048587
    582404     564981    1114120    1049299
    506203     556829    1114120    1048587
    506207     556826    1114120    1048587
    506007     556605    1114120    1048587
    505997     556605    1114120    1048587
    506000     556605    1114120    1048587
    506207     556828    1114120    1048587
    506007     556607    1114120    1048587
    506000     556606    1114120    1048587
    506207     556826    1114120    1048587

--------------------------------------------

TABLE

Processor 0
     Clock   Core cyc   Instruct       Uops
    463139     470490     458772     526135
    420240     462264     458765     524308
    420164     462438     458765     524308
    420241     462254     458765     524308
    420207     462219     458765     524308
    420216     462233     458765     524308
    420130     462138     458765     524308
    420281     462300     458765     524308
    420204     462215     458765     524308
    420216     462231     458765     524308
    420121     462124     458765     524308
    420198     462208     458765     524308
    420130     462137     458765     524308
    420274     462301     458765     524308
    420121     462124     458765     524308
    420151     462165     458765     524308
    420167     462177     458765     524308
    420339     462369     458765     524308
    420130     462136     458765     524308
    420198     462208     458765     524308
    


Table algo implementation:
Code:
;
; rcx   message
; rdx   length
; r8    init
; r9    xorout
; stack ptr to table
;
crc16_tbl:
  movzx eax, r8w
  mov r10, [rsp+40]    ; shadow stack should be reserved on call
  @@:
  movzx r11d, al
  shr eax, 8
  xor r11b, byte [rcx]
  xor ax, [r10 + r11 * 2]
  inc rcx
  dec edx
  jnz short @b
  xor eax, r9d
  ret
    

BTW I also tried different variants of instructions, like subsituting DEC ECX/JNZ with LOOP or INC REG with ADD REG, 1; etc. Those don't make significant difference.
Post 15 Sep 2016, 10:28
View user's profile Send private message Reply with quote
zhak



Joined: 12 Apr 2005
Posts: 489
Location: Belarus
Final test. fast pclmulqdq implementation as described in http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

The same random 64Kbytes message
Code:
Processor 0
     Clock   Core cyc   Instruct       Uops
     85403      34466      31788      48981
     16972      18671      31781      47151
     15612      17175      31781      47151
     15593      17161      31781      47151
     15603      17165      31781      47151
     15609      17173      31781      47151
     15582      17138      31781      47151
     15403      16949      31781      47151
     15399      16951      31781      47151
     15372      16911      31781      47151
     15381      16925      31781      47151
     15369      16910      31781      47151
     15301      16836      31781      47151
     15295      16830      31781      47151
     15301      16837      31781      47151
     15298      16835      31781      47151
     15646      17212      31781      47151
     15301      16840      31781      47151
     15301      16837      31781      47151
     15298      16835      31781      47151
    

This is really impressive. Although the algo itself is huge. around 1200 bytes
Post 20 Sep 2016, 18:56
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2018, Tomasz Grysztar.

Powered by rwasa.