flat assembler
Message board for the users of flat assembler.
Index
> Main > More faster realization of CRC32-algorithm. Goto page 1, 2 Next |
Author |
|
MCD 14 Feb 2007, 00:07
on what CPU should your code run? if it should run only on CPUs that support SSSE3/SSE4 (suplimentary SSE3), than you could do the whole stuff in just ONE instruction!
They have the "CRC32" instruction, but I still don't know its usage. else, I have optimized your code a little: Code: proc CRC32Update uses esi, CRC,Buf,Len mov eax,[CRC] push eax mov eax,dword [eax] mov esi,[Buf] mov ecx,[Len] ;<try> test ecx,ecx jz .end ;<or>;most P6 and AMD CPUs like jecxz ; jecxz .end ;</end try> add esi,ecx neg ecx .loop: mov edx,eax shr eax,24 xor al,[esi+ecx] shl edx,8 mov eax,dword [crc32_table + eax * 4] xor eax,edx inc ecx ;add ecx,1 if Pentium4 jnz .loop .end: pop esi mov [esi],eax ret crc32_table dd $00000000, $04c11db7, $09823b6e, $0d4326d9 dd $130476dc, $17c56b6b, $1a864db2, $1e475005 dd $2608edb8, $22c9f00f, $2f8ad6d6, $2b4bcb61 dd $350c9b64, $31cd86d3, $3c8ea00a, $384fbdbd dd $4c11db70, $48d0c6c7, $4593e01e, $4152fda9 dd $5f15adac, $5bd4b01b, $569796c2, $52568b75 dd $6a1936c8, $6ed82b7f, $639b0da6, $675a1011 dd $791d4014, $7ddc5da3, $709f7b7a, $745e66cd dd $9823b6e0, $9ce2ab57, $91a18d8e, $95609039 dd $8b27c03c, $8fe6dd8b, $82a5fb52, $8664e6e5 dd $be2b5b58, $baea46ef, $b7a96036, $b3687d81 dd $ad2f2d84, $a9ee3033, $a4ad16ea, $a06c0b5d dd $d4326d90, $d0f37027, $ddb056fe, $d9714b49 dd $c7361b4c, $c3f706fb, $ceb42022, $ca753d95 dd $f23a8028, $f6fb9d9f, $fbb8bb46, $ff79a6f1 dd $e13ef6f4, $e5ffeb43, $e8bccd9a, $ec7dd02d dd $34867077, $30476dc0, $3d044b19, $39c556ae dd $278206ab, $23431b1c, $2e003dc5, $2ac12072 dd $128e9dcf, $164f8078, $1b0ca6a1, $1fcdbb16 dd $018aeb13, $054bf6a4, $0808d07d, $0cc9cdca dd $7897ab07, $7c56b6b0, $71159069, $75d48dde dd $6b93dddb, $6f52c06c, $6211e6b5, $66d0fb02 dd $5e9f46bf, $5a5e5b08, $571d7dd1, $53dc6066 dd $4d9b3063, $495a2dd4, $44190b0d, $40d816ba dd $aca5c697, $a864db20, $a527fdf9, $a1e6e04e dd $bfa1b04b, $bb60adfc, $b6238b25, $b2e29692 dd $8aad2b2f, $8e6c3698, $832f1041, $87ee0df6 dd $99a95df3, $9d684044, $902b669d, $94ea7b2a dd $e0b41de7, $e4750050, $e9362689, $edf73b3e dd $f3b06b3b, $f771768c, $fa325055, $fef34de2 dd $c6bcf05f, $c27dede8, $cf3ecb31, $cbffd686 dd $d5b88683, $d1799b34, $dc3abded, $d8fba05a dd $690ce0ee, $6dcdfd59, $608edb80, $644fc637 dd $7a089632, $7ec98b85, $738aad5c, $774bb0eb dd $4f040d56, $4bc510e1, $46863638, $42472b8f dd $5c007b8a, $58c1663d, $558240e4, $51435d53 dd $251d3b9e, $21dc2629, $2c9f00f0, $285e1d47 dd $36194d42, $32d850f5, $3f9b762c, $3b5a6b9b dd $0315d626, $07d4cb91, $0a97ed48, $0e56f0ff dd $1011a0fa, $14d0bd4d, $19939b94, $1d528623 dd $f12f560e, $f5ee4bb9, $f8ad6d60, $fc6c70d7 dd $e22b20d2, $e6ea3d65, $eba91bbc, $ef68060b dd $d727bbb6, $d3e6a601, $dea580d8, $da649d6f dd $c423cd6a, $c0e2d0dd, $cda1f604, $c960ebb3 dd $bd3e8d7e, $b9ff90c9, $b4bcb610, $b07daba7 dd $ae3afba2, $aafbe615, $a7b8c0cc, $a379dd7b dd $9b3660c6, $9ff77d71, $92b45ba8, $9675461f dd $8832161a, $8cf30bad, $81b02d74, $857130c3 dd $5d8a9099, $594b8d2e, $5408abf7, $50c9b640 dd $4e8ee645, $4a4ffbf2, $470cdd2b, $43cdc09c dd $7b827d21, $7f436096, $7200464f, $76c15bf8 dd $68860bfd, $6c47164a, $61043093, $65c52d24 dd $119b4be9, $155a565e, $18197087, $1cd86d30 dd $029f3d35, $065e2082, $0b1d065b, $0fdc1bec dd $3793a651, $3352bbe6, $3e119d3f, $3ad08088 dd $2497d08d, $2056cd3a, $2d15ebe3, $29d4f654 dd $c5a92679, $c1683bce, $cc2b1d17, $c8ea00a0 dd $d6ad50a5, $d26c4d12, $df2f6bcb, $dbee767c dd $e3a1cbc1, $e760d676, $ea23f0af, $eee2ed18 dd $f0a5bd1d, $f464a0aa, $f9278673, $fde69bc4 dd $89b8fd09, $8d79e0be, $803ac667, $84fbdbd0 dd $9abc8bd5, $9e7d9662, $933eb0bb, $97ffad0c dd $afb010b1, $ab710d06, $a6322bdf, $a2f33668 dd $bcb4666d, $b8757bda, $b5365d03, $b1f740b4 endp _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || Last edited by MCD on 14 Feb 2007, 00:43; edited 1 time in total |
|||
14 Feb 2007, 00:07 |
|
Garthower 14 Feb 2007, 00:15
And procedure should be carried out on processors with support SSE3 a maximum.
SSE4 are not supported by any processor since still they are not realized in one chip. There is only a white-paper, that they will be At least up to processors Intel Daul-Core. I don't have information, whether so it with Intel Extreme with four cores. |
|||
14 Feb 2007, 00:15 |
|
MCD 14 Feb 2007, 01:02
hmm, maybe unrolling the loop may further fasten it?
|
|||
14 Feb 2007, 01:02 |
|
Garthower 14 Feb 2007, 07:49
Thanks for your posts, but now I understand, that this algorithm on the present to accelerate it will not be possible, it specific and is adhered to 32 bits. But I have thought up analogue. To consider the contorl sum separately on even and not even bytes, and in the end apply XOR of one control sum to another that the end result was 32 bit. Certainly, such algorithm will not be we shall combine with old, but it enables to use SIMD instructions not losing uniqueness of an end result.
|
|||
14 Feb 2007, 07:49 |
|
MCD 14 Feb 2007, 16:21
How just got a theoretical though of how to speed up the CRC32 algorithm, but I'm not sure if this is possible. You can try the following:
1.) calculate 2 CRC32 values in paralle, where 1 calculates the CRC32 of all even and the other calculates the CRC32 of all odd dwords, so that both data sources are interleaved (like the left and right signals from stereo PCM data usually are) 2.) next final step is to combine those 2 CRC32 values the problem about this is that I don't know how to properly combine those CRC32 values neither do I know if this is theoretically possible? I'm not much into hash/checksum/cryptography, so I can't say/show that. |
|||
14 Feb 2007, 16:21 |
|
Goplat 15 Feb 2007, 02:21
CRC has to be done sequentially; there isn't much room for algorithmic optimization. You could process 2 bytes at a time instead of 1 but that would require a 65536 entry table (256kB), which would probably make it slower.
|
|||
15 Feb 2007, 02:21 |
|
f0dder 15 Feb 2007, 02:25
Goplat: are you sure there isn't a way to process it in chunks, and "tie those chunks together"?
|
|||
15 Feb 2007, 02:25 |
|
r22 15 Feb 2007, 05:07
Unrolling :
- 3X unroll will probably give you a ~10 speed improvement on a loop that tight. - For the sake of branch prediction with the unrolls you may want to use the follow structure. Code: mov edx,eax .loop: ... inc ecx jz .end mov edx,eax ... inc ecx jz .end mov edx,eax ... inc ecx jz .end mov edx,eax jmp .loop .end: Addressing the LUT : - Hold crc32_table in another register [reg1 + reg2 * x] runs faster over many iterations than [Address + reg2 * x]. ~2-5% improvement Alignment : - Align [.loop:] to 16 bytes. Use correct padding with 90h and 66h bytes. Here's a poorly made macro that does it for you. Code: macro AMDPad16 { virtual align 16 a = $-$$ end virtual if a=1 db 90h end if if a=2 db 66h,90h end if if a=3 db 66h,66h,90h end if if a=4 db 66h,66h,66h,90h end if if a=5 db 66h,66h,90h,66h,90h end if if a=6 db 66h,66h,90h,66h,66h,90h end if if a=7 db 66h,66h,66h,90h,66h,66h,90h end if if a=8 db 66h,66h,66h,90h,66h,66h,66h,90h end if if a=9 db 66h,66h,90h,66h,66h,90h,66h,66h,90h end if if a=10 db 66h,66h,66h,90h,66h,66h,90h,66h,66h,90h end if if a=11 db 66h,66h,66h,90h,66h,66h,66h,90h,66h,66h,90h end if if a=12 db 66h,66h,66h,90h,66h,66h,66h,90h,66h,66h,66h,90h end if if a=13 db 66h,66h,66h,90h,66h,66h,90h,66h,66h,90h,66h,66h,90h end if if a=14 db 66h,66h,66h,90h,66h,66h,66h,90h,66h,66h,90h,66h,66h,90h end if if a=15 db 66h,66h,66h,90h,66h,66h,66h,90h,66h,66h,66h,90h,66h,66h,90h end if } 2-5% improvement. Bigger LUT : - If you can keep it under a half a MB (~65535 * then it should be fine. Before the algorithm starts just use PREFETCH loop to go through the hash in 64byte stride. Then it'll be in the cache. This would only be acceptable on newer processors which seem to always have at least 1 MB of L2 per core AND only if the buffer length was sufficiently long to make ti worth while to precache the LUT. 30% improvement under the right conditions. Why no parallelism : - The next byte from the buffer gets XORd with the first byte of the current CRC value. - The SSE4 CRC32 opcode will probably run as badly is the string opcodes Hopefully not. |
|||
15 Feb 2007, 05:07 |
|
MCD 15 Feb 2007, 16:22
r22 wrote:
sure, the CRC32 instruction won't be as fast as simple logical or arithmical instructions like XOR, TEXT, ADD, SUB..., but it will probably be faster than the loop versions. |
|||
15 Feb 2007, 16:22 |
|
Goplat 15 Feb 2007, 16:56
Garthower wrote:
This could be done with fewer instructions as Code: mov ebx,eax shr ebx,24 xor bl,[esi] shl eax,8 xor eax,dword [crc32_table + ebx * 4] |
|||
15 Feb 2007, 16:56 |
|
Octavio 15 Feb 2007, 23:24
see Zlib source code
|
|||
15 Feb 2007, 23:24 |
|
DOS386 17 Feb 2007, 01:27
Quote: on what CPU should your code run? if it should run only on CPUs that support SSSE3/SSE4 (suplimentary SSE3), than you could do the whole stuff in just ONE instruction! They have the "CRC32" instruction, but I still don't know its usage. Well, one day SSSSSSSSSSE99 will have instructions for CRC32, MD5, SHA-160, SHA-512, SHA-4096, RSA, Rijndael, Twofish, Deflate, LZMA, PPMD, MPEG, ... ... and maybe also for booting and shutting down Vista ... by now wasting much more time than "only" table-based CRC32 code in ASM No thanks, prefer 80386 compatible code ... Quote: see Zlib source code _________________ Bug Nr.: 12345 Title: Hello World program compiles to 100 KB !!! Status: Closed: NOT a Bug |
|||
17 Feb 2007, 01:27 |
|
f0dder 17 Feb 2007, 01:47
NTOSKRNL_VXE: CRC is pretty simple to implement, which is why it's now being added to silicon. And the reasoning for it is stated in the Intel provided manual. At first, I thought it was rather silly, but consider a software iSCSI stack running on a gigabit LAN...
MPEG is already being offloaded to GPUs, which are more suitable for that task. Quote:
Isn't the choice of assembler irrelevant? x86 assembly is x86 assembly. Sure, AT&T syntax is awful, but it's still readable (and there's a masm version too, anyway). |
|||
17 Feb 2007, 01:47 |
|
SomeoneNew 06 Mar 2008, 11:57
Sorry to revive an "old" thread... but I always wanted to know why 90% of the CRC32 codes I've seen uses a look-up table, is it for speed reasons?.
I have a CRC32 generator that doesn't use a look-up table and runs faster than one that actually does (although it might be a bad implementation, that I do not know). Please explain me Thanks _________________ Im new, sorry if I bothered with any stupid question |
|||
06 Mar 2008, 11:57 |
|
revolution 06 Mar 2008, 12:02
It is for speed, table methods are much faster, usually by a factor of 8 as most algos these days use an 8bitx32bit LUT.
But also 16bitx32bit LUT's are used in some cases where the ultimate speed is required (or desired), bit it needs 256k of memory just for the LUT and is usually slower on a cached CPU system. |
|||
06 Mar 2008, 12:02 |
|
SomeoneNew 06 Mar 2008, 12:18
So perhaps this overhead explains why the LUT one is slower in my case?, although as I said I don't have access to that CRC32 generator so I can't know for sure.
Are there any good crc32 codes I could take a look at that use a LUT? Thanks. |
|||
06 Mar 2008, 12:18 |
|
revolution 06 Mar 2008, 12:30
SomeoneNew wrote: So perhaps this overhead explains why the LUT one is slower in my case?, although as I said I don't have access to that CRC32 generator so I can't know for sure. SomeoneNew wrote: Are there any good crc32 codes I could take a look at that use a LUT? |
|||
06 Mar 2008, 12:30 |
|
SomeoneNew 06 Mar 2008, 13:00
Thanks, I'll take a look!
Was this the fruit of a coding exercise?, I can't see why would someone really need such an extensive crc32 library :p |
|||
06 Mar 2008, 13:00 |
|
bitRAKE 06 Mar 2008, 17:14
author of code unknown
Code: CRC32 proc lData:DWORD, ptrData:DWORD push esi push ecx push edx mov esi, ptrData xor edx, edx or eax, -1 mov ecx, lData CRC32_loop: mov dl, [esi] xor dl, al shr eax, 8 xor eax, [crc32_table + 4*edx] inc esi dec ecx jnz CRC32_loop not eax pop edx pop ecx pop esi ret ; 1K CRC32 Table crc32_table \ dd 000000000h,077073096h,0EE0E612Ch,0990951BAh,0076DC419h,0706AF48Fh,0E963A535h,09E6495A3h dd 00EDB8832h,079DCB8A4h,0E0D5E91Eh,097D2D988h,009B64C2Bh,07EB17CBDh,0E7B82D07h,090BF1D91h dd 01DB71064h,06AB020F2h,0F3B97148h,084BE41DEh,01ADAD47Dh,06DDDE4EBh,0F4D4B551h,083D385C7h dd 0136C9856h,0646BA8C0h,0FD62F97Ah,08A65C9ECh,014015C4Fh,063066CD9h,0FA0F3D63h,08D080DF5h dd 03B6E20C8h,04C69105Eh,0D56041E4h,0A2677172h,03C03E4D1h,04B04D447h,0D20D85FDh,0A50AB56Bh dd 035B5A8FAh,042B2986Ch,0DBBBC9D6h,0ACBCF940h,032D86CE3h,045DF5C75h,0DCD60DCFh,0ABD13D59h dd 026D930ACh,051DE003Ah,0C8D75180h,0BFD06116h,021B4F4B5h,056B3C423h,0CFBA9599h,0B8BDA50Fh dd 02802B89Eh,05F058808h,0C60CD9B2h,0B10BE924h,02F6F7C87h,058684C11h,0C1611DABh,0B6662D3Dh dd 076DC4190h,001DB7106h,098D220BCh,0EFD5102Ah,071B18589h,006B6B51Fh,09FBFE4A5h,0E8B8D433h dd 07807C9A2h,00F00F934h,09609A88Eh,0E10E9818h,07F6A0DBBh,0086D3D2Dh,091646C97h,0E6635C01h dd 06B6B51F4h,01C6C6162h,0856530D8h,0F262004Eh,06C0695EDh,01B01A57Bh,08208F4C1h,0F50FC457h dd 065B0D9C6h,012B7E950h,08BBEB8EAh,0FCB9887Ch,062DD1DDFh,015DA2D49h,08CD37CF3h,0FBD44C65h dd 04DB26158h,03AB551CEh,0A3BC0074h,0D4BB30E2h,04ADFA541h,03DD895D7h,0A4D1C46Dh,0D3D6F4FBh dd 04369E96Ah,0346ED9FCh,0AD678846h,0DA60B8D0h,044042D73h,033031DE5h,0AA0A4C5Fh,0DD0D7CC9h dd 05005713Ch,0270241AAh,0BE0B1010h,0C90C2086h,05768B525h,0206F85B3h,0B966D409h,0CE61E49Fh dd 05EDEF90Eh,029D9C998h,0B0D09822h,0C7D7A8B4h,059B33D17h,02EB40D81h,0B7BD5C3Bh,0C0BA6CADh dd 0EDB88320h,09ABFB3B6h,003B6E20Ch,074B1D29Ah,0EAD54739h,09DD277AFh,004DB2615h,073DC1683h dd 0E3630B12h,094643B84h,00D6D6A3Eh,07A6A5AA8h,0E40ECF0Bh,09309FF9Dh,00A00AE27h,07D079EB1h dd 0F00F9344h,08708A3D2h,01E01F268h,06906C2FEh,0F762575Dh,0806567CBh,0196C3671h,06E6B06E7h dd 0FED41B76h,089D32BE0h,010DA7A5Ah,067DD4ACCh,0F9B9DF6Fh,08EBEEFF9h,017B7BE43h,060B08ED5h dd 0D6D6A3E8h,0A1D1937Eh,038D8C2C4h,04FDFF252h,0D1BB67F1h,0A6BC5767h,03FB506DDh,048B2364Bh dd 0D80D2BDAh,0AF0A1B4Ch,036034AF6h,041047A60h,0DF60EFC3h,0A867DF55h,0316E8EEFh,04669BE79h dd 0CB61B38Ch,0BC66831Ah,0256FD2A0h,05268E236h,0CC0C7795h,0BB0B4703h,0220216B9h,05505262Fh dd 0C5BA3BBEh,0B2BD0B28h,02BB45A92h,05CB36A04h,0C2D7FFA7h,0B5D0CF31h,02CD99E8Bh,05BDEAE1Dh dd 09B64C2B0h,0EC63F226h,0756AA39Ch,0026D930Ah,09C0906A9h,0EB0E363Fh,072076785h,005005713h dd 095BF4A82h,0E2B87A14h,07BB12BAEh,00CB61B38h,092D28E9Bh,0E5D5BE0Dh,07CDCEFB7h,00BDBDF21h dd 086D3D2D4h,0F1D4E242h,068DDB3F8h,01FDA836Eh,081BE16CDh,0F6B9265Bh,06FB077E1h,018B74777h dd 088085AE6h,0FF0F6A70h,066063BCAh,011010B5Ch,08F659EFFh,0F862AE69h,0616BFFD3h,0166CCF45h dd 0A00AE278h,0D70DD2EEh,04E048354h,03903B3C2h,0A7672661h,0D06016F7h,04969474Dh,03E6E77DBh dd 0AED16A4Ah,0D9D65ADCh,040DF0B66h,037D83BF0h,0A9BCAE53h,0DEBB9EC5h,047B2CF7Fh,030B5FFE9h dd 0BDBDF21Ch,0CABAC28Ah,053B39330h,024B4A3A6h,0BAD03605h,0CDD70693h,054DE5729h,023D967BFh dd 0B3667A2Eh,0C4614AB8h,05D681B02h,02A6F2B94h,0B40BBE37h,0C30C8EA1h,05A05DF1Bh,02D02EF8Dh CRC32 endp _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
06 Mar 2008, 17:14 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.