flat assembler
Message board for the users of flat assembler.
Index
> Main > Sorting vector elements horizontally |
Author |
|
tthsqe 22 Apr 2012, 05:04
What exactly are you asking about here?
would you like to know if there is a shorter/more efficient way of doing this sorting using avx? Edit: It seems that your code implements the sorting network a─┬──e─┬─i─── b─┼┬─g─┴─j─┬─ c─┴┼─f─┬─k─┴─ d──┴─h─┴─l─── which cannot be improved upon for 4 inputs. |
|||
22 Apr 2012, 05:04 |
|
tripledot 22 Apr 2012, 07:38
Aha! Thanks, I knew there must be a name for what I was doing. I never did learn enough maths!
Yes, I'm trying to figure out how to implement the AA-sort (http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf) algorithm. The first step is to sort the values within each vector - which is what I'm trying to do here - before performing a comb sort, transposition, and finally a odd-even merge sort on an array of such vectors. It seems like an awful lot of shuffling just for the first step. Can you suggest any improvements? Fresh eyes this morning suggest it might be more efficient to load up 4 vectors at a time, transpose them, and perform a vertical sort, before tranposing them back for the comb sort. Edit: Thank you so much for your diagram. Having worked this out for myself, seeing a diagram like that with 'my' variable names on it has finally helped me understand a whole heap of stuff! |
|||
22 Apr 2012, 07:38 |
|
tripledot 22 Apr 2012, 21:06
Well it's certainly a lot faster to sort 4 vectors vertically, then transpose. But then implementing a modified comb sort, transposition, and finally a merge... it's going to take a while to get my head round it.
This sorting business is very new to me. So, switching from doubles to unsigned int64 keys, I've opted for something simpler. It's probably more appropriate, cos my data is mostly sorted anyway. So, who wants to rip up my insertion sort function?! Code: struct KeyValue key dq ? value dq ? ends label KeyValue.both dqword at KeyValue.key ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; ; KeyValue.insertionSort_ui64 ; ; inputs: ; rcx = nPairs ; rdx = keyValues ; outputs: ; none ; clobbers: ; rax, rcx, r8, r9, r10, xmm0, xmm1 align 64 KeyValue.insertionSort_ui64: vzeroupper xor rax, rax ; i = 0; mov r8d, sizeof.KeyValue ; ; align 16 ; rept 2 { db 0x66, 0x90 } ; .top: ; while(true) { add rax, r8 ; i ++; mov r9, [rdx+rax] ; r9 = keyValues[i].key; lea r10, [rax-0x10] ; j = i - 1; vmovdqa xmm0, [rdx+rax] ; xmm0 = keyValues[i].both; ; align 16 ; @@: ; while(true) { cmp r9, [rdx+r10] ; if(keyValues[i].key >= keyValues[j].key) jge @f ; break; vmovdqa xmm1, [rdx+r10] ; xmm1 = keyValues[j].both; vmovdqa [rdx+r10+0x10], xmm1 ; keyValues[j + 1].both = keyValues[j].both; sub r10, r8 ; j --; jns @b ; if(j < 0) break; ; align 8 ; } @@: ; vmovdqa [rdx+r10+0x10], xmm0 ; keyValues[j + 1].both = keyValues[i].both; sub rcx, 0x01 ; nPairs --; jnz .top ; if(nPairs <= 0) break; vzeroupper ; } ret |
|||
22 Apr 2012, 21:06 |
|
tripledot 23 Apr 2012, 20:23
Apologies for double-posting, but this belongs in this thread. I fixed up the horizontal sort to handle 16 key/value pairs at a time.
Say you have 16 keys: { 7, 2, e, 1 } { 9, 8, c, a } { 3, b, 4, 0 } { f, 5, d, 6 } The output is: { 3, 7, 9, f } { 2, 5, 8, b } { 4, c, d, e } { 0, 1, 6, a } Code: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; ; sort_4x4inRegister ; ; Description: ; - Performs a vertical sort on key/pointer pairs in ymm registers ; - Result is transposed to give horizontally-sorted output ; - Keys are f64 ; - Pointers are ui64 ; ; Usage: ; - Can be used in a hybrid sort to generate input for a merge sort ; - A more advanced hybrid sort (AA-sort) would implement a comb sort, ; followed by a transpose operation, and finally a merge sort. ; For more details, see: ; http://www.research.ibm.com/trl/people/inouehrs/pdf/PACT2007-SIMDsort.pdf ; ; Inputs: ; ymm0 = m i e a (keys) ; ymm1 = n j f b (keys) ; ymm2 = o k g c (keys) ; ymm3 = p l h d (keys) ; ; ymm4 = M I E A (pointers) ; ymm5 = N J F B (pointers) ; ymm6 = O K G C (pointers) ; ymm7 = P L H D (pointers) ; ; Outputs: ; ymm0 = d c b a (sorted keys) ; ymm1 = h g f e (sorted keys) ; ymm2 = l k j i (sorted keys) ; ymm3 = p o n m (sorted keys) ; ; ymm4 = D C B A (sorted pointers) ; ymm5 = H G F E (sorted pointers) ; ymm6 = L K J I (sorted pointers) ; ymm7 = P O N M (sorted pointers) ; ; Clobbers: ; ymm0 - ymm15 ; ; Algorithm: ; a o---o-------o-------o ; | | ; b o---|---o---o---o---o ; | | | ; c o---o---|---o---o---o ; | | ; d o-------o---o-------o ; ; comp(a, b) { ; temp = a; ; a = min(a, b); ; b = max(temp, b); ; } ; ; sort4(a, b, c, d) { ; comp(a, c); ; comp(b, d); ; comp(a, b); ; comp(c, d); ; comp(b, c); ; } align 32 sort_4x4inRegister: ; clocks: ; vcmpgtpd ymm15, ymm0, ymm2 ; ymm15 := mask(a > c) ; 1 vcmpgtpd ymm14, ymm1, ymm3 ; ymm14 := mask(b > d) ; 2 ; ; one wasted clock here; everything else is optimally ordered for sandy bridge ; 3 ; ; comp(a, c); ; vblendvpd ymm8, ymm0, ymm2, ymm15 ; ymm8 := a ; 4 vblendvpd ymm0, ymm2, ymm0, ymm15 ; ymm0 := c ; 5 ; comp(A, C); ; vblendvpd ymm2, ymm4, ymm6, ymm15 ; ymm2 := A ; 6 vblendvpd ymm4, ymm6, ymm4, ymm15 ; ymm4 := C ; 7 ; ; comp(b, d); ; vblendvpd ymm6, ymm1, ymm3, ymm14 ; ymm6 := b ; 8 vblendvpd ymm1, ymm3, ymm1, ymm14 ; ymm1 := d ; 9 vcmpgtpd ymm15, ymm8, ymm6 ; ymm15 := mask(a > b) ; ; comp(B, D); ; vblendvpd ymm3, ymm5, ymm7, ymm14 ; ymm3 := B ; 10 vcmpgtpd ymm14, ymm0, ymm1 ; ymm14 := mask(c > d) ; vblendvpd ymm5, ymm7, ymm5, ymm14 ; ymm5 := D ; 11 ; ; comp(a, b); ; vblendvpd ymm7, ymm8, ymm6, ymm15 ; ymm7 := a ; 12 vblendvpd ymm8, ymm6, ymm8, ymm15 ; ymm8 := b ; 13 ; comp(A, B); ; vblendvpd ymm6, ymm2, ymm3, ymm15 ; ymm6 := A ; 14 vblendvpd ymm2, ymm3, ymm2, ymm15 ; ymm2 := B ; 15 ; ; comp(c, d); ; vblendvpd ymm3, ymm0, ymm1, ymm14 ; ymm3 := c ; 16 vblendvpd ymm0, ymm1, ymm0, ymm14 ; ymm0 := d ; 17 vcmpgtpd ymm15, ymm8, ymm3 ; ymm15 := mask(b > c) ; ; comp(C, D); ; vblendvpd ymm1, ymm4, ymm5, ymm14 ; ymm1 := C ; 18 vblendvpd ymm4, ymm5, ymm4, ymm14 ; ymm4 := D ; 19 ; ; comp(b, c); ; vblendvpd ymm5, ymm8, ymm3, ymm15 ; ymm5 := b ; 20 vblendvpd ymm8, ymm3, ymm8, ymm15 ; ymm8 := c ; 21 ; comp(B, C); ; vblendvpd ymm3, ymm2, ymm1, ymm15 ; ymm3 := B ; 22 vblendvpd ymm2, ymm1, ymm2, ymm15 ; ymm2 := C ; 23 ; ; ymm7 = m i e a (keys) ; ; ymm5 = n j f b (keys) ; ; ymm8 = o k g c (keys) ; ; ymm0 = p l h d (keys) ; ; ; ymm6 = M I E A (pointers) ; ; ymm3 = N J F B (pointers) ; ; ymm2 = O K G C (pointers) ; ; ymm4 = P L H D (pointers) ; ; vperm2f128 ymm15, ymm7, ymm8, 0x20 ; ymm15 := g c e a ; 24 vperm2f128 ymm14, ymm5, ymm0, 0x20 ; ymm14 := h d f b ; 25 vperm2f128 ymm13, ymm7, ymm8, 0x31 ; ymm13 := o k m i ; 26 vperm2f128 ymm12, ymm5, ymm0, 0x31 ; ymm12 := p l n j ; 27 ; vperm2f128 ymm11, ymm6, ymm2, 0x20 ; ymm11 := G C E A ; 28 vperm2f128 ymm10, ymm3, ymm4, 0x20 ; ymm10 := H D F B ; 29 vperm2f128 ymm9, ymm6, ymm2, 0x31 ; ymm9 := O K M I ; 30 vperm2f128 ymm8, ymm3, ymm4, 0x31 ; ymm8 := P L N J ; 31 ; vunpcklpd ymm0, ymm15, ymm14 ; ymm0 := d c b a ; 32 vunpckhpd ymm1, ymm15, ymm14 ; ymm1 := h g f e ; 33 vunpcklpd ymm2, ymm13, ymm12 ; ymm2 := l k j i ; 34 vunpckhpd ymm3, ymm13, ymm12 ; ymm3 := p o n m ; 35 ; vunpcklpd ymm4, ymm11, ymm10 ; ymm0 := D C B A ; 36 vunpckhpd ymm5, ymm11, ymm10 ; ymm1 := H G F E ; 37 vunpcklpd ymm6, ymm9, ymm8 ; ymm2 := L K J I ; 38 vunpckhpd ymm7, ymm9, ymm8 ; ymm3 := P O N M ; 39 ; ret ; 40 ; 41 <- not bad, eh? |
|||
23 Apr 2012, 20:23 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.