flat assembler
Message board for the users of flat assembler.

 Index > Main > Sorting vector elements horizontally
Author
tripledot

Joined: 06 Jan 2009
Posts: 49
tripledot 21 Apr 2012, 21:14
I've been thinking about ways to sort elements within a SIMD vector, with a view to sorting key/pointer pairs (so it would have to be done using masks rather than min/max, so we can sort the pointers as well).

I'm sticking with AVX because I won't be upgrading to AVX2 anytime soon, but since I'm dealing with 64-bit pointers I have to use floating-point instructions in order to use ymm registers. The keys to be sorted are therefore stored as doubles. I think I've come up with something, but it might be terribly long-winded. Anyone mind taking a look?

Say we have a vector of 4 keys (a, b, c and d) in ymm0, with the corresponding pointers (A, B, C, and D) in ymm1:
Code:
```; ymm0 = d c b a
; ymm1 = D C B A

; perform the first round of sorting
vperm2f128  ymm2, ymm0, ymm0, 0x01  ; ymm2 = b a d c
vperm2f128  ymm3, ymm0, ymm1, 0x01  ; ymm3 = B A D C
vcmpgtpd    ymm4, ymm0, ymm2        ; ymm4 = d>b c>a b>d a>c
; 2 spare cycles here
; probably a good time to increment source pointer(s)
vblendvpd   ymm5, ymm0, ymm2, ymm4  ; ymm5 = min(b,d) min(a,c) min(b,d) min(a,c)
vblendvpd   ymm6, ymm2, ymm0, ymm4  ; ymm6 = max(b,d) max(a,c) max(b,d) max(a,c)
vblendvpd   ymm0, ymm1, ymm3, ymm4  ; ymm0 = min(B,D) min(A,C) min(B,D) min(A,C)
vblendvpd   ymm2, ymm3, ymm1, ymm4  ; ymm2 = max(B,D) max(A,C) max(B,D) max(A,C)

; let e = min(a,c)
; let f = max(a,c)
; let g = min(b,d)
; let h = max(b,d)

; axioms:
; e < f
; g < h

; ymm5 = g e g e
; ymm6 = h f h f
; ymm0 = G E G E
; ymm2 = H F H F

; perform the second round of sorting
vblendpd    ymm1, ymm5, ymm6, 1100b ; ymm1 = h f g e
vpermilpd   ymm3, ymm1, 0101b       ; ymm3 = f h e g
vcmpgtpd    ymm4, ymm1, ymm3        ; ymm4 = h>f f>h g>e e>g
vblendpd    ymm0, ymm0, ymm2, 1100b ; ymm0 = H F G E
vpermilpd   ymm2, ymm0, 0101b       ; ymm2 = F H E G
vblendvpd   ymm5, ymm1, ymm3, ymm4  ; ymm5 = min(f,h) min(f,h) min(e,g) min(e,g)
vblendvpd   ymm6, ymm3, ymm1, ymm4  ; ymm6 = max(f,h) max(f,h) max(e,g) max(e,g)
vblendvpd   ymm1, ymm0, ymm2, ymm4  ; ymm1 = min(F,H) min(F,H) min(E,G) min(E,G)
vblendvpd   ymm3, ymm0, ymm2, ymm4  ; ymm3 = max(F,H) max(F,H) max(E,G) max(E,G)

; let i = min(e,g)
; let j = max(e,g)
; let k = min(f,h)
; let l = max(f,h)

; axioms:
; i < j
; k < l
```

Here comes the tricky part...

We now have four numbers, 'i', 'j', 'k', and 'l'. According to our axioms (i < j, k < l), there are 6 possible ways these numbers can be ordered (from left to right, lowest to highest value):
(1) i j k l
(2) i k j l
(3) i k l j
(4) k i j l
(5) k i l j
(6) k l i j

Returning to our axioms for a moment...

'i' can either be 'e' or 'g'. Therefore, if 'i' is equal to 'e', 'j' must equal 'g'. If 'i' is equal to 'g', 'j' must equal 'e'.
'k' can either be 'f' or 'h'. Therefore, if 'k' is equal to 'f', 'l' must equal 'h'. If 'k' is equal to 'h', 'l' must equal 'f'.

e < f
g < h
i < j
k < l

Let's test out a few different possibilities...

Assuming 'i' = 'e', and 'k' = 'f'... 'i' must be less than 'k', and 'j' must be less than 'l', therefore only orders (1) and (2) are possible.
Assuming 'i' = 'e', and 'k' = 'h'... 'i' must be less than 'l', and 'j' must be less than 'k', therefore only order (1) is possible.
Assuming 'i' = 'g', and 'k' = 'f'... 'i' must be less than 'l', and 'j' must be less than 'k', therefore only order (1) is possible.
Assuming 'i' = 'g', and 'k' = 'h'... 'i' must be less than 'k', and 'j' must be less than 'l', therefore only orders (1) and (2) are possible.

We have reduced the number of possible ways our keys can be ordered from 6 to 2:
(1) i j k l
(2) i k j l

Spot the difference! Either j < k, or k < j. We only need to perform one more comparison:
Code:
```; ymm5 = k k i i
; ymm6 = l l j j
; ymm1 = K K I I
; ymm3 = L L J J

; perform the final round of sorting
vperm2f128  ymm0, ymm5, ymm6, 0x21  ; ymm0 = j j k k
vblendpd    ymm2, ymm5, ymm6, 0011b ; ymm2 = k k j j
vcmpgtpd    ymm4, ymm0, ymm2        ; ymm4 = j>k j>k k>j k>j
vperm2f128  ymm7, ymm1, ymm3, 0x21  ; ymm7 = J J K K
; another spare cycle...
vblendvpd   ymm0, ymm2, ymm0, ymm4  ; ymm0 = max(j,k) max(j,k) min(j,k) min(j,k)
vblendpd    ymm2, ymm1, ymm3, 0011b ; ymm2 = K K J J
vblendvpd   ymm2, ymm2, ymm7, ymm4  ; ymm2 = max(J,K) max(J,K) min(J,K) min(J,K)
vblendpd    ymm0, ymm0, ymm5, 0001b ; ymm0 = max(j,k) max(j,k) min(j,k) i
vblendpd    ymm2, ymm2, ymm1, 0001b ; ymm2 = max(J,K) max(J,K) min(J,K) I
vblendpd    ymm0, ymm0, ymm6, 1000b ; ymm0 = l max(j,k) min(j,k) i
; ymm0 = sorted keys
vblendpd    ymm1, ymm2, ymm3, 1000b ; ymm2 = L max(J,K) min(J,K) I
; ymm2 = sorted pointers
```

I've been wrong before. This could be a total waste of time.
21 Apr 2012, 21:14
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 22 Apr 2012, 05:04
would you like to know if there is a shorter/more efficient way of doing this sorting using avx?

Edit:
It seems that your code implements the sorting network
a─┬──e─┬─i───
b─┼┬─g─┴─j─┬─
c─┴┼─f─┬─k─┴─
d──┴─h─┴─l───
which cannot be improved upon for 4 inputs.
22 Apr 2012, 05:04
tripledot

Joined: 06 Jan 2009
Posts: 49
tripledot 22 Apr 2012, 07:38
Aha! Thanks, I knew there must be a name for what I was doing. I never did learn enough maths!

Yes, I'm trying to figure out how to implement the AA-sort (http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf) algorithm. The first step is to sort the values within each vector - which is what I'm trying to do here - before performing a comb sort, transposition, and finally a odd-even merge sort on an array of such vectors.

It seems like an awful lot of shuffling just for the first step. Can you suggest any improvements?

Fresh eyes this morning suggest it might be more efficient to load up 4 vectors at a time, transpose them, and perform a vertical sort, before tranposing them back for the comb sort.

Edit:

Thank you so much for your diagram. Having worked this out for myself, seeing a diagram like that with 'my' variable names on it has finally helped me understand a whole heap of stuff!
22 Apr 2012, 07:38
tripledot

Joined: 06 Jan 2009
Posts: 49
tripledot 22 Apr 2012, 21:06
Well it's certainly a lot faster to sort 4 vectors vertically, then transpose. But then implementing a modified comb sort, transposition, and finally a merge... it's going to take a while to get my head round it.

This sorting business is very new to me. So, switching from doubles to unsigned int64 keys, I've opted for something simpler. It's probably more appropriate, cos my data is mostly sorted anyway.

So, who wants to rip up my insertion sort function?!

Code:
```struct KeyValue
key     dq      ?
value   dq      ?
ends

label KeyValue.both     dqword at KeyValue.key

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; KeyValue.insertionSort_ui64
;
; inputs:
;       rcx = nPairs
;       rdx = keyValues
; outputs:
;       none
; clobbers:
;       rax, rcx, r8, r9, r10, xmm0, xmm1

align 64
KeyValue.insertionSort_ui64:
vzeroupper
xor             rax, rax                ; i = 0;
mov             r8d, sizeof.KeyValue    ;
;       align 16                                        ;
rept 2 { db 0x66, 0x90 }                        ;
.top:                                           ; while(true) {
add             rax, r8                 ;       i ++;
mov             r9, [rdx+rax]           ;       r9 = keyValues[i].key;
lea             r10, [rax-0x10]         ;       j = i - 1;
vmovdqa         xmm0, [rdx+rax]         ;       xmm0 = keyValues[i].both;
;       align 16                                        ;
@@:                                             ;       while(true) {
cmp             r9, [rdx+r10]           ;               if(keyValues[i].key >= keyValues[j].key)
jge             @f                      ;                       break;
vmovdqa         xmm1, [rdx+r10]         ;               xmm1 = keyValues[j].both;
vmovdqa         [rdx+r10+0x10], xmm1    ;               keyValues[j + 1].both = keyValues[j].both;
sub             r10, r8                 ;               j --;
jns             @b                      ;               if(j < 0) break;
;       align 8                                         ;       }
@@:                                             ;
vmovdqa         [rdx+r10+0x10], xmm0    ;       keyValues[j + 1].both = keyValues[i].both;
sub             rcx, 0x01               ;       nPairs --;
jnz             .top                    ;       if(nPairs <= 0) break;
vzeroupper                              ; }
ret
```
22 Apr 2012, 21:06
tripledot

Joined: 06 Jan 2009
Posts: 49
tripledot 23 Apr 2012, 20:23
Apologies for double-posting, but this belongs in this thread. I fixed up the horizontal sort to handle 16 key/value pairs at a time.

Say you have 16 keys: { 7, 2, e, 1 } { 9, 8, c, a } { 3, b, 4, 0 } { f, 5, d, 6 }

The output is: { 3, 7, 9, f } { 2, 5, 8, b } { 4, c, d, e } { 0, 1, 6, a }

Code:
```;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; sort_4x4inRegister
;
; Description:
;       - Performs a vertical sort on key/pointer pairs in ymm registers
;       - Result is transposed to give horizontally-sorted output
;       - Keys are f64
;       - Pointers are ui64
;
; Usage:
;       - Can be used in a hybrid sort to generate input for a merge sort
;       - A more advanced hybrid sort (AA-sort) would implement a comb sort,
;         followed by a transpose operation, and finally a merge sort.
;         For more details, see:
;         http://www.research.ibm.com/trl/people/inouehrs/pdf/PACT2007-SIMDsort.pdf
;
; Inputs:
;       ymm0 = m i e a (keys)
;       ymm1 = n j f b (keys)
;       ymm2 = o k g c (keys)
;       ymm3 = p l h d (keys)
;
;       ymm4 = M I E A (pointers)
;       ymm5 = N J F B (pointers)
;       ymm6 = O K G C (pointers)
;       ymm7 = P L H D (pointers)
;
; Outputs:
;       ymm0 = d c b a (sorted keys)
;       ymm1 = h g f e (sorted keys)
;       ymm2 = l k j i (sorted keys)
;       ymm3 = p o n m (sorted keys)
;
;       ymm4 = D C B A (sorted pointers)
;       ymm5 = H G F E (sorted pointers)
;       ymm6 = L K J I (sorted pointers)
;       ymm7 = P O N M (sorted pointers)
;
; Clobbers:
;       ymm0 - ymm15
;
; Algorithm:
;       a   o---o-------o-------o
;               |       |
;       b   o---|---o---o---o---o
;               |   |       |
;       c   o---o---|---o---o---o
;                   |   |
;       d   o-------o---o-------o
;
;       comp(a, b) {
;               temp = a;
;               a = min(a, b);
;               b = max(temp, b);
;       }
;
;       sort4(a, b, c, d) {
;               comp(a, c);
;               comp(b, d);
;               comp(a, b);
;               comp(c, d);
;               comp(b, c);
;       }

align 32
sort_4x4inRegister:                                                                                             ; clocks:
;
vcmpgtpd        ymm15, ymm0,  ymm2              ; ymm15 := mask(a > c)                                  ; 1
vcmpgtpd        ymm14, ymm1,  ymm3              ; ymm14 := mask(b > d)                                  ; 2
;
; one wasted clock here; everything else is optimally ordered for sandy bridge                          ; 3
;
; comp(a, c);                                                                                           ;
vblendvpd       ymm8,  ymm0,  ymm2,  ymm15      ; ymm8 := a                                             ; 4
vblendvpd       ymm0,  ymm2,  ymm0,  ymm15      ; ymm0 := c                                             ; 5
; comp(A, C);                                                                                           ;
vblendvpd       ymm2,  ymm4,  ymm6,  ymm15      ; ymm2 := A                                             ; 6
vblendvpd       ymm4,  ymm6,  ymm4,  ymm15      ; ymm4 := C                                             ; 7
;
; comp(b, d);                                                                                           ;
vblendvpd       ymm6,  ymm1,  ymm3,  ymm14      ; ymm6 := b                                             ; 8
vblendvpd       ymm1,  ymm3,  ymm1,  ymm14      ; ymm1 := d                                             ; 9
vcmpgtpd       ymm15, ymm8,  ymm6                                      ; ymm15 := mask(a > b)          ;
; comp(B, D);                                                                                           ;
vblendvpd       ymm3,  ymm5,  ymm7,  ymm14      ; ymm3 := B                                             ; 10
vcmpgtpd       ymm14, ymm0,  ymm1                                      ; ymm14 := mask(c > d)          ;
vblendvpd       ymm5,  ymm7,  ymm5,  ymm14      ; ymm5 := D                                             ; 11
;
; comp(a, b);                                                                                           ;
vblendvpd       ymm7,  ymm8,  ymm6,  ymm15      ; ymm7 := a                                             ; 12
vblendvpd       ymm8,  ymm6,  ymm8,  ymm15      ; ymm8 := b                                             ; 13
; comp(A, B);                                                                                           ;
vblendvpd       ymm6,  ymm2,  ymm3,  ymm15      ; ymm6 := A                                             ; 14
vblendvpd       ymm2,  ymm3,  ymm2,  ymm15      ; ymm2 := B                                             ; 15
;
; comp(c, d);                                                                                           ;
vblendvpd       ymm3,  ymm0,  ymm1,  ymm14      ; ymm3 := c                                             ; 16
vblendvpd       ymm0,  ymm1,  ymm0,  ymm14      ; ymm0 := d                                             ; 17
vcmpgtpd       ymm15, ymm8,  ymm3                                      ; ymm15 := mask(b > c)          ;
; comp(C, D);                                                                                           ;
vblendvpd       ymm1,  ymm4,  ymm5,  ymm14      ; ymm1 := C                                             ; 18
vblendvpd       ymm4,  ymm5,  ymm4,  ymm14      ; ymm4 := D                                             ; 19
;
; comp(b, c);                                                                                           ;
vblendvpd       ymm5,  ymm8,  ymm3,  ymm15      ; ymm5 := b                                             ; 20
vblendvpd       ymm8,  ymm3,  ymm8,  ymm15      ; ymm8 := c                                             ; 21
; comp(B, C);                                                                                           ;
vblendvpd       ymm3,  ymm2,  ymm1,  ymm15      ; ymm3 := B                                             ; 22
vblendvpd       ymm2,  ymm1,  ymm2,  ymm15      ; ymm2 := C                                             ; 23
;
; ymm7 = m i e a (keys)                                                                                 ;
; ymm5 = n j f b (keys)                                                                                 ;
; ymm8 = o k g c (keys)                                                                                 ;
; ymm0 = p l h d (keys)                                                                                 ;
;
; ymm6 = M I E A (pointers)                                                                             ;
; ymm3 = N J F B (pointers)                                                                             ;
; ymm2 = O K G C (pointers)                                                                             ;
; ymm4 = P L H D (pointers)                                                                             ;
;
vperm2f128      ymm15, ymm7,  ymm8,  0x20       ; ymm15 := g c e a                                      ; 24
vperm2f128      ymm14, ymm5,  ymm0,  0x20       ; ymm14 := h d f b                                      ; 25
vperm2f128      ymm13, ymm7,  ymm8,  0x31       ; ymm13 := o k m i                                      ; 26
vperm2f128      ymm12, ymm5,  ymm0,  0x31       ; ymm12 := p l n j                                      ; 27
;
vperm2f128      ymm11, ymm6,  ymm2,  0x20       ; ymm11 := G C E A                                      ; 28
vperm2f128      ymm10, ymm3,  ymm4,  0x20       ; ymm10 := H D F B                                      ; 29
vperm2f128      ymm9,  ymm6,  ymm2,  0x31       ; ymm9  := O K M I                                      ; 30
vperm2f128      ymm8,  ymm3,  ymm4,  0x31       ; ymm8  := P L N J                                      ; 31
;
vunpcklpd       ymm0,  ymm15, ymm14             ; ymm0 := d c b a                                       ; 32
vunpckhpd       ymm1,  ymm15, ymm14             ; ymm1 := h g f e                                       ; 33
vunpcklpd       ymm2,  ymm13, ymm12             ; ymm2 := l k j i                                       ; 34
vunpckhpd       ymm3,  ymm13, ymm12             ; ymm3 := p o n m                                       ; 35
;
vunpcklpd       ymm4,  ymm11, ymm10             ; ymm0 := D C B A                                       ; 36
vunpckhpd       ymm5,  ymm11, ymm10             ; ymm1 := H G F E                                       ; 37
vunpcklpd       ymm6,  ymm9,  ymm8              ; ymm2 := L K J I                                       ; 38
vunpckhpd       ymm7,  ymm9,  ymm8              ; ymm3 := P O N M                                       ; 39
;
ret                                                                                                     ; 40
; 41 <- not bad, eh?    ```
23 Apr 2012, 20:23
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum