flat assembler
Message board for the users of flat assembler.

Index > Projects and Ideas > [IDEA] SSE-bubble-sort

Author
Thread Post new topic Reply to topic
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 01 Sep 2008, 20:49
This time I will make a short introduction and then straight to the code.
I wanted to take on a challenge to port GPR bubble-sort code to SSE-style
code, but got into trouble when I needed to *fully* sort it. current code is
only semi-sorting it:
Code:
; If your CPU is Penryn or newer then go ahead and try the SSE4.1
; Very speedy Very Happy
        movdqa  xmm0,[rdi]
        movdqa  xmm2,xmm0
        pshufd  xmm1,xmm0,01001110b
        pmaxsw  xmm0,xmm1 ;pmaxud on SSE4.1
        pminsw  xmm2,xmm1 ;pminud on SSE4.1
        movhlps xmm0,xmm2
        movdqa  xmm2,xmm0
        pshufd  xmm1,xmm0,10110001b
        pmaxsw  xmm0,xmm1 ;pmaxud on SSE4.1
        pminsw  xmm2,xmm1 ;pminud on SSE4.1
        pshufd  xmm0,xmm0,11011000b
        pshufd  xmm2,xmm2,11011000b
        movhlps xmm0,xmm2
        pshufd  xmm0,xmm0,11011000b
        movdqa  [rdi],xmm0
; traditional bubble needs minimum 3 compares (1,2,3,4)
; but in the worst case 13 cmp+6 xchg (4,3,2,1)

align 16
      array dd 354  ,235  ,7563 ,32434,435  ,45   ,2    ,355
            dd 35   ,535  ,556  ,46   ,38   ,9324 ,74   ,5
      array.len = ($-array)
    


The previous code would iterate through the array by increments of 16
sorting every 4x4B part of it. You could replace the moves with unaligned
ones, but then you'd lose the point.

One thought that I came up with it to cache the next 16-byte space and
shuffle its 4 values with current ones. This way you will always have 8 values
to compare and when you run it through N times its guaranteed to sort.

The most unfortunate fact is that we need dummy entries and remember
them or agree on a MAX_VAL = 2^32-1 so that the dummies would end up
at the same place they started from. This value could not be used then
anymore.

I think this might be helpful in Quicksorting, but I think I need to be a
chess-wizard to implement that Razz

Just to prove I'm not making things up:
http://www.trl.ibm.com/people/inouehrs/pdf/PACT2007-SIMDsort.pdf
http://www.cs.ualberta.ca/~furtak/pub/TR07-02.pdf
There are more...

Network sort is probably my best bet for SIMD approach:
http://www.cs.brandeis.edu/~hugues/sorting_networks.html
Compare Bubble-sort on a 16-wide array (up to 120 exchanges) with
Network sort (10 parallel moves in ideal, but 60 exchanges).

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 01 Sep 2008, 20:49
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tripledot



Joined: 06 Jan 2009
Posts: 49
tripledot 23 Apr 2012, 16:11
Yo Madis, did you get any further with your network sort idea?
Post 23 Apr 2012, 16:11
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 23 Apr 2012, 18:51
Unfortunately not. I have seen your topic (http://board.flatassembler.net/topic.php?p=143190#143190) where you ended up with network sort, but I didn't find anything valuable to add. I know its not a dead subject, but it needs a lot of investigation. Maybe if we get some wizards on board like MHajduk, r22 etc who would give a second look at the topic. I don't think I have any new ideas regarding network sort. Even taking into account the capabilities of AVX and AVX2.
Post 23 Apr 2012, 18:51
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tripledot



Joined: 06 Jan 2009
Posts: 49
tripledot 23 Apr 2012, 19:41
After re-working my 4-input sorting network to handle 16 key/value pairs at once, I've arrived at something pretty optimal. Stage 1 of the AA-sort algorithm complete! Now for the comb-sort!

Code:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; sort_4x4inRegister
;
; Description:
;       - Performs a vertical sort on key/pointer pairs in ymm registers
;       - Result is transposed to give horizontally-sorted output
;       - Keys are f64
;       - Pointers are ui64
;
; Usage:
;       - Can be used in a hybrid sort to generate input for a merge sort
;       - A more advanced hybrid sort (AA-sort) would implement a comb sort,
;         followed by a transpose operation, and finally a merge sort.
;         For more details, see:
;         http://www.research.ibm.com/trl/people/inouehrs/pdf/PACT2007-SIMDsort.pdf
;
; Inputs:
;       ymm0 = m i e a (keys)
;       ymm1 = n j f b (keys)
;       ymm2 = o k g c (keys)
;       ymm3 = p l h d (keys)
;
;       ymm4 = M I E A (pointers)
;       ymm5 = N J F B (pointers)
;       ymm6 = O K G C (pointers)
;       ymm7 = P L H D (pointers)
;
; Outputs:
;       ymm0 = d c b a (sorted keys)
;       ymm1 = h g f e (sorted keys)
;       ymm2 = l k j i (sorted keys)
;       ymm3 = p o n m (sorted keys)
;
;       ymm4 = D C B A (sorted pointers)
;       ymm5 = H G F E (sorted pointers)
;       ymm6 = L K J I (sorted pointers)
;       ymm7 = P O N M (sorted pointers)
;
; Clobbers:
;       ymm0 - ymm15
;
; Algorithm:
;       a   o---o-------o-------o
;               |       |
;       b   o---|---o---o---o---o
;               |   |       |
;       c   o---o---|---o---o---o
;                   |   |
;       d   o-------o---o-------o
;
;       comp(a, b) {
;               temp = a;
;               a = min(a, b);
;               b = max(temp, b);
;       }
;
;       sort4(a, b, c, d) {
;               comp(a, c);
;               comp(b, d);
;               comp(a, b);
;               comp(c, d);
;               comp(b, c);
;       }

align 32
sort_4x4inRegister:                                                                                             ; clocks:
                                                                                                                ;
        vcmpgtpd        ymm15, ymm0,  ymm2              ; ymm15 := mask(a > c)                                  ; 1
        vcmpgtpd        ymm14, ymm1,  ymm3              ; ymm14 := mask(b > d)                                  ; 2
                                                                                                                ;
        ; one wasted clock here; everything else is optimally ordered for sandy bridge                          ; 3
                                                                                                                ;
        ; comp(a, c);                                                                                           ;
        vblendvpd       ymm8,  ymm0,  ymm2,  ymm15      ; ymm8 := a                                             ; 4
        vblendvpd       ymm0,  ymm2,  ymm0,  ymm15      ; ymm0 := c                                             ; 5
        ; comp(A, C);                                                                                           ;
        vblendvpd       ymm2,  ymm4,  ymm6,  ymm15      ; ymm2 := A                                             ; 6
        vblendvpd       ymm4,  ymm6,  ymm4,  ymm15      ; ymm4 := C                                             ; 7
                                                                                                                ;
        ; comp(b, d);                                                                                           ;
        vblendvpd       ymm6,  ymm1,  ymm3,  ymm14      ; ymm6 := b                                             ; 8
        vblendvpd       ymm1,  ymm3,  ymm1,  ymm14      ; ymm1 := d                                             ; 9
         vcmpgtpd       ymm15, ymm8,  ymm6                                      ; ymm15 := mask(a > b)          ;
        ; comp(B, D);                                                                                           ;
        vblendvpd       ymm3,  ymm5,  ymm7,  ymm14      ; ymm3 := B                                             ; 10
         vcmpgtpd       ymm14, ymm0,  ymm1                                      ; ymm14 := mask(c > d)          ;
        vblendvpd       ymm5,  ymm7,  ymm5,  ymm14      ; ymm5 := D                                             ; 11
                                                                                                                ;
        ; comp(a, b);                                                                                           ;
        vblendvpd       ymm7,  ymm8,  ymm6,  ymm15      ; ymm7 := a                                             ; 12
        vblendvpd       ymm8,  ymm6,  ymm8,  ymm15      ; ymm8 := b                                             ; 13
        ; comp(A, B);                                                                                           ;
        vblendvpd       ymm6,  ymm2,  ymm3,  ymm15      ; ymm6 := A                                             ; 14
        vblendvpd       ymm2,  ymm3,  ymm2,  ymm15      ; ymm2 := B                                             ; 15
                                                                                                                ;
        ; comp(c, d);                                                                                           ;
        vblendvpd       ymm3,  ymm0,  ymm1,  ymm14      ; ymm3 := c                                             ; 16
        vblendvpd       ymm0,  ymm1,  ymm0,  ymm14      ; ymm0 := d                                             ; 17
         vcmpgtpd       ymm15, ymm8,  ymm3                                      ; ymm15 := mask(b > c)          ;
        ; comp(C, D);                                                                                           ;
        vblendvpd       ymm1,  ymm4,  ymm5,  ymm14      ; ymm1 := C                                             ; 18
        vblendvpd       ymm4,  ymm5,  ymm4,  ymm14      ; ymm4 := D                                             ; 19
                                                                                                                ;
        ; comp(b, c);                                                                                           ;
        vblendvpd       ymm5,  ymm8,  ymm3,  ymm15      ; ymm5 := b                                             ; 20
        vblendvpd       ymm8,  ymm3,  ymm8,  ymm15      ; ymm8 := c                                             ; 21
        ; comp(B, C);                                                                                           ;
        vblendvpd       ymm3,  ymm2,  ymm1,  ymm15      ; ymm3 := B                                             ; 22
        vblendvpd       ymm2,  ymm1,  ymm2,  ymm15      ; ymm2 := C                                             ; 23
                                                                                                                ;
        ; ymm7 = m i e a (keys)                                                                                 ;
        ; ymm5 = n j f b (keys)                                                                                 ;
        ; ymm8 = o k g c (keys)                                                                                 ;
        ; ymm0 = p l h d (keys)                                                                                 ;
                                                                                                                ;
        ; ymm6 = M I E A (pointers)                                                                             ;
        ; ymm3 = N J F B (pointers)                                                                             ;
        ; ymm2 = O K G C (pointers)                                                                             ;
        ; ymm4 = P L H D (pointers)                                                                             ;
                                                                                                                ;
        vperm2f128      ymm15, ymm7,  ymm8,  0x20       ; ymm15 := g c e a                                      ; 24
        vperm2f128      ymm14, ymm5,  ymm0,  0x20       ; ymm14 := h d f b                                      ; 25
        vperm2f128      ymm13, ymm7,  ymm8,  0x31       ; ymm13 := o k m i                                      ; 26
        vperm2f128      ymm12, ymm5,  ymm0,  0x31       ; ymm12 := p l n j                                      ; 27
                                                                                                                ;
        vperm2f128      ymm11, ymm6,  ymm2,  0x20       ; ymm11 := G C E A                                      ; 28
        vperm2f128      ymm10, ymm3,  ymm4,  0x20       ; ymm10 := H D F B                                      ; 29
        vperm2f128      ymm9,  ymm6,  ymm2,  0x31       ; ymm9  := O K M I                                      ; 30
        vperm2f128      ymm8,  ymm3,  ymm4,  0x31       ; ymm8  := P L N J                                      ; 31
                                                                                                                ;
        vunpcklpd       ymm0,  ymm15, ymm14             ; ymm0 := d c b a                                       ; 32
        vunpckhpd       ymm1,  ymm15, ymm14             ; ymm1 := h g f e                                       ; 33
        vunpcklpd       ymm2,  ymm13, ymm12             ; ymm2 := l k j i                                       ; 34
        vunpckhpd       ymm3,  ymm13, ymm12             ; ymm3 := p o n m                                       ; 35
                                                                                                                ;
        vunpcklpd       ymm4,  ymm11, ymm10             ; ymm0 := D C B A                                       ; 36
        vunpckhpd       ymm5,  ymm11, ymm10             ; ymm1 := H G F E                                       ; 37
        vunpcklpd       ymm6,  ymm9,  ymm8              ; ymm2 := L K J I                                       ; 38
        vunpckhpd       ymm7,  ymm9,  ymm8              ; ymm3 := P O N M                                       ; 39
                                                                                                                ;
        ret                                                                                                     ; 40
                                                                                                                ; 41 <- not bad, eh?    
Post 23 Apr 2012, 19:41
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.