flat assembler
Message board for the users of flat assembler.

Index > Heap > GPUSort better than CPUSort?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
Raedwulf



Joined: 13 Jul 2005
Posts: 375
Location: United Kingdom
Raedwulf
http://gamma.cs.unc.edu/GPUSORT/results.html

Hmmmm, can someone explain? I can only guess its because GPUs are less complex than CPUs and hence dedicated number crunching is better.

_________________
Raedwulf
Post 08 Feb 2008, 14:00
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17248
Location: In your JS exploiting you and your system
revolution
Maybe because the GPU is not doing anything else. It has isolated memory, no OS to take care of, single tasking (but perhaps multi-threaded) and deliberately designed for specific (not general purpose) tasks.

So in summary I am not surprised it can be faster. It was designed to be faster for certain things and this result would seem to show that it is faster when doing a simple job of sorting.
Post 08 Feb 2008, 14:29
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2888
Location: [RSP+8*5]
bitRAKE
Any time an algorithm can be closely matched to hardware there will be a significant jump in speed - almost as if the hardware was specifically designed for the task. Might be able to get better performance on the CPU though - quicksort is okay, but not optimal. It's great that GPUs can be useful outside of eye candy. Better AI and physics simulation is more important than sorting, or maybe voice/image processing.
Post 08 Feb 2008, 16:35
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
I tested it on my GeForce 6660 but worked terribly sloooooooow, the CPU took always less than a second while the GPU took 44 seconds the last test I allowed to run because I got tired and killed the app.

But of course, my GPU is old and the memory is DDR1, so even perhaps the shaders was executed by the driver rather than the hardware (or software emulation doesn't exists?).
Post 08 Feb 2008, 17:39
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
BTW, I've tried the example3.exe again on a Vista with a GeForce 8600 GTS and again more or less the same timmings Confused (and couldn't recongize the memory amount so it defaulted to 128 MB).

I think that the shaders are run by software by some reason or I'm really idiot (more) today.

Anyone else tested it and had better results?
Post 08 Feb 2008, 22:05
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Keep in mind that the time the GPU spends executing the shaders is one thing - you also have to transfer your data in and out of GPU memory.

Yes, GPU programming can speed some tasks up considerately, but I don't think simple number sorting is a very good indicative task... and imho number sorting is pretty boring anyway Smile

But hm., they say their numbers include data up/download to the card... can't run the test here (GeForce7600GT/256) though, example3.exe crashes. Probably cause I'm on XP64... it bitches about not being able to mess with nvcpl.dll
Post 09 Feb 2008, 00:35
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

Keep in mind that the time the GPU spends executing the shaders is one thing - you also have to transfer your data in and out of GPU memory.

Yes, but 44 seconds for the GPU Vs 0.05 seconds for the CPU? I think that the timmings are too far to consider this a memory transfer issue.

I'll try on Ubuntu later to see if I get better results but I can do that on my 6600 slower model ever only...
Post 09 Feb 2008, 01:01
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Loco: sounds like things aren't executed natively on the GPU, but that the shaders are instead emulated by OpenGL? :-s
Post 09 Feb 2008, 08:57
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Well I've tried on Ubuntu but the sources don't compile (yes I instaled GLUT3 but still there are errors due to duplicated symbols and I don't know how much things I have no time to fix).

For the record here the timmings I was talking about:
Code:
131068 0.145451 0.145451
131069 0.661550 0.661550
131070 0.177526 0.177526
131071 0.941038 0.941038
 GPUSort Time: 44.257 sec Qsort() Time: 0.054 sec  DIFFERENCE: -44.203 sec
Input Sequence Size: 262144
ERROR - Bind()
(null) progid: 5    
Post 09 Feb 2008, 15:34
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
today, GPUs have many ALU, or math processing units. in fact, they are DSPs (digital signal processor), BTW they are dedicated to number manipulations, they do only that.
a gpu don't know the jmp or call instruction, it only know mac, imul, idiv, sqrt, log, etc...
they are math processors, like the x87 copross, but with more than one unique calculus units.

about the GPUsort and CPUsort, i don't see what it is.
is it an instruction?
Post 09 Feb 2008, 16:19
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
No, it is a sorting algorithm, one written in shader assembly for the GPU and the other in plain C/C++ for the CPU.

Someone has some skills with shaders? Perhaps someone could rewrite this and using HLL shaders since newer versions of OpenGL doesn't document assembly much and perhaps it is somewhat unsupported now (note that this GPUsort is from 2005 and the 2.0 version from 2006 but I think they changed very few things in the latter).
Post 09 Feb 2008, 17:21
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
For some reason, I got the feeling that the sorting algorithm they used in the GPU is not quicksort, because recursions(usually quicksort is recursive) are bad, especially on DSP-styled vector processors like the GPU is.

Depending on the cache and memory interface of those GPUs and the actual memory layout of the structures to sort, Heapsort or Mergesort are probably more efficient.

The problem with Heapsort is that it pretty much violates the concepts of spacial and temporal memory locality, thus it pretty easily busts naive caching algorithms like those implemented in lots of hardware.

It will require special * prefetching, and I'm not sure if modern GPUs are capable of this.
Also this special prefetching is not required if the time to access the memory is fast enough.

So basically, I would suggest using Mergesort, if only it were in place, which may become an issue if you are planning to sort a really big amount of numbers.

*: often manual, e.g. software implemted prefetching of numbers at position 4i, 4i+1, 4i+2 and 4i+3 must be done when you are currently at position i in some array and comparing/exchanging it with numbers at position 2i and 2i+1.
So the above described basically is an "on the fly 1-step-ahead" prefetching, which can be further and further extended to prefetch more and more steps in advance, but this will get more and more inefficient since you will need to prefetch more and more memory, but only use less and less of it (remember, in the heap sort, after each step only 1 "branch" is followed, which is only 1/2 in regular heaps). So this is why "block-prefetching" ** with heap sort won't work.
Damit, I should have drawn some graphs to explain it, but I have no more time left right now Sad

**: this is a simple (already posted here on the fasm board) prefetching technique for algorithms that linearly access some arrays.
Post 13 Feb 2008, 20:24
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Keep in mind that quicksort can be implemented non-recursively, though - see for example Microsoft's implementation in the CRT source from Visual Studio... teaser:
Code:
/* Note: the theoretical number of stack entries required is
   no more than 1 + log2(num).  But we switch to insertion
   sort for CUTOFF elements or less, so we really only need
   1 + log2(num) - log2(CUTOFF) stack entries.  For a CUTOFF
   of 8, that means we need no more than 30 stack entries for
   32 bit platforms, and 62 for 64-bit platforms. */
    
Post 14 Feb 2008, 00:12
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
use Radix Sort...

takes linear time compared to loglinear Wink
Post 14 Feb 2008, 15:36
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Radix sort is really great when it can be used Smile
Post 15 Feb 2008, 01:30
View user's profile Send private message Visit poster's website Reply with quote
Raedwulf



Joined: 13 Jul 2005
Posts: 375
Location: United Kingdom
Raedwulf
The_Grey_Beast wrote:
use Radix Sort...

takes linear time compared to loglinear Wink


but it takes linear space too!

_________________
Raedwulf
Post 15 Feb 2008, 12:56
View user's profile Send private message MSN Messenger Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
well I thought quicksort required linear space too, but I looked into it and the best one I could find was a logarithmic space, so it's a bit better than radix sort in that perspective. Do you know of any other better quicksort algo (I think one needs only constant space)?
Post 16 Feb 2008, 22:55
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Quicksort doesn't need additional space, it sorts in-place - that's one of it's advantages.

Often you need a "stable sort" though, one that doesn't change the order of items with equal sort-by values... that might sound a bit cryptic, but imagine you have an array with {Firstname,Lastname} records. If you use a stable sort and sort by Firstname first, then sort by Lastname, things will be "what you expect". If you use an unstable sort, items will be sorted by Lastname, but within the Lastname groups, the Firstnames can appear in any order.

And before you say "sort by both fields at once", consider situations like sort-on-click listview columns Smile
Post 16 Feb 2008, 22:59
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
the article I was referring to kept talking about 'stack' issues (being the space I meant), probably it has been implemented badly I presume.
Post 16 Feb 2008, 23:06
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
The_Grey_Beast wrote:
the article I was referring to kept talking about 'stack' issues (being the space I meant), probably it has been implemented badly I presume.


qsort is a recursive algorithm, so it can end up using "some stack space". As I mentioned above, though, it can be implemented iteratively (using an array for "stack", but obviously only including the things that really *needs* to be there, not all local variables).

_________________
Image - carpe noctem
Post 16 Feb 2008, 23:12
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.