flat assembler
Message board for the users of flat assembler.
Index
> Main > OpenGL - Inverse alpha? |
Author |
|
adnimo 08 Aug 2008, 20:20
I thought about getting rid of the second allocation, but as it turns out, I can't - Because the data I load is 24bit and if I allocate 32bit buffer, put the 24bit data in there, I'll just have a gap at the end of the buffer because the data is not spread in 32bit... so I can't get rid of the second alloc, that is my guess.
But I could optimize everything else, if OGL allows to use inverse alpha that'll do for starters. Perhaps I should use a struct for the color key and then do a CMP on 32bit - although the last part will be empty it doesn't matter right?. |
|||
08 Aug 2008, 20:20 |
|
Madis731 08 Aug 2008, 22:27
Well, I won't make a full-blown comment right now (I have to sleep etc.), but here are a few starters.
1) The 4-byte alignment is a very good idea - then your colours won't be RGBRGBRGBRGB... like now, but RGB_RGB_RGB_RGB_... "_" - meaning empty. This way you can compare 4-bytes really fast 2) Have you checked MMX/SSE routines yet? There are excellent instructions to compare 2 to 4 pixels in one blow (8 bytes in MMX, 16 bytes in SSE). 3) The color-keying answer lies in MMX/SSE aswell. The compare in these instruction sets will make you a nice mask that you can use conditionally or unconditionally. Ok, good night (Maybe other's aren't sleeping) |
|||
08 Aug 2008, 22:27 |
|
adnimo 12 Aug 2008, 05:12
hi, thanks for the suggestions
there is one problem however. the data itself is stored as BGRBGRBGR... and thus isn't 4-byte aligned... If you take a look, Microsoft uses rgb "quads" with an unused byte just to align... and that's cool, but my data is 24bit hence if I convert it into 32bit I'm afraid that any speed increase will be lost in conversion itself. I could make my own image format or just save with an empty alpha channel but that defeats the purpose of being able to use any existent graphic application and it also adds yet another limitation / rule to the artist... if I had BGR_BGR_BGR_BGR... I could obviously not only compare very fast but also I would get rid of the extra malloc, by using this memory buffer and simply writing on the fourth byte my alpha value. Again... I cannot do that, so what can I do?. Thanks - sleep tight by the way! |
|||
12 Aug 2008, 05:12 |
|
Madis731 12 Aug 2008, 07:48
Getting rid of 24-bit is your best bet. Unpacking can be done really fast. The first I can come up with is:
Code: mov ecx,DataLength mov esi,pDataSource mov edi,pDataDest @@: mov eax,[esi] mov [edi],eax add esi,3 ;Here's the trick add edi,4 ;You don't *have* to clear the fourth byte sub ecx,4 jne @b Where the unpacking takes time N and the general algorithm on the data takes time N, you can benefit from aligned data by upto 400% (maybe more) thus only N/4. Now when you compare your original idea using this data and copying to another location, then its N (for copying) + N (for algorithm) = 2N; My idea is N*~1.1 (some more time for copying while unpacking) + N/4 (for optimized algoritm) = ~1.25N. Q.E.D. |
|||
12 Aug 2008, 07:48 |
|
f0dder 12 Aug 2008, 10:51
Let your artists save in whatever format they want, and have a build step that converts the image data to a 32bit ARGB or RGBA format, as needed. You can of course keep 24bit loading code so you can test changes quickly, but imho if you want speed, do the conversion at build-time instead of at load-time.
And the trick is of course to automate so you don't have to do silly manual labour. Invest some time in learning a build system, SCons isn't half bad. |
|||
12 Aug 2008, 10:51 |
|
adnimo 18 Aug 2008, 08:07
Well, yes. that's what I suggest.
But meanwhile, what I'd like to do is speed up the 24bit code I'll try to run a test with Madis731's example, see if it helps out. f0dder, I automate through batch in my Windows box, that's all I need. however I'd like to speed up the 24bit code anyway. I would like to use 32bit all the time but it also means increased disk usage, unless we use some sort of compression schema. I have BLZ for compression, but I'm still trying to figure out the best balance between file size and load time... Keep in mind that I'm going for the minimalistic approach here - so while there are better compression algorithms out there, most implementations are huge and others are simply too hard for me to understand... One thing I'm certain of is the DDS format, it's a nice solution but it's lossy (AFAIK). I will however implement it, since I really like it and the speed increase is very noticeable not only on load-time but also during execution. Regarding the unpacking, is there any faster method?, perhaps through extensions?. Would it be a better idea to have what Madis suggested as a fall-back through a CPUID check?. Thanks. PS: I still don't understand the point of "build systems"... I have a bat, which I double click and everything that has to be compiled, is compiled... I have another one that creates a zip with all the required files for that test / suite... am I missing something?. |
|||
18 Aug 2008, 08:07 |
|
r22 18 Aug 2008, 12:32
Do NOT use 32bit image files if you only need 24bit.
The 33% size increase will hurt your performance when reading in the textures from disk. Disk reads are the worst bottle neck, to such an extent that compressing your images on disk and then decompressing them in memory MIGHT actually improve loading performance further. Along the same lines, keeping your graphics in larger files like 2D sprite images can improve loading times because you will have 1 file sequentially being read from the harddrive instead of many smaller files being randomly accessed. I'll back Madis731's method. |
|||
18 Aug 2008, 12:32 |
|
adnimo 18 Aug 2008, 16:16
You're talking about production I'm talking about development. At the end an image atlas is used, throughout development separate files are used - due to obvious reasons.
The bottleneck right now is on the 24bit to 32bit mask routine. One of the reasons is that the data in memory is not 32bit aligned due to its 24bit origin, perhaps unpacking would be the way to go but wouldn't I be processing the same data twice and thus not gaining anything in return?. Source is 24bit, destination is ALWAYS 32bit because we're talking about masked textures, ie. they have an 8bit alpha channel that is generated from the 24bit data (color keying). We went from the alpha issue down the road of image atlases, build systems... compression and whatnot. Let me please be clear I'm just worried about the color keying part being slow due to the nature of 24bit data. Everything else isn't really on topic. |
|||
18 Aug 2008, 16:16 |
|
r22 18 Aug 2008, 17:10
I clearly wrote
Quote: I'll back Madis731's method. To clarify further - Load your 24bit texture data - Unpack it to 32bit data - Run an optimized color key algorithm on the 32bit data Unpacking Code: mov ecx,DataLength mov esi,pDataSource mov edi,pDataDest @@: mov eax,[esi] mov [edi],eax add esi,3 ;Here's the trick add edi,4 ;You don't *have* to clear the fourth byte sub ecx,4 jne @b Keying - this is only slightly optimized but now that the above code lets you align your data correctly we can improve the following pretty easily Code: mov ecx,UnpackedDataLength mov esi,pUnpackedData @@: mov eax,[esi + ecx - 8] mov ebx,[esi + ecx - 4] ;do 2 at a time and eax,00FFFFFFh and ebx,00FFFFFFh ;mask off that last junk byte mov edx,0FF000000h mov edi,0FF000000h ;set our branchless masking variables cmp eax,00FF00FFh ;test color key cmovne edx,0 ;change branchless variable IF necessary cmp ebx,00FF00FFh ;test color key cmovne edi,0 ;change variable IF necessary or eax,edx or ebx,edi ;opaque or translucent alpha mov [esi + ecx - 8],eax mov [esi + ecx - 4],ebx ;store the data back in the array with correct alpha sub ecx,8 jns @b ;working through the array backwards so ecx=-8 is when we dont want to loo anymore |
|||
18 Aug 2008, 17:10 |
|
bitRAKE 19 Aug 2008, 00:47
No reason not to do unpack and color key in one routine, imho.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
19 Aug 2008, 00:47 |
|
r22 19 Aug 2008, 02:37
re bitRAKE
The reason not to is, you'd lose (make much more difficult) the ability to optimize the color key code with SSE/2. Unless you could come up with a clever/efficient way of getting XMMX = RBGRBGRBGRBGRBGR h to XMMX = 0BGR0BGR0BGR0BGR h |
|||
19 Aug 2008, 02:37 |
|
adnimo 19 Aug 2008, 12:41
k, I did some testing where I load a raw 24bit image into memory, unpack it and then save it back to the disk. The result is not quite correct, the memory doesn't seem to be unpacked, it's just the same as 24bit but with a bigger trail of 0s at the end of the data.
any ideas? The data is being processed, because if not I would be ending up with all 0s it's just not being correctly unpacked, I think. |
|||
19 Aug 2008, 12:41 |
|
Madis731 19 Aug 2008, 23:11
The algorithm we gave you does not produce zeros like you say it does.
Actually it should have only ONE zero because (an example): source=ABCDEFGHIJKL dest=ABCDDEFGGHIJJKL_ Please show us the code you used to unpack it. Sounds like you have switched the source and destination and/or are starting from the wrong end etc. When using dest=source and the algorithm shown here, then you should start from the end not to overwrite your data. |
|||
19 Aug 2008, 23:11 |
|
dagcrack 21 Nov 2021, 03:20
Hi guys I know this is an old topic but I've been interested in using this code on x64 mind you I'm just moving from x86 (I know...)
Do I have to replace the register names and nothing else because we're dealing with the data as-is or is there anyhing else one must do to convert this for 64bit? Thanks! |
|||
21 Nov 2021, 03:20 |
|
revolution 21 Nov 2021, 09:33
You probably just need to convert the registers used for addressing memory. The data processing can remain as 32-bit.
But even then, if your OS can assure you that all addresses are below 4G then you might be able to get by with no changes to the code as shown above. |
|||
21 Nov 2021, 09:33 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.