flat assembler
Message board for the users of flat assembler.

Index > Main > OpenGL - Inverse alpha?

Author
Thread Post new topic Reply to topic
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo 08 Aug 2008, 20:08
Hi guys, this is an OpenGL question but it relates 100% on FASM hence I'm asking in here, because I know most of you guys are speed junkies and that is what I'm after, speed.

When doing color-keying, one must process the 24bit color data into a 32bit buffer that contains the alpha information. While we do this, we set 255 to the bytes that are fully opaque and if the color key matches, we don't do anything.

So that's the problem, usually, most of my images have more of the other. I would like to know if it would be faster to do the opposite, set 255 to those bytes that are transparent and then probably run the whole buffer through an optimized reversing routine or something like that. Or perhaps OpenGL allows to use "inverse" alpha??

I'm stuck Confused - Any insights please?.

I also found a bottleneck in my checking code. Is there any way to check 3 bytes against another 3 bytes, really really fast?.

The bottleneck exists because my game relies on many textures and they all contain color keying. While profiling this routine I found out that most of the time is wasted on this check and on the memory allocation. Now, I would do alpha from the beginning (32bit textures), but its easier for everyone if we just use 255,0,255 as color key instead, even though its quite limiting the whole game is supposed to look and feel DOS, however we use 24bit data as to provide a bigger amount of colors (so it's retro in style only).

It might not be a huge bottleneck but at the end if I can save even 1 second of load time, it'll be worth the effort.
Post 08 Aug 2008, 20:08
View user's profile Send private message Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo 08 Aug 2008, 20:20
I thought about getting rid of the second allocation, but as it turns out, I can't - Because the data I load is 24bit and if I allocate 32bit buffer, put the 24bit data in there, I'll just have a gap at the end of the buffer because the data is not spread in 32bit... so I can't get rid of the second alloc, that is my guess.

But I could optimize everything else, if OGL allows to use inverse alpha that'll do for starters.

Perhaps I should use a struct for the color key and then do a CMP on 32bit - although the last part will be empty it doesn't matter right?.
Post 08 Aug 2008, 20:20
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2138
Location: Estonia
Madis731 08 Aug 2008, 22:27
Well, I won't make a full-blown comment right now (I have to sleep etc.), but here are a few starters.
1) The 4-byte alignment is a very good idea - then your colours won't be RGBRGBRGBRGB... like now, but RGB_RGB_RGB_RGB_... "_" - meaning empty. This way you can compare 4-bytes really fast Smile
2) Have you checked MMX/SSE routines yet? There are excellent instructions to compare 2 to 4 pixels in one blow (8 bytes in MMX, 16 bytes in SSE).
3) The color-keying answer lies in MMX/SSE aswell. The compare in these instruction sets will make you a nice mask that you can use conditionally or unconditionally.

Ok, good night Smile
(Maybe other's aren't sleeping)
Post 08 Aug 2008, 22:27
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo 12 Aug 2008, 05:12
hi, thanks for the suggestions Smile

there is one problem however. the data itself is stored as BGRBGRBGR... and thus isn't 4-byte aligned...

If you take a look, Microsoft uses rgb "quads" with an unused byte just to align... and that's cool, but my data is 24bit hence if I convert it into 32bit I'm afraid that any speed increase will be lost in conversion itself.

I could make my own image format or just save with an empty alpha channel but that defeats the purpose of being able to use any existent graphic application Laughing and it also adds yet another limitation / rule to the artist...

if I had BGR_BGR_BGR_BGR... I could obviously not only compare very fast but also I would get rid of the extra malloc, by using this memory buffer and simply writing on the fourth byte my alpha value.

Again... I cannot do that, so what can I do?.

Thanks - sleep tight by the way!
Post 12 Aug 2008, 05:12
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2138
Location: Estonia
Madis731 12 Aug 2008, 07:48
Getting rid of 24-bit is your best bet. Unpacking can be done really fast. The first I can come up with is:
Code:
  mov ecx,DataLength
  mov esi,pDataSource
  mov edi,pDataDest
@@:
  mov eax,[esi]
  mov [edi],eax
  add esi,3 ;Here's the trick
  add edi,4 ;You don't *have* to clear the fourth byte
  sub ecx,4
  jne @b
    

Where the unpacking takes time N and the general algorithm on the data takes time N, you can benefit from aligned data by upto 400% (maybe more) thus only N/4.
Now when you compare your original idea using this data and copying to another location, then its N (for copying) + N (for algorithm) = 2N; My idea is N*~1.1 (some more time for copying while unpacking) + N/4 (for optimized algoritm) = ~1.25N.

Q.E.D.
Post 12 Aug 2008, 07:48
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3174
Location: Denmark
f0dder 12 Aug 2008, 10:51
Let your artists save in whatever format they want, and have a build step that converts the image data to a 32bit ARGB or RGBA format, as needed. You can of course keep 24bit loading code so you can test changes quickly, but imho if you want speed, do the conversion at build-time instead of at load-time.

And the trick is of course to automate so you don't have to do silly manual labour. Invest some time in learning a build system, SCons isn't half bad.
Post 12 Aug 2008, 10:51
View user's profile Send private message Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo 18 Aug 2008, 08:07
Well, yes. that's what I suggest.
But meanwhile, what I'd like to do is speed up the 24bit code Smile

I'll try to run a test with Madis731's example, see if it helps out.

f0dder, I automate through batch in my Windows box, that's all I need. however I'd like to speed up the 24bit code anyway. I would like to use 32bit all the time but it also means increased disk usage, unless we use some sort of compression schema. I have BLZ for compression, but I'm still trying to figure out the best balance between file size and load time...

Keep in mind that I'm going for the minimalistic approach here Very Happy - so while there are better compression algorithms out there, most implementations are huge and others are simply too hard for me to understand...

One thing I'm certain of is the DDS format, it's a nice solution but it's lossy (AFAIK). I will however implement it, since I really like it and the speed increase is very noticeable not only on load-time but also during execution.


Regarding the unpacking, is there any faster method?, perhaps through extensions?. Would it be a better idea to have what Madis suggested as a fall-back through a CPUID check?.


Thanks.


PS: I still don't understand the point of "build systems"... I have a bat, which I double click and everything that has to be compiled, is compiled... I have another one that creates a zip with all the required files for that test / suite... am I missing something?.
Post 18 Aug 2008, 08:07
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 18 Aug 2008, 12:32
Do NOT use 32bit image files if you only need 24bit.

The 33% size increase will hurt your performance when reading in the textures from disk. Disk reads are the worst bottle neck, to such an extent that compressing your images on disk and then decompressing them in memory MIGHT actually improve loading performance further.

Along the same lines, keeping your graphics in larger files like 2D sprite images can improve loading times because you will have 1 file sequentially being read from the harddrive instead of many smaller files being randomly accessed.

I'll back Madis731's method.
Post 18 Aug 2008, 12:32
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo 18 Aug 2008, 16:16
You're talking about production I'm talking about development. At the end an image atlas is used, throughout development separate files are used - due to obvious reasons.

The bottleneck right now is on the 24bit to 32bit mask routine. One of the reasons is that the data in memory is not 32bit aligned due to its 24bit origin, perhaps unpacking would be the way to go but wouldn't I be processing the same data twice and thus not gaining anything in return?.

Source is 24bit, destination is ALWAYS 32bit because we're talking about masked textures, ie. they have an 8bit alpha channel that is generated from the 24bit data (color keying).

We went from the alpha issue down the road of image atlases, build systems... compression and whatnot. Let me please be clear I'm just worried about the color keying part being slow due to the nature of 24bit data. Everything else isn't really on topic.

Laughing
Post 18 Aug 2008, 16:16
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 18 Aug 2008, 17:10
I clearly wrote
Quote:
I'll back Madis731's method.


To clarify further
- Load your 24bit texture data
- Unpack it to 32bit data
- Run an optimized color key algorithm on the 32bit data

Unpacking
Code:
  mov ecx,DataLength
  mov esi,pDataSource
  mov edi,pDataDest
@@:
  mov eax,[esi]
  mov [edi],eax
  add esi,3 ;Here's the trick
  add edi,4 ;You don't *have* to clear the fourth byte
  sub ecx,4
  jne @b 
    


Keying - this is only slightly optimized but now that the above code lets you align your data correctly we can improve the following pretty easily
Code:
  mov ecx,UnpackedDataLength
  mov esi,pUnpackedData
@@:
  mov eax,[esi + ecx - 8]
  mov ebx,[esi + ecx - 4] ;do 2 at a time
  and eax,00FFFFFFh
  and ebx,00FFFFFFh ;mask off that last junk byte
  mov edx,0FF000000h
  mov edi,0FF000000h ;set our branchless masking variables
  cmp eax,00FF00FFh ;test color key
  cmovne edx,0 ;change branchless variable IF necessary
  cmp ebx,00FF00FFh ;test color key
  cmovne edi,0 ;change variable IF necessary
  or eax,edx 
  or ebx,edi ;opaque or translucent alpha
  mov [esi + ecx - 8],eax
  mov [esi + ecx - 4],ebx ;store the data back in the array with correct alpha
  sub ecx,8
  jns @b ;working through the array backwards so ecx=-8 is when we dont want to loo anymore
    
Post 18 Aug 2008, 17:10
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4308
Location: vpcmpistri
bitRAKE 19 Aug 2008, 00:47
No reason not to do unpack and color key in one routine, imho.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.
Post 19 Aug 2008, 00:47
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 19 Aug 2008, 02:37
re bitRAKE
The reason not to is, you'd lose (make much more difficult) the ability to optimize the color key code with SSE/2.

Unless you could come up with a clever/efficient way of getting
XMMX = RBGRBGRBGRBGRBGR h
to XMMX = 0BGR0BGR0BGR0BGR h
Post 19 Aug 2008, 02:37
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo 19 Aug 2008, 12:41
k, I did some testing where I load a raw 24bit image into memory, unpack it and then save it back to the disk. The result is not quite correct, the memory doesn't seem to be unpacked, it's just the same as 24bit but with a bigger trail of 0s at the end of the data.

any ideas?

The data is being processed, because if not I would be ending up with all 0s it's just not being correctly unpacked, I think.
Post 19 Aug 2008, 12:41
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2138
Location: Estonia
Madis731 19 Aug 2008, 23:11
The algorithm we gave you does not produce zeros like you say it does.
Actually it should have only ONE zero because (an example):
source=ABCDEFGHIJKL
dest=ABCDDEFGGHIJJKL_
Please show us the code you used to unpack it. Sounds like you have switched the source and destination and/or are starting from the wrong end etc. When using dest=source and the algorithm shown here, then you should start from the end not to overwrite your data.
Post 19 Aug 2008, 23:11
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
dagcrack



Joined: 03 Jul 2006
Posts: 1
dagcrack 21 Nov 2021, 03:20
Hi guys I know this is an old topic but I've been interested in using this code on x64 mind you I'm just moving from x86 (I know...)

Do I have to replace the register names and nothing else because we're dealing with the data as-is or is there anyhing else one must do to convert this for 64bit?

Thanks!
Post 21 Nov 2021, 03:20
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20754
Location: In your JS exploiting you and your system
revolution 21 Nov 2021, 09:33
You probably just need to convert the registers used for addressing memory. The data processing can remain as 32-bit.

But even then, if your OS can assure you that all addresses are below 4G then you might be able to get by with no changes to the code as shown above.
Post 21 Nov 2021, 09:33
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.