flat assembler
Message board for the users of flat assembler.

Index > Main > Critique please

Author
Thread Post new topic Reply to topic
tripledot



Joined: 06 Jan 2009
Posts: 49
tripledot 30 Mar 2012, 17:51
Would somebody mind looking this over to see if they have a better approach? I'm working on updating a VSTI sampler to take advantage of AVX, I've been asked to add 8-bit sample support while I'm at it. Internal sample data is stored in f64 format. Being the obsessive type, I'm optimising the sample loader for AVX as well, so here's my first stab at batch-converting signed bytes to f64 (ranging from -1.0 to +1.0). My eyes are going funny and I can't see any more optimisations...

Any thoughts greatly appreciated. Thanks for looking!

Code:
; Sample.load_i8_1ch
;
;       rcx  = nBytes
;      [rdx] = 1/127 (f64)
;      [rsi] = src
;      [rdi] = dst

align 16
Sample.load_i8_1ch:
                mov             rbx, 4
                vbroadcastsd    ymm1, [rdx]
                cmp             rcx, rbx
                jb              .process1
        align 16
        .process4:
                vpmovsxbd       xmm0, dword [rsi]
                add             rsi, rbx
                vcvtdq2pd       ymm0, xmm0
                vmulpd          ymm0, ymm0, ymm1
                vmovntpd        [rdi], ymm0
                add             rdi, 32
                sub             rcx, rbx
                cmp             rcx, rbx
                jae             .process4
        align 16
        .process1:
                movsx           eax, byte [rsi]
                add             rsi, 1
                vcvtsi2sd       xmm0, xmm0, eax
                vmulsd          xmm0, xmm0, xmm1
                vmovsd          [rdi], xmm0
                add             rdi, 8
                sub             rcx, 1
                jnz             .process1
                ret
    
Post 30 Mar 2012, 17:51
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 30 Mar 2012, 21:30
Sorry, I tend to be extremely stupid on floating point stuff, but don't you get some byte values out of range when converted? (e.g. -128)
Post 30 Mar 2012, 21:30
View user's profile Send private message Reply with quote
tripledot



Joined: 06 Jan 2009
Posts: 49
tripledot 30 Mar 2012, 21:46
I'm not sure I understand you. My apologies if this is obvious, but this code converts from signed bytes to doubles, not the other way around.

"vpmovsxbd" sign-extends bytes to dwords, which are converted to doubles by the "vcvtdq2pd" instruction. We now have floats in the range -128.0 to 127.0. In an ideal world, the person responsible for the code that generates 8-bit audio knows that they should not use the entire range of a byte to represent the amplitude range. Rather, they should use the (unsigned) range 1 to 255, therefore making 128 the middle (i.e. 0.0), giving you 127 values in either direction (+ve or -ve) to represent a floating point signal in fixed-point format. I multiply my doubles by 1.0/127.0 to get a range of -1.0 to +1.0 (I hope!)

In any case, I was appealing more to people with AVX/Sandy Bridge/BD experience... I've managed to get the inner loop down to 32 bytes. It's a tight loop but it's nice and small for the code cache. There are no loop-carried dependencies, so after a few iterations the latencies of the moves, conversions and multiplies should be overcome by pipelining and OoO execution. Unless somebody knows something I don't (very likely!)

But sheeeit, no need for apologies! Thanks a million for taking a look!

EDIT:

I think I see what you mean... So if a value of -128 crops up in a sound file, my code will result in there being a floating point value < -1.0...

Ack. If I divide everything by 1.0/128.0 then everything sits nicely in range (no clipping), but then I have a DC offset to worry about. I haven't been able to find much info about the standards used to store 8 bit audio data (nothing really deep, anyway). I should really fire up Audition and run some tests...
Post 30 Mar 2012, 21:46
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 01 Apr 2012, 19:56
If you've completed with theoretical optimizing then there's nothing more to do than test it on real hardware and you can optimize by reordering instructions and filling it with some strategically placed NOPs.
Post 01 Apr 2012, 19:56
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4353
Location: Now
edfed 01 Apr 2012, 21:18
working on the -128 value (the -1.0078 value) will maybe not be a problem while it is not frequent on real audio to have this kind of maximal values.

saturation can be used therefore, before to convert.

take the byte flow, and do the saturation adjustment before conversion and you'll get something good i think.

then, your loop will not be smaller, but will be bigger.
maybe a 64 bytes long loop can be acceptable.

i know many say:

don't use inc and dec instructions on registers. but what is the goal of such instruction if not to be used.

by replacing [rdi] by [rdi*8], and add edi,8, sub ecx,1 by inc edi, loop, you will get a smaller code, and then, maybe faster. there, i see ecx cannot be bigger than 4, then, why use the full rcx register. cl is largelly enough to do the job.


all in all, the multi case loop can be converted in a single loop if you pad or ignore the extra bytes of your signal.
Post 01 Apr 2012, 21:18
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 02 Apr 2012, 11:21
@edfed - because using partial registers is sometimes dangerous. I would generally use ecx where a value is guaranteed to fit into DWORD, but I would not use parts of that register (cx, cl, ch). Even if loading a BYTE/WORD value, you are recommended to make use of the MOV*X instructions.

@tripledot - you can do two things. Either:
1) Force using only counts divisible by 4 so you don't need .process1 at all
2) or take advantage of the fact that when rsi is 16-aligned (for xmm, or 32-aligned for ymm) and you won't get any errors accessing memory at rsi+rcx+00..15 (31). Then you can just make one extra round with .process 4 and later discard 1, 2 or 3 results depending on the original size of rcx.

My guess is that .process1 loop takes about the same time as .process4 therefore you can speed up the processing of the remaining 1..3 bytes by up to 3 times.

You can also change the mov rbx,4 to mov ebx,4, which accomplishes the same thing but brings the total footprint down (from 94) to 78 bytes. This lucky win comes from the fact that the original 4 instructions take that extra byte and trip over the 16-byte barrier. mov ebx,4 will zero-extend the result, but take less space in binary. Smile

Can anyone explain me why 1.0/127.0 is so important that -1.0078 result is acceptable while 1.0/128.0 sounds more natural (2^7=128, its ½ of 256 space) and you don't need to do any more checks on limits. Is there something wrong with 0.9921875 on the positive maximum side? 0.0 is still 0.0, isn't it?
Post 02 Apr 2012, 11:21
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tripledot



Joined: 06 Jan 2009
Posts: 49
tripledot 13 Apr 2012, 20:09
Really sorry for replying so late.

Thanks to all for the input, it is hugely appreciated.

@Madis: I can't believe I didn't spot "mov ebx, 4". Nice one! I haven't been writing 64-bit code for very long, and sign-extension still trips me up from time to time.

Like you, I'm very happy writing to the 32-bit portion of a register, but I avoid partial register use like the plague for words or bytes.

@both: You are both right; it seems like a total waste to even bother with the single-byte case. Since all my buffers are 32-byte aligned, it's not a big deal to pad their lengths to a multiple of 32 bytes.

And after a bit more research, it seems nobody cares about the extra negative value in fixed-point audio files. So I'll just downscale by 128 instead of 127 and be done with it. Should lead to more accurate floating-point representation of the original audio, too.

This begs an interesting question... when converting from floating-point to fixed point, which is more evil (in terms of THD): upscaling by 127 (and introducing floating-point inaccuracies along the way), or upscaling by 128 (necessitating saturation clipping of the most positive peaks)?

Really a moot point when dealing with 8-bit audio, but for 16/24-bit I suppose this might concern the audiophiles. Not that I'd believe them if they claimed to be able to hear a difference, but still...
Post 13 Apr 2012, 20:09
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.