flat assembler
Message board for the users of flat assembler.

Index > Main > Multiply to ZERO is slower than to value

Author
Thread Post new topic Reply to topic
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Hi
Just want to share my strange situation. My stupid CPU doesn't like to multiply to zero. The whole project works 1/4 times faster when I add some fake values to streamed simples to make sure it isn't empty (zero)
Code:
.data
  RequiredNoise dd 4 dup(0.0000000000001f)
.code
  movaps        xmm9,xword[RequiredNoise]
...
loop:
  ...
  movups        xmm1,[rbx]
  addps         xmm1,xmm9
  ...
  mulps         xmm1,...
    
Post 10 Jul 2021, 01:20
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
That is strange. What CPU?

Maybe create a constant for smallest non-zero float. I've seen that done in other code.

_________________
¯\(°_o)/¯ unlicense.org
Post 10 Jul 2021, 03:39
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
There is more factors than zero or non zero. The size of operands deals too. Similar size calculates faster. As you know sp have only up to 8 numbers shifted by nulls after point that can be calculated correctly. Seems that different of distance takes some extra time or something. When I rebalance my project to keep values closer to zero it works faster and no matter zero or not. Have to investigate it deeper.
Post 10 Jul 2021, 05:31
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
Check for denormal numbers in the pipeline and see if the flush to zero flag is unset making the FPU do extra processing.
Post 10 Jul 2021, 05:46
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
As wiki says Zero is denormal number too...
Post 10 Jul 2021, 13:25
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
In most FPU implementations zero is not treated in a special execution path, so it should give normal performance.

Trouble comes when denormals (non-zero very small numbers) are there either as an input or an output. Check your FTZ flag. If that is already set then another thing to look for is infinity (or divide by zero).
Post 10 Jul 2021, 13:33
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
New interesting think I've got. Multiply to Zero takes some time to be calculated BUT to NaN just pass on as NaN. It can be used as a speed up trick.
I'm seriously thinking to turn empty values to NaN and back to zero after all.

My engine takes 8 channels 192k frame rate sp values from stream. Turn stereo to 7.1 surround effect , 31 line EQ for each channel, delays subengine etc. It uses 248 virtual channels for work process, 31 mulps instructions for each line of EQ. That is very costly as you see. Optimisation is the only way I can do.
My old cpu takes up to 30% of core performance to do that job. I need any idea to speed it up.
Post 10 Jul 2021, 20:26
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Quote:

Trouble comes when denormals

Denormals happens sometime at engine starting, but it's flow out in process and no more denormals after that. No one flag happens except PE, but it's expected at huge low freq counters. Can PE slow down my process as something difficult to calculate? I'll try to turn it to dp, but I believe it slower anyway.
Post 13 Jul 2021, 12:28
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
Calculate what your data bandwidth usage (in and out of the core(s)) is, and that will give you an idea of how close you are getting to ideal conditions.

192kHz * 8 * 4 bytes = 6MB/s, but the virtual channels are probably consuming bandwidth as well. I imagine a scenario where all the virtual channels stay in cache while the audio data streams through - coercing the processor to play along might be tricky, or maybe the buffers are too large to cache?

_________________
¯\(°_o)/¯ unlicense.org
Post 13 Jul 2021, 13:39
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Yeah I do it by one thread only. No point to divide to threads as very big data dependence.
It is all about heavy Filter design, but I do calculations closer from part to part of data to make sure it's fit the cache. And also every single register in use...
Post 13 Jul 2021, 15:15
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Maybe it is not big problem. That how it looks at my old 6-core cpu. New processors will not even feel it.

bitRAKE You did ask me what processor I use. Sorry, I shie to answer. It is Phenom 1100t at 4 GHz. Old, I know.


Description:
Filesize: 73.76 KB
Viewed: 1807 Time(s)

Capture.PNG


Post 13 Jul 2021, 15:33
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
Given what you've said, that looks excellent.

_________________
¯\(°_o)/¯ unlicense.org
Post 13 Jul 2021, 21:10
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Ш
Quote:

Given what you've said, that looks excellent.

Thanks friend, I'm trying to do the best converter at the market. It is difficult.......
And it is not my order. It is my f..n hobby, first of all for myself Smile
Post 13 Jul 2021, 21:31
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
My old version is here
https://sourceforge.net/projects/stereo-to-7-1-converter/
But I hate it already when I started to use the new engine Smile
Post 13 Jul 2021, 21:36
View user's profile Send private message Visit poster's website Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 125
Melissa
Overclick wrote:
Hi
Just want to share my strange situation. My stupid CPU doesn't like to multiply to zero. The whole project works 1/4 times faster when I add some fake values to streamed simples to make sure it isn't empty (zero)
Code:
.data
  RequiredNoise dd 4 dup(0.0000000000001f)
.code
  movaps        xmm9,xword[RequiredNoise]
...
loop:
  ...
  movups        xmm1,[rbx]
  addps         xmm1,xmm9
  ...
  mulps         xmm1,...
    

You are not multipling by zero here, rather small number.
Post 18 Sep 2021, 11:45
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.