flat assembler
Message board for the users of flat assembler.

Index > Projects and Ideas > Rewritting applications in Assembler for Benchmarking

Author
Thread Post new topic Reply to topic
zir_blazer



Joined: 05 Dec 2006
Posts: 66
zir_blazer 03 Nov 2008, 03:42
Though I usually read and participate in this Forum, as an ASM programmer I have barely anything of it because I never did anything else than a simple Hello World. However, that doesn't means that I don't have some ideas...

I think that everyone knows that hand optimized ASM code should be the fastest thing that you can aim to use for any type of application. However, how fast it is actually in real world with big, heavy and complex projects, like games, when compared to other languages? Modern games are obviously programmed in High Level Languages (I suppose that C++ is still widely mainstream), but by how much they performance could be improved if they were rewritten and optimized in ASM?
The first thing that comes to my mind for an example are ID Software game engines, as they usually release them as Open Source and are quite the example of a complex, professional, and commercial project. If they were rewritten in ASM, how much the performance could potentially increase? Do someone attempted something like this in real life by rewritting in ASM games or other type of applications so it could be possible to have solid numbers for comparision? I always though that ASM could be useful for optimizing existing and demanding applications (For most of them without Source Code we would be talking about something hack-like) where you would want any sort of available resources for speeding up things.
Post 03 Nov 2008, 03:42
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20433
Location: In your JS exploiting you and your system
revolution 03 Nov 2008, 04:52
I always estimate about double the speed. Although take that figure with a grain of salt because there are just so many dependencies that actual generic figures are impossible.

The main thing to keep in mind is that only certain parts of a program can get a good benefit from ASM. Heavily used computation loops are a prime candidate for ASM optimisations. Fluffy stuff like GUIs and user interaction will mostly see little or no benefit from ASM optimisation.
Post 03 Nov 2008, 04:52
View user's profile Send private message Visit poster's website Reply with quote
roboman



Joined: 03 Dec 2006
Posts: 122
Location: USA
roboman 03 Nov 2008, 15:38
The other thing to remember is that many programs, such as games, already do have the speed critical sections written in inline asm
Post 03 Nov 2008, 15:38
View user's profile Send private message Visit poster's website Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr 03 Nov 2008, 18:32
roboman,

Another thing to remember is that operating system (under control of which these programs run) do have the speed critical sections written in inline Visual Basicâ„¢, so don't expect much speed gain (unless your program is heavily CPU-bound). Wink
Post 03 Nov 2008, 18:32
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 04 Nov 2008, 14:15
And keep in mind that large game parts are GPU and not CPU dependant nowadays...
Post 04 Nov 2008, 14:15
View user's profile Send private message Visit poster's website Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr 04 Nov 2008, 18:10
f0dder,

Aha! It's about GPU, "Loading..." (how do I miss it? Wink)
Post 04 Nov 2008, 18:10
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 05 Nov 2008, 07:36
baldr: loading screens might take time, but usually the CPU load isn't very high during load - at least not on any system I've had in the last few years. So I guess it's about disk I/O and throwing textures at the GPU?
Post 05 Nov 2008, 07:36
View user's profile Send private message Visit poster's website Reply with quote
zir_blazer



Joined: 05 Dec 2006
Posts: 66
zir_blazer 05 Nov 2008, 08:40
roboman wrote:
The other thing to remember is that many programs, such as games, already do have the speed critical sections written in inline asm

I recall than that was done with the Wolfenstein 3D engine (Again, ID Software), but that was in a time when you had to get the most of current Hardware and not where you just go and buy faster Hardware to compensate for developers lazyness or rushed products. I doubt that more than a little minority of Software and games got special treatment for optimizations like that these days.
The only developer that I know that did a relatively modern game in ASM was the oldschool Chris Sawyer ( http://www.chrissawyer.com/faq3.htm ), that I supposed that besides Rollercoaster Tycoon, should have done Transport Tycoon and Locomotion in ASM, too. Besides games, for console emulation purposes (For those that ever used it), around year 2002, NeoRageX for Neo Geo Roms (From some arcade machines) was also made in ASM and did a world of a difference in my old K6-II 500 MHz compared to the official MAME32 client (I recall than I was enthusiast at that time because Metal Slug 3. MAME32 emulated it like a Powerpoint presentation and NeoRageX allow full playability at real time), as does No$gba for running Nintendo DS games.


f0dder wrote:
And keep in mind that large game parts are GPU and not CPU dependant nowadays...

This depends on load balancing and varies from game to game and also graphics settings. At the high end spectrum of performance GPU matters much more, but on the low end Processors usually helps a little with the exeption that you're seriously bottlenecked by the GPU even at minimal settings (IGPs).
If anything, I don't think that the idea of a sort of hybrid, modular engine that allows the CPU to take care of a few extra things if there is enough processing headroom (SMP viability included, but these two ideas should be ridiculous hard to implement as would require major rework of engines that have Source Code available and lets simply not talk if you need to hack your way to attempt to improve one that you don't have the access to it). Again, another engine that I recall than could be done by Software rendering or OpenGL was Quake 2 (Not a precise case though, as they weren't the same version of the engine but for using with the CPU or the GPU, the OpenGL one was vastly superior graphically wise).


f0dder wrote:
baldr: loading screens might take time, but usually the CPU load isn't very high during load - at least not on any system I've had in the last few years. So I guess it's about disk I/O and throwing textures at the GPU?

But the code than does that loading could be optimized and improved, too.
Post 05 Nov 2008, 08:40
View user's profile Send private message MSN Messenger Reply with quote
Sean4CC



Joined: 15 Apr 2015
Posts: 14
Sean4CC 16 Apr 2015, 00:39
Both gcc and the java hotspot compiler will do a better job of optimizing machine code than you can, unless you have lots and lots of time on your hands. Sometimes you can do better with a smart sequence of instructions that the compiler can't possibly work out. I found haddps works well but seems to be unknow to compilers. The java hotspot just in time compiler is much better at optimizing than gcc.
The problem with both of those is undefined behavior. For C:
http://blog.regehr.org/archives/213
With java low level operations are fully defined but memory behavior is undefined when dealing with large data sets. It seems you are supposed to just use trial and error to pick memory settings. That is nonsense and can mean your code can flake at any moment. However java does have memory mapped files which is a way around the problem.
If you are writing security related software or code that must work reliably then using assembly is a good way to have fully defined behavior, even if the code is actually a little slower than using a compiler.
Post 16 Apr 2015, 00:39
View user's profile Send private message Reply with quote
gens



Joined: 18 Feb 2013
Posts: 161
gens 16 Apr 2015, 15:12
try reading some code that gcc spits out
JIT even better
Post 16 Apr 2015, 15:12
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20433
Location: In your JS exploiting you and your system
revolution 16 Apr 2015, 15:32
Sean4CC wrote:
Both gcc and the java hotspot compiler will do a better job of optimizing machine code than you can, unless you have lots and lots of time on your hands ...
... or you are familiar with the common optimising techniques used by the compilers.

Compilers aren't magic. They can do okay sometimes if the algorithm can be expressed well in the language used. But they can't make use of the full CPU instruction set and will miss many little tricks and techniques.


Last edited by revolution on 17 Apr 2015, 12:39; edited 1 time in total
Post 16 Apr 2015, 15:32
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 17 Apr 2015, 12:03
revolution wrote:
Sean4CC wrote:
Both gcc and the java hotspot compiler will do a better job of optimizing machine code than you can, unless you have lots and lots of time on your hands ...
... or you are familiar with the common optimising techniques used by the compilers.

Compilers aren't magic. They can do okay sometimes of the algorithm can be expressed well in the language used. But they can't make use of the full CPU instruction set and will miss many little tricks and techniques.


I agree with revolution. It is not that hard to beat the compiler. The most important thing in my opinion is to know the machine and use tools (iaca, vtune). Recently I have been reading a lot about Haswell, AVX, AVX2 and FMA.
I have written simple raymarching demo in ASM and C (using intrinsics). Hand written ASM version is about 11% faster than C version compiled with MSVC 2013 (in a simple test scene).

ASM version is here if someone is interested: https://github.com/michal-z/qjulia

Requires AVX2 and FMA capable CPU.
This is WIP, my main goal is to render Quaternion Julia Sets in real-time on the CPU.
Post 17 Apr 2015, 12:03
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.