flat assembler
Message board for the users of flat assembler.

Index > Main > Beauty in x86 assembly?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8367
Location: Kraków, Poland
Tomasz Grysztar 27 Apr 2018, 20:15
A friend of mine that knows that I deal a lot with x86 assembly sent me a link to this discussion today:
What are some examples of beautiful x86 assembly code?
I wonder what would the users of this board answer. I see that someone there already mentioned HeavyThing - this would probably be my pick, too. In recent years this was the code that impressed me the most - even though, or perhaps exactly for the reason that its style is in many aspects different from mine.

And I could not overlook someone mentioning "xor eax,eax" as "the coolest one liner". How many disputes over this simple instruction have we had already? Perhaps never enough.

Speaking of which, if we talk not about beautifully structured programs, but just a tiny tricky snippets that do impressive things with at most few instructions, then my favorite is perhaps a trick to convert a nibble to hexadecimal digits without jumps of tables:
The Svin wrote:
Code:
cmp al,10 ; if x < 10, set CF = 1 
sbb al,69h ; 0-9: 96h .. 9Fh, A-F: A1h..A6h 
das ; 0-9: subtr. 66h -> 30h-39h, 
; A-F: subtr. 60h -> 41h..46h    
After seeing it for the first time I immediately started using it myself, enchanted by the magic of this technique (though, sadly, it is not possible to do the trick in long mode). It blew my mind when I tried to think how someone could design this. The original authorship might have been forgotten, that old thread suggests it may date back to 1984.
Post 27 Apr 2018, 20:15
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 28 Apr 2018, 02:15
I'd not seen the two instruction Fibonacci calculator before. Very nice.
Post 28 Apr 2018, 02:15
View user's profile Send private message Visit poster's website Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 28 Apr 2018, 07:28
Tomasz Grysztar wrote:
Speaking of which, if we talk not about beautifully structured programs, but just a tiny tricky snippets that do impressive things with at most few instructions, then my favorite is perhaps a trick to convert a nibble to hexadecimal digits without jumps of tables:
The Svin wrote:
Code:
cmp al,10 ; if x < 10, set CF = 1 
sbb al,69h ; 0-9: 96h .. 9Fh, A-F: A1h..A6h 
das ; 0-9: subtr. 66h -> 30h-39h, 
; A-F: subtr. 60h -> 41h..46h    
After seeing it for the first time I immediately started using it myself, enchanted by the magic of this technique (though, sadly, it is not possible to do the trick in long mode). It blew my mind when I tried to think how someone could design this. The original authorship might have been forgotten, that old thread suggests it may date back to 1984.

AFAIK, it’s called Allison’s algorithm.
Post 28 Apr 2018, 07:28
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 28 Apr 2018, 08:05
I suspect that is the only usage of das in any code. Hehe, I haven't seen all code of course, but it is one of those rarely used instructions that originally had a very narrow application scope. And suddenly it found a use in something completely unexpected.

And perhaps even rarer would be aas. I've never seen it used anywhere, ever. Except in test code designed to show what it does, but not in anything that is used in production code.
Post 28 Apr 2018, 08:05
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8367
Location: Kraków, Poland
Tomasz Grysztar 28 Apr 2018, 09:04
revolution wrote:
I'd not seen the two instruction Fibonacci calculator before. Very nice.
It is nice, though it is actually nothing more than just making XADD do what it is designed to do. It is similar to how sometimes a couple of x86 instructions can be arranged that the output of one fits perfectly as an input of another - for example when chaining DIV instructions to divide a large number. These "tricks" are there by design and the beauty lies in fact in the design of the instruction set itself.

DimonSoft wrote:
AFAIK, it’s called Allison’s algorithm.
Do you have a source? Does the name refer to the underlying method (as the "algorithm" part seems to imply) or to the specific implementation? Because all I could find is the name referring to older, 8080 variety:
Code:
ADD  AL, 90h 
DAA 
ADC  AL, 40h 
DAA     
But this one seems a bit easier to wrap one's head around. Though it perhaps gives an insight into an evolution of the trick, perhaps the final one was a bit easier to invent when there was an earlier one as an inspiration.
Post 28 Apr 2018, 09:04
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 28 Apr 2018, 09:35
Tomasz Grysztar wrote:
revolution wrote:
I'd not seen the two instruction Fibonacci calculator before. Very nice.
It is nice, though it is actually nothing more than just making XADD do what it is designed to do.
Are you sure that xadd was designed to compute Fibonacci numbers? I thought its original intention was something else. And it takes a bit of inspiration to realise that it can also be used for Fibonacci.
Post 28 Apr 2018, 09:35
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8367
Location: Kraków, Poland
Tomasz Grysztar 28 Apr 2018, 10:01
revolution wrote:
Are you sure that xadd was designed to compute Fibonacci numbers? I thought its original intention was something else. And it takes a bit of inspiration to realise that it can also be used for Fibonacci.
The function of XADD is to accumulate new component into a sum while preserving the previous value of the sum for further use. And this is a core of what Fibonacci sequence is about - you just need to conflate the "component" with "previous sum". The mental jump is not very far here, at least to me it seems this way.
Post 28 Apr 2018, 10:01
View user's profile Send private message Visit poster's website Reply with quote
alexfru



Joined: 23 Mar 2014
Posts: 80
alexfru 28 Apr 2018, 10:41
I think x86 code is more of an accidental beauty. There are just a few good things about the instruction set: it's compact, it's extensible, an instruction can encode the full memory operand offset (not just 12 bits or something). The rest is pretty ugly: too few registers, a bunch of non-orthogonal instructions, some of them with hard-coded operands.

The MIPS equivalent of the hex->ASCII is just one instruction longer:
Code:
# a0 = input: 0 through 15
# v0 = output: corresponding ASCII char: '0' ... '9', 'A' ... 'F'
addiu v0, a0, 0x30 # output candidate 1: input + 0x30
addiu t0, a0, 0x37 # output candidate 2: input + 0x37
sltiu t1, a0, 10 # input < 10?
movz  v0, t0, t1 # if input >= 10, choose output candidate 2
    

No tricky instructions, no hard-coded registers. Everything's simple and regular.
Post 28 Apr 2018, 10:41
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 28 Apr 2018, 11:08
alexfru wrote:
I think x86 code is more of an accidental beauty. There are just a few good things about the instruction set: it's compact, it's extensible, an instruction can encode the full memory operand offset (not just 12 bits or something).
Well at least 31-bits of signed offset. But there is a full 64-bit constant load with mov.
alexfru wrote:
The MIPS equivalent of the hex->ASCII is just one instruction longer:
Code:
# a0 = input: 0 through 15
# v0 = output: corresponding ASCII char: '0' ... '9', 'A' ... 'F'
addiu v0, a0, 0x30 # output candidate 1: input + 0x30
addiu t0, a0, 0x37 # output candidate 2: input + 0x37
sltiu t1, a0, 10 # input < 10?
movz  v0, t0, t1 # if input >= 10, choose output candidate 2
    

No tricky instructions, no hard-coded registers. Everything's simple and regular.
ARM32 can do it in three instructions also, but the byte count is larger - 12 bytes.
Post 28 Apr 2018, 11:08
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8367
Location: Kraków, Poland
Tomasz Grysztar 28 Apr 2018, 11:10
alexfru wrote:
No tricky instructions, no hard-coded registers. Everything's simple and regular.
Also a bit boring. Wink Nonetheless, from CMOV to AVX-512, x86 also evolved in this direction. Some of the details, like the design of VEX prefix, I also admired as very clever.
Post 28 Apr 2018, 11:10
View user's profile Send private message Visit poster's website Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 28 Apr 2018, 20:47
Tomasz Grysztar wrote:
DimonSoft wrote:
AFAIK, it’s called Allison’s algorithm.
Do you have a source? Does the name refer to the underlying method (as the "algorithm" part seems to imply) or to the specific implementation? Because all I could find is the name referring to older, 8080 variety:
Code:
ADD  AL, 90h 
DAA 
ADC  AL, 40h 
DAA     
But this one seems a bit easier to wrap one's head around. Though it perhaps gives an insight into an evolution of the trick, perhaps the final one was a bit easier to invent when there was an earlier one as an inspiration.

Well, not much sources, but… Both variations are described in [url=https://groups.google.com/forum/#!msg/comp.lang.asm.x86/TJg1gpsY8FQ/khvpAflvzpMJ]this old comp.lang.asm.x86 thread[/url] along with explanations on how/why they work. I feel I’ve seem a page some day that told a bit more about who this Allison is/was but can’t find it now.
Post 28 Apr 2018, 20:47
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2597
Furs 28 Apr 2018, 23:05
alexfru wrote:
The MIPS equivalent of the hex->ASCII is just one instruction longer:
Code:
# a0 = input: 0 through 15
# v0 = output: corresponding ASCII char: '0' ... '9', 'A' ... 'F'
addiu v0, a0, 0x30 # output candidate 1: input + 0x30
addiu t0, a0, 0x37 # output candidate 2: input + 0x37
sltiu t1, a0, 10 # input < 10?
movz  v0, t0, t1 # if input >= 10, choose output candidate 2
    

No tricky instructions, no hard-coded registers. Everything's simple and regular.
And bloated as hell.
Post 28 Apr 2018, 23:05
View user's profile Send private message Reply with quote
alexfru



Joined: 23 Mar 2014
Posts: 80
alexfru 29 Apr 2018, 00:20
Furs wrote:
alexfru wrote:
The MIPS equivalent of the hex->ASCII is just one instruction longer:
Code:
# a0 = input: 0 through 15
# v0 = output: corresponding ASCII char: '0' ... '9', 'A' ... 'F'
addiu v0, a0, 0x30 # output candidate 1: input + 0x30
addiu t0, a0, 0x37 # output candidate 2: input + 0x37
sltiu t1, a0, 10 # input < 10?
movz  v0, t0, t1 # if input >= 10, choose output candidate 2
    

No tricky instructions, no hard-coded registers. Everything's simple and regular.
And bloated as hell.

LOL
Post 29 Apr 2018, 00:20
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4166
Location: vpcmpistri
bitRAKE 29 Apr 2018, 01:12
On another machine, I have an archive going back over 20 years - lots of saved snippets and whole projects from dubious sources. But for right now, when so much of the web has moved on - the MadWizard remains: Smile http://www.madwizard.org/programming/snippets

...and http://www.azillionmonkeys.com/qed/asmexample.html

Bubble sort:
Code:
; by Andrew Howe

outerloop:  lea     ebx,[edi+ecx*4]
            mov     eax,[edi]
cmploop:    sub     ebx,4
            cmp     eax,[ebx]
            jle     notyet
            xchg    eax,[ebx]
notyet:     cmp     ebx,edi
            jnz     cmploop
            stosd
            loop    outerloop    
...is one of my favorites.
Post 29 Apr 2018, 01:12
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 29 Apr 2018, 01:47
That page is a good example of how we need to change our code to have it perform well on whatever is the current batch of CPUs. In many cases there the code changes radically in order to retain optimal performance. But the performance is entirely dependant upon the specific CPU in use. The situation is still the same today. If you want to keep your code running as fast as possible you have to customise it for each different CPU/memory/mobo combination.
Post 29 Apr 2018, 01:47
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4166
Location: vpcmpistri
bitRAKE 29 Apr 2018, 02:06
Couldn't agree more, revolution. Projects like GMPlib do that sort of thing. (Even though they don't always have the best algorithms, imho.)


...and a Windows 64 snip:
Code:
; structure is created on stack by ENTER instruction.
    mov rbp,$00003FFF00000008           ; INITCOMMONCONTROLSEX trickery
    enter 32,0                          ;
    mov ecx,ebp                         ; #32#
    call [InitCommonControlsEx]         ;    
Post 29 Apr 2018, 02:06
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 29 Apr 2018, 02:24
Plus with things like OOO execution and LRU caches, when trying to do a short timing test, it is almost impossible to get consistent results. Added to the fact that some CPUs have decoupled the TSC from the CPU clock making it so much harder to get precise results.

The solution is to simply forget about it and concentrate on the overall performance, rather than individual parts. And even then it ain't easy; change one instruction and suddenly the whole chain of events is different and you get a huge change. All the while accounting for the fact that only on that CPU do you see such effects.

Non-determinism FTW Very Happy
Post 29 Apr 2018, 02:24
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4166
Location: vpcmpistri
bitRAKE 29 Apr 2018, 03:15
You'd think it'd be in the best interest of the CPU manufacturers to actually publish accurate CPU details, but the marketing department has convinced them that that is just not the case. This is seen most blatantly in the consumer line versus the Xeon line of Intel chips.

I don't think anyone is doing short timing test anymore - maybe for the last decade. There are different results for each cache level, and the length of the test needs to account for that. I code for size by default for that reason - code that runs once takes longer to load than to execute.
Post 29 Apr 2018, 03:15
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 29 Apr 2018, 03:35
bitRAKE wrote:
You'd think it'd be in the best interest of the CPU manufacturers to actually publish accurate CPU details, but the marketing department has convinced them that that is just not the case. This is seen most blatantly in the consumer line versus the Xeon line of Intel chips.
It isn't just the CPU though. Even if you knew exactly every transistor, it is still the code you are running that affects things. On some CPUs the OOO buffer is more than 100 instructions long. So you also have to know every one of those 100+ instructions ahead of your snippet, and which port they will go into, and what instructions are currently in each port, and how many ports you have, and whether or not the memory read/write buffers are full, and the current state of the BTB and caches, whether or not another SMP instruction stream is interleaved with your stream, etc. etc. etc. It's mind bogglingly complex.
Post 29 Apr 2018, 03:35
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20522
Location: In your JS exploiting you and your system
revolution 29 Apr 2018, 03:44
Trying to keep on topic Confused

Here is an ARM snippet to range check a value in two instructions:
Code:
        cmp     r0,'0'          ;carry is set if r0 >= '0'
        rsbcss  ip,r0,'9'       ;carry is set if r0 <= '9'. ip is not used    
Carry is set if '0' <= r0 <= '9'
Post 29 Apr 2018, 03:44
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.