flat assembler
Message board for the users of flat assembler.

Index > Compiler Internals > Suggestion: double quoted string accepts escape sequences

Author
Thread Post new topic Reply to topic
Jessé



Joined: 03 May 2025
Posts: 50
Jessé 01 Jun 2025, 15:13
Today, I wake up with a very fun idea in mind, and that leads me to make this simple application, which can process standard escape sequences:

Code:
format ELF64 executable 3
entry Start

include 'anon_label.inc'
include 'fastcall_BETA2.inc'
include 'stdio.inc'

_data                              ;   08   19   2A   3B   4C   5D   6E   7F
     escapetable:                  db 255, 255, '"', 255, 255, 255, 255, "'" ; 20 - 27
                                   db 255, 255, 255, 255, 255, 255, 255, 255 ; 28 - 2F
                                   db 254, 254, 254, 254, 254, 254, 254, 254 ; 30 - 37
                                   db 255, 255, 255, 255, 255, 255, 255, '?' ; 38 - 3F
                                   db 255, 255, 255, 255, 255, 255, 255, 255 ; 40 - 47
                                   db 255, 255, 255, 255, 255, 255, 255, 255 ; 48 - 4F
                                   db 255, 255, 255, 255, 255, 255, 255, 255 ; 50 - 57
                                   db 255, 255, 255, 255, '\', 255, 255, 255 ; 58 - 5F
                                   db 255, 007, 008, 255, 255, 027, 012, 255 ; 60 - 67
                                   db 255, 255, 255, 255, 255, 255, 010, 255 ; 68 - 6F
                                   db 255, 255, 013, 255, 009, 255, 011, 255 ; 70 - 77
                                   db 253, 255, 255, 255, 255, 255, 255, 255 ; 78 - 7F

_code
          ParseEscapedString:      endbr64   ; rdi = destination buffer; rsi = source string
                                   push      rbx
                                   lea       rbx, [escapetable-20h]
               .nextchar:          lodsb
                                   test      al, al
                                   jz        .endsuccess
                                   cmp       al, '\'
                                   je        .escape
                                   stosb
                                   jmp       .nextchar
               .endsuccess:        clc
                                   pop       rbx
                                   ret
               .escape:            lodsb
                                   test      al, al
                                   js        .ignoreescape
                                   xlatb
                                   test      al, -1
                                   jns       .store
                                   cmp       al, 254
                                   je        .octal
                                   cmp       al, 253
                                   je        .hex
               .ignoreescape:      mov       ax, [rsi-2]
                                   stosw
                                   jmp       .nextchar
               .store:             stosb
                                   jmp       .nextchar
               .octal:             mov       eax, [rsi-1]
                                   mov       ch, 1     ; Invalid octal flag before process
                                   cmp       al, '0'
                                   jb        .endoctal
                                   cmp       al, '7'
                                   ja        .endoctal
                                   xor       ecx, ecx  ; Valid octal + cl = number of octal chars
                                   sub       al, '0'
                                   movzx     edx, al
                                   inc       cl
                                   shr       eax, 8
                                   cmp       al, '0'
                                   jb        .endoctal
                                   cmp       al, '7'
                                   ja        .endoctal
                                   sub       al, '0'
                                   shl       edx, 3
                                   inc       cl
                                   or        dl, al
                                   shr       eax, 8
                                   cmp       al, '0'
                                   jb        .endoctal
                                   cmp       al, '7'
                                   ja        .endoctal
                                   sub       al, '0'
                                   shl       edx, 3
                                   or        dl, al
                                   inc       cl
               .endoctal:          test      ch, ch
                                   jnz       .ignoreescape
                                   lea       rsi, [rsi+rcx-1]
                                   mov       al, dl
                                   stosb
                                   jmp       .nextchar
               .hex:               mov       dx, [rsi] ; supporting 2 char hex \xNN
                                   mov       ch, 1     ; Set invalid flag before process
                                   cmp       dx, '00'
                                   jb        .endhex
                                   cmp       dx, 'ff'
                                   ja        .endhex
                                   sub       dx, '00'
                                   cmp       dl, 9
                                   jbe       @f
                                   sub       dl, 7
                                   cmp       dl, 0Fh
                                   jbe       @f
                                   sub       dl, 20h
                                   cmp       dl, 0Fh
                                   ja        .endhex
                                   cmp       dl, 0Ah
                                   jb        .endhex
                              @@   cmp       dh, 9
                                   jbe       @f
                                   sub       dh, 7
                                   cmp       dh, 0Fh
                                   jbe       @f
                                   sub       dh, 20h
                                   cmp       dh, 0Fh
                                   ja        .endhex
                                   cmp       dh, 0Ah
                                   jb        .endhex
                              @@   xor       ch, ch    ; Valid hex escape
               .endhex:            test      ch, ch
                                   jnz       .ignoreescape
                                   shl       dl, 4
                                   or        dl, dh
                                   mov       al, dl
                                   stosb
                                   add       rsi, 2
                                   jmp       .nextchar

          Start:                   endbr64
                                   cmp       [rsp], dword 2
                                   jne       .err0
                                   mov       rdx, [stdout]
                                   mov       rcx, [stderr]
                                   mov       rdx, [rdx]
                                   mov       rcx, [rcx]
                                   mov       [stdout], rdx
                                   mov       [stderr], rcx
                                   mov       rbp, [rsp+16]  ; argv[1]
                                   mov       rdi, [rsp+16]  ; argv[1]
                                   xor       al, al
                                   mov       ecx, -1        ; 4 GB string limit
                                   repne     scasb
                                   not       ecx            ; String size with \0 char
                                   malloc(ecx);
                                   test      rax, rax
                                   jz        .err1
                                   mov       rsi, rbp
                                   mov       rdi, rax
                                   mov       rbp, rax
                                   call      ParseEscapedString
                                   jc        .err2
                                   puts(rbp);
                                   free(rbp);
                                   exit(0);

               .err2:              free(rbp);
                                   exit(3);

               .err1:              perror("Parsing failed");
               .err0:              exit(1);
    


The results are very interesting, because one may have the option of escaping any character inside a string token, without leaving the string, which is actually a practical way of defining a string. And, the single quoted in this case must not be parsed, staying the same, with no escape sequence support.
I made the function 'ParseEscapedString' as portable as I can, the only exception is being 64 bit, but exchanging the few 64 bit registers for the 32 bit ones (to make it fit fasmg code) is a safe approach. I also avoid any exclusive 64-bit register inside the function, for the same reason.
The guide for what I made available came from this link and also another materials I read online. All supported, but unicode (in my example code), because unicode can also be escaped as hex or octal sequences. Or the character can be directly typed in, as I show in the image with the emoji example. I also made this one does not modify the string if an invalid escape sequence is parsed, so, if the string has '\w'on it, '\w' will come out. Also if it has any single char that must be inside escaped sequence (e.g., '?' instead of '\?'), this char will be processed as normal text and go to the output.

This is of course only a suggestion with a working example, I don't know if all of this fits the fasmg (also Tomasz and other users) philosophy for the fasmg assembler, but, in my opinion, it would be a good addition to the versatility of already versatile fasmg. I also don't know if adding this functionality will be too much challenge, because, so far, I still have not figure out the internals of fasmg assembler by reading its source code. Despite being assembly, it is such a sophisticated "machine"... Cool


Description: Everything shown as command line argument should fit a single double quoted string.
Filesize: 48.18 KB
Viewed: 261 Time(s)

escape_preprocessor_output.png



_________________
jesse6
Post 01 Jun 2025, 15:13
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8420
Location: Kraków, Poland
Tomasz Grysztar 01 Jun 2025, 16:26
Jessé wrote:
The results are very interesting, because one may have the option of escaping any character inside a string token, without leaving the string, which is actually a practical way of defining a string.
The classic assembly language string syntax is even more practical in my view, since the only character that would make you leave the string is the quote itself, and you can just double it to force it becoming part of the string. I don't count line breaks, as assembly is line-based language, but for these assembly has legacy idioms, too.

That said, I was considering making string syntax a switchable option, similar to RETAINCOMMENTS/ISOLATELINES directives, as this would fit fasm's design philosophy. This would not be possible without expanding the format of internal raw token stream*, to accommodate two potentially different tokenizations of every line, and it would require a lot of attention. So far I have not felt that it would be worth the effort, but it's something I keep in mind.

___
* fasmg engine uses two different formats for token streams, one for for raw tokens created from source text, and one for preprocessed tokens created by macros and symbolic definitions. Only the first one would require changes, what helps limit the number of affected places. Still, it would require careful handling.
Post 01 Jun 2025, 16:26
View user's profile Send private message Visit poster's website Reply with quote
Jessé



Joined: 03 May 2025
Posts: 50
Jessé 01 Jun 2025, 16:42
Indeed, you're right.
This idea is taken straight from what is a C standard. In other words, completely aside of what is pure assembly.
I also fully agree with the switchable option (default being off), because when I posted it here, I think if it is a permanent feature, it might mess up some previous code from people that already uses fasmg since then, forcing them to adapt their codes. That's not great, for sure.
Post 01 Jun 2025, 16:42
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.