flat assembler
Message board for the users of flat assembler.

Index > Main > Skip BOM in sources

Author
Thread Post new topic Reply to topic
Jin X



Joined: 06 Mar 2004
Posts: 133
Location: Russia
Jin X 10 Feb 2024, 15:45
Hello Tomasz.
Please let fasm 1 to skip BOM unicode signature in sources.
I often use unicode in comments and I would like to add BOM signatures.
Post 10 Feb 2024, 15:45
View user's profile Send private message Reply with quote
MatQuasar



Joined: 25 Oct 2023
Posts: 105
MatQuasar 11 Feb 2024, 06:19
Comments in the past may help:

revolution wrote:
But UTF-8 should work for saving and loading text files processed by fasm. If you ensure that there is not a UTF-8 signature at the start of the file then there should be no difficulty.

But even if you have the UTF-8 signature then you can make the first line be a colon to make it a label. It would look like this:

Code:
: ; <--- editor hidden UTF-8 signature now becomes a redundant label in fasm
; code goes here
...
Post 11 Feb 2024, 06:19
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 11 Feb 2024, 06:55
You can also combine the colon with the first instruction/directive.
Code:
:format elf executable
mov eax, 1
int 0x80    
Code:
~ hd BOM.asm 
00000000  ef bb bf 3a 66 6f 72 6d  61 74 20 65 6c 66 20 65  |...:format elf e|
00000010  78 65 63 75 74 61 62 6c  65 0a 6d 6f 76 20 65 61  |xecutable.mov ea|
00000020  78 2c 20 31 0a 69 6e 74  20 30 78 38 30           |x, 1.int 0x80|
0000002d
~ fasm BOM.asm && ./BOM 
flat assembler  version 1.73.31  (16384 kilobytes memory)
1 passes, 91 bytes.
~     
Post 11 Feb 2024, 06:55
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 926
Location: Russia
macomics 11 Feb 2024, 11:11
And then what about this?
main.asm
Code:
: ; BOM
format  ELF64 executable 3
segment executable
entry   $
        call    secondProc
        mov     eax, 60
        xor     dil, dil
        syscall

include "second.asm"    

second.asm
Code:
: ; BOM
secondProc:
        mov     edx, .length
        lea     rsi, [.hello]
        push    1
        pop     rdi
        mov     eax, edi
        syscall
  .hello        db 'Hello world!'
  .length = $ - .hello    

Code:
$ fasm -m 102400 main.asm
flat assembler  version 1.73.32  (102400 kilobytes memory, x64)
second.asm [1]:
: ; BOM
processed: :
error: symbol already defined.    


You can fix it like this
Code:
=0 ; BOM
format  ELF64 executable 3
segment executable
entry   $
        call    secondProc
        mov     eax, 60
        xor     dil, dil
        syscall

include "second.asm"    

Code:
=0 ; BOM
secondProc:
        mov     edx, .length
        lea     rsi, [.hello]
        push    1
        pop     rdi
        mov     eax, edi
        syscall
  .hello        db 'Hello world!'
  .length = $ - .hello    

Code:
$ fasm -m 102400 main.asm
flat assembler  version 1.73.32  (102400 kilobytes memory, x64)
2 passes, 166 bytes.    


But in the absence of a BOM, we get this:
Code:
$ fasm -m 102400 main.asm
flat assembler  version 1.73.32  (102400 kilobytes memory, x64)
main.asm [1]:
=0 ; BOM
processed: =0
error: illegal instruction.    
Post 11 Feb 2024, 11:11
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 11 Feb 2024, 11:24
The fix for BOM and BOM-less sources?
Code:
BOM=0 ; works whether invisible BOM is present or not
;...    
I think the real fix is to not have the BOM. UTF-8 doesn't need the BOM, it is all bytes anyway, there is no ordering problem to solve.


Last edited by revolution on 11 Feb 2024, 17:09; edited 1 time in total
Post 11 Feb 2024, 11:24
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 11 Feb 2024, 16:41
revolution wrote:
The fix for BOM and BOM-less sources?
Code:
BOM=0 ; works whether invisible BOM is present or not
;...    
I think thre real fix is to not have the BOM. UTF-8 doesn't need the BOM, it is all bytes anyway, there is no ordering problem to solve.
Some text editors don't display the file in UTF-8 without the BOM and they assume it's ASCII instead. Which to be honest is a sane default because ASCII files definitely don't have any BOM, so it's the most backwards compatible solution.
Post 11 Feb 2024, 16:41
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 11 Feb 2024, 16:54
Furs wrote:
Some text editors don't display the file in UTF-8 without the BOM and they assume it's ASCII instead. Which to be honest is a sane default because ASCII files definitely don't have any BOM, so it's the most backwards compatible solution.
Sure, some editors that are annoying.

The editor I use the most works perfectly fine to detect UTF-8 vs ASCII vs ISO-8859 without any BOM. It isn't hard, it only needs is small amount of logic.

Requiring a BOM would be worse. No scratch that, it is worse. Very few apps add a BOM for UTF-8 (because it isn't needed), so then it would be an awful experience for the user to manually try to figure out what they are looking at.
Post 11 Feb 2024, 16:54
View user's profile Send private message Visit poster's website Reply with quote
Jin X



Joined: 06 Mar 2004
Posts: 133
Location: Russia
Jin X 12 Feb 2024, 10:24
It works. But this is a crutch solution..
I think it's quite easy to add BOM support to compiler.
Post 12 Feb 2024, 10:24
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 12 Feb 2024, 10:53
Jin X wrote:
I think it's quite easy to add BOM support to compiler.
You can prepare a solution and post a patch for consideration.
Post 12 Feb 2024, 10:53
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 12 Feb 2024, 16:46
revolution wrote:
Sure, some editors that are annoying.

The editor I use the most works perfectly fine to detect UTF-8 vs ASCII vs ISO-8859 without any BOM. It isn't hard, it only needs is small amount of logic.
How do you know those characters are UTF-8 and not extended DOS ASCII chars? Heuristics are never perfect.


Last edited by Furs on 12 Feb 2024, 16:46; edited 1 time in total
Post 12 Feb 2024, 16:46
View user's profile Send private message Reply with quote
Jin X



Joined: 06 Mar 2004
Posts: 133
Location: Russia
Jin X 12 Feb 2024, 16:46
Where to post, here?
Post 12 Feb 2024, 16:46
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 12 Feb 2024, 17:14
Furs wrote:
Heuristics are never perfect.
It's better than doing nothing and leaving the user with a big "FU, you figure it out". At least it tries. And so far it has never been wrong IME. Detecting invalid UTF-8 is easy.
Jin X wrote:
Where to post, here?
Seems like a good place to me. See what happens. It can always be posted somewhere else later if needed.
Post 12 Feb 2024, 17:14
View user's profile Send private message Visit poster's website Reply with quote
Jin X



Joined: 06 Mar 2004
Posts: 133
Location: Russia
Jin X 18 Feb 2024, 17:12
BOM checker is done.
I made 2 versions: normal and extended (with support of extra BOMs) in PREPROCE.EXT.INC.
All my inserts are marked as "Jin X".

The main code from PREPROCE.INC:
Code:
        mov     eax,[esi]
        cmp     ax,0FEFFh ; UTF-16 (LE) / UTF-32 (LE)
        je      unsuppoted_bom
        cmp     ax,0FFFEh ; UTF-16 (BE)
        je      unsuppoted_bom
        cmp     eax,0FFFE0000h ; UTF-32 (BE)
        je      unsuppoted_bom
        cmp     eax,3ABFBBEFh ; UTF-8 + colon char
        je      bom_no_skip ; don't skip if colon trick is used (for backward compatibility)
        and     eax,00FFFFFFh
        cmp     eax,00BFBBEFh ; UTF-8
        jne     bom_no_skip
        add     esi,3 ; skip BOM
      bom_no_skip:
        mov     ebx,esi ; moved down by Jin X
    


Description:
Download
Filename: bom_checker.zip
Filesize: 22.04 KB
Downloaded: 100 Time(s)

Post 18 Feb 2024, 17:12
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 926
Location: Russia
macomics 18 Feb 2024, 17:34
This trick has already been discussed. It doesn't work for BOM in multiple files.
Code:
        cmp     byte [esi+ecx],':' ; BOM + colon trick
        je      bom_no_skip ; don't skip for backward compatibility
        add     esi,ecx ; skip BOM
      bom_no_skip:    
It is better to skip the BOM along with the symbol ':'.
Post 18 Feb 2024, 17:34
View user's profile Send private message Reply with quote
Jin X



Joined: 06 Mar 2004
Posts: 133
Location: Russia
Jin X 18 Feb 2024, 18:08
Ok, fixed and checked (for both versions).


Description:
Download
Filename: bom_checker.zip
Filesize: 22.85 KB
Downloaded: 86 Time(s)

Post 18 Feb 2024, 18:08
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.