flat assembler
Message board for the users of flat assembler.

Index > Compiler Internals > [bug]UTF-8 support

Author
Thread Post new topic Reply to topic
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 19 Mar 2007, 13:48
Tomasz claims FASM to have support UTF-8 encoding of sources. It does so just by treating all characters above 0x80 as normal characters allowed in symbols and strings.

Unfortunatelly this is not enough. There are two problems with that:

1. Due to standards, characters with overlength encoding should be refused. That means for example if character is encoded in 2 bytes and could be in one (like zero encoded as C0,00), FASM should display error. This includes checking for incomplete last character.

2. when using du directive with string initializer, FASM should decode UTF-8 to UTF-16 used by Windows (real variable-length character UTF-16, not UCS-2)

I think point 1. could be relatively easily solved in stage where internal tokens are generated, and point 2 shouldn't be big problem aswell. This will give FASM a REAL support for sources in UTF-8. I believe that such internally used tool as FASM undoubtely is deserves good internalization support.
Post 19 Mar 2007, 13:48
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 19 Mar 2007, 16:59
1. I intentionally ignored this rule, as it matters only where security is important, or where you plan to compare the UTF-8 strings etc. I did not see any advantage in signalizing such error.

2. It already does. You may have missed this fragment of UTF8.INC:
Code:
     if wide < 10000h
      dw wide
     else
      dw 0D7C0h + wide shr 10,0DC00h or (wide and 3FFh)
     end if    


3. The real problem with UTF-8 support is that if your editor put BOM as a header to your file, it won't work with fasm, as fasm will treat it as some real symbol.
Post 19 Mar 2007, 16:59
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 19 Mar 2007, 18:09
1.
Quote:
I did not see any advantage in signalizing such error.

conformace to standards, for example?

2.
Quote:
It already does. You may have missed this fragment of UTF8.INC
sorry, seems i have missed the "encoding" directory completely. Nice to see FASM supports all these encodings, even though doing so by include file is little ... uncommon Smile

3. if you had checking for valid UTF-8 encoding during tokenization, it would be easy to add this feature Wink
Post 19 Mar 2007, 18:09
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 19 Mar 2007, 18:31
vid wrote:
conformace to standards, for example?
I treat it rather as a "security hint". Accepting non-standard encodings of such kind is not really much worse from the fact that fasm accepts any non-standard line breaks regardless of the operating system.

vid wrote:
3. if you had checking for valid UTF-8 encoding during tokenization, it would be easy to add this feature Wink

fasm itself is encoding-transparent, that's why the encodings are supported with include files.
Post 19 Mar 2007, 18:31
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.