flat assembler
Message board for the users of flat assembler.

Index > Main > Unicode in FASM

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 19 Feb 2021, 08:43
FlierMate wrote:
I found that I cannot type the Unicode chars directly into source file and save as UTF-8 or Unicode. FASM can only open .ASM file saved as ANSI mode.
Unicode isn't an encoding, it is a mapping to characters.

But UTF-8 should work for saving and loading text files processed by fasm. If you ensure that there is not a UTF-8 signature at the start of the file then there should be no difficulty.

But even if you have the UTF-8 signature then you can make the first line be a colon to make it a label. It would look like this:
Code:
: ; <--- editor hidden UTF-8 signature now becomes a redundant label in fasm
; code goes here
...    
Post 19 Feb 2021, 08:43
View user's profile Send private message Visit poster's website Reply with quote
FlierMate



Joined: 21 Jan 2021
Posts: 219
FlierMate 19 Feb 2021, 14:17
revolution wrote:

But even if you have the UTF-8 signature then you can make the first line be a colon to make it a label. It would look like this:
Code:
: ; <--- editor hidden UTF-8 signature now becomes a redundant label in fasm
; code goes here
...    


Thanks for the brilliant solution, now it can compiles, but the output Unicode chars are wrong...

As advised by you, I put a colon followed by semicolon
Code:
:;
format PE GUI 4.0    

Without this first line label, FASM would complain:

Quote:
flat assembler version 1.73.27 (1089705 kilobytes memory)
wow64chs.asm [1]:
;
processed: 
error: illegal instruction.


Please see the screenshot, showing the Unicode chars in Msgbox are not readable, differs from the words I typed in the source file (saved as UTF-8, yes).

However, using my previous approach, i.e. type Unicode code point hex values, then save the .ASM file as ANSI, resulting in readable Unicode chars in MsgBox.


Description: Chars in Msgbox are scrambled, not the exact words I typed into the source file
Filesize: 10.41 KB
Viewed: 7926 Time(s)

msgbox2.PNG


Post 19 Feb 2021, 14:17
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 19 Feb 2021, 15:36
If you save as UTF-8 encoding then you won't be able to place wide characters with du. The encodings are different.

fasm sees a string of bytes
Code:
IS64 du 0x36,0x34,0xf3,0xc6,0xa3,... ; UTF-8 encoding    
I just made up the bytes but that is what UTF-8 might look like.

Then the du converts it to this:
Code:
0x36,0x00,0x34,0x00,0xf3,0x00,0xc6,0x00,0xa3,0x00 ; fasm just inserts null bytes    
So now they are "wide" characters, but of course they are not what you wanted.

What you want to output is wide characters directly:
Code:
IS64 dw '6','4',0x34a2,... ; manually entered wide character    
Post 19 Feb 2021, 15:36
View user's profile Send private message Visit poster's website Reply with quote
FlierMate



Joined: 21 Jan 2021
Posts: 219
FlierMate 19 Feb 2021, 15:56
So it was what I have done initially.

The following...
Code:
IS64 du '64',0x4F4D,0x5143,' Windows',0
IS32 du '32',0x4F4D,0x5143,' Windows',0    


...are working exactly the same as...

Code:
IS64 dw '6','4',0x4F4D,0x5143,' ','W','i','n','d','o','w','s',0
IS32 dw '3','2',0x4F4D,0x5143,' ','W','i','n','d','o','w','s',0    


Hmmm... so it means I still cannot type CJK chars directly in the source file, as FASM would give error:

Quote:
wow64chs.asm [11]:
IS64 dw '6','4','位','元',' ','W','i','n','d','o','w','s',0
processed: IS64 dw '6','4','位','元',' ','W','i','n','d','o','w','s',0
error: value out of range.
Post 19 Feb 2021, 15:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 19 Feb 2021, 16:15
Your CJK characters are UTF-8 encoded so fasm sees all the bytes of the encoding inside the single quotes:
Code:
'<UTF-8 bytes in here>'    
So if your characters require more than 2 bytes to encode in UTF-8 then it can't fit inside the 16-bit word.

And even if they are 2 bytes long it is still stored wrong because fasm doesn't reverse the bytes of the string characters, so the big-endian encoding of UTF-8 is read in as little-endian by the CPU as a wide character. So there is no simple way around this, you have to manually encode.
Post 19 Feb 2021, 16:15
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 19 Feb 2021, 16:25
revolution wrote:
If you save as UTF-8 encoding then you won't be able to place wide characters with du. The encodings are different.
That is true with the basic (built-in) implementation of DU, but all you need to get a better implementation is a line like:
Code:
include 'encoding/utf8.inc'    
with standard Windows headers (see their documentation).
Post 19 Feb 2021, 16:25
View user's profile Send private message Visit poster's website Reply with quote
FlierMate



Joined: 21 Jan 2021
Posts: 219
FlierMate 19 Feb 2021, 17:45
Sincere thanks to revolution and Tomasz.

By including the UTF8.inc it works wonder.
Now I can include CJK chars directly inside the Assembly code.
Code:
:;
format PE GUI 4.0

entry start 

include '\fasm\include\encoding\utf8.inc'
include '\fasm\include\win32w.inc'

section '.data' data readable writeable

a rb MAX_PATH
IS64 du '64位元 Windows',0
IS32 du '32位元 Windows',0
.....
.....    


This solved my problem. Smile
Post 19 Feb 2021, 17:45
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1850
Roman 13 Jan 2024, 17:19
Not help me for Russian symbols unicode
Image

UTF-8 notepad windows 10 store russian symbols(two bytes) 0xD090 = A
English symbols store one byte.

For me fine 0x0410 =A russian
Image
Post 13 Jan 2024, 17:19
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1043
Location: Russia
macomics 13 Jan 2024, 18:43
It must be saved in UTF-16 LE so that Windows functions with W prefix work normally with this text. Or convert strings to the desired format using the MultiByteToWideChar function
Post 13 Jan 2024, 18:43
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.