Unicode in FASM

Index > Main > Unicode in FASM

Goto page 1, 2 Next

Author

Thread

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 28 Jul 2007, 10:51

This topic is split from Things you hate most about FASM
Except for something a little more friendly towards windows resources than manual typing (which really isn't tomasz' duty), my problems with fasm are people related. Also, the lack of handling unicode directly (having to make seperate files and import them using the file data directive) is annoying (but tomasz always did include the source so we could all fix things like that ourselves)... But outside of those two problems i can't argue. I heard talk of people wanting a plugin system for fasmw? We really don't need that.

28 Jul 2007, 10:51

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 29 Jul 2007, 04:16

It does, but the problem is that it can't accept unicode as text in the source. Take the code below for example...

Code:

du "rawr",0 ;unicode string

works, right? Well the code below this won't...

Code:

Of course, i could always do it in notepad then paste the code into fasmw that turns into something else, but it certainly won't look right in fasmw. I'm up the crick without a paddle if i try to do it on a system without fasmw.

29 Jul 2007, 04:16

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 29 Jul 2007, 04:54

Perhaps the problem is that FASMW (and maybe FASM itself) handle 8-bit chars source code only?

Note that du is still useful even with this 8-bit chars at source code limitation, for example I can write "myName du 'Hernán'" and the á will be correctly displayed in Russian or Japanese systems (unless I did understand the purpose of unicode wrong Laughing

29 Jul 2007, 04:54

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 29 Jul 2007, 05:56

LocoDelAssembly wrote:

Perhaps the problem is that FASMW (and maybe FASM itself) handle 8-bit chars source code only?

Note that du is still useful even with this 8-bit chars at source code limitation, for example I can write "myName du 'Hernán'" and the á will be correctly displayed in Russian or Japanese systems (unless I did understand the purpose of unicode wrong ).

Yea, but the last time i tried using windows functions, there is only A and W, not 8... Unless fasm turns the 8bit uni into regular uni...

29 Jul 2007, 05:56

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 29 Jul 2007, 13:51

Quote:

Yea, but the last time i tried using windows functions, there is only A and W, not 8... Unless fasm turns the 8bit uni into regular uni...

Well, du in my example gets translated to the equivalent of "db 'H', 0, 'e', 0, 'r', 0, 'n', 0, 'á', 0, 'n', 0". What I'm not sure however is if the translation is done properly. I saved in notepad a unicode TXT with my name and except for the starting FF FE it had the same content as the binary produced by fasm but, it will be properly translated when assembled on other systems?

29 Jul 2007, 13:51

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 29 Jul 2007, 18:07

LocoDelAssembly wrote:

but, it will be properly translated when assembled on other systems?

That will depend on the locale of the target system...

29 Jul 2007, 18:07

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 29 Jul 2007, 18:15

So I suppose that my assumption above about that "Hernán" will be correctly displayed on Russian and Japanese systems is also wrong?

29 Jul 2007, 18:15

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 29 Jul 2007, 23:16

LocoDelAssembly wrote:

So I suppose that my assumption above about that "Hernán" will be correctly displayed on Russian and Japanese systems is also wrong?

One way to find out. i changed my ansi settings to japanese. =p

Quote:

As for unicode, you have to include proper file from 'include/encoding'. This way is little weird but allows you to have sources in multiple encodings (UTF8, win1250, etc...). This suits FASM's no-command-line design well.

Well, then the problems remains a problem for fasmw, but fasmw isn't really that important for other systems which don't have it. Though, can you provide an example source for this. I don't completely understand.

29 Jul 2007, 23:16

peter

Joined: 09 May 2006
Posts: 63

peter 29 Jul 2007, 23:23

LocoDelAssembly: No, you should use explicit Unicode code:

db 'H', 0, 'e', 0, 'r', 0, 'n', 0, 0xE1, 0, 'n', 0

if you want the source code to be compilable under Russian version of Windows.

29 Jul 2007, 23:23

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 29 Jul 2007, 23:47

ah, but "0, $E1" is the correct sequence for 'á' even on Russian systems? That is the reason why I am not sure about how safe it is because the 8-bit representation is $E1 also and if in a Russian system $E1 is used for something else but fasm still uses the "0, $E1" code then I see it somewhat unsafe.

Example

Code:

du 'This is a Russian char: ', {some Russian char}, 0

Since FASMW will still be 8-bit, can the Russian char get translated into "0, $E1" even though it obviously wasn't 'á'?

vid told about the encodings includes, so I assume that when the char is prefixed with 0 it means "current locale"?

Perhaps my problem is that I'm thinking that Unicode sequences are of universal representation while it isn't the truth? I'm starting to believe that this is the problem, specially because of the different encodings Razz

29 Jul 2007, 23:47

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 29 Jul 2007, 23:49

W (unicode) is supposed to be universal while A isn't...

29 Jul 2007, 23:49

vid
Verbosity in development

Joined: 05 Sep 2003
Posts: 7103
Location: Slovakia

vid 30 Jul 2007, 09:13

those macros are proper answer. If your text editor can create files in UTF8 encoding (best option), then include "utf8.inc". If your editor uses 8bit character set with your locale settings (like FASMW), then you should just include proper file for your encoding.

For example, i create following file under WIN1250 (central european) character set:

Code:

include "encoding\win1250.inc"
du "pi
a"

it uses character "č" (c with caron). In Win1250, this character has code 0xE8. In Unicode, it is U+010D.

The source file is encoded as Win1250 text, so it contains just single 0xE8 byte in place of that character. On system with different encoding, text editor will display this source wrongly. For example in win1251 (cyrillic set), 0xE8 will display as "и".

But unlike displaying, compiling this source would work on any locale settings. It will always compile to unicde character 0x10D in "du" string.

To get rid of all problems, simply use some editor that supports UTF-8. It is universal.

30 Jul 2007, 09:13

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 30 Jul 2007, 15:05

Perfect explanation, I have no doubts this time Very Happy

Thank you very much vid and the others that tried to make me understand this Smile

30 Jul 2007, 15:05

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 31 Jul 2007, 00:15

vid wrote:

those macros are proper answer. If your text editor can create files in UTF8 encoding (best option), then include "utf8.inc". If your editor uses 8bit character set with your locale settings (like FASMW), then you should just include proper file for your encoding.

For example, i create following file under WIN1250 (central european) character set:
Code:
include "encoding\win1250.inc"
du "pi
a"    
it uses character "č" (c with caron). In Win1250, this character has code 0xE8. In Unicode, it is U+010D.

The source file is encoded as Win1250 text, so it contains just single 0xE8 byte in place of that character. On system with different encoding, text editor will display this source wrongly. For example in win1251 (cyrillic set), 0xE8 will display as "и".

But unlike displaying, compiling this source would work on any locale settings. It will always compile to unicde character 0x10D in "du" string.

To get rid of all problems, simply use some editor that supports UTF-8. It is universal.

I'm still not sure, but i think i get it. So, let's say i use WIN1257.INC, and i put text in from a win1257 system. It would convert all du strings from the text of that system to unicode? If so, if i use the UTF-8 file, it'll convert all the UTF into unicode? If that's the case, it still leaves a problem when trying to use something other than ansi for resource files, or does it work on that as well?

31 Jul 2007, 00:15

peter

Joined: 09 May 2006
Posts: 63

peter 31 Jul 2007, 00:27

Quote:

ah, but "0, $E1" is the correct sequence for 'á' even on Russian systems?

Yes, U+00E1 character is 'á' on every system. Unicode is one universal encoding, not a set of different encodings.

Quote:

I assume that when the char is prefixed with 0 it means "current locale"?

No, U+0080..U+0100 characters are always Latin-1 block. Russian (more precisely, Cyrillic) chars are always U+0400..U+0500. See http://www.unicode.org/charts/ for the list of blocks and characters in it.

For example, my name in Russian will be:

db 0x1F, 0x04, 0x51, 0x04, 0x42, 0x04, 0x40, 0x04

Of course, UTF-8 source files are better than these "magic" numbers, but your text editor and compiler should support them. When programming in C, I often use explicit codes for portability, because many C compilers still don't support UTF-8.

31 Jul 2007, 00:27

vid
Verbosity in development

Joined: 05 Sep 2003
Posts: 7103
Location: Slovakia

vid 31 Jul 2007, 00:28

Quote:

I'm still not sure, but i think i get it. So, let's say i use WIN1257.INC, and i put text in from a win1257 system. It would convert all du strings from the text of that system to unicode?

You should create source file on win1257 system, and include win1257.inc. Then, anyone who tries to compile that file, even on system with different locale, with will get same result as you.

But opening source in some editor would display it wrongly, because editors usually use current system default encoding. Viewing file encoded in win1250 as file encoded in win1252 doesn't work well. Trying to re-save file in such case would probably corrupt some characters.

Quote:

If so, if i use the UTF-8 file, it'll convert all the UTF into unicode?

If you encode your source file in UTF-8 encoding, and include "utf8.inc", then "du" will of course produce proper UTF-16 unicode string.

Quote:

If that's the case, it still leaves a problem when trying to use something other than ansi for resource files, or does it work on that as well?

I have minimal experiences with resources, so i can't help you with this.

31 Jul 2007, 00:28

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 31 Jul 2007, 00:31

Alright, that makes sence, but then the problem *may* remain to be one with the resource files (dialog box and menues for example) as hexing them in after making the exe is rather annoying.

31 Jul 2007, 00:31

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 31 Jul 2007, 00:34

If you use some _external_ resource format, rc.exe and a linker, then unicode resources will be just fine and work as expected. If you use FASM's PE output and resource macros, then... *shrug*.

Imho once you start doing unicode, you shouldn't be hardcoding strings in your program anymore... either use resources + loadstring, or do your own support code. Keeping localizable strings in external files pays off in the end, trust me.

31 Jul 2007, 00:34

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 31 Jul 2007, 00:40

Quote:

If you use FASM's PE output and resource macros, then... *shrug*.

That's where the problem comes in...

Quote:

Imho once you start doing unicode, you shouldn't be hardcoding strings in your program anymore... either use resources + loadstring, or do your own support code. Keeping localizable strings in external files pays off in the end, trust me.

If it's a small program, you should just hardcode it. No point in including files with an already incredibly small program. Especially if it was only intended for 1 target language.

31 Jul 2007, 00:40

FlierMate

Joined: 21 Jan 2021
Posts: 219

FlierMate 19 Feb 2021, 07:39

Thanks to this thread, I am able to hardcode Unicode chars into my Assembly code, though it is a bit tricky, I have to find out the Unicode code point in hexadecimal.

Code:

IS64 du '64',0x4F4D,0x5143,' Windows',0
IS32 du '32',0x4F4D,0x5143,' Windows',0

And I would make sure I use Wide char/string of Win32 API

Code:

import user32,\
     MessageBox,'MessageBoxW'

Before I invoke:

Code:

      invoke MessageBox, 0, IS32, '', MB_OK

I found that I cannot type the Unicode chars directly into source file and save as UTF-8 or Unicode. FASM can only open .ASM file saved as ANSI mode.

19 Feb 2021, 07:39

Goto page 1, 2 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum