flat assembler
Message board for the users of flat assembler.

Index > Main > Unicode in FASM

Author
Thread Post new topic Reply to topic
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
This topic is split from Things you hate most about FASM
Except for something a little more friendly towards windows resources than manual typing (which really isn't tomasz' duty), my problems with fasm are people related. Also, the lack of handling unicode directly (having to make seperate files and import them using the file data directive) is annoying (but tomasz always did include the source so we could all fix things like that ourselves)... But outside of those two problems i can't argue. I heard talk of people wanting a plugin system for fasmw? We really don't need that.
Post 28 Jul 2007, 10:51
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
It does, but the problem is that it can't accept unicode as text in the source. Take the code below for example...

Code:
du "rawr",0 ;unicode string    


works, right? Well the code below this won't...

Code:
    


Of course, i could always do it in notepad then paste the code into fasmw that turns into something else, but it certainly won't look right in fasmw. I'm up the crick without a paddle if i try to do it on a system without fasmw.
Post 29 Jul 2007, 04:16
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Perhaps the problem is that FASMW (and maybe FASM itself) handle 8-bit chars source code only?

Note that du is still useful even with this 8-bit chars at source code limitation, for example I can write "myName du 'Hernán'" and the á will be correctly displayed in Russian or Japanese systems (unless I did understand the purpose of unicode wrong Laughing).
Post 29 Jul 2007, 04:54
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
LocoDelAssembly wrote:
Perhaps the problem is that FASMW (and maybe FASM itself) handle 8-bit chars source code only?

Note that du is still useful even with this 8-bit chars at source code limitation, for example I can write "myName du 'Hernán'" and the á will be correctly displayed in Russian or Japanese systems (unless I did understand the purpose of unicode wrong Laughing).


Yea, but the last time i tried using windows functions, there is only A and W, not 8... Unless fasm turns the 8bit uni into regular uni...
Post 29 Jul 2007, 05:56
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

Yea, but the last time i tried using windows functions, there is only A and W, not 8... Unless fasm turns the 8bit uni into regular uni...

Well, du in my example gets translated to the equivalent of "db 'H', 0, 'e', 0, 'r', 0, 'n', 0, 'á', 0, 'n', 0". What I'm not sure however is if the translation is done properly. I saved in notepad a unicode TXT with my name and except for the starting FF FE it had the same content as the binary produced by fasm but, it will be properly translated when assembled on other systems?
Post 29 Jul 2007, 13:51
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
LocoDelAssembly wrote:

but, it will be properly translated when assembled on other systems?

That will depend on the locale of the target system...
Post 29 Jul 2007, 18:07
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
So I suppose that my assumption above about that "Hernán" will be correctly displayed on Russian and Japanese systems is also wrong?
Post 29 Jul 2007, 18:15
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
LocoDelAssembly wrote:
So I suppose that my assumption above about that "Hernán" will be correctly displayed on Russian and Japanese systems is also wrong?


One way to find out. i changed my ansi settings to japanese. =p

Quote:
As for unicode, you have to include proper file from 'include/encoding'. This way is little weird but allows you to have sources in multiple encodings (UTF8, win1250, etc...). This suits FASM's no-command-line design well.


Well, then the problems remains a problem for fasmw, but fasmw isn't really that important for other systems which don't have it. Though, can you provide an example source for this. I don't completely understand.
Post 29 Jul 2007, 23:16
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
LocoDelAssembly: No, you should use explicit Unicode code:

db 'H', 0, 'e', 0, 'r', 0, 'n', 0, 0xE1, 0, 'n', 0

if you want the source code to be compilable under Russian version of Windows.
Post 29 Jul 2007, 23:23
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
ah, but "0, $E1" is the correct sequence for 'á' even on Russian systems? That is the reason why I am not sure about how safe it is because the 8-bit representation is $E1 also and if in a Russian system $E1 is used for something else but fasm still uses the "0, $E1" code then I see it somewhat unsafe.

Example
Code:
du 'This is a Russian char: ', {some Russian char}, 0    

Since FASMW will still be 8-bit, can the Russian char get translated into "0, $E1" even though it obviously wasn't 'á'?

vid told about the encodings includes, so I assume that when the char is prefixed with 0 it means "current locale"?

Perhaps my problem is that I'm thinking that Unicode sequences are of universal representation while it isn't the truth? I'm starting to believe that this is the problem, specially because of the different encodings Razz
Post 29 Jul 2007, 23:47
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
W (unicode) is supposed to be universal while A isn't...
Post 29 Jul 2007, 23:49
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
those macros are proper answer. If your text editor can create files in UTF8 encoding (best option), then include "utf8.inc". If your editor uses 8bit character set with your locale settings (like FASMW), then you should just include proper file for your encoding.

For example, i create following file under WIN1250 (central european) character set:
Code:
include "encoding\win1250.inc"
du "pi
a"    

it uses character "č" (c with caron). In Win1250, this character has code 0xE8. In Unicode, it is U+010D.

The source file is encoded as Win1250 text, so it contains just single 0xE8 byte in place of that character. On system with different encoding, text editor will display this source wrongly. For example in win1251 (cyrillic set), 0xE8 will display as "и".

But unlike displaying, compiling this source would work on any locale settings. It will always compile to unicde character 0x10D in "du" string.

To get rid of all problems, simply use some editor that supports UTF-8. It is universal.
Post 30 Jul 2007, 09:13
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Perfect explanation, I have no doubts this time Very Happy

Thank you very much vid and the others that tried to make me understand this Smile
Post 30 Jul 2007, 15:05
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
vid wrote:
those macros are proper answer. If your text editor can create files in UTF8 encoding (best option), then include "utf8.inc". If your editor uses 8bit character set with your locale settings (like FASMW), then you should just include proper file for your encoding.

For example, i create following file under WIN1250 (central european) character set:
Code:
include "encoding\win1250.inc"
du "pi
a"    

it uses character "č" (c with caron). In Win1250, this character has code 0xE8. In Unicode, it is U+010D.

The source file is encoded as Win1250 text, so it contains just single 0xE8 byte in place of that character. On system with different encoding, text editor will display this source wrongly. For example in win1251 (cyrillic set), 0xE8 will display as "и".

But unlike displaying, compiling this source would work on any locale settings. It will always compile to unicde character 0x10D in "du" string.

To get rid of all problems, simply use some editor that supports UTF-8. It is universal.


I'm still not sure, but i think i get it. So, let's say i use WIN1257.INC, and i put text in from a win1257 system. It would convert all du strings from the text of that system to unicode? If so, if i use the UTF-8 file, it'll convert all the UTF into unicode? If that's the case, it still leaves a problem when trying to use something other than ansi for resource files, or does it work on that as well?
Post 31 Jul 2007, 00:15
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
Quote:

ah, but "0, $E1" is the correct sequence for 'á' even on Russian systems?

Yes, U+00E1 character is 'á' on every system. Unicode is one universal encoding, not a set of different encodings.

Quote:

I assume that when the char is prefixed with 0 it means "current locale"?

No, U+0080..U+0100 characters are always Latin-1 block. Russian (more precisely, Cyrillic) chars are always U+0400..U+0500. See http://www.unicode.org/charts/ for the list of blocks and characters in it.

For example, my name in Russian will be:

db 0x1F, 0x04, 0x51, 0x04, 0x42, 0x04, 0x40, 0x04

Of course, UTF-8 source files are better than these "magic" numbers, but your text editor and compiler should support them. When programming in C, I often use explicit codes for portability, because many C compilers still don't support UTF-8.
Post 31 Jul 2007, 00:27
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
Quote:
I'm still not sure, but i think i get it. So, let's say i use WIN1257.INC, and i put text in from a win1257 system. It would convert all du strings from the text of that system to unicode?

You should create source file on win1257 system, and include win1257.inc. Then, anyone who tries to compile that file, even on system with different locale, with will get same result as you.

But opening source in some editor would display it wrongly, because editors usually use current system default encoding. Viewing file encoded in win1250 as file encoded in win1252 doesn't work well. Trying to re-save file in such case would probably corrupt some characters.

Quote:
If so, if i use the UTF-8 file, it'll convert all the UTF into unicode?

If you encode your source file in UTF-8 encoding, and include "utf8.inc", then "du" will of course produce proper UTF-16 unicode string.

Quote:
If that's the case, it still leaves a problem when trying to use something other than ansi for resource files, or does it work on that as well?
I have minimal experiences with resources, so i can't help you with this.
Post 31 Jul 2007, 00:28
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Alright, that makes sence, but then the problem *may* remain to be one with the resource files (dialog box and menues for example) as hexing them in after making the exe is rather annoying.
Post 31 Jul 2007, 00:31
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
If you use some _external_ resource format, rc.exe and a linker, then unicode resources will be just fine and work as expected. If you use FASM's PE output and resource macros, then... *shrug*.

Imho once you start doing unicode, you shouldn't be hardcoding strings in your program anymore... either use resources + loadstring, or do your own support code. Keeping localizable strings in external files pays off in the end, trust me.
Post 31 Jul 2007, 00:34
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
If you use FASM's PE output and resource macros, then... *shrug*.


That's where the problem comes in...

Quote:
Imho once you start doing unicode, you shouldn't be hardcoding strings in your program anymore... either use resources + loadstring, or do your own support code. Keeping localizable strings in external files pays off in the end, trust me.


If it's a small program, you should just hardcode it. No point in including files with an already incredibly small program. Especially if it was only intended for 1 target language.
Post 31 Jul 2007, 00:40
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.