flat assembler
Message board for the users of flat assembler.

Index > Windows > Why need more bytes for unicode?

Author
Thread Post new topic Reply to topic
Roman



Joined: 21 Apr 2012
Posts: 1251
Roman
Why not using simple asc2 text and two bytes unicode?
Code:
Text db 'hello world!', 0
UnicodeToken dd 0

Invoke printUnicode, Text, [UnicodeToken]
    


Why we need use more bytes for unicode text?
And this create many problem.
Convertation text to asc2 or other format and lost many bytes for text different languages.
Post 29 May 2022, 11:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
Unicode is a mess.

What do you propose to replace it with?

Note that you are mixing up terms there. ASCII is an encoding, Unicode is a character set. They are very different things.
Post 29 May 2022, 11:27
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE
We clearly need more emojis!

I wonder how many different grunts cavemen had? Can you imagine that one uppity caveman that started using more sounds - he probably got his head bashed in.

_________________
¯\(°_o)/¯ unlicense.org
Post 29 May 2022, 12:10
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
I suspect the real gripe here is with the encoding. Things like UTF8, UCS2, etc.

But Unicode has it's own "difficulties" also.

Perhaps if everyone used UTF-32 and ceased using the combining, zero-width, and/or non-printing characters then everything would be fine?
Post 29 May 2022, 12:15
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE
revolution wrote:
Perhaps if everyone used UTF-32 and ceased using the combining, zero-width, and/or non-printing characters then everything would be fine?
Those codes have real uses in other industries - they don't exist because someone was trying to be difficult. Just like all those weird vowels I don't use in my language - I assume other people need them.

I suspect things will only get more complicated. We might transition to an easier system, but that transition will be difficult as well.

_________________
¯\(°_o)/¯ unlicense.org
Post 29 May 2022, 13:15
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
bitRAKE wrote:
Those codes have real uses in other industries ...
Yes, of course. Some people find them useful.

But the question that should be asked, perhaps, is should Unicode be used to provide those things? Including what are essentially layout modifiers inside a character set is debatable. Things like HTML, or even the old RTF, are perhaps a better place for those things to exist.

Anyhow, we have them now. So we just have to put up with it and deal with all the confusion and bugs that come from it.
Post 29 May 2022, 13:22
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
Post 29 May 2022, 13:25
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE
revolution wrote:
Anyhow, we have them now. So we just have to put up with it and deal with all the confusion and bugs that come from it.
Or design and implement a system that meets one's needs. Then only translation algorithms are needed for the systems one wishes to interoperate with. We are programmers - except for the boundaries we make the rules!

_________________
¯\(°_o)/¯ unlicense.org
Post 29 May 2022, 13:29
View user's profile Send private message Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1464
Location: Toronto, Canada
AsmGuru62
Is it true that UTF-8 is basically an IF statement on every character?
Because you have to decode the multibyte characters.
Post 29 May 2022, 17:00
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
AsmGuru62 wrote:
Is it true that UTF-8 is basically an IF statement on every character?
Because you have to decode the multibyte characters.
Yes, true.

Doing something "simple" like counting characters (not bytes) is a really hard problem. It shouldn't be that way IMO. So much complexity for such little benefit.

Searching for something "simple" like a single character is just as hard. Actually it is harder, with all the combining, zero-width, and other stuff to take into account.
Post 29 May 2022, 17:07
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
Here is a nice example.

ÀÀA‍‍̀

Three characters, right? Perhaps they look slightly different, it depends upon your browser. The last two probably look identical, but they are different.

But look again, there are 7 code points. Could you design a search function to find A‍‍̀? What would it find? The first, the second, or the third? All three? Just the last two? None? It isn't an easy problem to solve.

This is the UTF-8 byte sequence:
Code:
c3 80 41 cc 80 41 e2 80  8d e2 80 8d cc 80    
Post 29 May 2022, 17:34
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE

_________________
¯\(°_o)/¯ unlicense.org
Post 29 May 2022, 22:48
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE
Aren't there languages with a lot of permutations? Which make even UTF-32 not enough code space to cover all the possible characters?

I mean, if we go through the thought experiment of removing everything we don't like, and try to create an encoding the works in the way we wish it would. Is that even possible?

_________________
¯\(°_o)/¯ unlicense.org
Post 30 May 2022, 01:09
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
bitRAKE wrote:
Aren't there languages with a lot of permutations? Which make even UTF-32 not enough code space to cover all the possible characters?
I don't know of any. What do you have in mind? Unicode defines up to 0x10ffff only, and that isn't close to being full yet.
bitRAKE wrote:
I mean, if we go through the thought experiment of removing everything we don't like, and try to create an encoding the works in the way we wish it would. Is that even possible?
I don't think it is a matter of "things we don't like", but more of "things that don't belong". It tries to do too much, mixing various aspects of rendering, layout, appearance, and mappings all into one blob. There are literally an infinite number of ways to encode Capital A with acute, and that is just one character. It is also possible to encode entirely new characters that don't exist in any language, just by using the combining "feature".
Post 30 May 2022, 03:13
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE
Well, my thinking was to start at some ideal and work backward. For size, let's go wide 32-bit. The greatest benefit of fixed size is character count? To have a real character count every character would need to fit in 32-bit and no control characters. The code points don't represent all characters, in all languages - some characters only exist in combination, right?

So, my other thinking to to remap UTF-8 into something that is faster to scan.

I'm ignoring the fact that for many problems caching string lengths and structured text is the algorithmically correct solution. Just pondering a better wheel - this is how they are found.

_________________
¯\(°_o)/¯ unlicense.org
Post 30 May 2022, 06:48
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
Thai has plain letters and additional combining vowels. I don't think all combinations are valid, but if they were a full mixture would be about 44 * 8 possible characters. Even if it is twice that, it doesn't exhaust the UTF-32 range.

Arabic has beginning, middle, and end glyphs for each letter. All possible combinations of those don't get anywhere near UTF-8 exhaustion.

Maori uses a lot of double consonants. A full 21x21 set is only 441 + 5 vowels.

Braille has 6 main dots, and two additional dots. Giving 256 possible outputs.

If there is another language that does have some huge number of possible combinations I would be keen to discover it.
Post 30 May 2022, 07:14
View user's profile Send private message Visit poster's website Reply with quote
FlierMate1



Joined: 31 May 2022
Posts: 118
FlierMate1
What about Kanji characters? (But I might not understand what combinations that revolution is talking here)
Post 31 May 2022, 17:39
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
I'm not aware of any combing characters intended for Kanji. Do they exist? How many of them are there?

It is possible to combine any of the other things like acutes with Kanji, it isn't disallowed by the Unicode spec. And that is kind of the point I guess. There are so many random weird things that can be done with Unicode that dealing with them correctly can be a headache.
Post 01 Jun 2022, 06:15
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1888
Furs
The point is that such combinations exist already. And can be used (or are used already), even for silly things like "Unicode art".

A flat UTF-32 with no combinations would thus not be Unicode. Nobody said it has to be for "natural languages" only.
Post 01 Jun 2022, 13:05
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.