flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
revolution 29 May 2022, 11:27
Unicode is a mess.
What do you propose to replace it with? Note that you are mixing up terms there. ASCII is an encoding, Unicode is a character set. They are very different things. |
|||
![]() |
|
bitRAKE 29 May 2022, 12:10
We clearly need more emojis!
I wonder how many different grunts cavemen had? Can you imagine that one uppity caveman that started using more sounds - he probably got his head bashed in. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
revolution 29 May 2022, 12:15
I suspect the real gripe here is with the encoding. Things like UTF8, UCS2, etc.
But Unicode has it's own "difficulties" also. Perhaps if everyone used UTF-32 and ceased using the combining, zero-width, and/or non-printing characters then everything would be fine? |
|||
![]() |
|
bitRAKE 29 May 2022, 13:15
revolution wrote: Perhaps if everyone used UTF-32 and ceased using the combining, zero-width, and/or non-printing characters then everything would be fine? I suspect things will only get more complicated. We might transition to an easier system, but that transition will be difficult as well. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
revolution 29 May 2022, 13:22
bitRAKE wrote: Those codes have real uses in other industries ... But the question that should be asked, perhaps, is should Unicode be used to provide those things? Including what are essentially layout modifiers inside a character set is debatable. Things like HTML, or even the old RTF, are perhaps a better place for those things to exist. Anyhow, we have them now. So we just have to put up with it and deal with all the confusion and bugs that come from it. |
|||
![]() |
|
revolution 29 May 2022, 13:25
|
|||
![]() |
|
bitRAKE 29 May 2022, 13:29
revolution wrote: Anyhow, we have them now. So we just have to put up with it and deal with all the confusion and bugs that come from it. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
AsmGuru62 29 May 2022, 17:00
Is it true that UTF-8 is basically an IF statement on every character?
Because you have to decode the multibyte characters. |
|||
![]() |
|
revolution 29 May 2022, 17:07
AsmGuru62 wrote: Is it true that UTF-8 is basically an IF statement on every character? Doing something "simple" like counting characters (not bytes) is a really hard problem. It shouldn't be that way IMO. So much complexity for such little benefit. Searching for something "simple" like a single character is just as hard. Actually it is harder, with all the combining, zero-width, and other stuff to take into account. |
|||
![]() |
|
revolution 29 May 2022, 17:34
Here is a nice example.
ÀÀÀ Three characters, right? Perhaps they look slightly different, it depends upon your browser. The last two probably look identical, but they are different. But look again, there are 7 code points. Could you design a search function to find À? What would it find? The first, the second, or the third? All three? Just the last two? None? It isn't an easy problem to solve. This is the UTF-8 byte sequence: Code: c3 80 41 cc 80 41 e2 80 8d e2 80 8d cc 80 |
|||
![]() |
|
bitRAKE 29 May 2022, 22:48
https://unicode-org.github.io/icu/userguide/collation/string-search.html
https://docs.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu- Other programming languages: https://icu.unicode.org/related _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
bitRAKE 30 May 2022, 01:09
Aren't there languages with a lot of permutations? Which make even UTF-32 not enough code space to cover all the possible characters?
I mean, if we go through the thought experiment of removing everything we don't like, and try to create an encoding the works in the way we wish it would. Is that even possible? _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
revolution 30 May 2022, 03:13
bitRAKE wrote: Aren't there languages with a lot of permutations? Which make even UTF-32 not enough code space to cover all the possible characters? bitRAKE wrote: I mean, if we go through the thought experiment of removing everything we don't like, and try to create an encoding the works in the way we wish it would. Is that even possible? |
|||
![]() |
|
bitRAKE 30 May 2022, 06:48
Well, my thinking was to start at some ideal and work backward. For size, let's go wide 32-bit. The greatest benefit of fixed size is character count? To have a real character count every character would need to fit in 32-bit and no control characters. The code points don't represent all characters, in all languages - some characters only exist in combination, right?
So, my other thinking to to remap UTF-8 into something that is faster to scan. I'm ignoring the fact that for many problems caching string lengths and structured text is the algorithmically correct solution. Just pondering a better wheel - this is how they are found. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
revolution 30 May 2022, 07:14
Thai has plain letters and additional combining vowels. I don't think all combinations are valid, but if they were a full mixture would be about 44 * 8 possible characters. Even if it is twice that, it doesn't exhaust the UTF-32 range.
Arabic has beginning, middle, and end glyphs for each letter. All possible combinations of those don't get anywhere near UTF-8 exhaustion. Maori uses a lot of double consonants. A full 21x21 set is only 441 + 5 vowels. Braille has 6 main dots, and two additional dots. Giving 256 possible outputs. If there is another language that does have some huge number of possible combinations I would be keen to discover it. |
|||
![]() |
|
FlierMate1 31 May 2022, 17:39
What about Kanji characters? (But I might not understand what combinations that revolution is talking here)
|
|||
![]() |
|
revolution 01 Jun 2022, 06:15
I'm not aware of any combing characters intended for Kanji. Do they exist? How many of them are there?
It is possible to combine any of the other things like acutes with Kanji, it isn't disallowed by the Unicode spec. And that is kind of the point I guess. There are so many random weird things that can be done with Unicode that dealing with them correctly can be a headache. |
|||
![]() |
|
Furs 01 Jun 2022, 13:05
The point is that such combinations exist already. And can be used (or are used already), even for silly things like "Unicode art".
A flat UTF-32 with no combinations would thus not be Unicode. Nobody said it has to be for "natural languages" only. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.