flat assembler
Message board for the users of flat assembler.
Index
> Projects and Ideas > IDEA: Unicode characters database representation. Goto page 1, 2 Next |
Author |
|
JohnFound 16 Feb 2013, 17:31
I have an idea, but I am not able to implement it right now.
If someone wants to help - it will be significant contribution to FreshLib. The only possible way to properly handling Unicode characters is to use the characters database from The Unicode Consortium. It is the only way to provide proper ToUpper, ToLower functions and also case insensitive search and comparison. Another use of this database is to check the type of the characters: such as digits, letters, punctuation, combining marks, etc. The database is distributed as a several text file formats. The files are located at: http://www.unicode.org/Public/UNIDATA/ The main of them, which will provide the most of the functionality described above is: UnicodeData.txt My idea is to make some processing of this database in order to convert it to binary format more compact and easy for use from assembly language and possibly easy for search. As long as this table will be big enough, maybe it worth to make it compressed and to uncompress it on the fly on the application initialization. Together with the uncompression, some hash tables might be created in order to provide fast search in this database. I wish I could implement this project, but I am busy and the implementation would be very slow. So, if someone searches for interesting problem - here is it. Also, any considerations about possible ways to solve this problem would be very helpful. _________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 |
|||
16 Feb 2013, 17:31 |
|
JohnFound 17 Feb 2013, 20:24
50 views and not even one opinion...
|
|||
17 Feb 2013, 20:24 |
|
HaHaAnonymous 17 Feb 2013, 20:53
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 21:25; edited 1 time in total |
|||
17 Feb 2013, 20:53 |
|
AsmGuru62 18 Feb 2013, 00:29
I looked at file, but not sure what to do with it.
What are these things separated by ';'? Is it like a DB TABLE with columns data separated by ';' and if data is empty then it is like: ";;;;"? If so, then you need some kind of SQL engine which will select the data from that table. Am I close? |
|||
18 Feb 2013, 00:29 |
|
JohnFound 18 Feb 2013, 05:53
Well, maybe the problem is not so clear. It is my guilt.
The mentioned file contains an information about Unicode characters - one character at line. The fields are separated by ";" and they can be empty - as AsmGuru62 noticed. The detailed description of the fields can be found with Google. For example, there are fields about the UpperCase and LowerCase character corresponding the given character. The type of the character - whether it is number, punctuation, letter or combining mark (they are many) etc. To use this tables is the only possible way to provide number of string functions as UpperCase, LowerCase, etc. that to properly work with Unicode. The big problem with this file is that it is huge - more than 1MB in size and more than 24000 lines. But for assembly written application, such size is unacceptable. SQL database will be even bigger. The task is to make some kind of binary representation of this file that to be both - very compact, and very fast to be searched. There are several aspects of this task - the design of this binary database. The tool that will compile given text files to this representation. Such database will be invaluable for number of projects that need to deal with Unicode text - including assembly written OSes, text editors, etc. I will use this table for the project Fresh IDE with the new source editor that is designed to support Unicode and with the string library of FreshLib. |
|||
18 Feb 2013, 05:53 |
|
revolution 18 Feb 2013, 06:09
Proper Unicode has more than 1M possible characters. Using a simple 1M+ entry lookup table could be doable at runtime. You can initialise the table at startup from a compressed binary. Since most characters would be zeros in the table it would compress well.
Code: struct UNICODE_INFO lower_case dd ? upper_case dd ? type dw ? ends unicode_info rb sizeof.UNICODE_INFO * 0x110000 |
|||
18 Feb 2013, 06:09 |
|
shoorick 18 Feb 2013, 06:58
JohnFound wrote: 50 views and not even one opinion... there is no simple answer unicode is very handy for handling texts which simultaneusly contain characters, of different 8-bit codepages, all of them have same 16-bit width, so there is no any problem to handle strings of them. you may use "small" and "capital" signs from that textual database to make easy upchar/lowchar functions and case insensitive search. just pity that many 8-bit algorythms become inefficient for 16-bit characters... but when you go out of europian languages, you start to meet different control characters inside the string, kinda changing text direction etc. in languages of the far east you will not meet small/capital characters, but there are traditional<->simplified character sets, which are intersecting and it is impossible to use just table to convert these sets, some conversions depends on the sense of the text. also there you will find characters of the same meaning just slightly different in drawing (variants), and thus have different code. chinese text usually have no spaces between words and may have spaces between characters for readability and they are not separating words, as well as the end of line may differ of the end of the word. finally(?) you may find characters of the exactly same look, but different code as they were added separately in different countries or other reason, like this: 兀兀 - it is kinda this variant: MАMА & МAМA - are these words the same for computer? even going not so far: look at polish forums and you will meet grammaticaly correct messages just omiting diacritics due to some reasons: there is no polish keyboard layout in the actual system or just lazy to press right alt these text will fail to be found in regular search something funny in addition: you could know that there is a Ё letter in russian, and it is nowdays mostly ignoring and replacing with Е it is funny, but if you will use Ё correctly instead of Е, anyway many spellchecking and translating programs will report an error and refuse to translate, while, say, ВСЕ & ВСЁ are both correct words of different meaning all of these issues make complete intellectual search/comparing/conversion of strings very hard, it is not just 8<->16 bit problem you will need a command of specialists which are known of linguistic specifics of different languages to do that in my program i use texts to load tables of same meaning/looking characters. it is not perfect, but i got a result, just now it seems became abandonware _________________ UNICODE forever! |
|||
18 Feb 2013, 06:58 |
|
JohnFound 18 Feb 2013, 07:09
@revolution:
BTW, Unicode defines so called TitleCase as well and for some characters it is different than UpperCase. The big disadvantage of such straightforward solution is that it will use more than 16MB (if title case included) of memory table full of 0s. For comparison - the whole KolibriOS works with 8MB RAM. Maybe the better solution is to use variable size records and indexes? Or binary search - it will be a little bit slower of course, but will allow very small tables. Mentioned UnicodeData.txt contains only information about 24000 characters and it seems to be enough for most cases. Of course expanding the tables in runtime is possible, but it can be made carefully IMHO in order to not waste memory. |
|||
18 Feb 2013, 07:09 |
|
JohnFound 18 Feb 2013, 07:14
@shoorick: Of course correct processing of linguistic information is really hard. But I am not trying to do it anyway. The goal of this project is to provide just the mechanical processing of Unicode strings - which by itself is not easy task.
|
|||
18 Feb 2013, 07:14 |
|
shoorick 18 Feb 2013, 07:27
then my solutions were:
1.list of separators 2.list of character translations or... you may use some lookup tables of 65536 bytes. they may be in a separate module of unicode supporting in our blueray era it's already not looking too bloating |
|||
18 Feb 2013, 07:27 |
|
JohnFound 18 Feb 2013, 07:34
@shoorick: As I can see, you think that Unicode is 16bit encoding. It is not true. The whole Unicode space is $110000 characters long. In UTF-16 encodings, some characters take 2 words. UTF-16 used by MS Windows is actually variable size encoding, not fixed size.
|
|||
18 Feb 2013, 07:34 |
|
shoorick 18 Feb 2013, 09:02
yes, i ment exactly utf-16. full space is even more complex to handle, then balanced tree for conversions can be a solution.
it is better to balance the tree of chars on char appearing statistic to get more speed, then copy the tree into new one to get more frequently chars to be situated closer to the tree root in memory (jumping between memory pages in large tree may slow down search even in balanced tree). i would suggest to use in UNICODE_INFO structure not only lower_case and upper_case, but also pointer to character which may be an "alias" for searching, like this: A is alias for: ÀÁÂÃÄÅàáâãäåāĀĂ㥹ǍǎǺǻ +++++++++ btw, a lot of chars (say ";") has no different cases, then char not found in tree treated as is. this may reduce the size of the tree |
|||
18 Feb 2013, 09:02 |
|
LocoDelAssembly 18 Feb 2013, 19:50
Not an answer about dealing with case conversion but might be interesting for you: Cache-Oblivious Algorithms and Data Structures. Also google "cache oblivious".
|
|||
18 Feb 2013, 19:50 |
|
JohnFound 18 Feb 2013, 22:04
LocoDelAssembly, thanks for the reading. It is really interesting. As for the discussed problem - unfortunately, I am not able to take this task right now, because I have several projects to work both for my job and my open source projects. That is why I started this discussion - in order to attract someone else to do it instead of me.
|
|||
18 Feb 2013, 22:04 |
|
cod3b453 24 Feb 2013, 23:20
This is something I had back burner-ed a while ago for UTF8 support but more for keyboard/font translation. My thinking was I'd start with character group maps (array/tree of ranges) and then map the upper/lower case regions within those node entries with 2 base offsets and a limit. This might also work for the other character types (note I was only considering latin/greek/cyrillic groups; not sure how well it would work for others) and should be smaller (<1k) than a total mapping of every character.
|
|||
24 Feb 2013, 23:20 |
|
JohnFound 25 Feb 2013, 12:54
cod3b453 wrote: This is something I had back burner-ed a while ago for UTF8 support but more for keyboard/font translation. My thinking was I'd start with character group maps (array/tree of ranges) and then map the upper/lower case regions within those node entries with 2 base offsets and a limit. This might also work for the other character types (note I was only considering latin/greek/cyrillic groups; not sure how well it would work for others) and should be smaller (<1k) than a total mapping of every character. I didn't understand the whole idea. But the case tables for the whole unicode range should be not so big, because only small part of the UNICODE characters have cases. For example, the Chinese hieroglyphs have no case, so they will not be placed in the tables. _________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 |
|||
25 Feb 2013, 12:54 |
|
cod3b453 25 Feb 2013, 19:55
This is the general shape I was referring to:
Code: ; Range entry 0 dd 0x00000000 ; Base dd 0x0000007F ; Limit db 1 ; Number of case tables ; Case table 0 dw 'A' ; Upper offset dw 'a' ; Lower offset dw 'Z'-'A' ; Limit ; Range entry 1 dd 0x00000100 ; Base dd 0x000002AF ; Limit db 2 ; Case table 0 dw 0 ; LATIN CAPITAL LETTER A WITH MACRON dw 1 ; LATIN SMALL LETTER A WITH MACRON dw 0x37/2 ; LATIN SMALL LETTER K WITH CEDILLA ; Case table 1 ; ... Next you find the separation size between upper and lower by MIN(MAX(0,1) - MIN(0,1),0x37/2) = 1 which implies that you use base 2 (for range 0 it would be mod 26 due to limit of 25) and calculate 0x23 mod 2 = 1{offset}. Upper case would be where 1{offset} < 0{base}+1{separation} -> FALSE and lower would be where 1{offset} < 1{base}+1{separation} -> TRUE; hence the character is lower and conversion to upper is add (0{upper_base} - 1{lower_base}) = -1 i.e. 0x0122. (Sorry it's so convoluted but hopefully you can see how this can be applied to different mappings in the general case) My revised guesstimate is somewhere around 512-768 bytes to store the upper/lower pairings. Hopefully this makes enough sense? |
|||
25 Feb 2013, 19:55 |
|
JohnFound 25 Feb 2013, 21:06
Well, I can't see the whole picture, but the idea is more or less clear. (Is "<=" in your post means less or equal? and what 0x37/2 means? 0x37/2 = 0x1B)
But there are several questions: 1. In your example you are using words for unicode characters, but there are small/capital letters that have to be encoded in 2 words. 2. Are you checked that the letters that have cases are always grouped in ranges? 3. How these ranges structures will be created from the file UnicodeData.txt? 4. For example, how the code for ToLower will looks like? |
|||
25 Feb 2013, 21:06 |
|
cod3b453 25 Feb 2013, 21:20
JohnFound wrote: Well, I can't see the whole picture, but the idea is more or less clear. (Is "<=" in your post means less or equal? and what 0x37/2 means? 0x37/2 = 0x1B) JohnFound wrote:
JohnFound wrote:
JohnFound wrote:
|
|||
25 Feb 2013, 21:20 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.