IDEA: Unicode characters database representation.

Index > Projects and Ideas > IDEA: Unicode characters database representation.

Goto page 1, 2 Next

Author

Thread

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 16 Feb 2013, 17:31

I have an idea, but I am not able to implement it right now.
If someone wants to help - it will be significant contribution to FreshLib.

The only possible way to properly handling Unicode characters is to use the characters database from The Unicode Consortium.

It is the only way to provide proper ToUpper, ToLower functions and also case insensitive search and comparison. Another use of this database is to check the type of the characters: such as digits, letters, punctuation, combining marks, etc.

The database is distributed as a several text file formats. The files are located at: http://www.unicode.org/Public/UNIDATA/

The main of them, which will provide the most of the functionality described above is: UnicodeData.txt

My idea is to make some processing of this database in order to convert it to binary format more compact and easy for use from assembly language and possibly easy for search.

As long as this table will be big enough, maybe it worth to make it compressed and to uncompress it on the fly on the application initialization. Together with the uncompression, some hash tables might be created in order to provide fast search in this database.

I wish I could implement this project, but I am busy and the implementation would be very slow.

So, if someone searches for interesting problem - here is it.

Also, any considerations about possible ways to solve this problem would be very helpful.

_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9

16 Feb 2013, 17:31

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 17 Feb 2013, 20:24

50 views and not even one opinion...

17 Feb 2013, 20:24

HaHaAnonymous

Joined: 02 Dec 2012
Posts: 1178
Location: Unknown

HaHaAnonymous 17 Feb 2013, 20:53

[ Post removed by author. ]

Last edited by HaHaAnonymous on 28 Feb 2015, 21:25; edited 1 time in total

17 Feb 2013, 20:53

baldr

Joined: 19 Mar 2008
Posts: 1651

baldr 17 Feb 2013, 21:26

JohnFound,

Unicode addresses many problems, maybe library can chew them one at a time? As Niklaus Wirth said, "programs = algorithms + data structures". Codepoints' classification supposes something like _ctype array of bitfields, case conversion needs something completely different; ligatures, composite characters and such... I know you've got the pattern. Wink

17 Feb 2013, 21:26

AsmGuru62

Joined: 28 Jan 2004
Posts: 1817
Location: Toronto, Canada

AsmGuru62 18 Feb 2013, 00:29

I looked at file, but not sure what to do with it.
What are these things separated by ';'?

Is it like a DB TABLE with columns data separated by ';' and if data is empty then it is like: ";;;;"?

If so, then you need some kind of SQL engine which will select the data
from that table.

Am I close?

18 Feb 2013, 00:29

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 18 Feb 2013, 05:53

Well, maybe the problem is not so clear. It is my guilt.
The mentioned file contains an information about Unicode characters - one character at line. The fields are separated by ";" and they can be empty - as AsmGuru62 noticed.

The detailed description of the fields can be found with Google. For example, there are fields about the UpperCase and LowerCase character corresponding the given character. The type of the character - whether it is number, punctuation, letter or combining mark (they are many) etc.

To use this tables is the only possible way to provide number of string functions as UpperCase, LowerCase, etc. that to properly work with Unicode.

The big problem with this file is that it is huge - more than 1MB in size and more than 24000 lines.

But for assembly written application, such size is unacceptable. SQL database will be even bigger.

The task is to make some kind of binary representation of this file that to be both - very compact, and very fast to be searched.

There are several aspects of this task - the design of this binary database. The tool that will compile given text files to this representation.

Such database will be invaluable for number of projects that need to deal with Unicode text - including assembly written OSes, text editors, etc.
I will use this table for the project Fresh IDE with the new source editor that is designed to support Unicode and with the string library of FreshLib.

18 Feb 2013, 05:53

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 21011
Location: In your JS exploiting you and your system

revolution 18 Feb 2013, 06:09

Proper Unicode has more than 1M possible characters. Using a simple 1M+ entry lookup table could be doable at runtime. You can initialise the table at startup from a compressed binary. Since most characters would be zeros in the table it would compress well.

Code:

struct UNICODE_INFO
 lower_case dd ?
 upper_case dd ?
 type dw ?
ends

unicode_info rb sizeof.UNICODE_INFO * 0x110000

18 Feb 2013, 06:09

shoorick

Joined: 25 Feb 2005
Posts: 1617
Location: Ukraine

shoorick 18 Feb 2013, 06:58

JohnFound wrote:

50 views and not even one opinion...

there is no simple answer Wink

unicode is very handy for handling texts which simultaneusly contain characters, of different 8-bit codepages, all of them have same 16-bit width, so there is no any problem to handle strings of them. you may use "small" and "capital" signs from that textual database to make easy upchar/lowchar functions and case insensitive search. just pity that many 8-bit algorythms become inefficient for 16-bit characters...

but when you go out of europian languages, you start to meet different control characters inside the string, kinda changing text direction etc.

in languages of the far east you will not meet small/capital characters, but there are traditional<->simplified character sets, which are intersecting and it is impossible to use just table to convert these sets, some conversions depends on the sense of the text. also there you will find characters of the same meaning just slightly different in drawing (variants), and thus have different code. chinese text usually have no spaces between words and may have spaces between characters for readability and they are not separating words, as well as the end of line may differ of the end of the word. finally(?) you may find characters of the exactly same look, but different code as they were added separately in different countries or other reason, like this:
兀兀
- it is kinda this variant: MАMА & МAМA - are these words the same for computer?
even going not so far: look at polish forums and you will meet grammaticaly correct messages just omiting diacritics due to some reasons: there is no polish keyboard layout in the actual system or just lazy to press right alt Wink

these text will fail to be found in regular search Wink

something funny in addition: you could know that there is a Ё letter in russian, and it is nowdays mostly ignoring and replacing with Е Wink

it is funny, but if you will use Ё correctly instead of Е, anyway many spellchecking and translating programs will report an error and refuse to translate, while, say, ВСЕ & ВСЁ are both correct words of different meaning Wink

all of these issues make complete intellectual search/comparing/conversion of strings very hard, it is not just 8<->16 bit problem Wink

you will need a command of specialists which are known of linguistic specifics of different languages to do that Wink

in my program i use texts to load tables of same meaning/looking characters. it is not perfect, but i got a result, just now it seems became abandonware Wink

_________________
UNICODE forever!

18 Feb 2013, 06:58

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 18 Feb 2013, 07:09

@revolution:
BTW, Unicode defines so called TitleCase as well and for some characters it is different than UpperCase.
The big disadvantage of such straightforward solution is that it will use more than 16MB (if title case included) of memory table full of 0s. For comparison - the whole KolibriOS works with 8MB RAM.
Maybe the better solution is to use variable size records and indexes? Or binary search - it will be a little bit slower of course, but will allow very small tables.
Mentioned UnicodeData.txt contains only information about 24000 characters and it seems to be enough for most cases.
Of course expanding the tables in runtime is possible, but it can be made carefully IMHO in order to not waste memory.

18 Feb 2013, 07:09

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 18 Feb 2013, 07:14

@shoorick: Of course correct processing of linguistic information is really hard. But I am not trying to do it anyway. The goal of this project is to provide just the mechanical processing of Unicode strings - which by itself is not easy task.

18 Feb 2013, 07:14

shoorick

Joined: 25 Feb 2005
Posts: 1617
Location: Ukraine

shoorick 18 Feb 2013, 07:27

then my solutions were:
1.list of separators
2.list of character translations

or... you may use some lookup tables of 65536 bytes. they may be in a separate module of unicode supporting Wink

in our blueray era it's already not looking too bloating Wink

18 Feb 2013, 07:27

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 18 Feb 2013, 07:34

@shoorick: As I can see, you think that Unicode is 16bit encoding. It is not true. The whole Unicode space is $110000 characters long. In UTF-16 encodings, some characters take 2 words. UTF-16 used by MS Windows is actually variable size encoding, not fixed size.

18 Feb 2013, 07:34

shoorick

Joined: 25 Feb 2005
Posts: 1617
Location: Ukraine

shoorick 18 Feb 2013, 09:02

yes, i ment exactly utf-16. full space is even more complex to handle, then balanced tree for conversions can be a solution.
it is better to balance the tree of chars on char appearing statistic to get more speed, then copy the tree into new one to get more frequently chars to be situated closer to the tree root in memory (jumping between memory pages in large tree may slow down search even in balanced tree).
i would suggest to use in UNICODE_INFO structure not only lower_case and upper_case, but also pointer to character which may be an "alias" for searching, like this:
A is alias for: ÀÁÂÃÄÅàáâãäåāĀĂăĄąǍǎǺǻ

+++++++++
btw, a lot of chars (say ";") has no different cases, then char not found in tree treated as is. this may reduce the size of the tree Wink

18 Feb 2013, 09:02

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 18 Feb 2013, 19:50

Not an answer about dealing with case conversion but might be interesting for you: Cache-Oblivious Algorithms and Data Structures. Also google "cache oblivious".

18 Feb 2013, 19:50

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 18 Feb 2013, 22:04

LocoDelAssembly, thanks for the reading. It is really interesting. As for the discussed problem - unfortunately, I am not able to take this task right now, because I have several projects to work both for my job and my open source projects. That is why I started this discussion - in order to attract someone else to do it instead of me. Wink

18 Feb 2013, 22:04

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 24 Feb 2013, 23:20

This is something I had back burner-ed a while ago for UTF8 support but more for keyboard/font translation. My thinking was I'd start with character group maps (array/tree of ranges) and then map the upper/lower case regions within those node entries with 2 base offsets and a limit. This might also work for the other character types (note I was only considering latin/greek/cyrillic groups; not sure how well it would work for others) and should be smaller (<1k) than a total mapping of every character.

24 Feb 2013, 23:20

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 25 Feb 2013, 12:54

cod3b453 wrote:

This is something I had back burner-ed a while ago for UTF8 support but more for keyboard/font translation. My thinking was I'd start with character group maps (array/tree of ranges) and then map the upper/lower case regions within those node entries with 2 base offsets and a limit. This might also work for the other character types (note I was only considering latin/greek/cyrillic groups; not sure how well it would work for others) and should be smaller (<1k) than a total mapping of every character.

I didn't understand the whole idea. But the case tables for the whole unicode range should be not so big, because only small part of the UNICODE characters have cases. For example, the Chinese hieroglyphs have no case, so they will not be placed in the tables.

_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9

25 Feb 2013, 12:54

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 25 Feb 2013, 19:55

This is the general shape I was referring to:

Code:

; Range entry 0
dd 0x00000000 ; Base
dd 0x0000007F ; Limit
   db 1       ;   Number of case tables
   ; Case table 0
   dw 'A'     ;   Upper offset
   dw 'a'     ;   Lower offset
   dw 'Z'-'A' ;   Limit
; Range entry 1
dd 0x00000100 ; Base
dd 0x000002AF ; Limit
   db 2
   ; Case table 0
   dw 0       ;   LATIN CAPITAL LETTER A WITH MACRON
   dw 1       ;   LATIN SMALL LETTER A WITH MACRON
   dw 0x37/2  ;   LATIN SMALL LETTER K WITH CEDILLA
   ; Case table 1
   ; ...

To decode e.g. 0x0123 you'd see it lies in range 1 (0x0100<=0x0123<=0x2AF), case table 0 (0<=(0x0123-0x0100)<=0+0x37/2 OR 1<=(0x0123-0x0100)<=1+0x37/2).

Next you find the separation size between upper and lower by MIN(MAX(0,1) - MIN(0,1),0x37/2) = 1 which implies that you use base 2 (for range 0 it would be mod 26 due to limit of 25) and calculate 0x23 mod 2 = 1{offset}.

Upper case would be where 1{offset} < 0{base}+1{separation} -> FALSE and lower would be where 1{offset} < 1{base}+1{separation} -> TRUE; hence the character is lower and conversion to upper is add (0{upper_base} - 1{lower_base}) = -1 i.e. 0x0122.

(Sorry it's so convoluted but hopefully you can see how this can be applied to different mappings in the general case)

My revised guesstimate is somewhere around 512-768 bytes to store the upper/lower pairings.

Hopefully this makes enough sense?

25 Feb 2013, 19:55

JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria

JohnFound 25 Feb 2013, 21:06

Well, I can't see the whole picture, but the idea is more or less clear. (Is "<=" in your post means less or equal? and what 0x37/2 means? 0x37/2 = 0x1B)

But there are several questions:
1. In your example you are using words for unicode characters, but there are small/capital letters that have to be encoded in 2 words.
2. Are you checked that the letters that have cases are always grouped in ranges?
3. How these ranges structures will be created from the file UnicodeData.txt?
4. For example, how the code for ToLower will looks like?

25 Feb 2013, 21:06

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 25 Feb 2013, 21:20

JohnFound wrote:

Well, I can't see the whole picture, but the idea is more or less clear. (Is "<=" in your post means less or equal? and what 0x37/2 means? 0x37/2 = 0x1B)
...

Yes and yes

JohnFound wrote:

...
But there are several questions:
1. In your example you are using words for unicode characters, but there are small/capital letters that have to be encoded in 2 words.
2. Are you checked that the letters that have cases are always grouped in ranges?

The range table provides a 32bit base (current UNICODE space is 31) but the assumption is that the difference will fit into 16bit - are there cases where the upper/lower are more than 2^16 apart?

JohnFound wrote:

3. How these ranges structures will be created from the file UnicodeData.txt?

To be honest, either some complicated app/script or manually

JohnFound wrote:

4. For example, how the code for ToLower will looks like?

The conversion to lower is identical to the ToUpper except that the final adjustment is add (1{lower_base} - 0{upper_base}) = 1.

25 Feb 2013, 21:20

Goto page 1, 2 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum