IDEA: Unicode characters database representation.

Index > Projects and Ideas > IDEA: Unicode characters database representation.

Goto page Previous 1, 2

Author

Thread

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 25 Feb 2013, 22:02

cod3b453 wrote:

...
JohnFound wrote:

3. How these ranges structures will be created from the file UnicodeData.txt?
To be honest, either some complicated app/script or manually...

Actually having thought about this a little more it should be possible to extract pairs of upper/lower in the list and then merge consecutive groups to form the tables:

grep "UPPER\|LOWER"
match 'a UPPER x' to 'b LOWER x' to form a <-> b pairs for character x
merge
EDIT: "while pairs have same difference and increment increment current set limit"

Last edited by cod3b453 on 04 Mar 2013, 22:31; edited 1 time in total

25 Feb 2013, 22:02

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 28 Feb 2013, 12:00

I'm going off-topic here, but I have thought about unicode many times. My problem was not the decoding but representation. http://www.fileformat.info/info/unicode/char/0040/commercial_at.svg
Sometimes glyphs take more than a KB in SVG.

On topic: isn't the easiest way to remap unicode characters (in database) like in ASCII where you add/subtract 20h? If there are more letters just use another bit to flip between case. Other characters just after the table.

28 Feb 2013, 12:00

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 28 Feb 2013, 23:11

Madis731 wrote:

...
On topic: isn't the easiest way to remap unicode characters (in database) like in ASCII where you add/subtract 20h? If there are more letters just use another bit to flip between case. Other characters just after the table.

Yes but unfortunately it's only that simple for ASCII. Some are interleaved 1 and 1 or in blocks like 6 and 6 (see latin/cyrillic/greek groups). On top of this they're not all aligned but addition is no worse than xor and works the same.

28 Feb 2013, 23:11

shoorick

Joined: 25 Feb 2005
Posts: 1614
Location: Ukraine

shoorick 01 Mar 2013, 06:55

JohnFound wrote:

For example, the Chinese hieroglyphs have no case, so they will not be placed in the tables.

you are right. there is no need to use them for to_lower or to_upper functions.

but what about case insensitive search? what is it? it treats some characters of different look (and code) as similar. in europian languages we have small and capital letters. in chinese language we have variants of writing of the same characters, and there may be more then 2 variants. if we wish to make efficient search in chinese text, like case insensitive for europian, we must handle these characters also, even if their quantity is not too numerous.

the optimisation can be provided by splitting unicode range to parts.
function simply looks by code of the certain character of which range is it, and depending on result uses proper algorythm of its conversion into the code for comparing.

there can be even a table of processing modules, and function on entry selects matching module to process character code.

_________________
UNICODE forever!

01 Mar 2013, 06:55

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 04 Mar 2013, 22:30

Apologies for the use of C (and the poor code, very much work in progress) but this code takes the unicode.txt file and splits it into upper and lower files. It then scans both for matching pairs and prints the table values:

Code:

#define _CRT_SECURE_NO_WARNINGS 1

#include <stdio.h>
#include <string.h>


void str_replace( char * pdst, const char * psrc, char * pfind, char * preplace )
{
        /*
        "..."<find>"..." -> "..."<replace>"..."
        */
        int flen = strlen(pfind);
        int rlen = strlen(preplace);

        for ( ; *psrc; pdst++, psrc++)
        {
                if (strncmp(psrc,pfind,flen) != 0)
                {
                        *pdst = *psrc;
                }
                else
                {
                        strcpy(pdst,preplace);

                        psrc = psrc + flen - 1;
                        pdst = pdst + rlen - 1;
                }
        }

        *pdst = 0;
}

int main(int argc,char * argv[])
{
        FILE * f;
        FILE * fu;
        FILE * fl;

        char line[1024];

        char * p;
        char * q;

        int unicode;
        char name[128];

        int b = 0;
        int dl;
        int du;
        int d;

        f = fopen("../../unicode.txt","rb");
        fu = fopen("upper.txt","wb");
        fl = fopen("lower.txt","wb");

        while (!feof(f))
        {
                fgets(line,sizeof(line),f);

                sscanf(line,"%x",&unicode);

                p = strstr(line,";")+1;
                q = strstr(p,";");

                *q = 0;

                if (strstr(line,"SMALL"))
                {
                        str_replace(p,p,"SMALL ","");
                        fprintf(fl,"%08X %s\n",unicode,p);
                }

                if (strstr(line,"CAPITAL"))
                {
                        str_replace(p,p,"CAPITAL ","");
                        fprintf(fu,"%08X %s\n",unicode,p);
                }
        }

        fflush(fu);
        fflush(fl);

        fclose(fl);
        fclose(fu);

        fclose(f);

        fu = fopen("upper.txt","rb");
        fl = fopen("lower.txt","rb");

        while (!feof(fu))
        {
                int uc;
                int puc;
                char uname[128];
                int last_l;

                fgets(line,sizeof(line),fu);
                sscanf(line,"%08X",&uc);
                strcpy(uname,line+9);

                while (!feof(fl))
                {
                        int lc;
                        int plc;
                        char lname[128];

                        fgets(line,sizeof(line),fl);
                        sscanf(line,"%08X",&lc);
                        strcpy(lname,line+9);
                        
                        if (strcmp(lname,uname) == 0)
                        {
                                last_l = ftell(fl);

                                switch (b)
                                {
                                        case 0:
                                        {
                                                b = 1;
                                                dl = lc;
                                                du = uc;
                                                d = lc - uc;
                                                break;
                                        }
                                        case 1:
                                        {
                                                if (d != (lc - uc))
                                                {
                                                        printf("%04X\n",b);
                                                        b = 0;
                                                }
                                                else
                                                {
                                                        b = 2;
                                                        dl = lc - dl;
                                                        du = uc - du;
                                                }
                                                break;
                                        }
                                        default:
                                        {
                                                if      (
                                                                (dl != (lc - plc))
                                                        ||      (du != (uc - puc))
                                                        ||      (d != (lc - uc))
                                                        )
                                                {
                                                        printf("%04X\n",b);
                                                        b = 0;
                                                }
                                                else
                                                {
                                                        b = b + 1;
                                                }
                                                break;
                                        }
                                }

                                if (b == 0)
                                {
                                        b = 1;
                                        dl = lc;
                                        du = uc;
                                        d = lc - uc;
                                }

                                if (b == 1)
                                {
                                        printf("%08X %08X %04X ",
                                                lc,
                                                uc,
                                                (lc - uc));
                                }

                                puc = uc;
                                plc = lc;

                                break;
                        }
                        else if (
                                                b
                                        &&      (       (dl != (lc - plc))
                                                ||      (du != (uc - puc))
                                                ||      (d != (lc - uc))
                                                )
                                        )
                        {
                                printf("%04X\n",b);
                                b = 0;
                        }
                }

                if (!feof(fu))
                {
                        fseek(fl,last_l,SEEK_SET);
                }
        }

        fflush(fu);
        fflush(fl);

        fclose(fl);
        fclose(fu);
}

The output I get is [note I've not verified all pairs but seems to be correct]:

Code:

;Lower   Upper    Diff Limit
00000061 00000041 0020 001A
000000E0 000000C0 0020 0017
000000F8 000000D8 0020 0007
00000101 00000100 0001 0018
00000133 00000132 0001 0003
0000013A 00000139 0001 0008
0000014B 0000014A 0001 0017
0000017A 00000179 0001 0003
00000253 00000181 00D2 0001
00000254 00000186 00CE 0001
00000257 0000018A 00CD 0001
00000258 0000018E 00CA 0002
0000025B 00000190 00CB 0001
00000260 00000193 00CD 0001
00000263 00000194 00CF 0001
00000269 00000196 00D3 0001
0000026F 0000019C 00D3 0001
00000272 0000019D 00D5 0001
00000283 000001A9 00DA 0001
00000288 000001AE 00DA 0001
0000028A 000001B1 00D9 0002
00000292 000001B7 00DB 0001
00002C65 0000023A 2A2B 0001
00002C66 0000023E 2A28 0001
00002D00 000010A0 1C60 0026
00002D27 000010C7 1C60 0002
0000A641 0000A640 0001 0017
0000A681 0000A680 0001 000C
0000A723 0000A722 0001 0007
0000A733 0000A732 0001 001F
0000A775 0000A776 FFFF 0001
0000A77A 0000A779 0001 0002
0000A77F 0000A77E 0001 0005
0000A78C 0000A78B 0001 0001
0000A791 0000A790 0001 0002
0000A7A1 0000A7A0 0001 0005
0000FF41 0000FF21 0020 001A
00010428 00010400 0028 0028
0001D41A 0001D400 001A 001A
0001D44E 0001D434 001A 0007
0001D456 0001D43C 001A 0012
0001D482 0001D468 001A 001A
0001D4B6 0001D49C 001A 0001
0001D4B8 0001D49E 001A 0002
0001D4BF 0001D4A5 001A 0002
0001D4C3 0001D4A9 001A 0001
0001D4C5 0001D4AB 001A 0002
0001D4C8 0001D4AE 001A 0008
0001D4EA 0001D4D0 001A 001A
0001D51E 0001D504 001A 0002
0001D521 0001D507 001A 0004
0001D527 0001D50D 001A 0008
0001D530 0001D516 001A 0007
0001D552 0001D538 001A 0002
0001D555 0001D53B 001A 0004
0001D55A 0001D540 001A 0005
0001D560 0001D546 001A 0001
0001D564 0001D54A 001A 0007
0001D586 0001D56C 001A 001A
0001D5BA 0001D5A0 001A 001A
0001D5EE 0001D5D4 001A 001A
0001D622 0001D608 001A 001A
0001D656 0001D63C 001A 001A
0001D68A 0001D670 001A 001A
0001D6C2 0001D6A8 001A 0011
0001D6D4 0001D6BA 001A 0007
0001D6FC 0001D6E2 001A 0011
0001D70E 0001D6F4 001A 0007
0001D736 0001D71C 001A 0011
0001D748 0001D72E 001A 0007
0001D770 0001D756 001A 0011
0001D782 0001D768 001A 0007
0001D7AA 0001D790 001A 0011
0001D7BC 0001D7A2 001A 0007
0001D7CB 0001D7CA 0001 0001
0001F521 0001F520 0001 0001
000E0061 000E0041 0020 001A

In the best case this comes out as 77x{0+3+2+1}=462 (I've excluded the 1st column because it can be derived from the next two. Also, I've merged the range and upper/lower into one pair, since the actual groups are not important for upper/lower but were for my font selection)

04 Mar 2013, 22:30

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20690
Location: In your JS exploiting you and your system

revolution 07 Mar 2013, 07:41

cod3b453: That is a nice little list. Can you confirm that it is complete?

BTW: The C-code makes my eyes bleed. Hehe, sorry. Does your compiler have a severe restriction on the length of variable names? Because all those extremely short, non-intuitive, inscrutable names used makes it even harder to follow. Razz

07 Mar 2013, 07:41

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 07 Mar 2013, 19:45

All I can say at the moment is there are more small letters than capitals but both have characters that have no pairing so I don't know what their correct handling (if any) should be.

Right now my decoding function is telling me stuff that's mapped isn't mapped so I can't process the original list.

07 Mar 2013, 19:45

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 18 Mar 2013, 21:54

For anyone who's interested in my still crappy progress, the attached has some is/to Upper/Lower functions that seems to work for the few mappings I've looked at manually. However, the numbers I'm seeing aren't tallying up, so I think I'm still missing something Embarassed

----
EDIT: OK the issue was that one bit of information was missing (the increment between sequential upper/lower characters) so the table needs to change.
EDIT2: New code uploaded

Description:		Download
Filename:	unicode_tables.zip
Filesize:	9.19 KB
Downloaded:	833 Time(s)

Last edited by cod3b453 on 21 Mar 2013, 01:54; edited 3 times in total

18 Mar 2013, 21:54

edfed

Joined: 20 Feb 2006
Posts: 4350
Location: Now

edfed 19 Mar 2013, 10:39

as assembly coders, we probably can find (invent) a new way to encode characters that can trow away the unicode ascii crap.

let consider the ascii as a base (because we cannot change the world in just one pass), and just add a static byte to control wich character set to use.

for example

Code:

testString db "éàôîù",0
...
mov eax,french
call [asmicode.setCharSet]
mov eax,testString
call [printf]
mov eax,german
call [asmicode.setCharSet]
mov eax,testString
call [printf]
mov eax,chinese
call [asmicode.setCharSet]
mov eax,testString
call [printf]

as a result, testString would display differentlly depending on the setting of the charset used. and will use only one byte per character for french and german. and a for chinese... use one byte if bit7=0, two if bit7=1.

????

19 Mar 2013, 10:39

Goto page Previous 1, 2

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum