flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2 |
Author |
|
Madis731 28 Feb 2013, 12:00
I'm going off-topic here, but I have thought about unicode many times. My problem was not the decoding but representation. http://www.fileformat.info/info/unicode/char/0040/commercial_at.svg
Sometimes glyphs take more than a KB in SVG. On topic: isn't the easiest way to remap unicode characters (in database) like in ASCII where you add/subtract 20h? If there are more letters just use another bit to flip between case. Other characters just after the table. |
|||
![]() |
|
cod3b453 28 Feb 2013, 23:11
Madis731 wrote: ... |
|||
![]() |
|
shoorick 01 Mar 2013, 06:55
JohnFound wrote: For example, the Chinese hieroglyphs have no case, so they will not be placed in the tables. you are right. there is no need to use them for to_lower or to_upper functions. but what about case insensitive search? what is it? it treats some characters of different look (and code) as similar. in europian languages we have small and capital letters. in chinese language we have variants of writing of the same characters, and there may be more then 2 variants. if we wish to make efficient search in chinese text, like case insensitive for europian, we must handle these characters also, even if their quantity is not too numerous. the optimisation can be provided by splitting unicode range to parts. function simply looks by code of the certain character of which range is it, and depending on result uses proper algorythm of its conversion into the code for comparing. there can be even a table of processing modules, and function on entry selects matching module to process character code. _________________ UNICODE forever! |
|||
![]() |
|
cod3b453 04 Mar 2013, 22:30
Apologies for the use of C (and the poor code, very much work in progress) but this code takes the unicode.txt file and splits it into upper and lower files. It then scans both for matching pairs and prints the table values:
Code: #define _CRT_SECURE_NO_WARNINGS 1 #include <stdio.h> #include <string.h> void str_replace( char * pdst, const char * psrc, char * pfind, char * preplace ) { /* "..."<find>"..." -> "..."<replace>"..." */ int flen = strlen(pfind); int rlen = strlen(preplace); for ( ; *psrc; pdst++, psrc++) { if (strncmp(psrc,pfind,flen) != 0) { *pdst = *psrc; } else { strcpy(pdst,preplace); psrc = psrc + flen - 1; pdst = pdst + rlen - 1; } } *pdst = 0; } int main(int argc,char * argv[]) { FILE * f; FILE * fu; FILE * fl; char line[1024]; char * p; char * q; int unicode; char name[128]; int b = 0; int dl; int du; int d; f = fopen("../../unicode.txt","rb"); fu = fopen("upper.txt","wb"); fl = fopen("lower.txt","wb"); while (!feof(f)) { fgets(line,sizeof(line),f); sscanf(line,"%x",&unicode); p = strstr(line,";")+1; q = strstr(p,";"); *q = 0; if (strstr(line,"SMALL")) { str_replace(p,p,"SMALL ",""); fprintf(fl,"%08X %s\n",unicode,p); } if (strstr(line,"CAPITAL")) { str_replace(p,p,"CAPITAL ",""); fprintf(fu,"%08X %s\n",unicode,p); } } fflush(fu); fflush(fl); fclose(fl); fclose(fu); fclose(f); fu = fopen("upper.txt","rb"); fl = fopen("lower.txt","rb"); while (!feof(fu)) { int uc; int puc; char uname[128]; int last_l; fgets(line,sizeof(line),fu); sscanf(line,"%08X",&uc); strcpy(uname,line+9); while (!feof(fl)) { int lc; int plc; char lname[128]; fgets(line,sizeof(line),fl); sscanf(line,"%08X",&lc); strcpy(lname,line+9); if (strcmp(lname,uname) == 0) { last_l = ftell(fl); switch (b) { case 0: { b = 1; dl = lc; du = uc; d = lc - uc; break; } case 1: { if (d != (lc - uc)) { printf("%04X\n",b); b = 0; } else { b = 2; dl = lc - dl; du = uc - du; } break; } default: { if ( (dl != (lc - plc)) || (du != (uc - puc)) || (d != (lc - uc)) ) { printf("%04X\n",b); b = 0; } else { b = b + 1; } break; } } if (b == 0) { b = 1; dl = lc; du = uc; d = lc - uc; } if (b == 1) { printf("%08X %08X %04X ", lc, uc, (lc - uc)); } puc = uc; plc = lc; break; } else if ( b && ( (dl != (lc - plc)) || (du != (uc - puc)) || (d != (lc - uc)) ) ) { printf("%04X\n",b); b = 0; } } if (!feof(fu)) { fseek(fl,last_l,SEEK_SET); } } fflush(fu); fflush(fl); fclose(fl); fclose(fu); } Code: ;Lower Upper Diff Limit 00000061 00000041 0020 001A 000000E0 000000C0 0020 0017 000000F8 000000D8 0020 0007 00000101 00000100 0001 0018 00000133 00000132 0001 0003 0000013A 00000139 0001 0008 0000014B 0000014A 0001 0017 0000017A 00000179 0001 0003 00000253 00000181 00D2 0001 00000254 00000186 00CE 0001 00000257 0000018A 00CD 0001 00000258 0000018E 00CA 0002 0000025B 00000190 00CB 0001 00000260 00000193 00CD 0001 00000263 00000194 00CF 0001 00000269 00000196 00D3 0001 0000026F 0000019C 00D3 0001 00000272 0000019D 00D5 0001 00000283 000001A9 00DA 0001 00000288 000001AE 00DA 0001 0000028A 000001B1 00D9 0002 00000292 000001B7 00DB 0001 00002C65 0000023A 2A2B 0001 00002C66 0000023E 2A28 0001 00002D00 000010A0 1C60 0026 00002D27 000010C7 1C60 0002 0000A641 0000A640 0001 0017 0000A681 0000A680 0001 000C 0000A723 0000A722 0001 0007 0000A733 0000A732 0001 001F 0000A775 0000A776 FFFF 0001 0000A77A 0000A779 0001 0002 0000A77F 0000A77E 0001 0005 0000A78C 0000A78B 0001 0001 0000A791 0000A790 0001 0002 0000A7A1 0000A7A0 0001 0005 0000FF41 0000FF21 0020 001A 00010428 00010400 0028 0028 0001D41A 0001D400 001A 001A 0001D44E 0001D434 001A 0007 0001D456 0001D43C 001A 0012 0001D482 0001D468 001A 001A 0001D4B6 0001D49C 001A 0001 0001D4B8 0001D49E 001A 0002 0001D4BF 0001D4A5 001A 0002 0001D4C3 0001D4A9 001A 0001 0001D4C5 0001D4AB 001A 0002 0001D4C8 0001D4AE 001A 0008 0001D4EA 0001D4D0 001A 001A 0001D51E 0001D504 001A 0002 0001D521 0001D507 001A 0004 0001D527 0001D50D 001A 0008 0001D530 0001D516 001A 0007 0001D552 0001D538 001A 0002 0001D555 0001D53B 001A 0004 0001D55A 0001D540 001A 0005 0001D560 0001D546 001A 0001 0001D564 0001D54A 001A 0007 0001D586 0001D56C 001A 001A 0001D5BA 0001D5A0 001A 001A 0001D5EE 0001D5D4 001A 001A 0001D622 0001D608 001A 001A 0001D656 0001D63C 001A 001A 0001D68A 0001D670 001A 001A 0001D6C2 0001D6A8 001A 0011 0001D6D4 0001D6BA 001A 0007 0001D6FC 0001D6E2 001A 0011 0001D70E 0001D6F4 001A 0007 0001D736 0001D71C 001A 0011 0001D748 0001D72E 001A 0007 0001D770 0001D756 001A 0011 0001D782 0001D768 001A 0007 0001D7AA 0001D790 001A 0011 0001D7BC 0001D7A2 001A 0007 0001D7CB 0001D7CA 0001 0001 0001F521 0001F520 0001 0001 000E0061 000E0041 0020 001A |
|||
![]() |
|
revolution 07 Mar 2013, 07:41
cod3b453: That is a nice little list. Can you confirm that it is complete?
BTW: The C-code makes my eyes bleed. Hehe, sorry. Does your compiler have a severe restriction on the length of variable names? Because all those extremely short, non-intuitive, inscrutable names used makes it even harder to follow. ![]() |
|||
![]() |
|
cod3b453 07 Mar 2013, 19:45
All I can say at the moment is there are more small letters than capitals but both have characters that have no pairing so I don't know what their correct handling (if any) should be.
Right now my decoding function is telling me stuff that's mapped isn't mapped so I can't process the original list. |
|||
![]() |
|
cod3b453 18 Mar 2013, 21:54
For anyone who's interested in my still crappy progress, the attached has some is/to Upper/Lower functions that seems to work for the few mappings I've looked at manually. However, the numbers I'm seeing aren't tallying up, so I think I'm still missing something
![]() ---- EDIT: OK the issue was that one bit of information was missing (the increment between sequential upper/lower characters) so the table needs to change. EDIT2: New code uploaded
Last edited by cod3b453 on 21 Mar 2013, 01:54; edited 3 times in total |
|||||||||||
![]() |
|
edfed 19 Mar 2013, 10:39
as assembly coders, we probably can find (invent) a new way to encode characters that can trow away the unicode ascii crap.
let consider the ascii as a base (because we cannot change the world in just one pass), and just add a static byte to control wich character set to use. for example Code: testString db "éàôîù",0 ... mov eax,french call [asmicode.setCharSet] mov eax,testString call [printf] mov eax,german call [asmicode.setCharSet] mov eax,testString call [printf] mov eax,chinese call [asmicode.setCharSet] mov eax,testString call [printf] as a result, testString would display differentlly depending on the setting of the charset used. and will use only one byte per character for french and german. and a for chinese... use one byte if bit7=0, two if bit7=1. ???? |
|||
![]() |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.