flat assembler
Message board for the users of flat assembler.

Index > Windows > How do I print unicode string using printf

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
moistMaven



Joined: 02 Jun 2023
Posts: 9
moistMaven 02 Jun 2023, 12:56
format PE64 console
entry start

include './include/win64a.inc'

;======================================
section '.data' data readable writeable
;======================================

hindi db "हिन्दी",10,0




;=======================================
section '.code' code readable executable
;=======================================
start:
ccall [printf], hindi
ccall [getchar]
ccall [ExitProcess],0
;====================================
section '.idata' import data readable
;====================================

library kernel,'kernel32.dll', msvcrt,'msvcrt.dll'

import kernel, ExitProcess,'ExitProcess'

import msvcrt, printf,'printf', wprintf, 'wprintf', setlocale, 'setlocale', getchar,'_fgetchar'
Post 02 Jun 2023, 12:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 02 Jun 2023, 13:08
If by "Unicode" you mean UTF-8 then for Windows you can first convert the UTF-8 encoding to WCHAR (sometimes incorrectly called UTF-16) and then use the *W APIs to print the string.
Post 02 Jun 2023, 13:08
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2522
Furs 02 Jun 2023, 13:46
revolution wrote:
If by "Unicode" you mean UTF-8 then for Windows you can first convert the UTF-8 encoding to WCHAR (sometimes incorrectly called UTF-16) and then use the *W APIs to print the string.
Why incorrectly called UTF-16? It is UTF-16, isn't it?

BTW, newer versions of Windows support UTF-8 codepage in ANSI functions.
Post 02 Jun 2023, 13:46
View user's profile Send private message Reply with quote
moistMaven



Joined: 02 Jun 2023
Posts: 9
moistMaven 02 Jun 2023, 13:54
Quote:

Why incorrectly called UTF-16? It is UTF-16, isn't it?

may be because UTF-16 is a character encoding scheme while WCHAR is an datatype.🤷
Post 02 Jun 2023, 13:54
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 02 Jun 2023, 13:56
Furs wrote:
revolution wrote:
If by "Unicode" you mean UTF-8 then for Windows you can first convert the UTF-8 encoding to WCHAR (sometimes incorrectly called UTF-16) and then use the *W APIs to print the string.
Why incorrectly called UTF-16? It is UTF-16, isn't it?
It doesn't specify an encoding, only a width.
The size of a wide character type does not dictate what kind of text encodings a system can process, as conversions are available. (Old conversion code commonly overlook surrogates, however.) The historical circumstances of their adoption does also decide what types of encoding they prefer.
Most system configurations will tend to default to UTF-16 encoding, so then one can use WCHAR to print Unicode characters. But that isn't always true. Use with caution if it matters.
Post 02 Jun 2023, 13:56
View user's profile Send private message Visit poster's website Reply with quote
moistMaven



Joined: 02 Jun 2023
Posts: 9
moistMaven 02 Jun 2023, 16:50
Quote:

If by "Unicode" you mean UTF-8 then for Windows you can first convert the UTF-8 encoding to WCHAR (sometimes incorrectly called UTF-16) and then use the *W APIs to print the string.

Researching on your suggestion I got this code, but running this code just opens and shuts the console immediately without showing any output.:
format PE console
entry _start

section '.data' data readable writeable
unicode_string db 0xE2, 0x9C, 0x93, 0x20, 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x2C, 0x20, 0x57, 0x6F, 0x72, 0x6C, 0x64, 0x21, 0x00 ; UTF-8 encoded Unicode string
wunicode_string dw 0x2654, 0x0020, 0x0048, 0x0065, 0x006C, 0x006C, 0x006F, 0x002C, 0x0020, 0x0057, 0x006F, 0x0072, 0x006C, 0x0064, 0x0021, 0x0000 ; WCHAR (UTF-16) encoded Unicode string

section '.code' code readable executable
include 'win32a.inc'


_start:
push 0 ; dwFlags (must be 0)
push 0 ; dwSrcEncoding (automatic detection)
push unicode_string ; lpSrcStr (UTF-8 encoded string)
push -1 ; cchSrc (determine the length automatically)
push 0 ; lpWideCharStr (output buffer)
push 0 ; cchWideChar (calculate the required buffer size)
ccall [MultiByteToWideChar]
mov esi, eax ; esi holds the length of the WCHAR string

; Allocate console and get its handle
push 0 ; lpSecurityAttributes (not used, set to 0)
push 0x40000000 ; dwDesiredAccess (GENERIC_READ | GENERIC_WRITE)
ccall [GetStdHandle]
mov ebx, eax ; ebx holds the console handle

; Write the WCHAR string to the console using WriteConsoleW function
push 0 ; lpNumberOfCharsWritten (output parameter, not used)
push esi ; nNumberOfCharsToWrite (length of WCHAR string)
push wunicode_string ; lpBuffer (WCHAR string)
push ebx ; hConsoleOutput (console handle)
ccall [WriteConsoleW]

; Prompt the user to press a key
push 0 ; lpBuffer (input buffer)
ccall [GetStdHandle]
mov ebx, eax ; ebx holds the standard input handle
push ebx ; hConsoleInput (standard input handle)
call [FlushConsoleInputBuffer]
push 0 ; lpNumberOfEventsRead (output parameter, not used)
push 1 ; nNumberOfEventsToRead (number of events to read)
push 0 ; lpBuffer (input buffer)
push ebx ; hConsoleInput (standard input handle)
ccall [ReadConsoleInputW]

; Exit the program
xor eax, eax
xor ebx, ebx
ccall [ExitProcess]



section '.idata' import data readable writeable
library kernel32, 'kernel32.dll', \
user32, 'user32.dll'
import kernel32, \
GetStdHandle, 'GetStdHandle', \
WriteConsoleW, 'WriteConsoleW', \
ExitProcess, 'ExitProcess', \
MultiByteToWideChar, 'MultiByteToWideChar', \
FlushConsoleInputBuffer, 'FlushConsoleInputBuffer',\
ReadConsoleInputW, 'ReadConsoleInputW'
Post 02 Jun 2023, 16:50
View user's profile Send private message Reply with quote
Flier-Mate



Joined: 26 May 2023
Posts: 88
Flier-Mate 03 Jun 2023, 04:15
You can use "du" instead of "db" to specify the Unicode string, as long as you include "encoding/utf8.inc".

My 32-bit example only work with MessageBoxW (GUI), not working with WriteConsoleW (console) because it shows question marks.

I don't know the perfect solution to it.
Code:
;format PE console
format PE GUI
entry start

include "win32a.inc"
include "encoding\utf8.inc"

section ".data" data readable writeable

hindi du "हिन्दी",13,10,0
;_len  = $ - hindi

section ".code" code executable readable

start:

;push -11
;call [GetStdHandle]

;push 0
;push 0
;push _len
;push hindi
;push eax
;call [WriteConsoleW]

push 0x40
push hindi
push hindi
push 0    ;Desktop
call [MessageBoxW]

push 0
call [ExitProcess]

section ".idata" import readable writeable

library kernel, "kernel32.dll", \
        user, "user32.dll"

import kernel, \
       GetStdHandle, "GetStdHandle", \
       WriteConsoleW, "WriteConsoleW", \
       ExitProcess, "ExitProcess"

import user, \
       MessageBoxW, "MessageBoxW"    


Description:
Filesize: 4.1 KB
Viewed: 6140 Time(s)

hindi.png


Post 03 Jun 2023, 04:15
View user's profile Send private message Reply with quote
moistMaven



Joined: 02 Jun 2023
Posts: 9
moistMaven 03 Jun 2023, 07:35
I tried to to do it using wprintf but interestingly all the english unicodes are shown correcly but non english unicode (both utf-8 and 16) show gibrish output in the console:
Code:
format PE64 console
entry start

include './include/win64w.inc'
include './include/macro/proc64.inc'
include './include/encoding/utf8.inc'

;======================================
section '.data' data readable writeable
;======================================

wunicode_string du 0x0939, 0x093f, 0x0928, 0x094d, 0x0926, 0x0940 



;=======================================
section '.code' code readable executable
;=======================================

start:
    
    mov rax, 0
    ccall [wprintf], "%ls", wunicode_string
   

    ccall   [getchar]                   ; I added this line to exit the application AFTER the user pressed any key.
    stdcall [ExitProcess],0             ; Exit the application

;====================================
section '.idata' import data readable
;====================================

library kernel,'kernel32.dll',        msvcrt,'msvcrt.dll'

import  kernel,        ExitProcess,'ExitProcess'

import  msvcrt,        printf,'printf', wprintf, 'wprintf',       getchar,'_fgetchar'
    
Post 03 Jun 2023, 07:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 03 Jun 2023, 07:40
To print WCHAR you use the *W APIs.

So, for example, wsprintfW (not wsprintfA). It is in user.dll
Code:
;...
        user_table:
                MessageBox                      dd      RVA _MessageBoxA
                wsprintf                        dd      RVA _wsprintfW  ; <--- use the W version
;...
                                                dd      0
;...    
Post 03 Jun 2023, 07:40
View user's profile Send private message Visit poster's website Reply with quote
moistMaven



Joined: 02 Jun 2023
Posts: 9
moistMaven 03 Jun 2023, 09:30
I feel so dumb right now the problem was with code page used by the console. Now it works even with printf:
Code:

format PE64 console
entry start

include './include/win64a.inc'
include './include/macro/proc64.inc'

;======================================
section '.data' data readable writeable
;======================================

hello_newline    db "Hello World!",10,0


;=======================================
section '.code' code readable executable
;=======================================

start:

        ccall   [SetConsoleOutputCP], 65001
        ccall   [printf], "%s", "हिन्दी"


    ccall   [getchar]                   ; I added this line to exit the application AFTER the user pressed any key.
    stdcall [ExitProcess],0             ; Exit the application

;====================================
section '.idata' import data readable
;====================================

library kernel,'kernel32.dll',        msvcrt,'msvcrt.dll'

import  kernel,        ExitProcess,'ExitProcess', SetConsoleOutputCP, 'SetConsoleOutputCP'

import  msvcrt,        printf,'printf',        getchar,'_fgetchar'
    
Post 03 Jun 2023, 09:30
View user's profile Send private message Reply with quote
Flier-Mate



Joined: 26 May 2023
Posts: 88
Flier-Mate 08 Jun 2023, 10:22
I use the same code that I posted above, but no luck seeing Hindi characters in my Windows 10 console.

I typed "chcp 65001" then run hindi.exe , a program that uses WriteConsoleW API function.

Does anybody know why?


Description:
Filesize: 8.48 KB
Viewed: 5953 Time(s)

Screenshot 2023-06-08 180913.png


Post 08 Jun 2023, 10:22
View user's profile Send private message Reply with quote
Picnic



Joined: 05 May 2007
Posts: 1391
Location: Piraeus, Greece
Picnic 08 Jun 2023, 11:56
It's a bit more confusing on the console.

You have to find and install a monospace font that supports the language you want, then select it as the default console font, but modify the registry also.

My toy interpreter supports wide characters, below I am using the Deja Vu Sans Mono to display Greek and Arabic.

Code:

    screen 80,25,300
    color 0,7
    cls
    print "Good morning my friend"
    print "Καλημέρα φίλε μου"
    print REVERSE("صباح الخير يا صديقي")
    end 
    

Image
Post 08 Jun 2023, 11:56
View user's profile Send private message Visit poster's website Reply with quote
Flier-Mate



Joined: 26 May 2023
Posts: 88
Flier-Mate 08 Jun 2023, 13:33
Thanks Picnic, it is good that Deja Vu Sans Mono supports Arabic (not Hindi). I tried Lucida Console as suggested on the Internet, but no luck also.

I am facing the same issue....
Quote:
The Windows Console only allows you to select fixed pitch fonts. All the Hindi fonts that I have found are variable pitch.
Post 08 Jun 2023, 13:33
View user's profile Send private message Reply with quote
Picnic



Joined: 05 May 2007
Posts: 1391
Location: Piraeus, Greece
Picnic 08 Jun 2023, 13:49
Υou're welcome. I did a fast search for Hindi console font before I post the message, but I had no luck either.
Post 08 Jun 2023, 13:49
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4047
Location: vpcmpistri
bitRAKE 08 Jun 2023, 23:24
Flier-Mate wrote:
I typed "chcp 65001" then run hindi.exe , a program that uses WriteConsoleW API function.

Does anybody know why?
IIRC, the code page only applies to *A type API's - you'd need to use WriteConsoleA.

My Win11 system is configured for UTF-8 system wide, it's as simple as sending a byte string to console with WriteConsoleA. I installed no special fonts. I can use UTF-8 in GUI code with no special handling - it finally works how it should have 10+ years ago, imho.

(If I am understanding that correctly, I've messed up the Arabic? Or, right-left languages require configuring the console differently.)

Edit: the GUI looks a lot better. I think the console is lacking a lot of language features - looks like it isn't combining the characters correctly. Playing with the Hebrew, the console switched to right-left automatically. I'm missing something to get it to work correctly with those languages.


Description:
Filesize: 7.16 KB
Viewed: 5850 Time(s)

languages2.png


Description:
Filesize: 20.95 KB
Viewed: 5863 Time(s)

languages.png



_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 08 Jun 2023, 23:24
View user's profile Send private message Visit poster's website Reply with quote
Flier-Mate



Joined: 26 May 2023
Posts: 88
Flier-Mate 09 Jun 2023, 03:46
Interesting demonstration, bitRAKE.

I test with WriteConsoleA and "db" string, the output is the same as using WriteConsoleW and "du" string. Still showing question marks on console.

It is nice you got it working in your Windows 11. Maybe what Furs said is referring to Windows 11 only (not Windows 10)?
Furs wrote:

BTW, newer versions of Windows support UTF-8 codepage in ANSI functions.


GUI is a lot easier to output Unicode string than the console.
Post 09 Jun 2023, 03:46
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 09 Jun 2023, 03:48
Win7 supported CP65001 perfectly fine. I'm not sure about ANSI though, perhaps that is new.
Post 09 Jun 2023, 03:48
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2522
Furs 09 Jun 2023, 13:39
Flier-Mate wrote:
Interesting demonstration, bitRAKE.

I test with WriteConsoleA and "db" string, the output is the same as using WriteConsoleW and "du" string. Still showing question marks on console.

It is nice you got it working in your Windows 11. Maybe what Furs said is referring to Windows 11 only (not Windows 10)?
Furs wrote:

BTW, newer versions of Windows support UTF-8 codepage in ANSI functions.


GUI is a lot easier to output Unicode string than the console.
You are confusing code pages here, and ANSI/Wide versions ("Unicode"), with actual font rendering.

ANSI codepages / Wide versions are a way to encode the glyphs as text. It's just data. But just because you encode them, doesn't mean they will render.

Try to copy them, for example, and paste them into a browser, you'll likely see the proper characters (unless it copies question marks as a "hack" which would be dumb imo), most browser fonts tend to have support for more glyphs due to web pages.

Rendering such glyphs/characters requires the font to support it. It's likely your console font does not, especially if your Windows version is not Hindi in the first place. A font supporting all glyphs would be huge, maybe gigabytes.

UTF-8 is not a rendering thing, it's a way to encode those glyphs. Actually rendering them is up to the font.
Post 09 Jun 2023, 13:39
View user's profile Send private message Reply with quote
Flier-Mate



Joined: 26 May 2023
Posts: 88
Flier-Mate 10 Jun 2023, 05:42
Thanks Furs for the helpful explanation.

Quote:
UTF-8 is not a rendering thing, it's a way to encode those glyphs. Actually rendering them is up to the font.


Maybe Windows 11 has the answer to it, both encoding and font rendering.

I asked someone to test my hindi.exe on their Windows 11, but again their web browser blocked the download because of suspected trojan. (Sigh)
Post 10 Jun 2023, 05:42
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4047
Location: vpcmpistri
bitRAKE 10 Jun 2023, 07:45
Flier-Mate wrote:
I asked someone to test my hindi.exe on their Windows 11, but again their web browser blocked the download because of suspected trojan. (Sigh)
GPT suggests there is a known issue with how older consoles process UTF-8, and suggest to use WriteFile() to send the bytes - bypassing buffered processing of WriteConsoleA(). Honestly, the Windows console has had so many problems in the past that I don't keep track of them and their work arounds.

The following test program works well on Win11, and if GPT-4 knows anything, it should work well there - on Win10. (To the extent that the font your console is configured with supports Hindi.)


Description: WriteFile() UTF-8 characters to console.
Download
Filename: ex07.zip
Filesize: 1.41 KB
Downloaded: 195 Time(s)


_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 10 Jun 2023, 07:45
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.