japanase cmd

Message

carlos · #1 Post by **carlos** » 08 Apr 2014 03:19

Hello I have curiosity about how cmd handle unicode characters in variables.
Then, I get a windows japanase.
My first impression is that the path separator is other. It looks like a Y with a hyphen, and looks as it in notepad also.
But redirecting the ouput of cd. to a file, and looking the file in a english windows and hexadecimal editor, I found that is the same \ character, but in that windows it looks like a Y with a hyphen, but internally is the same \ of all life.

The codepage that it uses is 932.

I correctly save unicode text in a variable, and also I create a script, but a save it as ansi from notepad, and works:
I test saving it encoded as Unicode, Unicode Big Endian, UTF-8, and only works from cmd saving as ANSI.

Then I ask me how it representate internally the unicode characters using 8 bit bytes, and my answer is that cmd translates the 8 bit characters sequence to unicode. Because it the \ is showed like the character Y with hyphen.

This is a hexadecimal output image of the working ansi batch script that save a unicode text in variable.
I used the only japanse word that I know: sokoban.

Also is interesting, that it uses the raster font (or "terminal" for programmers), but it have the japanase characters.

carlos · #2 Post by **carlos** » 08 Apr 2014 03:36

Also I do other test.

I run a new cmd and change the codepage to 65001, then I run the script that works using the codepage for japanase (932), and it not works, not saved the content of the variable.
Then from a cmd that runs over codepage 932 I run correclty the script and from this script start a new cmd (that will have the variables from the parent) and change the codepages to 65001, and show how it display the content of the variable.

Edited 3rd time
unicode encode: use 2 bytes for represent a character.
multibyte encode: use 1 or 2 byte(s) for represent a character.
unicode text: text that use characters other than the ascii extended.

Then, my final conclusion is that cmd save unicode text internally using unicode encode (without bom), but in the input in script mode and interactive mode, it read the text as multibyte encode and translate to unicode encode using the current codepage (MultiByteToWideChar function) and in the ansi output, it convert from unicode encode to multibyte encode (WideCharToMultiByte function) using the current codepage.

For operations with a script text from some codepage, the script will be run over that codepage for ensure success, else it can fail.

Then for operations with text that have characters unicode, it would be converted from the original codepage to unicode, and after the operations convert it again to the original codepage.

#3 Post by **penpen** » 08 Apr 2014 08:21

You can check, that the Unicode values of a environment variables don't change when changing the character set:

Code: Select all

@if (true==false) /*
@echo off
cls

chcp 1252
set test=ö
echo set test=ö

chcp 850
cscript //nologo //e:JScript "%~f0"

chcp 1252
cscript //nologo //e:JScript "%~f0"

echo =========================================
chcp 850
set test=”
echo set test=”

chcp 850
cscript //nologo //e:JScript "%~f0"

chcp 1252
cscript //nologo //e:JScript "%~f0"

chcp 850

goto :eof
*/
@end

var env = WScript.CreateObject ("WScript.Shell").Environment ("Process");
var test = env ("test");

WScript.Echo ("test = " + test + " = " + test.charCodeAt (0));

Result on my Win Xp home 32 bit (de) actual patch level:

Code: Select all

Aktive Codepage: 1252.
set test=÷
Aktive Codepage: 850.
test = ö = 246
Aktive Codepage: 1252.
test = ÷ = 246
=========================================
Aktive Codepage: 850.
set test=ö
Aktive Codepage: 850.
test = ö = 246
Aktive Codepage: 1252.
test = ÷ = 246
Aktive Codepage: 850.

penpen

Liviu · #4 Post by **Liviu** » 08 Apr 2014 11:44

carlos wrote:I run a new cmd and change the codepage to 65001, then I run the script that works using the codepage for japanase (932), and it not works, not saved the content of the variable.

The screenshots look like XP, and XP batch parsing is known to be broken under codepage 65001 - your script doesn't execute even its first line.

A few other random notes:
- all Windows (since 9x/ME) store strings as UTF16-LE internally, and that includes the console and cmd interpreter;
- UTF16 is a Unicode encoding using 1 or 2 16-bit code units (16b integer) per code point (character, loosely speaking);
- multi-byte encodings are not limited to 2 bytes per character, for example UTF-8 is treated as an MBCS in Windows and can use up to 4 bytes per character;
- Asian CJK languages use DBCS codepages, including the Japanese 932 default OEM codepage;
- interactive cmd input is fully Unicode, for example you can paste arbitrary Unicode strings to the console regardless of the active codepage;
- piped-in or redirected-from-file input is read in the active codepage, and internally converted to Unicode (MultiByteToWideChar) - and this why you can't execute batch files saved as UTF16-LE Unicode;
- console output is fully Unicode, for example you can echo arbitrary Unicode strings to the console regardless of the active codepage;
- piped-out or redirected-to-file cmd/a output is done in the active codepage, and the internal Unicode is converted to the respective codepage (WideCharToMultiByte).

You may find more tips and caveats in some older posts here http://www.dostips.com/forum/viewtopic.php?p=24089#p24089 and on the ZTree board http://www.ztw3.com/forum/forum_entry.php?id=106405.

Liviu

carlos · #5 Post by **carlos** » 08 Apr 2014 13:31

- interactive cmd input is fully Unicode, for example you can paste arbitrary Unicode strings to the console regardless of the active codepage;

see this image:

I also test it with chcp 10000 and the same loose happen.

it show that in interactive mode, using codepage 850, on windows japanase and english, paste a unicode text cause a loose conversion because the codepage, then i keep my conclusion that the input is always treated as multibyte and converted to unicode using the codepage.

Also, i want mentions that the notepad ANSI save option is different from a english windows because in a english windows it warn with a message like this:

This file contains characters in Unicode forma wich will be lost if you save this file as an ANSI encoded text file.

but on the japanase version, it not warns about this and convert it to multibyte.

In cmd the max codepage is utf-8, some says that the codepage for utf-16 is 1200 but this is not accepted as valid codepage in the function SetConsoleCP ( http://stackoverflow.com/questions/3222213/how-can-i-change-console-font#comment27195859_3222213 ). or chcp (I tested this on windows japanase or english).

I'm convinced that cmd internally stores the text using 2 bytes for each character, no more than it, because the environment block use widechar (wchar_t) that are 2 bytes, and the environment block is terminated with two '0' wide characters (4 bytes).
Then, this would means that cmd internally save each character using 2 bytes. (this will left out characters in utf-8 that use 4 bytes for represent a character). Thus, cmd can not store all the unicode characters of the world. Utf8 is will be treated as codepage, not a internal representation.

I keep my conclusions:
Cmd internally store each character using 2 bytes.
In input it translate a byte secuence treating it as multibyte (1 or 2 bytes (no more than it)) using the current codepage to 2 bytes character internally representation.
And cmd can ouput this as multibyte (/a option) or unicode (use 2 bytes for each character) (/u option).

Then /A output option is multibyte, not single byte.

#6 Post by **penpen** » 08 Apr 2014 14:46

Sorry carlos, but Liviu is right:

carlos wrote:
- interactive cmd input is fully Unicode, for example you can paste arbitrary Unicode strings to the console regardless of the active codepage;

see this image:
(...)
I also test it with chcp 10000 and the same loose happen.

it show that in interactive mode, using codepage 850, on windows japanase and english, paste a unicode text cause a loose conversion because the codepage, then i keep my conclusion that the input is always treated as multibyte and converted to unicode using the codepage.

You see just 3 times a replacement character, used as the actual codepage does not support the used unicode codepoints.
This does not mean, that cmd cannot display the unicode values, it is an artifact from copy/paste.
If it is converted to ansi (codepage 932) and back to unicode there should be 6 characters as codepage 850 maps all single bytes to unicode values.

carlos wrote:Also, i want mentions that the notepad ANSI save option is different from a english windows because in a english windows it warn with a message like this:
This file contains characters in Unicode forma wich will be lost if you save this file as an ANSI encoded text file.
but on the japanase version, it not warns about this and convert it to multibyte.

It does exactly the same as in the english version, but with another codepage:
If the text is saved a reverse codepage mapping from unicode code points to bytes is performed.
As there is no character outer codepage 932, you don't get the message/option to store it using Unicode.
Same in the english version (don't know what the default codepage is used by notepad there, maybe, just guessed, codepage 1252?).

You can check this by open the saved text file in a hex editor: No byte order mark => ANSI.
In addition the file contains (in hex): 91 71 8C C9 94 D4.
These are the raw bytes mapped by codepage 932 to unicode: see http://msdn.microsoft.com/en-us/goglobal/cc305152
raw bytes -> Unicode codepoint = UTF-16LE hex codepoint(s) = byte hex representation
91 71 -> U+5009 = 0x0950 = 09 50
8C C9 -> U+5EAB = 0xAB5E = AB 5E
94 D4 -> U+756A = 0x6A75 = 6A 75

carlos wrote:I'm convinced that cmd internally stores the text using 2 bytes for each character, no more than it, because the environment block use widechar (wchar_t) that are 2 bytes, and the environment block is terminated with two '0' wide characters (4 bytes).
Then, this would means that cmd internally save each character using 2 bytes. (this will left out characters in utf-8 that use 4 bytes for represent a character). Thus, cmd can not store all the unicode characters of the world. Utf8 is will be treated as codepage, not a internal representation.

No, the UTF-16 text string is coded as a list of code units stored using a wchar_t list/array (wchar_t* text / wchar_t text [8096]):
Each code unit fits in a wchar_t struct.

It is allowed to use 2 code units to represent one character; this is known as a unicode surrogate pair (wikipedia example):
U+1D11E (MUSICAL SYMBOL G CLEF) = 0x34D8 0x1EDD (2 code units) = 34 D8 1E DD (hex)

carlos wrote:I keep my conclusions:
Cmd internally store each character using 2 bytes.
In input it translate a byte secuence treating it as multibyte (1 or 2 bytes (no more than it)) using the current codepage to 2 bytes character internally representation.
And cmd can ouput this as multibyte (/a option) or unicode (use 2 bytes for each character) (/u option).

Then /A output option is multibyte, not single byte.

As mentioned by Liviu, there may be more than 2 bytes per character, for example when using UTF-8;
but UTF-8 was designed to be able to use 5 bytes: Currently not in use as the codepoints aren't defined up to that now, but may be needed in the future.

Again the wikipedia example:
U+1D11E (MUSICAL SYMBOL G CLEF) = F0 9D 84 9E (UTF-8 hex representation)

Wikipedia examples (only 2 links allowed so i misuse the code block):

Code: Select all

http://en.wikipedia.org/wiki/UTF-16
http://de.wikipedia.org/wiki/UTF-8

penpen

Edit: Fixed some flaws.

carlos · #7 Post by **carlos** » 08 Apr 2014 14:54

Some real example that show that cmd holding a unicode character that use more than 2 bytes for represent it?
I keep that cmd cannot do it.

#8 Post by **aGerman** » 08 Apr 2014 15:42

carlos

You have to clearly separate the unicode support of cmd.exe from the ability to display unicode characters in the console window. The latter depends on the font that you set for the console. You can check using charmap.exe what characters are actually supported. E.g. "Lucida Console" or "Consolas" can be set for the console and do support several unicode characters.

Example:
create a file called "♫.#"
change the default console font to "Lucida Console"
reopen the cmd window and navigate to the folder
type dir *.# and see what happens

Code: Select all

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. Alle Rechte vorbehalten.

C:\Users\steffen>cd desktop

C:\Users\steffen\Desktop>dir *.#
 Datenträger in Laufwerk C: ist Acer
 Volumeseriennummer: 149E-FE40

 Verzeichnis von C:\Users\steffen\Desktop

08.04.2014  23:47                 0 ♫.#
               1 Datei(en),              0 Bytes
               0 Verzeichnis(se), 104.213.942.272 Bytes frei

C:\Users\steffen\Desktop>for /f %i in ('dir /b *.#') do set "var=%~ni"

C:\Users\steffen\Desktop>set "var=♫"

C:\Users\steffen\Desktop>echo %var%
♫

C:\Users\steffen\Desktop>

Regards
aGerman

#9 Post by **aGerman** » 08 Apr 2014 16:47

Addendum;

If you really want to explore the unicode support of cmd.exe you would need two things:

1) A console emulator that is some kind of front end for the cmd.exe.
This program should
- have an interface for the character streams from and to the cmd.exe process that runs in the background
- support any installed font
ConEmu

2) A font that supports almost any possible unicode character
GNU Unifont

Regards
aGerman

Liviu · #10 Post by **Liviu** » 08 Apr 2014 17:24

Carlos, and please don't take it the wrong way, but I think you are confusing yourself more than needed by jumping into Japanese codepages before fully exploring the common issues present in the - relatively simpler case of - western languages.

carlos wrote:I'm convinced that cmd internally stores the text using 2 bytes for each character, no more than it, because the environment block use widechar (wchar_t) that are 2 bytes

That's not a matter of opinion ;-)

and, sorry, but fact is that you are mistaken. It is true that all Windows (not just cmd) use 16b wchar_t's for storing strings - but some Unicode codepoints require 2 wchar_t's. Just lookup the UTF16 specs, surrogates, extended planes etc. What you're thinking at (one 16b value per character) sounds more like UCS-2 which I believe was used in NT 3.x once upon a time, but then expanded to UTF16 since at least Win2K.

carlos wrote:In input it translate a byte secuence treating it as multibyte (1 or 2 bytes (no more than it)) using the current codepage to 2 bytes character internally representation.

There is no translation on interactive input (except in really rare cases when you copy/paste from some ancient app that only puts 8-bit text onto the clipboard). For a quick test (no Japanese required) paste this "echo ‹αß©∂€›" at a cmd prompt in any of the common codepages - 437, 850, 1252 - and then execute it. You'll get the text pasted correctly, and the output of "echo" will be correct - using any of those codepages - still neither of those codepages contains all characters in the string that was pasted. This should make it obvious that what's pasted is the fully Unicode string, without any conversion to the active codepage. (Note: as always when codepages and Unicode are involved, you should be using a Unicode/TT - not raster - font in the console. Besides aesthetics, using a raster font actually changes the automatic codepage conversions done by the console, for output in particular - see the MSDN SetConsoleOutputCP docs "if the current font is a raster font, SetConsoleOutputCP does not affect how extended characters are displayed".)

carlos wrote:Then /A output option is multibyte, not single byte.

The /A output is single or multi-byte depending on the codepage being SBCS (single byte) or MBCS (multi byte). All "western" codepages (437, 850, 1252 etc) are single byte.

carlos wrote:Some real example that show that cmd holding a unicode character that use more than 2 bytes for represent it? I keep that cmd cannot do it.

So far, cmd doesn't display surrogates correctly, that much is true - but that's just a shortcoming of particular cmd versions. It doesn't change what UTF16 means and how it's supported by Windows itself. To see an example of a 2-wchar_t (4-byte) Unicode codepoint:
- open http://www.alanwood.net/unicode/mathematical_alphanumeric_symbols.html;
- depending on your browser and settings, the leftmost char on the first line under "Character" should display as either a funky "A" or maybe a <box> placeholder;
- regardless of how the browser displays it, select that character and copy it to the clipboard;
- now run Wordpad and paste it;
- if you have the Cambria Math font installed, Ctrl-A and change the font to Cambria Math (if you don't have the font, see http://en.wikipedia.org/wiki/Cambria_(typeface) for ways to get it - include free MS Office viewers etc);
- if using Cambria Math, Wordpad would show that same funky "A", otherwise some <box> placeholder;
- regardless of what's displayed, do a Save-As, select type "Unicode document", give it some name and save it.
At this point, you should have a 6-byte long file, with hex contents "FF FE 35 D8 00 DC". This is the 2-byte UTF16-LE BOM "FEFF" plus the 2-double-byte "D835 DC00" UTF16 encoding of U+1D400 - the selected Unicode character "MATHEMATICAL BOLD CAPITAL A". Note that the contents of the file is one single Unicode character, which takes 4 bytes to encode as UTF16-LE.

aGerman wrote:You have to clearly separate the unicode support of cmd.exe from the ability to display unicode characters in the console window.

That's good advice, indeed. To add to that, one must also separate between the Unicode support built in and provided by Windows itself vs. what/how programs choose to use that support - and both change from version to version. For an example, the XP Notepad doesn't display surrogates correctly, though XP itself offers the necessary support (which Wordpad takes advantage of). However, the Win7 Notepad does have the proper surrogates support.

Liviu

carlos · #11 Post by **carlos** » 08 Apr 2014 18:02

Thanks. I will read this more slowly.
But you are right, the function MultiByteToWideChar function says:
Maps a character string to a UTF-16 (wide character) string.

Then the internal representation is utf-16, that can use more than 2 bytes for represent a character.

Code: Select all

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx

and this page says:

Code: Select all

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374081%28v=vs.85%29.aspx

UTF-7, UTF-8, UTF-16, and UTF-32. Conversion of data among these encodings is lossless.

Then I update my conclusion:

/U ouput text as utf-16 (without the bom)
/A output text as multibyte (convert utf-16 to multibyte using the current codepage)

Then for a lossless unicode input, the utf-16 should be converted to multibyte using a codepage 65001 (utf-8) or 65000 (utf-7).

Liviu · #12 Post by **Liviu** » 08 Apr 2014 22:05

carlos wrote:/U ouput text as utf-16 (without the bom)
/A output text as multibyte (convert utf-16 to multibyte using the current codepage)

Then for a lossless unicode input, the utf-16 should be converted to multibyte using a codepage 65001 (utf-8) or 65000 (utf-7).

That's what the code I posted as "replVar7.cmd" in the other thread (viewtopic.php?p=33517#p33517) does - (a) the child process saves the output to a UTF16-LE file, then (b) the parent process converts that file to UTF-8 and (c) reads it into a variable.

A couple of notes, however...

Console-friendly apps "inherit" their output codepage from the parent, and don't make assumptions about the respective codepage being SBCS vs. MBCS. For such apps, the parent can set the codepage to 65001 before calling, and the child process could then save a UTF-8 file directly - effectively merging steps (a)+(b) into one, and avoiding the need for a UTF16 intermediate file. For example, cmd's builtin "type" command is such a console-friendly command. Unfortunately, cscript is not, and its output fails to pipe or redirect correctly under codepage 65001 - which is the reason why my code required steps (a) and (b) to be kept separate.

Also, step (c) does not work under XP, and I am not aware of any (reasonable) trick/alternative to make it work. I say "reasonable" because there is always the brute-force beyond-ugly way to do it as in http://www.dostips.com/forum/viewtopic.php?p=24160#p24160 ;-)

But that's painfully twisted, and slow enough to not count as a practical solution. Unless/until a workable alternative is found, reading a variable from a UTF-8 file remains a Win7+ fringe benefit.

Liviu

carlos · #13 Post by **carlos** » 09 Apr 2014 03:37

aGerman, I will download that unicode font.

Liviu. I checked that as you say, on windows xp cmd batch interpreter end of the script execution when codepage 65001 or 65000 is used.

Your script is ended when the line chcp 65001 is executed.
Edit:
Also, the routine for get the current codepage with for /f works ok in the japanase windows.

Then I write a script similar to your script but for convert a unicode variable to multibyte script using the current codepage. I get the same script that i save in the japanase windows using notepad.
I test it using codepage 932.

Code: Select all

@echo off

REM Convert unicode variable to multibyte batch script
REM using the current codepage.
REM Parameter: the name of variable
REM Output: multibyte.bat

setlocal enableextensions enabledelayedexpansion

call :genchr 255
call :genchr 254

cmd /u /d /c "echo(set "%~1=!%~1!">utf16-bom.txt"
copy 255.chr /b + 254.chr /b + utf16-bom.txt /b utf16bom.txt /b /y >nul
 
cmd /a /d /c "type utf16bom.txt >multibyte.bat"

del 255.chr 254.chr
del utf16bom.txt
del utf16-bom.txt

goto :eof


:genchr
REM This code creates one single byte. Parameter: <int>0-255
REM Teamwork of carlos, penpen, aGerman, dbenham
REM Tested under Win2000, XP, Win7, Win8
set "options=/d compress=off /d reserveperdatablocksize=26"
if %~1 neq 26  (type nul >t.tmp
makecab %options% /d reserveperfoldersize=%~1 t.tmp %~1.chr >nul
type %~1.chr | (
(for /l %%N in (1 1 38) do pause)>nul&findstr "^">t.tmp)
>nul copy /y t.tmp /a %~1.chr /b
del t.tmp
) else (copy /y nul + nul /a 26.chr /a >nul)
goto :eof

I'm sure that we can encode it using utf-8 with vbscript beginning with some like this:

Code: Select all

set o=createobject("scripting.filesystemobject")
set w=createobject("adodb.stream")
set r=o.opentextfile("utf16-bom.txt",1,0,-1)
w.charset="utf-8" : w.type=1 : w.open

but not have sense use this for compatibility with windows xp, if having the multibyte encoded as utf-8, we cannot use the utf-8 codepage for retranslate it to utf-16.

carlos · #14 Post by **carlos** » 09 Apr 2014 03:46

I found a solution for use the utf 8 codepage using a batch script on windows xp.

This is the trick

Code: Select all

(
chcp 65001 >nul
echo do something
rem restore the codepage
chcp 850 >nul
)

we need run the batch using a codepage other than 65001 or 65000, then we load all the instructions in memory, and on runtime we change the codepage to 65001 and use it, and it not fails because the instructions are in memory and are not readed from the batch file.
It works on xp.

carlos · #15 Post by **carlos** » 09 Apr 2014 04:01

I updated the code for use utf-8 instead the current codepage.
Works on xp.
Liviu you need add two parenthesis to your script and will work on xp.

this is the code. I tested on the japanase xp.

Code: Select all


@echo off

REM Convert unicode variable to multibyte batch script
REM using utf-8.
REM Parameter: the name of variable
REM Output: multibyte.bat

setlocal enableextensions enabledelayedexpansion

call :genchr 255
call :genchr 254

cmd /u /d /c "echo(set "%~1=!%~1!">utf16-bom.txt"
copy 255.chr /b + 254.chr /b + utf16-bom.txt /b utf16bom.txt /b /y >nul

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do @set /a "cp=%%~a"

(
chcp 65001 >nul
cmd /a /d /c "type utf16bom.txt >multibyte.bat"
chcp %cp% >nul
)

del 255.chr 254.chr
del utf16bom.txt
del utf16-bom.txt

goto :eof


:genchr
REM This code creates one single byte. Parameter: <int>0-255
REM Teamwork of carlos, penpen, aGerman, dbenham
REM Tested under Win2000, XP, Win7, Win8
set "options=/d compress=off /d reserveperdatablocksize=26"
if %~1 neq 26  (type nul >t.tmp
makecab %options% /d reserveperfoldersize=%~1 t.tmp %~1.chr >nul
type %~1.chr | (
(for /l %%N in (1 1 38) do pause)>nul&findstr "^">t.tmp)
>nul copy /y t.tmp /a %~1.chr /b
del t.tmp
) else (copy /y nul + nul /a 26.chr /a >nul)
goto :eof

DosTips.com

japanase cmd

japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd

Re: japanase cmd