Synthesizing Unicode strings in Windows 7 batch

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Synthesizing Unicode strings in Windows 7 batch

#1 Post by Liviu » 13 Apr 2014 23:31

Given the newfound UTF-8 codepage 65001 support in Windows 7, it has become possible to effectively build arbitrary Unicode strings in batch code using the hex representation of their UTF-8 encoding. This did not - and to the best I can guess, couldn't possibly - work in XP, and it's an interesting new ability with some potential. For example, one could remove active codepage dependencies for extended ASCII, display multi language text at the same prompt, use the full range of box/drawing characters and symbols, and so on.

Below is the $chrU.cmd that does the actual conversion and assignment. First, though, the prerequisites are (a) Windows 7 - this will not work in XP, then (b) a cmd prompt set to Lucida Console or another Unicode/TT (not raster) font, and (c) my auxiliary $cpChars.cmd batch that can be downloaded from https://db.tt/laqBZ7Dv. The latter is a dropbox shortened link, and that batch just generates a 256-chars variable where offset 0 is not used, and 1-255 hold the character with the respective code in the active codepage (file is provided as a link to a .zip since it contains some control chars and extended ASCII that make it difficult to copy/paste directly). EDIT: Updated $chrU code to fix ^! return to disableDelayedExpansion context.

Code: Select all

:: $chrU  [out,ref] str  =  [in,val] hex#1 .. hex#N  ______________ 14.04.17 __
::
::      decodes sequence of 'hex#' utf-8 bytes into variable 'str'
::
:: e.g. $chrU str = 22 CE B1 22  --  sets 'str' to a quoted greek alpha u+03B1
::
:: rem  requires win7 or later, won't work under xp or earlier
:: rem  control chars 0x00-1F not supported
:: rem  undefined behavior if the input is not valid utf-8
:: rem  the '=' equal sign is only used as a delimiter, any of '= ,;' work too
:: ____________________________________________________________________________

@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal & if "!"=="" (set "isEdx=1") else (set "isEdx=")

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"

:: save controls + ascii + extended chars to 'chrA' table
call $cpChars chrA

:setU  [out,ref] str  [in,val] hex#1 .. hex#N  --------------------------------
setlocal enableDelayedExpansion
set "u=" & for /f "tokens=1,* delims== " %%U in ("%*") do for %%W in (%%V) do (
  set/a "x=0x%%W" & for %%X in (!x!) do set "u=!u!!chrA:~%%X,1!")

@rem escape '%^<>|&!' before final for/f-echo-set conversion
if defined isEdx (set "u=!u:^=^^^^^^^^!") else (set "u=!u:^=^^^^!")
set "u=!u:%%=%%chrA:~37,1%%!"
for %%Q in ("<" ">" "|" "&") do set "u=!u:%%~Q=^^%%~Q!"
set ^"u=!u:"=""!^"
if defined isEdx (set "u=%u:!=^^^^^!%"!) else (set "u=%u:!=^^^!%"!)
set ^"u=!u:""=^^"!^"

@rem 2nd next must stay on one line, and 'cp' must be the 'chrA' codepage [1]
chcp 65001 >nul
chcp %cp% >nul & for /f delims^=^ eol^= %%V in ('echo(!u!') do (
  endlocal & endlocal & set "%~1=%%V"!)
exit /b 0

:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
  set "z=%%~a" & setlocal enableDelayedExpansion
  if not "!z:~0,1!"==":" endlocal & exit /b 0
  (if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error:  bad syntax '%~0 %*'))
exit /b 1 .....................................................................

:: [1] '!w!' expands to a double-byte sequence equivalent to the utf-8 encoding
::     then 'echo' narrows it down to single-byte per inner 'cp' codepage   (a)
::     then '%%u' expands it back to double-byte per outer codepage 65001   (b)
::     effectively decoding the utf-8 byte sequence into the unicode string (*)
:: (a) the 'in' command of a 'for/f' loop runs in the codepage which is active
::     at the time the nested 'cmd' executes the command - in this case 'cp'
:: (b) the 'for/f' evaluates the loop variables according to the original
::     codepage in effect at the time the loop was parsed - in this case 65001
:: (*) works in win7, but not xp - because chcp 65001 stops batch parsing in xp
:: ____________________________________________________________________________

And this is a batch file using $chrU, with the following output copied from a win7x64.sp1 cmd prompt using the Lucida Console font.

Code: Select all

@echo off & setlocal enableDelayedExpansion

call $chrU u = E2 80 B9 CE B1    C3 9F    C2 A9    E2 88 82 E2 82 AC E2 80 BA
echo blend     !u!
call $chrU u = C3 A0    C3 A1    C3 A2    C4 81    C4 83    C4 85    C7 BB
echo latin     !u!
call $chrU u = CE B1    CE B2    CE B3    CE B4    CE B5    CE B6    CE B7
echo greek     !u!
call $chrU u = D0 B0    D0 B1    D0 B2    D0 B3    D0 B4    D0 B5    D0 B6
echo cyrillic  !u!
call $chrU u = E2 86 90 E2 86 91 E2 86 92 E2 86 93 E2 86 94 E2 86 95 E2 86 A8
echo arrows    !u!
call $chrU u = E2 96 8C E2 97 84 E2 96 B2 E2 97 8B E2 96 BC E2 96 BA E2 96 90
echo drawing   !u!
call $chrU u = C2 A2    C2 A3    C2 A4    C2 A5    E2 82 A3 E2 82 A4 E2 82 AC
echo currency  !u!
call $chrU u = C2 B1    C3 97    E2 88 82 E2 88 86 E2 88 8F E2 88 91 E2 88 92
echo math      !u!
call $chrU u = C2 AB    C2 A1    C2 BF    C2 A9    C2 AE    C2 A7    E2 80 A0
echo punct     !u!
call $chrU u = C2 BC    E2 85 9B C2 B9    E2 99 A0 E2 99 A3 E2 99 A5 E2 99 A6
echo misc      !u!
call $chrU u = 3B 25 63 64 25 28 21 63 64 21 5E 22 5E 5E 22 22 21 3C 3F 26 5E
echo ascii     !u!

endlocal & goto :eof

Code: Select all

C:\tmp>$chrU.test
blend     ‹αß©∂€›
latin     àáâāăąǻ
greek     αβγδεζη
cyrillic  абвгдеж
arrows    ←↑→↓↔↕↨
drawing   ▌◄▲○▼►▐
currency  ¢£¤¥₣₤€
math      ±×∂∆∏∑−
punct     «¡¿©®§†
misc      ¼⅛¹♠♣♥♦
ascii     ;%cd%(!cd!^"^^""!<?&^

Some closing notes:
- the two key tricks that make it work are the new Windows 7 support for parsing batch code under codepage 65001, and the codepage handling around for/f loops - which hasn't changed since XP, but could not be put to good UTF-8 use until now;
- anyone not fond of my helper batch $cpChars.cmd can use any other code to generate the same "character map" of the active codepage, instead, and there have been several ways to do it posted on dostips before;
- $chrU doesn't use temp files, and only one for/f-command loop running the internal 'echo';
- the code is not particularly optimized, and I tried to keep it reasonably clean - only concession being the extra lines dealing with the traditional '^!' problem characters.

Liviu

P.S. As to the question of where to get the UTF-8 encoding of a given string... If the string comes from a UTF-8 encoded text file, then viewing the file in a hex viewer will show the bytes. If it comes from another document, or a web page, copying/pasting to any number of online tools (such as http://rishida.net/tools/conversion/) will show the corresponding UTF-8.

Or, save the following as $ascU.cmd (which FWIW works under XP too, not just Windows 7). EDIT #2: Updated $ascU, $ascX code below to call '%comspec%' instead of hardcoded 'cmd', plus minor/cosmetic changes.

Code: Select all

:: $ascU  [in,ref] str  /U  [out,ref,opt] utf-8-bytes-hex  ________ 14.05.23 __
::                      /W  [out,ref,opt] utf16-words-hex
::                      /A  [out,ref,opt] ext-asc-bytes-hex  [out,ref,opt] strA
::
::      returns encoding of 'str' as utf-8, utf16, or 8-bit active codepage
::      and optionally for '/A' the translation of 'str' in the given codepage
::
:: rem  '/U' is assumed by default if no '/' specified
:: rem  control chars not supported in the input string
:: rem  '/A' with 'strA' must be called from disableDelayedExpansion context
::      in order for '^!' to be returned correctly in 'strA'
::
:: 14.05.23  replaced 'cmd' with '%comSpec%' in nested calls
:: 14.04.19  checked ok under xp.sp3, win7x64.sp1
:: ____________________________________________________________________________

@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal enableDelayedExpansion

set "str=!%~1!" & shift
set "enc=U" & for %%E in (U W A) do if /i "%~1"=="/%%E" set "enc=%%E" & shift
set "hex=%~1" & set "asc=%~2"

:: quick exit for empty string
if not defined str (
  if not defined hex (echo() else set "%hex%=" & if defined asc set "%asc%="
  endlocal & exit /b 0
)

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"

:: define BS backspace
for /f %%b in ('"prompt $H & for %%b in (1) do rem"') do set "BS=%%b"

set "f0=%temp%\%time::=%%random%.tmp" & set "f1=%temp%\%time::=%%random%.tmp"
set ^"echoA="%comSpec%" /a/v/c echo^" & set ^"echoW="%comSpec%" /u/v/c echo^"

set "hX=" & if defined asc (set "aX=" & call :asc%enc% str hX aX
            ) else         (call :asc%enc% str hX)
2>nul del "%f1%" "%f0%"

if not defined hex (endlocal & echo %hX% & exit /b 0)
if not defined asc (endlocal & set "%hex%=%hX%" & exit /b 0)
for /f delims^=^ eol^= %%X in ("!aX!") do (
  endlocal & set "%hex%=%hX%" & set "%asc%=%%X")
exit /b 0

:ascU  [in,ref] str  [out,ref] utf-8-bytes-hex  ...............................
chcp 65001 >nul & (>"%f1%" %echoA%(^^!%1^^!) & chcp %cp% >nul
call :hexA %2 & goto :eof

:ascW  [in,ref] str  [out,ref] utf16-words-hex  ...............................
(>"%f1%" %echoW%(^^!%1^^!) & call :hexW %2 & goto :eof

:ascA  [in,ref] str  [out,ref] ext-asc-bytes-hex  [out,ref,opt] ext-asc-str  ..
(>"%f1%" %echoA%(^^!%1^^!) & call :hexA %2 & if "%3"=="" goto :eof
@rem escape '%^<>|&!' before final for/f-echo-set conversion
set "u=!%1!" & set "PCT=%%"
for %%Q in ("%%=%%PCT%%" "^=^^^^^^^^") do set "u=!u:%%~Q!"
for %%Q in ("<" ">" "|" "&") do set "u=!u:%%~Q=^^%%~Q!"
set ^"u=!u:"=""!^"
set "u=%u:!=^^^^^!%"!
set ^"u=!u:""=^^"!^"
for /f delims^=^ eol^= %%V in ('echo(!u!') do set "%3=%%V"!
goto :eof

:hexA  [out,ref] hex-bytes  --  'f1' = narrow string + narrow <cr><lf>  .......
set "%1=" & for %%X in ("%f1%") do set/a len=%%~zX-2
set "z=" & for /l %%N in (1 1 !len!) do set "z=!z!!BS!"
>"%f0%" %echoA%(!z!
for /f "skip=1 tokens=2 delims=: " %%U in ('fc /b "%f1%" "%f0%"') do (
  (if defined %1 set "%1=!%1! ") & set "%1=!%1!%%U")
goto :eof

:hexW  [out,ref] hex-bytes  --  'f1' = wide string + wide <cr><lf>  ...........
set "%1=" & set "u=" & for %%X in ("%f1%") do set/a len=%%~zX-4
set "z=" & for /l %%N in (1 1 !len!) do set "z=!z!!BS!"
>"%f0%" "%comSpec%" /a/c ^<nul set/p "=!z!" & >>"%f0%" "%comSpec%" /u/c echo(
for /f "skip=1 tokens=2 delims=: " %%U in ('fc /b "%f1%" "%f0%"') do (
  if not defined u (set "u=%%U") else (
    (if defined %1 set "%1=!%1! ") & set "%1=!%1!%%U!u!" & set "u="))
goto :eof

:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
  set "z=%%~a" & setlocal enableDelayedExpansion
  if not "!z:~0,1!"==":" endlocal & goto :eof
  (if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error:  bad syntax '%~0 %*'))
exit /b 1 .....................................................................
And this as $ascX.cmd

Code: Select all

:: $ascX  [in,ref]  var   [in,val,opt] cp#1 .. cp#N  ______________ 14.04.19 __
:: $ascX  [in,val] "str"  [in,val,opt] cp#1 .. cp#N
::
::      displays 'string' encoding as utf16, utf-8, and 8-bit codepage(s)
::      with 'string' either passed by reference in 'var'
::                    or passed by value as '"str"' inside quotes
::
:: rem  'cp' lines show '==' if string converts losslessly, '->' otherwise
:: rem  control chars not supported in the input string
:: rem  active codepage is included by default, and displayed first
::
:: 14.04.19  checked ok under xp.sp3, win7x64.sp1
:: ____________________________________________________________________________

@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal disableDelayedExpansion

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"

:: first argument is either quoted string value, or name of string variable
(if '%1'=='"%~1"' (set "sZ=%~1" & set "sX=sZ") else (set "sX=%~1")) & shift
set "xW=" & call $ascU %sX% /W xW & set "xU=" & call $ascU %sX% /U xU

set "cx= %~1 %~2 %~3 %~4 %~5 %~6 %~7 %~8 %~9 "
setlocal enableDelayedExpansion
echo '!%sX%!' = [u+] !xW! = [utf-8] !xU!
set "cx=!cx: %cp% = !"
endlocal & set "cx=%cx%"
for %%p in (%cp% %cx%) do (
  chcp %%p >nul
  set "xA=" & set "sA=" & call $ascU %sX% /A xA sA
  setlocal enableDelayedExpansion
  set "p=cp %%~p   " & set "p=!p:~0,8!"
  if "!sA!"=="!%sX%!" (echo !p! == '!sA!' = !xA!) else (
    set "xW=" & call $ascU sA /W xW
    echo !p! -^> '!sA!' = !xA! = [u+] !xW!
  )
  endlocal
)
chcp %cp% >nul & endlocal & exit /b 0

:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
  set "z=%%~a" & setlocal enableDelayedExpansion
  if not "!z:~0,1!"==":" endlocal & goto :eof
  (if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error:  bad syntax '%~0 %*'))
exit /b 1 _____________________________________________________________________

Then enter or paste any string to see its encoding in UTF16, UTF-8 and codepage(s) - note that the '==' also indicates whether a string converts losslessly to a given codepage.

Code: Select all

C:\tmp>$ascX ‹.ß©.€› 850 437 1252 28591
'‹.ß©.€›' = [u+] 2039 002E 00DF 00A9 002E 20AC 203A = [utf-8] E2 80 B9 2E C3 9F C2 A9 2E E2 82 AC E2 80 BA
cp 437   -> '<.ßc.?>' = 3C 2E E1 63 2E 3F 3E = [u+] 003C 002E 00DF 0063 002E 003F 003E
cp 850   -> '<.ß©.?>' = 3C 2E E1 B8 2E 3F 3E = [u+] 003C 002E 00DF 00A9 002E 003F 003E
cp 1252  == '‹.ß©.€›' = 8B 2E DF A9 2E 80 9B
cp 28591 -> '<.ß©.?>' = 3C 2E DF A9 2E 3F 3E = [u+] 003C 002E 00DF 00A9 002E 003F 003E
Last edited by Liviu on 23 May 2014 19:24, edited 2 times in total.

aGerman
Expert
Posts: 3761
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Synthesizing Unicode strings in Windows 7 batch

#2 Post by aGerman » 14 Apr 2014 17:10

Good job Liviu :) I think this will come in handy even if I still don't know the scope.

Regards
aGerman

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: Synthesizing Unicode strings in Windows 7 batch

#3 Post by Liviu » 19 Apr 2014 00:43

@aGerman thanks. I know how popular Unicode is in batch world, of course ;-) but I still think it's an overlooked topic for now, which will become as commonplace as LFNs as time goes.

Back to code now, and just to wrap up the original post, it is also possible to synthesize Unicode strings off their UTF16 encoding (instead of UTF-8, as $chrU does). However, it's far less efficient, uses temp files and does a char-by-char conversion, thus not too practical. But FWIW below is a $chrW.cmd that reconstructs a Unicode string from its UTF16 encoding. It has the same prerequisites as $chrU above (Windows 7, Unicode/TT console font, $cpChars.cmd).
EDIT: Thanks @penpen for the followup in the next post below, and the idea of encoding U+ into UTF-8 in batch code directly - which completely eliminates the overhead of char-by-char external calls and temp files. Updated code is at the bottom of this post, the original code in the block right below is obsolete now, and I left it here just for historical interest.

Code: Select all

:: $chrW  [out,ref] str  =  [in,val] u+#1 .. u+#N  ________________ 14.04.17 __
::
::      decodes sequence of (hex) 'u+#' unicode codepoints into variable 'str'
::
:: e.g. $chrW str = 22 3B1 0022  --  sets 'str' to a quoted greek alpha u+03B1
::
:: rem  requires win7 or later, won't work under xp or earlier
:: rem  codepoints <0x20 (controls) and >0xFFFF (outside bmp) not supported
:: rem  the '=' equal sign is only used as a delimiter, any of '= ,;' work too
:: ____________________________________________________________________________

@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal & if "!"=="" (set "isEdx=1") else (set "isEdx=")

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"

:: save controls + ascii + extended chars in codepage 28591 to 'iso8859' table
:: where single-byte 0x## encodings match codepoints u+00## (iso/iec 8859-1)
chcp 28591 >nul & call $cpChars iso8859
:: prepare utf16-le u+FEFF bom
set "bom16le=%iso8859:~255,1%%iso8859:~254,1%"

:: build unicode string from list of u+ codepoints ............................
:strW  [out,ref] str  [in,val] u+hex#1 .. u+hex#N
set "tmpW=%temp%\%time::=.%.%random%.tmp"
setlocal enableDelayedExpansion
set "w=" & for /f "tokens=1,* delims== " %%U in ("%*") do (
  for %%W in (%%V) do call :chrW c %%W & set "w=!w!!c!")
del "%tmpW%" 2>nul & chcp %cp% >nul
for /f delims^=^ eol^= %%V in ("!w!") do (
  endlocal & endlocal & set "%~1=%%V"!)
exit /b 0

:: convert u+ codepoint to unicode char .......................................
:chrW  [out,ref] chr  [in,val] u+hex#
chcp 28591 >nul
set/a u=0x%~2, hi=u/256, lo=u%%256

if %hi%==0 (
  @rem pass through single-byte chars, since codepage 28591 matches u+00##
  set "%~1=!iso8859:~%lo%,1!"
  @rem must still escape '^!' if returning to 'enableDelayedExpansion' context
  if defined isEdx for %%Q in (33 94) do if %lo%==%%Q set "%~1=^^!%~1!"
  exit /b 0
)

@rem synthesize 2-byte encoding of target char, save to utf16-le file with bom
@rem can't use 'echo' since it adds 8-bit CR+LF which decodes as phony u+0A0D
>"%tmpW%" <nul cmd /a/c set/p "=%bom16le%!iso8859:~%lo%,1!!iso8859:~%hi%,1!"

@rem convert utf16-le file to utf-8, read wide-char off it under codepage 65001
chcp 65001 >nul
for /f delims^=^ eol^= %%c in ('type "%tmpW%"') do set "%~1=%%~c"
exit /b 0

:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
  set "z=%%~a" & setlocal enableDelayedExpansion
  if not "!z:~0,1!"==":" endlocal & exit /b 0
  (if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error:  bad syntax '%~0 %*'))
exit /b 1 _____________________________________________________________________

And the following duplicates the same output as in the original test.

Code: Select all

@echo off & setlocal
for %%U in (
"blend   :  2039     03B1     00DF     00A9     2202     20AC     203A"
"latin   :  00E0     00E1     00E2     0101     0103     0105     01FB"
"greek   :  03B1     03B2     03B3     03B4     03B5     03B6     03B7"
"cyrillic:  0430     0431     0432     0433     0434     0435     0436"
"arrows  :  2190     2191     2192     2193     2194     2195     21A8"
"drawing :  258C     25C4     25B2     25CB     25BC     25BA     2590"
"currency:  00A2     00A3     00A4     00A5     20A3     20A4     20AC"
"math    :  00B1     00D7     2202     2206     220F     2211     2212"
"punct   :  00AB     00A1     00BF     00A9     00AE     00A7     2020"
"misc    :  00BC     215B     00B9     2660     2663     2665     2666"
"ascii   :  3B 25 63 64 25 28 21 63 64 21 5E 22 5E 5E 22 22 21 3C 3F 26 5E"
) do for /f "tokens=1,* delims=:" %%V in (%%U) do call :echoW "%%~V" "%%~W"
endlocal & goto :eof

:echoW
call $chrW u = %~2
setlocal enableDelayedExpansion
echo %~1  !u!
endlocal & goto :eof

Code: Select all

C:\tmp>$chrW.test
blend     ‹αß©∂€›
latin     àáâāăąǻ
greek     αβγδεζη
cyrillic  абвгдеж
arrows    ←↑→↓↔↕↨
drawing   ▌◄▲○▼►▐
currency  ¢£¤¥₣₤€
math      ±×∂∆∏∑−
punct     «¡¿©®§†
misc      ¼⅛¹♠♣♥♦
ascii     ;%cd%(!cd!^"^^""!<?&^

Liviu

P.S. EDIT: This updated $chrW uses a %w2u% macro that borrows heavily from penpen's http://www.dostips.com/forum/viewtopic.php?p=34502#p34502 for a lot better performance.

Code: Select all

:: $chrW  [out,ref] str  =  [in,val] u+#1 .. u+#N  ________________ 14.05.22 __
::
::      decodes sequence of (hex) 'u+#' unicode codepoints into variable 'str'
::
:: e.g. $chrW str = 22 3B1 0022  --  sets 'str' to a quoted greek alpha u+03B1
::
:: rem  requires win7 or later, won't work under xp or earlier
:: rem  codepoints <0x20 (controls) not supported
:: rem  codepoints >0xFFFF (outside bmp) not displayed correctly in the console
:: rem  the '=' equal sign is only used as a delimiter, any of '= ,;' work too
:: ____________________________________________________________________________

@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal & if "!"=="" (set "isEdx=1") else (set "isEdx=")

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"

:: save controls + ascii + extended chars to 'chrA' table
call $cpChars chrA

:: load 'w2u' macro
setlocal disableDelayedExpansion
call :set.w2u

:setW  [out,ref] str  [in,val] u+#1 .. u+#N  ----------------------------------
setlocal enableDelayedExpansion
set "u=" & for /f "tokens=1,* delims== " %%U in ("%*") do for %%W in (%%V) do (
  set "w=%%W" & %w2u% 0x!w:U+=! u0 u1 u2 u3
  for %%U in (!u0! !u1! !u2! !u3!) do set "u=!u!!chrA:~%%U,1!"
)

@rem escape '%^<>|&!' before final for/f-echo-set conversion
if defined isEdx (set "u=!u:^=^^^^^^^^!") else (set "u=!u:^=^^^^!")
set "u=!u:%%=%%chrA:~37,1%%!"
for %%Q in ("<" ">" "|" "&") do set "u=!u:%%~Q=^^%%~Q!"
set ^"u=!u:"=""!^"
if defined isEdx (set "u=%u:!=^^^^^!%"!) else (set "u=%u:!=^^^!%"!)
set ^"u=!u:""=^^"!^"

@rem 2nd next must stay on one line, and 'cp' must be the 'chrA' codepage [1]
chcp 65001 >nul
chcp %cp% >nul & for /f delims^=^ eol^= %%V in ('echo(!u!') do (
  endlocal & endlocal & endlocal & set "%~1=%%V"!)
exit /b 0

:set.w2u ......................................................................
@rem single linefeed char 0x0A (two blank lines required below)
set LF=^


@rem newline macros (linefeed + line continuation)
set ^"\n=^^^%LF%%LF%^%LF%%LF%^^"
@rem u+ ranges for utf-8 1/2/3/4 byte encodings
set/a U8x0=0x000000, U8x1=0x00007F, U8x2=0x0007FF, U8x3=0x00FFFF, U8x4=0x10FFFF
@rem must be loaded under 'disableDelayedExpansion'
for /f %%$ in ("%~1 w2u") do ^
set ^"%%$=for %%# in (1 2) do if %%#==2 (%\n%
  for /f "tokens=1-5 delims=, " %%1 in ("!args!") do (%\n%
    set "b0=" ^& if %%1 gtr %U8x0% if %%1 leq %U8x4% (%\n%
      if %%1 leq %U8x1%        (set/a "b0 = %%1" ^& set "b1=" ^& set "b2=" ^& set "b3="%\n%
      ) else if %%1 leq %U8x2% (set/a "b0 = 0xC0 | (%%1 >> 6)", ^
                                      "b1 = 0x80 | (%%1 & 0x3F)" ^& set "b2=" ^& set "b3="%\n%
      ) else if %%1 leq %U8x3% (set/a "b0 = 0xE0 | (%%1 >> 12)", ^
                                      "b1 = 0x80 | ((%%1 >> 6) & 0x3F)", ^
                                      "b2 = 0x80 | (%%1 & 0x3F)" ^& set "b3="%\n%
      ) else                   (set/a "b0 = 0xF0 | (%%1 >> 18)", ^
                                      "b1 = 0x80 | ((%%1 >> 12) & 0x3F)", ^
                                      "b2 = 0x80 | ((%%1 >> 6) & 0x3F)", ^
                                      "b3 = 0x80 | (%%1 & 0x3F)"))%\n%
    if not defined b0 (%\n%
      endlocal ^& set "%%2=" ^& set "%%3=" ^& set "%%4=" ^& set "%%5="%\n%
    ) else for /f "tokens=1-4" %%6 in ("!b0! !b1! !b2! !b3!") do (%\n%
      endlocal ^& set "%%2=%%6" ^& set "%%3=%%7" ^& set "%%4=%%8" ^& set "%%5=%%9"))%\n%
) else setlocal enableDelayedExpansion ^& set args=,^"
exit/b 0

:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
  set "z=%%~a" & setlocal enableDelayedExpansion
  if not "!z:~0,1!"==":" endlocal & exit /b 0
  (if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error:  bad syntax '%~0 %*'))
exit /b 1 .....................................................................

:: [1] '!w!' expands to a double-byte sequence equivalent to the utf-8 encoding
::     then 'echo' narrows it down to single-byte per inner 'cp' codepage   (a)
::     then '%%u' expands it back to double-byte per outer codepage 65001   (b)
::     effectively decoding the utf-8 byte sequence into the unicode string (*)
:: (a) the 'in' command of a 'for/f' loop runs in the codepage which is active
::     at the time the nested 'cmd' executes the command - in this case 'cp'
:: (b) the 'for/f' evaluates the loop variables according to the original
::     codepage in effect at the time the loop was parsed - in this case 65001
:: (*) works in win7, but not xp - because chcp 65001 stops batch parsing in xp
:: ____________________________________________________________________________
Last edited by Liviu on 23 May 2014 01:18, edited 2 times in total.

penpen
Expert
Posts: 1726
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Synthesizing Unicode strings in Windows 7 batch

#4 Post by penpen » 17 May 2014 17:52

This code displays the UTF16 characters given as UTF32 codepoints (only tested on patched cmd.exe under win xp 32 home):

Code: Select all

@echo off
:: example: 'COMMERCIAL AT', 'COPYRIGHT SIGN', 'EURO SIGN', 'MUSICAL SYMBOL G CLEF'
:: testcon U+40 A9 20AC 1D11E

:: example: !  "  %  &  (  )  :  <  >  ^  |
:: testcon 21 22 25 26 28 29 3A 3C 3E 5E 7C

if "%~1" == "initialized" goto :%~2

setlocal enableDelayedExpansion
set "patchedXPComSpec=cmd3"
set "executable=%ComSpec%"
if defined patchedXPComSpec set "executable=%patchedXPComSpec%"

set "codepage=850"
for /F "tokens=2 delims=:." %%c in ('chcp') do set "codepage=%%c"

echo lock> "monitor.lck"

(
   "%executable%" /A /C ""%~f0" "initialized" "MultilingualLatinI" %*"
) | (
   "%executable%" /A /C ""%~f0" "initialized" "UTF8" %*"
)

chcp %codepage% > nul
endlocal
exit /b %ErrorLevel%


:MultilingualLatinI
setlocal enableDelayedExpansion
chcp 850 > nul
rem set /P "table=" < table.dat
call $cpChars table

set /A "MIN_UTF8   = 0x000000"
set /A "MAX_UTF8_1 = 0x00007F"
set /A "MAX_UTF8_2 = 0x0007FF"
set /A "MAX_UTF8_3 = 0x00FFFF"
set /A "MAX_UTF8_4 = 0x10FFFF"


:: remove "initialized" ["MultilingualLatinI"|UTF8"]
set "codepoints=%* "
set "codepoints=!codepoints:* =! "
set "codepoints=!codepoints:* =! "
set "UTF8String=#"

for %%a in (!codepoints!) do (
   for %%a in (a b c d) do set "%%a="
   set "cp=%%a"
   set "cp=0x!cp:*U+=!" >&2

   if !cp! LSS %MIN_UTF8% (
      echo(Codepoint currently undefined: !cp! >&2
   ) else if %MIN_UTF8% EQU !cp! (
      echo(Codepoint not supported: NUL = U+0 >&2
   ) else if !cp! LEQ %MAX_UTF8_1% (
      set /A "a = cp"
      for /F "tokens=1" %%a in ("!a!") do set "UTF8String=!UTF8String!!table:~%%a,1!"
   ) else if !cp! LEQ %MAX_UTF8_2% (
      set /A "b = 0xC0 | (cp >> 6)", "a = 0x80 | (cp & 0x3F)"
      for /F "tokens=1-2" %%a in ("!a! !b!") do set "UTF8String=!UTF8String!!table:~%%b,1!!table:~%%a,1!"
   ) else if !cp! LEQ %MAX_UTF8_3% (
      set /A "c = 0xE0 | (cp >> 12)", "b = 0x80 | ((cp >> 6) & 0x3F)", "a = 0x80 | (cp & 0x3F)"
      for /F "tokens=1-3" %%a in ("!a! !b! !c!") do set "UTF8String=!UTF8String!!table:~%%c,1!!table:~%%b,1!!table:~%%a,1!"
   ) else if !cp! LEQ %MAX_UTF8_4% (
      set /A "d = 0xF0 | (cp >> 18)", "c = 0x80 | ((cp >> 12) & 0x3F)", "b = 0x80 | ((cp >> 6) & 0x3F)", "a = 0x80 | (cp & 0x3F)"
      for /F "tokens=1-4" %%a in ("!a! !b! !c! !d!") do set "UTF8String=!UTF8String!!table:~%%d,1!!table:~%%c,1!!table:~%%b,1!!table:~%%a,1!"
   ) else (
      echo(Codepoint currently undefined: !cp! >&2
   )

rem   echo( %%a == !cp! == !d!, !c!, !b!, !a! >&2
)

set /p "=!UTF8String!#" < nul
endlocal

del "monitor.lck"
exit /b 0


:UTF8
if exist "monitor.lck" goto :UTF8

setlocal enableDelayedExpansion
chcp 65001 > nul

set "input=
set /p "input="
echo(!input:~1,-1!

endlocal
exit /b 0
Examples:

Code: Select all

Z:\>testcon U+40 A9 20AC 1D11E
@©€??

Z:\>testcon 21 22 25 26 28 29 3A 3C 3E 5E 7C
!"%&():<>^|


Sidenote: The above scripts are all working on the patched cmd.exe under win xp (32 bit, http://www.dostips.com/forum/viewtopic.php?f=3&t=5588).
You need Livius $cpChars.cmd see the OP.

penpen

Edits:
- Corrected the LSS bugs (to LEQ) thanks to Liviu, for seeing it, and
- removed a bug where the charcater with the hex value 0x7F is displayed although there is no input ("testcon.bat").
Last edited by penpen on 23 May 2014 04:42, edited 1 time in total.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: Synthesizing Unicode strings in Windows 7 batch

#5 Post by Liviu » 23 May 2014 00:50

penpen wrote:This code displays the UTF16 characters given as UTF32 codepoints
Nicely done! (Just one nitpick, those LSS need actually be LEQ.)

Complications related to xp and carlos' patched cmd make your code a bit difficult to follow, so here is my take on the U+ to UTF-8 conversion alone - which is purely numerical, without any codepage/xp dependencies or limitations.

Code: Select all

:: $w2u.cmd    in: U+ codepoints          out: UTF-8 encoding thereof
::                 0041 2262 U+391 2E          41,E2 89 A2,CE 91,2E,
::                 D55C AD6D C5B4              ED 95 9C,EA B5 AD,EC 96 B4,
::                 65E5 672C 8A9E              E6 97 A5,E6 9C AC,E8 AA 9E,
::                 233B4                       F0 A3 8E B4
:: - leading 0s optional, U+ prefix allowed but not required for input
:: - output has characters delimited by commas, bytes within by spaces

@echo off & setLocal enableDelayedExpansion

set/a U8x0=0x000000, U8x1=0x00007F, U8x2=0x0007FF, U8x3=0x00FFFF, U8x4=0x10FFFF
set "u=" & for %%W in (%*) do (
  set "w=%%W" & call :w2u 0x!w:U+=! u0 u1 u2 u3
  if defined u0 (
    set "sep=," & for %%U in (!u0! !u1! !u2! !u3!) do (
      call :num2hex %%U hex 2
      if defined u set "u=!u!!sep!"
      set "u=!u!!hex!" & set "sep= "
  )) else (set "u=!u! "))
echo !u!

endlocal & goto :eof

:: based on http://www.dostips.com/forum/viewtopic.php?p=34502#p34502 @penpen
:: errorlevel 0 = ok, 1 = codepoint undefined, or 'U+0' which is not supported
:w2u
if %1 gtr %U8x0% if %1 leq %U8x4% (
  if %1 leq %U8x1%        (set/a "%2 = %1" & set "%3=" & set "%4=" & set "%5="
  ) else if %1 leq %U8x2% (set/a "%2 = 0xC0 | (%1 >> 6)", ^
                                 "%3 = 0x80 | (%1 & 0x3F)" & set "%4=" & set "%5="
  ) else if %1 leq %U8x3% (set/a "%2 = 0xE0 | (%1 >> 12)", ^
                                 "%3 = 0x80 | ((%1 >> 6) & 0x3F)", ^
                                 "%4 = 0x80 | (%1 & 0x3F)" & set "%5="
  ) else                  (set/a "%2 = 0xF0 | (%1 >> 18)", ^
                                 "%3 = 0x80 | ((%1 >> 12) & 0x3F)", ^
                                 "%4 = 0x80 | ((%1 >> 6) & 0x3F)", ^
                                 "%5 = 0x80 | (%1 & 0x3F)")
  exit/b 0
)
set "%2=" & set "%3=" & set "%4=" & set "%5="
exit/b 1

:: hacked off :toHex from http://www.dostips.com/DtTipsArithmetic.php#toHex
:: takes optional extra argument for number of hex digits to return, default 8
:num2hex
if not defined num2hex.0x#map set "num2hex.0x#map=0123456789ABCDEF"
setlocal enableDelayedExpansion
set/a "num = %~1" & set "hex="
for /l %%N in (1, 1, %3 8) do (
  set/a "d = num & 0x0F", "num >>= 4"
  for %%D in (!d!) do set "hex=!num2hex.0x#map:~%%D,1!!hex!")
endlocal & if "%~2" neq "" (set "%~2=%hex%") else echo(%hex% & exit/b
Test run...

Code: Select all

C:\tmp>$w2u.cmd  0041 2262 0391 002E D55C AD6D C5B4 65E5 672C 8A9E 233B4
41,E2 89 A2,CE 91,2E,ED 95 9C,EA B5 AD,EC 96 B4,E6 97 A5,E6 9C AC,E8 AA 9E,F0 A3 8E B4
...which verifies the examples in the UTF-8 RFC at http://www.ietf.org/rfc/rfc3629.txt.

I also wrapped the same :w2u function into a %w2u% macro, and updated my previous $chrW post at http://www.dostips.com/forum/viewtopic.php?p=33810#p33810 above to use it.

Liviu

Post Reply