setting variable to value from utf-encoded file

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

setting variable to value from utf-encoded file

#1 Post by Liviu » 06 Feb 2012 21:18

I've looked up prior art on this topic but haven't found much, except hints that it's not possible. Just thought I'd ask in case I missed something somewhere...

For a quick recap, cmd input is fully unicode, and both 'set' and 'set /p' will take a unicode string and assign it correctly, regardless of the active codepage. Once set, a variable can be safely copied with another 'set', again regardless of the chcp in effect. ( Side note, I am aware that "unicode string" is the wrong term technically, and only using it here as a shortcut for "text containing characters outside the individual codepages used during the exercise". )

The open question, however, seems to be how to get such a unicode string into a variable to begin with - other than typing it in, or pasting it interactively at the prompt. ( Another side note, the question is about an arbitrary user defined string, as opposed to an existing file/directory name which can be retrieved with the appropriate 'for' loop. )

Assuming the respective text exists in an external file (say, UTF-8 or UTF-16), Jeb's trick can read it correctly (last post at http://www.dostips.com/forum/viewtopic.php?f=3&t=1462&start=0) but for redirection purposes, only. As noted in his post, simple echo fails, and incidentally so does any attempt to 'set' it to a variable.

I have tried a number of variations on the theme, with '<file set /p', 'type | set /p' and combinations of chcp and 'cmd /u' but haven't hit the right note yet. Any further pointers welcome.

Liviu

jeb
Expert
Posts: 1041
Joined: 30 Aug 2007 08:05
Location: Germany, Bochum

Re: setting variable to value from utf-encoded file

#2 Post by jeb » 07 Feb 2012 01:49

Hi Liviu,

you can copy any content from one variable to another with delayed expansion, this is always safe.

You can get any content (without the <NUL> character) from a file with a FOR/F loop or the SET/p technic (has some quirks with control characters at the line end).
You can echo any content with echo( and DelayedExpansion.

Code: Select all

setlocal DisableDelayedExpansion
for /f "delims=" %%a in (myFile.txt) do (
  set "line=%%a"
  setlocal EnableDelayedExpansion
  set "newVar=!line!"
  (echo(!newVar!)
  endlocal
)
)


This sample doesn't handle empty lines or lines beginning with ";" (EOL) but it can be easily solved.

jeb

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: setting variable to value from utf-encoded file

#3 Post by Liviu » 07 Feb 2012 11:21

jeb wrote:You can get any content (without the <NUL> character) from a file with a FOR/F loop or the SET/p technic
Thanks, Jeb, but I don't think either works when the file is UTF-8 or UTF-16, which is the difficult point here.

Since my phrasing of the question was a bit elliptic, here it is in more detail... I can set a variable to unicode text interactively with no problems, and once set I can echo/copy/use it fine.
C:\tmp>chcp & set "ucs2=‹αß©∂€›" & set ucs2
Active code page: 437
ucs2=‹αß©∂€›

C:\tmp>set "ucs2=" & set /p "ucs2=" & set ucs2
‹αß©∂€›
ucs2=‹αß©∂€›

C:\tmp>echo %ucs2%
‹αß©∂€›

I can save the contents of the variable to a UTF-8 file...
C:\tmp>chcp 65001
Active code page: 65001

C:\tmp>echo %ucs2%>utf8.txt

C:\tmp>chcp 437
...or a UTF-16-LE (no BOM) file...
C:\tmp>cmd /u /c echo %ucs2%>utf16le.txt
...or a UTF-16-LE (with BOM) file.
C:\tmp>chcp 1252
Active code page: 1252

C:\tmp>(set /p =ÿþ) <nul >utf16.txt 2>nul

C:\tmp>cmd /u /c echo %ucs2%>>utf16.txt

C:\tmp>chcp 437

The byte-by-byte binary contents of the files are...

Code: Select all

utf8.txt:

00000000  E2 80 B9 CE  B1 C3 9F C2  A9 E2 88 82  E2 82 AC E2
00000010  80 BA 0D 0A

utf16le.txt:

00000000  39 20 B1 03  DF 00 A9 00  02 22 AC 20  3A 20 0D 00
00000010  0A 00
...and the utf16.txt is the same with utf16le only it has the two BOM bytes 0xFF 0xFE at the beginning.

My question is: if the ucs2 variable did not exist, but I had one of those 3 text files (with the utf-encoded text), how could ucs2 be re-created from the file?

The technique you showed in the old post works for just _copying_ the UTF-8 file, but not setting a variable from it (well, you could for example 'set ucs2new' in the 'for' loop, but the resulting variable won't match the original i.e. "%ucs2%"=="%ucs2new%" fails).

Liviu

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: setting variable to value from utf-encoded file

#4 Post by Liviu » 16 Feb 2013 00:54

Just for the record, here is a a tentative answer to the title question... The following demonstrates a way to convert hardcoded or read-from-file UTF-8 strings to UTF-16 and store them in a regular, usable variable. On one hand, the code is not pretty and the conversion is painfully slow. On the other hand, it does actually work (tried under xp.sp3 and win7.sp1), and only uses reg.exe and wmic.exe which are builtins as of xp+. As far as I can tell, it's not been attempted this way before. Maybe this inspires someone to come up with a neater, faster, pure-batch solution.

Basic idea was fairly straightforward:

1. Get the string somehow merged into the registry under HKCU\Environment.

2. Pick up the newly registered environment variable from the registry, use it happily ever after ;-)

Difficulties along the way:

1.a. The UTF-8 string can be saved as either UTF-8 or UTF-16LE to an external file using known tricks, previously discussed. But the natural choice for registry manipulation "setx.exe -f" doesn't seem to do a proper codepage translation from UTF-8, nor take a UTF-16LE input file. Workaround was to manually build a UTF-16LE .reg file, then use reg.exe to merge it into the registry.

2.a. Once the new variable is added to the registry, Windows needs to be notified before it acknowledges it (http://support.microsoft.com/kb/104011 - How to propagate environment variables to the system). The batch code itself cannot send the expected WM_SETTINGCHANGE message. One would hope that setx.exe did that after effecting changes, but it doesn't appear to. Turns out that wmic.exe does it after environment changes, however.

2.b. Even once Windows is notified, the environment changes are only visible to future processes, since each current one maintains its own copy, initialized at the time it was started. So a new process is needed to pick up the changes. Unfortunately, any 'cmd' launched from the active console runs as a child process, and inherits the environment of its parent (either current, or original for cmd/i) i.e. is oblivious to system level environment changes. One way to start a new 'cmd' process not-as-a-child is to use 'wmic process call create'.

2.c. Once the new process is started, and sees the just-added environment variable, issue remains that it has no direct way to return it to the caller. Workaround here is to create a temporary file with the given name, whose name can then be read back in the original batch. Since wmic starts the secondary 'cmd' asynchronously, the caller needs to wait until the callee completes.

That said, the sample set-utf8.cmd code is copied below.

Code: Select all

:: set-utf8.cmd - convert utf-8 to utf-16 and store in an(other) variable
::
:: syntax:  set-utf8  [out,ref] string-var,  [in,ref] utf-8-string-var
::
:: - expected to fail on 'poison' (&%!) and illegal <:"\/|> path characters
::   which is fixable, but not relevant to the main point of this exercise
::
:: - otherwise checked ok under xp.sp3, win7.sp1.x64

@echo off & setLocal enableExtensions disableDelayedExpansion

if "%~2"=="" ( echo.
  @rem dump :: comment lines at the top of the file
  for /f "usebackq delims=" %%a in ("%~f0") do (
    set "z=%%~a" & setlocal enableDelayedExpansion
    if not "!z:~0,1!"==":" endlocal & goto :eof
    echo !z! & endlocal
  )
  endLocal & goto :eof
)

@rem save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do @set /a "cp=%%~a"

@rem set global variables
set "hkcu.env=HKEY_CURRENT_USER\Environment"
@rem utf-16le bom, hex 'FF FE'  n.b. win7 requires chcp 1252, first
chcp 1252 >nul
set "bom16le=ÿþ"
chcp %cp% >nul

call :set.utf u16 "%~2"
endLocal & set "%~1=%u16%" & goto :eof

:set.utf
setLocal enableDelayedExpansion
set "var==%time::=.%.%random%"
set "tmp8=%temp%\%var%.tmp"
set "reg16=%temp%\%var%.reg"

:: build utf-16le .reg file including bom  n.b. win7 requires chcp 1252, first
chcp 1252 >nul
cmd /d /a /c (set/p "=%bom16le%") <nul >"%reg16%" 2>nul
chcp %cp% >nul
@rem save fixed header
cmd /d /u /c                                     ^
  (echo Windows Registry Editor Version 5.00) ^& ^
  (echo.)                                     ^& ^
  (echo [%hkcu.env%]) >>"%reg16%"
@rem save variable, separate echo> + dir/u type>> required for utf-8 conversion
echo "%var%"="!%~2!" >"%tmp8%"
chcp 65001>nul & cmd /u /c type "%tmp8%" >>"%reg16%" & chcp %cp%>nul
del "%tmp8%"

:: set variable in user's environment
@rem n.b. win7 sends 'operation completed successfully' to &2, therefore 2>&1
reg import "%reg16%" >nul 2>&1

:: force an environment refresh for the next cmd to pick up the new variable
@rem create another dummy variable since under xp at least
@rem - setx doesn't broadcast the necessary wm_settingchange, and anyway
@rem   it only comes with the resource kit, not in the default install
@rem - wmic 'environment create' does broadcast the wm_settingchange, but
@rem   sometimes hangs at exit waiting for input, therefore the <nul
wmic environment create name="%var% ",variablevalue=" ",username="%username%" <nul >nul 2>&1

:: run an external (not child) cmd to create a temp file with the utf-16 name
md "%temp%\!var!"
wmic process call create '%comspec% /v /c copy nul "%temp%\!var!\^!%var%^!.tmp"' <nul >nul 2>&1

:: wait until the external cmd completes
set "u16="
:loop
for %%u in ("%temp%\!var!\*.tmp") do set "u16=%%~nu"
if not defined u16 goto :loop

:: cleanup
rd /s /q "%temp%\!var!"
reg delete "%hkcu.env%" /v "!var!" /f >nul 2>&1
@rem this removes the other dummy variable, also forces an environment refresh
wmic environment where(name="!var! ") delete <nul >nul 2>&1
del "%reg16%"

endLocal & set "%~1=%u16%" & goto :eof

Test case using the set-utf8-test.cmd copied below, and assuming the same utf8.txt file from the previous post

Code: Select all

@echo off & setLocal disableDelayedExpansion & echo.

:: example of reading utf-8 from external file
@rem binary contents of 'utf8.txt' must be
@rem E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2 82 AC E2 80 BA 0D 0A
for /f %%s in (utf8.txt) do set "ucs2.utf8=%%s"
call set-utf8 "ucs2" "ucs2.utf8"
setLocal enableDelayedExpansion
echo "!ucs2.utf8!" [utf-8] = "!ucs2!" [utf-16]
endLocal

:: example of hardcoding utf-8 in batch itself
@rem binary contents of string below in the .cmd file must be
@rem E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2 82 AC E2 80 BA
set "ucs2.utf8=‹αß©∂€›"
call set-utf8 "ucs2" "ucs2.utf8"
setLocal enableDelayedExpansion
echo "!ucs2.utf8!" [utf-8] = "!ucs2!" [utf-16]
endLocal

endLocal & goto :eof
outputs

Code: Select all

C:\tmp>set-utf8-test

"ΓÇ╣╬▒├ƒ┬⌐ΓêéΓé¼ΓÇ║" [utf-8] = "‹αß©∂€›" [utf-16]
"ΓÇ╣╬▒├ƒ┬⌐ΓêéΓé¼ΓÇ║" [utf-8] = "‹αß©∂€›" [utf-16]

C:\tmp>

Liviu

Post Reply