Page 1 of 1

UTF-8 codepage 65001 in Windows 7 - part II

Posted: 07 Feb 2014 19:15
by Liviu
Continuation to the previous post viewtopic.php?p=32355#p32355... Parsing seems to work under codepage 65001 in Win7. Among other intriguing uses, this now allows arbitrary Unicode strings to be hardcoded in the batch file itself, and also be read from external UTF-8 text files - which was previously not possible under XP (see for example viewtopic.php?f=3&t=2895&start=0). Sample code below, showing both inline assignments and for/f loops reading from external files...

Code: Select all

:: ‹αß©∂€›
::
:: first line in hex should be
:: 3A 3A 20 E2  80 B9 CE B1  C3 9F C2 A9  E2 88 82 E2
:: 82 AC E2 80  BA 0D 0A

@echo off & setlocal disableDelayedExpansion
for /f "tokens=2 delims=:" %%a in ('chcp') do @set /a "cp=%%~a"

@rem works in win7, fails in xp
chcp 65001 >nul
call :test
chcp %cp% >nul
call :dump

@rem works in win7, fails in xp
(chcp 65001 >nul) & call :test & (chcp %cp% >nul)
call :dump

@rem ...but this doesn't work !?
set "x="
(chcp 65001 >nul) & (<"%~f0" set /p "x=") & (chcp %cp% >nul)
setlocal enableDelayedExpansion
echo( & echo !x:~3! & endlocal

endlocal & goto :eof

:test --------------------------------
@rem inline assignment
set "a="
set "a=‹αß©∂€›"

@rem set/p from file
set "b="
<"%~f0" set /p "b="
set "b=%b:~3%"

@rem for/f from file
set "c="
for /f "usebackq delims=" %%c in ("%~f0") do (
  set "c=%%~c"
  goto :c
)
:c
set "c=%c:~3%"

@rem for/f from command output
@rem ...but 'more ^<"%~f0"' and 'type "%~f0" ^| more'
@rem both fail with 'not enough memory' !?
set "d="
for /f "delims=" %%d in ('type "%~f0"') do (
  set "d=%%~d"
  goto :d
)
:d
set "d=%d:~3%"
goto :eof

:dump  --------------------------------
echo(
setlocal enableDelayedExpansion
echo !a!
echo !b!
echo !c!
echo !d!
endlocal & goto :eof
It's probably easiest to copy/paste the code above in a Unicode-enabled editor which has a Save-As-UTF-8-No-BOM option, and then save it as a batch file, like "w7-utf8.cmd". Or, if using just Windows 7's own Notepad (which always adds a BOM to UTF-8 files when saving), then copy/paste it to Notepad, and Save-As with Encoding set to Unicode as "C:\tmp\w7-utf8.txt". Then, at the cmd prompt, do the following which will convert it to the expected UTF-8-No-BOM encoding.

Code: Select all

C:\tmp>chcp 65001
Active code page: 65001

C:\tmp>type w7-utf8.txt >w7-utf8.cmd

C:\tmp>
Regardless of how you save it, make sure the initial byte sequence matches the one in the top comment when the file is viewed in hex.

Running the batch file under XP (sp3) fails at the first 'chcp 65001' line, and outputs nothing at all. In Win7 (x64.sp1) however, it yields...

Code: Select all

C:\tmp>ver

Microsoft Windows [Version 6.1.7601]

C:\tmp>chcp
Active code page: 437

C:\tmp>w7-utf8

‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›

‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›

‹αß©∂€›

C:\tmp>

Some notes...

This does not mean, or means to imply, that Win7 now runs UTF-8 batch files natively - it does not. The parts of the sample code which contain characters outside the default codepage (in my case 437) are only ever accessed while an explicit chcp 65001 is in effect. The reason this works is that UTF-8 is itself a byte-oriented encoding, its first 128 codepoints match the ASCII encoding, and no other multi-byte encodings use values 0-127. In particular, line breaks are the same between UTF-8 and ASCII. If a string contains no control characters, its UTF-8 encoding will contain no control characters, either. If a string does not contain quotes, neither will its UTF-8 encoding. In other words, the batch parser sees the .cmd file as a regular text file, with maybe some "odd" - but harmless - characters in the 128-255 range here and there. When, and only when, the codepage is explicitly switched to 65001 those "odd" characters are interpreted as UTF-8.

The 65001 codepage should be used sparingly, just for initializing the variables or reading the necessary data, then reverted to the default codepage. External programs may misbehave if launched under the 65001 codepage. Besides...

Redirection/pipes are still broken under chcp 65001, and the codepage associated with the parser and input/output streams seems to still be decided in advance for multi-line/parenthesized blocks, rather than for individual commands later at runtime - see the "...but !?" comments in the code.

The for/f loops in the above code read the first line off the "%~f0" batch file itself. That's just so that the sample is self contained, without requiring an auxiliary data file. In real life, the same code could work with an external UTF-8 encoded file, and would not need to stop at the first line, of course.

Liviu