UTF-8 codepage 65001 in Windows 7 - part I
Posted: 07 Feb 2014 19:06
This flew under my radar for some time now, but it looks like Win7 silently enhanced support for codepage 65001. Significant limitations do remain - in particular redirection and piping still fail under codepage 65001. Nevertheless, the added support opens up some new exciting possibilities.
For background, 65001 has been long known as the UTF-8 codepage, but officially unsupported and mostly useless due to critical limitations (except for one-off tricks like converting text files to UTF-8 encoding, for example viewtopic.php?p=16399#p16399). Two of those critical limitations - broken parsing, and broken for loops - appear to have been lifted in Win7. This post will discuss the for loop part.
Previously under XP (and, unverified, but probably Vista, too) for loops simply did not work while codepage 65001 was active, neither in batch nor even at the cmd prompt. They seem to work correctly in Win7 now, including the necessary conversions between Windows' native UTF-16 and the active UTF-8 codepage. As an example, start a cmd prompt (using Lucida Console i.e. a non-raster font) at an initially empty C:\tmp directory, then create the following files and set some to +s system and/or +h hidden.
In XP (sp3) the following commands return...
Main point is that the last "for" loop, which runs under chcp 65001, returns nothing at all. A secondary point is that in XP there is no safe way (that I am aware of) to enumerate all files including +h hidden ones. The first/plain for loop skips hidden files. The second for loop under chcp 437 returns the wrong names for characters outside the codepage (it should be clear that it's not just a display artifact, but the filenames are in fact wrong - %~ad is empty since it can't retrieve attributes given the wrong filename).
Now, in Win7 (x64.sp1) the same commands return...
The difference in Win7 is that the last "for" loop does in fact return the expected output - and finally provides a way to list all files safely, including hidden ones and regardless of character sets.
As noted, support is still far from complete. For one example, Win7 still fails if the last for loop runs a pipe under chcp 65001...
Liviu
For background, 65001 has been long known as the UTF-8 codepage, but officially unsupported and mostly useless due to critical limitations (except for one-off tricks like converting text files to UTF-8 encoding, for example viewtopic.php?p=16399#p16399). Two of those critical limitations - broken parsing, and broken for loops - appear to have been lifted in Win7. This post will discuss the for loop part.
Previously under XP (and, unverified, but probably Vista, too) for loops simply did not work while codepage 65001 was active, neither in batch nor even at the cmd prompt. They seem to work correctly in Win7 now, including the necessary conversions between Windows' native UTF-16 and the active UTF-8 codepage. As an example, start a cmd prompt (using Lucida Console i.e. a non-raster font) at an initially empty C:\tmp directory, then create the following files and set some to +s system and/or +h hidden.
Code: Select all
C:\tmp>(copy nul ‹αß©∂€›
More? copy nul ‹αß©∂€›.h
More? copy nul ‹αß©∂€›.s
More? copy nul ‹αß©∂€›.sh
More? attrib +h ‹αß©∂€›.h
More? attrib +s ‹αß©∂€›.s
More? attrib +s +h ‹αß©∂€›.sh)
1 file(s) copied.
1 file(s) copied.
1 file(s) copied.
1 file(s) copied.
C:\tmp>attrib *
A C:\tmp\‹αß©∂€›
A H C:\tmp\‹αß©∂€›.h
A S C:\tmp\‹αß©∂€›.s
A SH C:\tmp\‹αß©∂€›.sh
C:\tmp>
In XP (sp3) the following commands return...
Code: Select all
C:\tmp>ver
Microsoft Windows XP [Version 5.1.2600]
C:\tmp>for %d in (*) do @echo %~ad %d
--a------ ‹αß©∂€›
--a-s---- ‹αß©∂€›.s
C:\tmp>chcp 437
Active code page: 437
C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad %d
<αßc??>
<αßc??>.h
<αßc??>.s
<αßc??>.sh
C:\tmp>chcp 65001
Active code page: 65001
C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad %d
C:\tmp>
Now, in Win7 (x64.sp1) the same commands return...
Code: Select all
C:\tmp>ver
Microsoft Windows [Version 6.1.7601]
C:\tmp>for %d in (*) do @echo %~ad %d
--a------ ‹αß©∂€›
--a-s---- ‹αß©∂€›.s
C:\tmp>chcp 437
Active code page: 437
C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad %d
<αßc??>
<αßc??>.h
<αßc??>.s
<αßc??>.sh
C:\tmp>chcp 65001
Active code page: 65001
C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad %d
--a------ ‹αß©∂€›
--ah----- ‹αß©∂€›.h
--a-s---- ‹αß©∂€›.s
--ahs---- ‹αß©∂€›.sh
C:\tmp>
As noted, support is still far from complete. For one example, Win7 still fails if the last for loop runs a pipe under chcp 65001...
Code: Select all
C:\tmp>for /f "delims=" %d in ('dir /a /b ^| more') do @echo %~ad %d
Not enough memory.
C:\tmp>
Liviu