UTF-8 codepage 65001 in Windows 7 - part I

Message

Liviu · #1 Post by **Liviu** » 07 Feb 2014 19:06

This flew under my radar for some time now, but it looks like Win7 silently enhanced support for codepage 65001. Significant limitations do remain - in particular redirection and piping still fail under codepage 65001. Nevertheless, the added support opens up some new exciting possibilities.

For background, 65001 has been long known as the UTF-8 codepage, but officially unsupported and mostly useless due to critical limitations (except for one-off tricks like converting text files to UTF-8 encoding, for example viewtopic.php?p=16399#p16399). Two of those critical limitations - broken parsing, and broken for loops - appear to have been lifted in Win7. This post will discuss the for loop part.

Previously under XP (and, unverified, but probably Vista, too) for loops simply did not work while codepage 65001 was active, neither in batch nor even at the cmd prompt. They seem to work correctly in Win7 now, including the necessary conversions between Windows' native UTF-16 and the active UTF-8 codepage. As an example, start a cmd prompt (using Lucida Console i.e. a non-raster font) at an initially empty C:\tmp directory, then create the following files and set some to +s system and/or +h hidden.

Code: Select all

C:\tmp>(copy nul ‹αß©∂€›
More? copy nul ‹αß©∂€›.h
More? copy nul ‹αß©∂€›.s
More? copy nul ‹αß©∂€›.sh
More? attrib +h ‹αß©∂€›.h
More? attrib +s ‹αß©∂€›.s
More? attrib +s +h ‹αß©∂€›.sh)
        1 file(s) copied.
        1 file(s) copied.
        1 file(s) copied.
        1 file(s) copied.

C:\tmp>attrib *
A            C:\tmp\‹αß©∂€›
A   H        C:\tmp\‹αß©∂€›.h
A  S         C:\tmp\‹αß©∂€›.s
A  SH        C:\tmp\‹αß©∂€›.sh

C:\tmp>

In XP (sp3) the following commands return...

Code: Select all

C:\tmp>ver

Microsoft Windows XP [Version 5.1.2600]

C:\tmp>for %d in (*) do @echo %~ad  %d
--a------  ‹αß©∂€›
--a-s----  ‹αß©∂€›.s

C:\tmp>chcp 437
Active code page: 437

C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad  %d
  <αßc??>
  <αßc??>.h
  <αßc??>.s
  <αßc??>.sh

C:\tmp>chcp 65001
Active code page: 65001

C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad  %d

C:\tmp>

Main point is that the last "for" loop, which runs under chcp 65001, returns nothing at all. A secondary point is that in XP there is no safe way (that I am aware of) to enumerate all files including +h hidden ones. The first/plain for loop skips hidden files. The second for loop under chcp 437 returns the wrong names for characters outside the codepage (it should be clear that it's not just a display artifact, but the filenames are in fact wrong - %~ad is empty since it can't retrieve attributes given the wrong filename).

Now, in Win7 (x64.sp1) the same commands return...

Code: Select all

C:\tmp>ver

Microsoft Windows [Version 6.1.7601]

C:\tmp>for %d in (*) do @echo %~ad  %d
--a------  ‹αß©∂€›
--a-s----  ‹αß©∂€›.s

C:\tmp>chcp 437
Active code page: 437

C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad  %d
  <αßc??>
  <αßc??>.h
  <αßc??>.s
  <αßc??>.sh

C:\tmp>chcp 65001
Active code page: 65001

C:\tmp>for /f "delims=" %d in ('dir /a /b') do @echo %~ad  %d
--a------  ‹αß©∂€›
--ah-----  ‹αß©∂€›.h
--a-s----  ‹αß©∂€›.s
--ahs----  ‹αß©∂€›.sh

C:\tmp>

The difference in Win7 is that the last "for" loop does in fact return the expected output - and finally provides a way to list all files safely, including hidden ones and regardless of character sets.

As noted, support is still far from complete. For one example, Win7 still fails if the last for loop runs a pipe under chcp 65001...

Code: Select all

C:\tmp>for /f "delims=" %d in ('dir /a /b ^| more') do @echo %~ad  %d
Not enough memory.

C:\tmp>

Liviu

#2 Post by **penpen** » 08 Feb 2014 17:22

Liviu wrote:(...) , 65001 has been long known as the UTF-8 codepage, (...)

I hope 65001 is never known as the UTF-8 codepage, as UTF-8 is a character set and no codepage.

In short:
A codepage is a mapping from codepoints to glyphs.
A character set is a collection of character encodings: This could be done in various ways.
For instance a mapping between (partial) byte+ space and Unicode codepoint space is a character set.
An example of such a character set is UTF-8:
- ... ,
- (byte tupel) c2 a1 <-> 00A1 (Unicode codepoint; represents: INVERTED EXCLAMATION MARK),
- ...

Codepages may be bound to specific character encodings.
So using a specific codepage may indicate the usage of a specific character set.
But no codepage will be a specific character set.

penpen

carlos · #3 Post by **carlos** » 08 Feb 2014 18:28

@penpen: header file winnls.h in for c programming contains this macro:

#define CP_UTF7 65000
#define CP_UTF8 65001

Liviu · #4 Post by **Liviu** » 08 Feb 2014 18:31

penpen wrote:I hope 65001 is never known as the UTF-8 codepage, as UTF-8 is a character set and no codepage.

I take it you dispute the terminology, and not the substance of the post itself. But I think you are confusing things more than they need be.

My terminology is consistent with the common usage in Windows:
- INVERTED EXCLAMATION MARK is the character;
- U+00A1 is its Unicode codepoint;
- a codepage is essentially a mapping between codepoints and binary representations;
- UTF-8 is one such possible encoding using a multi-byte representation for each Unicode codepoint;
- UTF-8 is technically a MBCS (multi byte character set) codepage as far as Windows is concerned - you can lookup its identifier CP_UTF8 (where CP stands for codepage) and/or check the MSDN page for the WideCharToMultiByte API (http://msdn.microsoft.com/en-us/library ... 30(v=vs.85).aspx);
- back to your example, codepoint U+00A1 can be represented as encoding 0x00A1 in UTF-16, or 0xA1 in codepage 1252, or 0xAD in codepage 437 (http://msdn.microsoft.com/en-us/goglobal/bb964653), or 0xC2 0xA1 in UTF-8.

Liviu

#5 Post by **penpen** » 09 Feb 2014 10:43

I assume i should have written what i mean with the word "bound"... .
Sorry for that, i'm using word by word translation of terms i use in german.

When i say "Codepage 65001 is bound to UTF-8",
it should be understood as: "Codepage 65001 performs the same mapping as UTF-8"
But UTF-8 is no codepage, because the semantic of that function differs.

I hope that explains my post above, and the following answers:

carlos wrote:@penpen: header file winnls.h in for c programming contains this macro:

So these are just two examples of codepages bound to UTF encodings (although UTF-7 is no official part of Unicode).

Liviu wrote:I take it you dispute the terminology, and not the substance of the post itself. But I think you are confusing things more than they need be.

Confirmed.

Liviu wrote:- a codepage is essentially a mapping between codepoints and binary representations;

No, a codepage is a mapping between a codepoint and a glyph (the "image" of the character (realized by a glyph index).
The glyph index seems to be an UTF-16 reperesentation.

Liviu wrote:- UTF-8 is one such possible encoding using a multi-byte representation for each Unicode codepoint;

If you leave "such possible" this is correct (definition: character set).
If you mean codepage 65001 then this is no byte encoding as this is a mapping from codepoints to glyphs.

Liviu wrote:UTF-8 is technically a MBCS (multi byte character set) codepage as far as Windows is concerned - you can lookup its identifier CP_UTF8 (where CP stands for codepage) and/or check the MSDN page for the WideCharToMultiByte API (...);

The codepage 65001 is a MBCS codepage.
WideCharToMultiByte does something like this for each character, so the result is the needed conversion (although semantically different, the result is the same):
converted := inverse_codepage_65001 (codepage_1200 (toConvert))
I assume they don't wanted to implement two functions that do the same twice.

Liviu wrote:- back to your example, codepoint U+00A1 can be represented as encoding 0x00A1 in UTF-16, or 0xA1 in codepage 1252, or 0xAD in codepage 437 (...), or 0xC2 0xA1 in UTF-8.

[/quote]No, it is no encoding, it is the codepage mapping: From codepage index to glyph (or better to glyph index which is a Unicode codepoint):
U+00A1 == UTF-16BE (0x00A1) == codepage 1252 (0xA1) == codepage_437 (0xAD) == codepage_65001 (0xC2, 0xA1)
The linked codepages demonstrates that nicely.

UTF-8 (0xC2, 0xA1) == U+00A1, but here it is a mapping from a bytestream to a unicode codepoint.
(Technically, i agree, it is the same, but semantically different.)

That Microsoft differs between these two things can be seen here (sadly only hints...):
- in the notes of: http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
- in the examples: http://msdn.microsoft.com/en-us/library/ms524628(v=vs.90).aspx
Although these examples are a little bit boring as codepage and character set are using the same function.
But anyway it should be avoided to mix these up.

Sadly cmd only supports two character encodings (cmd /A, cmd /U), so one cannot play around too much... .

penpen

Edit: character set -> character encoding (last sentence).

Liviu · #6 Post by **Liviu** » 09 Feb 2014 12:00

penpen wrote:
Liviu wrote:- a codepage is essentially a mapping between codepoints and binary representations;
No, a codepage is a mapping between a codepoint and a glyph (the "image" of the character (realized by a glyph index).

Sorry, but you are plain wrong. An encoding, or a codepage, is a mapping between codepoints (abstract characters) and binary representations. Fonts are mappings between codepoints and visual glyphs, which is a separate, unrelated matter.

Microsoft is fairly consistent in its terminology, for example..

"Unicode does the following: (...) Defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32." (http://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx)

"Two encodings of Unicode (UTF-7 and UTF-8) are implemented as code pages. Like other code pages, each page is known by a numeric identifier and can be handled with many of the same Unicode and character set API functions. (...) In addition to SBCS and DBCS code pages, your applications have available the multibyte character set code pages (...). A multibyte character set code page goes beyond two-byte encodings of some characters, however. UTF-7 and UTF-8 use a similar approach to encode Unicode based on a 7-bit and 8-bit bytes, respectively." (http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752(v=vs.85).aspx)

Code: Select all

C:\tmp>chcp 65001
Active code page: 65001

C:\tmp>mode con

Status for device CON:
----------------------
    Lines:          9999
    Columns:        132
    Keyboard rate:  31
    Keyboard delay: 1
    Code page:      65001


C:\tmp>nlsinfo
...
Windows Code Page:            437   (OEM - United States)
Console Code Page:            65001 (UTF-8)
...

Installed Code Pages:
...
          437   (OEM - United States),
          850   (OEM - Multilingual Latin I),
          858   (OEM - Multilingual Latin I + Euro),
          1252  (ANSI - Latin I),
...
          65000 (UTF-7),
          65001 (UTF-8)

C:\tmp>ver

Microsoft Windows [Version 6.1.7601]

As shown above, Windows itself lists 65001 as a codepage whose name is "UTF-8" - and I'll leave it at that, since I have no interest to pursue pointless arguments.

Liviu

#7 Post by **penpen** » 09 Feb 2014 15:09

I think we agree more than i thought... (next two points):

Liviu wrote:Unicode does the following: (...) Defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32." (http://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx)

Here the UTF-8 (Unicode.org) definition is used, so this is a character set: This is exactly, what i've written, so you can assume i won't doubt that.
Beside there is another passage in this linked document:

For compatibility with 8-bit and 7-bit environments, Unicode can also be encoded as UTF-8 and UTF-7, respectively. While Unicode-enabled functions in Windows use UTF-16, it is also possible to work with data encoded in UTF-8 or UTF-7, which are supported in Windows as multibyte character set code pages.

So Microsoft has added codepages to allow to work with unicode data in 8-bit and 7-bit environments via an UTF-7/8 mapping within these codepages: This is the reason why they were implemented (the link is a nice find).

Liviu wrote:"Two encodings of Unicode (UTF-7 and UTF-8) are implemented as code pages..."

Agreed: "Codepage 65001 performs the same mapping as UTF-8".
Same for UTF-7 but i don't know the codepage... (maybe 65000... just guessed).

Just disagree on this point (probably caused by my lazy shortening):

Liviu wrote:Sorry, but you are plain wrong. An encoding, or a codepage, is a mapping between codepoints (abstract characters) and binary representations. Fonts are mappings between codepoints and visual glyphs, which is a separate, unrelated matter.

No, i only wanted to shorten my post by beeing imprecise of (what i thought were unimportant information): So i've shortened the rendering of characters... it seems that was an error.

So to be as precise as possible, i will use the official terminology of Microsoft, too: http://www.microsoft.com/typography/unicode/cscp.htm

A font is a collection of glyphs.
A glyph is a representation of a character.
A character set is only a collection of characters.
Characters are represented by character codes.
Each Unicode index refers unambiguously to a given character.
A codepage is a list of selected character codes in a certain order.

Unicode in the script means Unicode version 1.0.
Additional: Valid unicode character codes are called codepoints.

So assuming Windows uses Unicode as the basic coding the following is true (1-3):
1) Unicode is a character set: Mapping between codepoints and a character.
(Note: In my above post I wanted to simplify the differentiation of codepages and character sets by identifying the character codes of the codepage mapping with a unique glyph, as this is common use in codepages.
This is also possible for the character set part, as it also uses character codes.
The key idea behind this is, to use a one to one mapping between character codes and glyphs as common in most such codepage tables.
The collection of the above glyphs is combied to a font.
If you use this font only, then each character code can be identified by a unique glyph.
In addition this view makes the difference between a codepage and a character set easier to see, so i thought that shortcut was a good idea... .)
2) A codepage is a mapping from an index to a codepoint.
3) UTF-8 (as defined by Unicode.org) is the 8 bit Universal Character Set Transformation Format: It maps code values (in this case n-tupels of bytes) to codepoints (valid character codes).
The mapping of UTF-8 (Unicode.org) is unambiguous, so it is the same description as Unicode but in a dual space.
Because of that UTF-8 (Unicode.org) is a character set, too.

Code values are semantically something different than an index: For example the set of real number may be used as an index, but will never serve as code values.
In case of codepage 65001 and UTF-8 (Unicode.org), the difference is minimal (semantical only), as the mapping of both is identical.
Nevertheless UTF-8 is defined (by Unicode.org) as something different than a codepage.
So UTF-8 (Unicode.org) is semantically something different than a codepage.

But at the end i accept that (you and others) name the codepage 65001 the "UTF-8 codepage", although i personally dislike such a syntactical mixture as this leads to "fishing in mist" when talking about UTF-8 (Unicode.org) and codepages.

penpen
Edit: Changed the 1) part to clarify my intention. The original part could be seen in my next post in its Edit section.

Liviu · #8 Post by **Liviu** » 09 Feb 2014 16:46

Getting closer, I think ;-)

except...

penpen wrote:So assuming Windows uses Unicode as the basic coding the following is true (1-3):
1) Unicode is a character set: Mapping between codepoints and a character.
If only one font is used it is also a mapping between codepoints and glyphs.

...that's still wrong. Key point is that "glyphs do not correspond one-to-one with characters" - which both the link you quoted and, for example http://www.unicode.org/reports/tr17/#CharactersVsGlyphs, point out. As a very simple example, the same character U+0028 can be represented by different glyphs resembling "(" and ")" respectively, depending on whether "right-to-left" direction is in effect - which can change even in the middle of the text (http://www.unicode.org/reports/tr9/#L4 - with the note that "while the name indicates that it is a left parenthesis, the character really expresses an open parenthesis—the leading character in a parenthetical phrase"). Bottom line - mapping between codepoints and glyphs is the responsibility of the host renderer (and it is not related to a particular encoding).

penpen wrote:Nevertheless UTF-8 is defined (by Unicode.org) as something different than a codepage.
So UTF-8 (Unicode.org) is semantically something different than a codepage.

With this I agree, of course. UTF-8 is an encoding, not a codepage. Please note that I never wrote or even implied that UTF-8 itself is a codepage. What I did was use "UTF-8 codepage" and "codepage 65001" interchangeably, both referring to the same Windows codepage which uses the UTF-8 encoding and has an identifier of 65001. I stand by my use of this terminology as being completely proper and unambiguous. If you still don't see it that way then maybe we can just agree to disagree, before this thread goes hopelessly off topic.

Liviu

#9 Post by **penpen** » 09 Feb 2014 18:43

Liviu wrote:(...)
...that's still wrong. Key point is that "glyphs do not correspond one-to-one with characters"
(...)

In common, you are right, but this is we are talking about the command shell.
I doubt there are more than a hand full of people, that have seen more than one glyph per character within the cmd shell, if you don't change the font (inlcuding: size, boldness, ...).
So this simplification is not THAT bad.
In addition it is a common simplification (used by microsoft, too; see your link to the codepages) that a one to one mapping is assumed (exemplary instance).

penpen

Edit: Beside this part should just explain, why i have used the simplification character codes == glyphs.
I thought the intention was obvious, but i have changed this part in my post above, as it seems it is not obvious:

If only one font is used it is also a mapping between codepoints and glyphs.
In addition this view makes the difference between a codepage and a character set easier to see, so i thought that shortcut was a good idea... .

DosTips.com

UTF-8 codepage 65001 in Windows 7 - part I

UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I

Re: UTF-8 codepage 65001 in Windows 7 - part I