encodings

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
taripo
Posts: 217
Joined: 01 Aug 2011 13:48

encodings

#1 Post by taripo » 27 Nov 2011 18:51

continuing from the discussion here that went off topic
viewtopic.php?f=3&t=2647&p=12098#p12098

@agerman interesting stuff agerman.. but howcome when I change to codepage 850 and go to charmap DOS western europe (presumably codepage 850) and choose 0xA9 ® and paste it into the cmd prompt, it prints ⌐ (0xA9 of DOS United States - presumably codepage 437). Similarly a file with hex 0xA9 in it, doing TYPE on it, prints ⌐ even when i've got codepage 850 selected.

@Ed,
how are you producing 0xDD in CMD? The button you push on your keyboard to make that character, can't be used to do DIR pipe to MORE can it?


Notice
C:\WINDOWS>dir ▌ more
Volume in drive C has no label.
Volume Serial Number is FC9D-4769

Directory of C:\WINDOWS


Directory of C:\WINDOWS

File Not Found

C:\WINDOWS>
Last edited by taripo on 13 Dec 2011 15:48, edited 2 times in total.

aGerman
Expert
Posts: 3875
Joined: 22 Jan 2010 18:01
Location: Germany

Re: encodings

#2 Post by aGerman » 27 Nov 2011 19:35

I experimented on my command prompt and found the same behavior:

Code: Select all

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. Alle Rechte vorbehalten.

C:\windows\system32>prompt $g$s

> rem what's the current codepage?

> chcp
Aktive Codepage: 850.

> rem try to paste the registered trademark character and echo it

> echo ®
®

> rem OK, that worked

> rem now I'm gonna change the codepage ...

> chcp 437
Aktive Codepage: 437.

> rem again paste the registered trademark character ...

> echo r
r

> rem that's a bit odd

> rem I cannot explain that behavior

>


Another strange behavior:
*.bat (Windows-1252 encoded)

Code: Select all

@echo off &setlocal
chcp 1252>nul
echo ©
pause>nul

That returns ® even if I changed the codepage before, but ...

Code: Select all

@echo off &setlocal
chcp 1252>nul
set "c=©"
chcp 850>nul
echo %c%
pause>nul

... returns ©.

I'm not able to explain why :( Sometimes the CMD is a miracle.

Regards
aGerman

taripo
Posts: 217
Joined: 01 Aug 2011 13:48

Re: encodings

#3 Post by taripo » 27 Nov 2011 20:30

I was getting some funny results, because I was on raster font, changing to lucida console made a difference..
In raster fonts, ® and this seems extremely strange, it doesn't come up, it comes up as if I was in codepage 437 and done ⌐
But the funny thing is, in that situation i.e. in raster fonts, and chcp 850 and entering that ® and it comes up looking like ⌐ it's not ⌐ it's really ® 'cos if I copy/paste the thing that looks like ⌐ into notepad or here, it comes up as ®. I don't know if fonts work like this with glyphs mapped to encodings, but it must be mapping the ⌐ glyph to the 0xA9 encoding. (anyhow, now i'm on lucida console, which I think you are too, it's not as strange, maybe supports more)

Maybe the clipboard uses unicode, and when one does copy of that copyright R symbol in CMD, perhaps it gives the clipboard the unicode encoding of the R-copyright symbol.. just a theory..


In the case of codepage 437 the R copyright symbol not appearing, that's because codepage 437 doesn't have it..
Here I think is codepage 437
http://ascii-table.com/ascii-extended-pc-list.php
http://www.jimprice.com/ascii-dos.gif

and in charmap check advanced view and dchoose DOS united states(I guess that's codepage 437). instead of DOS Western(codepage 850 I suppose).. see no copyright R symbol in codepage 437.

In that encoding position, it has ⌐

When you paste in the copyright R, I think there's some mapping between different encodings which comes into play. Perhaps the clipboard has copyright R, probably in unicode.
And I think somebody decided, that mapping that symbol to codepage 437, the best match is r Not ⌐, they decided to use the closest matching glyph(the closest matching squiggle). Hence the r appears.

For those 2 bat files you give, I get the © symbol.

So that's the right response from the first bat file for me.

But for the second bat file, I also get the copyright symbol..
But, then I thought hmm, maybe it's harder to run from codepage 1252 'cos it's windows.. and even if you change to another one you can't run from it!

But then I remembered there were 2 codepage settings, in that registry entry you showed.

I tried changing ACP to 850 then starting a new cmd prompt.

Then the copyright symbol doesn't come up, it comes up as c

Ed Dyreen
Expert
Posts: 1569
Joined: 16 May 2011 08:21
Location: Flanders(Belgium)
Contact:

Re: encodings

#4 Post by Ed Dyreen » 28 Nov 2011 08:11

'
I have little knowledge of those tables, is there a way to avoid these problems ?
I mean, if I save a batch in ASCII, and then use the chcp 850 command, will-can that avoid these issues ?
That way the programmer could be certain the mapping is the one intended or :?:

In the past I tried saving a batch in unicode thinking I would be able to print fancy chars, that didn't work obviously.

taripo
Posts: 217
Joined: 01 Aug 2011 13:48

Re: encodings

#5 Post by taripo » 28 Nov 2011 08:57

Ed Dyreen wrote:'
I have little knowledge of those tables, is there a way to avoid these problems ?

I mean, if I save a batch in ASCII, and then use the chcp 850 command, will-can that avoid these issues :?:


well, changing the codepage seems to require a registry change for permanence and in 2 values. But even if you did that, somebody else might have a different codepage..

The thing to do, is to not use characters that are within a funny range.
To not use funny characters..

The characters, A,B,C and | and " " are accepted by anybody. They're in codepages 437 or 850. And in ANSI..

But these characters, they're in ANSI and unicode but not in 437 or 850, they would cause problems \u00A6 (unicode A6).. ANSI 0xA6 ¦

and "smart quotes" / "curly quotes" \u201C and \u201D (these are in ANSI too, 0x93 and 0x94)..http://www.alanwood.net/demos/ansi.html “ ”
If you paste them into CMD.EXE window, they get converted into " " and work.
But if I put them in a bat file(Written in notepad for example), they won't
FOR %%F in (“abc”) do @ECHO %%F <-- in a bat file fails
FOR %F in (“abc”) do @ECHO %F <-- at the console works
notepad supports either ANSI or Unicode so it won't convert those characters into something else, it will store them properly as those characters.

Here are two other funny ones I see on the web ‘ ’
\u2018 and \u2019
One needs to use normal single quotes ' ' Not ‘smart’ /curly ones!


Best thing is just to use " " and ' ' for quotes. (you probably do)



You need to see if your keyboard is producing any of these funny characters.

It might be producing the broken bar ¦
You should only use |
DOS doesn't really recognise the broken bar.. which is in ANSI.. and unicode.. But not in codepage 437 or codepage 850

Funnily enough the bar CMD displays looks like a broken bar, that's because codepage 437 and codepage 850 display the pipe like that. and don't have a broken bar. But it is a pipe of course.. as you know.. DIR | MORE is DIR piped to More.
The broken bar won't work there.

And if you

But ANSI(aka Windows-1252 / codepage 1252), displays pipe like | and has a broken bar character it displays like ¦ which funnily enough is the same way codepage 437 and 850 display the pipe. CMD it seems if you paste a broken bar in, won't convert it to a pipe, it just converts it to 0xDD a white mark, an unknown character.

The first 128 chars, characters with codes 0-127 are generally ok I suppose.
But above 127, is perhaps unnecessary.

I often use Editpad Pro for its great regex support(The author is an expert in regular expressions), but it's not free. I search for funny characters if I paste things from the web and want to save them in ANSI rather than unicode, sometimes they have non-ANSI characters.. But I know I don't generally need characters >127, or can replace them with more regular ones. So I search
[\u0080-\uFFFF] i.e. any character from 128 upwards..
and I replace them manually.
I might start saving in unicode actually but it's an interesting exercise.
There is Editpad Lite but that doesn't support Regex.

You could use Notepad+, it's free.
As a test, type things in
Search for [\x00-\x7F]
that should pick up any or most characters you type in. 0-127.

To search for funny characters, use [\x80-\xFF]

You can paste your bat file into it, and make sure it doesn't have any of those characters, and if it does, then replace them.

You may even find if a particular key on your keyboard is a culprit.
Last edited by taripo on 28 Nov 2011 12:40, edited 1 time in total.

taripo
Posts: 217
Joined: 01 Aug 2011 13:48

Re: encodings

#6 Post by taripo » 28 Nov 2011 10:17

also, if you want fancy characters, maybe some are in both codepage 437 and 850 and ansi.. some might work..
This one § might work
looks like it's in codepage 437 and codepage 850 at number 21
and codepage 1252, position 167

if you write it in notepad and put it in cmd, it'll probably map § to § changing the number accordingly depending on the codepage/character set..

but check that they are in them! § looks fine.. though actually in codepage 850 it appears in 2 places, 250 and 21.. so maybe there are two, which might be problematic..

better perhaps not to use fancy characters.. unless you have really checked them.

I strongly recommend you download this
a Windows version of XXD.EXE It is included in VIM funnily enough.

then you can see a file in Hex, and even write a file in Hex, something like echo 41| xxd -p -r, from the command line, and really better understand and investigate things.

aGerman
Expert
Posts: 3875
Joined: 22 Jan 2010 18:01
Location: Germany

Re: encodings

#7 Post by aGerman » 28 Nov 2011 17:40

I use the rastered font because it's the default in western europe, US and many others. Lucida Console supports unicode which enables characters with more than 2 Bytes width (e.g. Chinese, Japanese, perhaps Russian ...). I never experimented with it, for that reason I don't know how it changes the behavior of processing characters with a width of 2 Bytes.
I however noticed that cmd.exe is not cmd.exe. There are a lot of minor differences between the CMDs in different Windows versions but also in different languages and the same OS. Obviously these differences affect also the character handling.
It's indeed difficult :?

Regards
aGerman

taripo
Posts: 217
Joined: 01 Aug 2011 13:48

Re: encodings

#8 Post by taripo » 28 Nov 2011 21:56

Some fonts are considered unicode fonts, but lucida console isn't one of them. Saying a font supports unicode is a funny thing.. because for sure no font supports all of the characters in unicode. I notice that Lucida console doesn't support Hebrew for example. You can choose unicode in charmap, and maybe that just makes it list all the characters the font has. It's not a substantial number.

It could be that all the characters in Lucida Console are in codepages 437 and 850 and 1252 and of course all those are in unicode. perhaps it has some that aren't in unicode too.
I see Lucida Console has quite a number of characters not listed on 1252, 437 or 850.. like √ and charmap advanced view has a unicode and group by unicode subrange option which is useful..babelmap shows lucida console too.

If a font doesn't have characters that in unicode take more than 2 bytes, it doesn't mean the font doesn't support unicode.. 'cos the characters that big in unicode numbers, like >FFFF are extremely obscure, e.g. Cuneiform and Ugaritic, scripts no longer in everyday use, or scripts of peoples and languages no longer in existence. or ancient greek musical notation!
wikipedia List_of_Unicode_characters
So I guess that most "unicode fonts" don't go >FFFF /2 bytes.

If you open Babelmap, choose Fonts..Font Analysis Utility then choose a Font and click "Copy All Characters" You can see all the characters it has in notepad with the unicode number of each character. Also, it says Lucida console has 663 characters. Some of them are past the 00FF point..

But, 2 bytes worth of characters would be approx 65535 characters.
Also a unicode font only needs a few thousand. So needs nowhere near 2 bytes worth. (See a bit later in this post a quote from wikipedia "these fonts attempt to include many thousands of possible glyphs, so that they can be used as a single typeface across multi-lingual documents")

Also it looks like just by supporting ANSI, codepage 1252, extended ascii, there will some unicode characters > 00FF http://www.alanwood.net/demos/ansi.html like this trademark character ™ its ANSI code is 0-255, but not its unicode code, its unicode value is \u2122.
and characters unicode 0000-FFFF are pretty much any character almost anybody in the world is going to use.

Raster fonts is a bit of a mystery.. I can't see it in charmap or babelmap or MS Word.. I don't know what font it actually is 'cos a Raster font is a font whose display is bitmap..I don't know what "Raster Fonts" is, maybe a limited font specific to the cmd prompt window.

I'm guessing a bit, but I don't think fonts have any say or anything to do with the bit encoding, like how many bytes are used to store a character. and the set of characters they have might be their own too.. which I suppose could be the whole of one codepage, the whole of another, and is certainly only ever a subset of unicode. though could perhaps have some characters not in unicode..

I just looked up what a unicode font is..
Here's
wikipedia Unicode_font#List_of_Unicode_fonts
"A Unicode font (also known as UCS font and Unicode typeface) is a computer font that contains a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc., which are collectively mapped into the standard Universal Character Set, derived from many different languages and scripts from around the world. Unlike most conventional computer fonts, which are specific to a particular language or legacy character set and contain only a small subset of the UCS characters, these fonts attempt to include many thousands of possible glyphs, so that they can be used as a single typeface across multi-lingual documents."

Lucida console didn't have many scripts..e.g. it doesn't have Hebrew

Lucida console is not listed as a unicode font at that link.

And only had about 600 glyphs, not thousands as required in that definition.

There is Lucida Sans unicode(strange name since afaik sans means without!), and Lucida Grande are listed on that wikipedia page as unicode fonts. They have thousands of characters. Lucida console isn't listed there, and only has about 600.

Out of curiousity, I picked a weird character from lucida console, charmap..advanced..unicode..unicode subrange.. private.
\uF81D

It doesn't display here in Chrome! it looks like an n with a comma underneath, but it displayed in cmd prompt properly.. but funnily enough came up as 3F (it had to come up as something strange cos it is not in codepage 437 - it's not even in 1252) and it doesn't get stored properly either, it gets stored as 3F..
C:\>echo | od -x
0000000 0d3f 000a (3f 0d 0a 00 (3f then CRLF(0D0A) then 00))
0000003

3F is ? I checked, (and would be the same number in any codepage 'cos the first 128, certainly the printable range, is the same ),
C:\>echo ?| od -x
0000000 0d3f 000a
0000003

C:\>

C:\>echo >a

C:\>type a
?

C:\>

Same with \u20A3 echo it, and it displays a question mark.. so that's one way CMD deals with characters pasted in..(That I suppose are outside of its codepage) other way is the thick white mark, and another being an equivalent character. another way is a square.. some unrecognised ones can manage to get pasted in, at least with lucida but aren't really supported right. some like broken bar get pasted in as that in lucida, but | od -x and you see it's a 0xDD others get pasted in but pipe it and you see it's getting stored as a question mark xA3. depending on font and codepage and which unrecognised character. And if a bat file has characters >127 saved in a different format to the codepage there, then many of those characters will just come out as a very different looking character, not a graphical equivalent or question mark or thick white mark, just an equivalent glyph at that integer, whatever it is.

The 95 printable characters 32-126 are probably the same in all codepages..so best to stick with those. From SPACE(x20/32),!,",......... up to Tilda(x7E/126).

aGerman
Expert
Posts: 3875
Joined: 22 Jan 2010 18:01
Location: Germany

Re: encodings

#9 Post by aGerman » 29 Nov 2011 13:30

taripo wrote:Saying a font supports unicode is a funny thing..

It's maybe not exactly fitting but it's not that funny since Lucida Console indeed supports characters wider than 2 Bytes (>0xFF).

First I opened charmap, selected Lucida Console and character set Unicode. I copied character 0x0414 (capital cyrillic letter De).
I opened my command prompt for testings (raster font yet) and tried to echo that character:

Code: Select all

> echo ?
?

>

Now I changed the font of the console window to Lucida Console and tried the same again ...

Code: Select all

> echo Д
Д

>

... that obviously worked.

I however agree with you the best way is to work with the "normal" printable ASCII characters. But because I'm German I also have to work with extended ASCII characters (there are strange letters like ÄÖÜäöüß).

Regards
aGerman

taripo
Posts: 217
Joined: 01 Aug 2011 13:48

Re: encodings

#10 Post by taripo » 30 Nov 2011 04:35

aGerman wrote:...


I mean, the Font has more than one bytes worth of characters. But not 2 bytes worth of characters! But would i'd guess, have characters that in its code, are 2 bytes, and certainly has some that in unicode have to be 2 bytes.. I meant the number of characters it has, is not 2 bytes worth. 2 bytes 16 bits.. (65536).

You said
"..Lucida Console indeed supports characters wider than 2 Bytes (>0xFF)."

>0xFF is wider than one byte

and you said
" characters with more than 2 Bytes width (e.g. Chinese, Japanese, perhaps Russian ...)"

I doubt those characters have more than 2 bytes width.

F is a nibble - 4 bits. FF is one byte. 0xFFFF or 0xABCD is 2 bytes.
Only very obscure unicode characters are more than 2 bytes.




I fiddled with my registry settings, ACP and OEMCP are both set to 1252 (don't know if that's ok but anyway).. and cmd CHCP says codepage 437.
So, that's what my test is done on. And i'm on Lucida Console.

Now with your Echo.. It's getting stored as a question mark. As you can see below.

And yes it echos that character, so looking on the surface it looks like it might have worked, but further investigation as you see below, shows it is getting stored as a question mark.

Does it work on your system, with a German character and german codepage settings? As in with the test done below.


C:\blah>echo Д
Д

C:\blah>echo Д| od -x
0000000 0d3f 000a
0000003

(Now look up 3F in charmap)

C:\blah>echo Д>a.a

C:\blah>type a.a
?

aGerman
Expert
Posts: 3875
Joined: 22 Jan 2010 18:01
Location: Germany

Re: encodings

#11 Post by aGerman » 30 Nov 2011 15:07

taripo wrote:You said
"..Lucida Console indeed supports characters wider than 2 Bytes (>0xFF)."

>0xFF is wider than one byte

Oh, you're absolutely right. 2 Hex digits represent 1 Byte. For some strange reason I mixed it up :oops: I hope you however understood what I was talking about.

taripo wrote:I fiddled with my registry settings, ACP and OEMCP are both set to 1252 (don't know if that's ok but anyway).. and cmd CHCP says codepage 437.

Hmm, did you restart your machine?

taripo wrote:Does it work on your system, with a German character and german codepage settings? As in with the test done below.

No, that doesn't work. You need CMD /U to redirect unicode characters. Try ...

Code: Select all

cmd /u /c "echo Д>a.a"

...on the command prompt and open the file in a Hex editor. You should find:

Code: Select all

14 04 0D 00 0A 00

But the TYPE command will also fail because the Byte Order Mark is missing. If you prepend FF FE with your Hex editor then TYPE should work.

Regards
aGerman

taripo
Posts: 217
Joined: 01 Aug 2011 13:48

Re: encodings

#12 Post by taripo » 13 Dec 2011 15:36

thanks agerman, that /U helped with appending FFFE.

I didn't know about that /U switch.

Post Reply