a little unicode related subtopic

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
taripo
Posts: 227
Joined: 01 Aug 2011 13:48

Re: a little unicode related subtopic

#16 Post by taripo » 23 Jan 2012 01:32

all the tests I did were testing, what is displayed, in a few situations, it didn't show the effects of different codepages much.

my tests didn't really cover your question regarding codepages.. but I wanted to at least ensure we both agreed on what we were looking at!!

It's not always just a question of how something is displayed, since that 'a' with a line isn't the same character that I pasted into notepad..(the broken pipe). That may or may not happen to make a difference, but it could have consequences, it's a weird behaviour. There's a simple reason for it, which i'll look into, I think I looked into it before I just have to check my posts.

And raster when you paste the broken pipe in, is not displaying the broken pipe as a block like image, it is displaying a different character. N

as to whether your batches work or break.. in different codepages, my tests didn't test that.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: a little unicode related subtopic

#17 Post by Liviu » 23 Jan 2012 01:33

taripo wrote:It might not make any difference, but which font are you using, and do you also get the display that picture shows when you paste it in the command prompt?

The font makes no difference, except look different on the screen and in screen snapshots (or, worse, look the same for different binary characters depending on the active codepage). Whenever in doubt, check the actual file contents with your favorite binary viewer.

Liviu

Ed Dyreen
Expert
Posts: 1569
Joined: 16 May 2011 08:21
Location: Flanders(Belgium)
Contact:

Re: a little unicode related subtopic

#18 Post by Ed Dyreen » 23 Jan 2012 01:38

'
Thanks Liviu, you read faster than I write/edit.. ( I understand now, which is why I removed that question.)

I think my code works for everyone ( just guessing ) if I simply use

Code: Select all

@echo off &prompt $_$G &chcp 850 %= prompt minimalistic, codepage 850 =%
for /f "usebackq tokens=1-3 delims=¦" %%a in ( '"1"¦"2"¦"3"' ) do echo.a=%%~a_, b=%%~b_, c=%%~c_
pause

Code: Select all

a=1_, b=2_, c=3_
Druk op een toets om door te gaan. . .
If not please let me know, thanks for the quick responses :D

taripo
Posts: 227
Joined: 01 Aug 2011 13:48

Re: a little unicode related subtopic

#19 Post by taripo » 23 Jan 2012 01:41

Liviu wrote:
taripo wrote:It might not make any difference, but which font are you using, and do you also get the display that picture shows when you paste it in the command prompt?

The font makes no difference, except look different on the screen and in screen snapshots (or, worse, look the same for different binary characters depending on the active codepage). Whenever in doubt, check the actual file contents with your favorite binary viewer.

Liviu


I suppose it looks a bit like it does make a difference .. forgetting files 'cos just looking at pasting..

But try Pasting a broken pipe in..

I am using gnuwin32 od command.

echo ¦ | od -x

see it stores the broken pipe, not as 0xA6, but as 0xDD

that is a different character.

Also, in the encodings thread it was noticed that there are cases where a c-copyright symbol or r circle symbol can be pasted it and come out as c or r, that is a different character to the one with a circle. And it's not just the same character displayed differently, it's displaying a different character.

it's not displaying a broken pipe, when a broken pipe ¦ is pasted in..

that od command should show the same as you using a binary editor.

C:\Documents and Settings\user>echo ▌ | od -x
0000000 20dd 0a0d
0000004

putting the bytes in the "correct" err readable, order.

dd 20 0d 0a

see, dd not A6

--
though I don't know.. actually.. 'cos in lucida, echo ¦| od -x also outputs dd and also with cmd /u

i might fiddle with it later this week, and check back on prior posts but (I recall ASCII is a 7 bit code), I stick to 7 bits, hex 0-7F. so I don't tend to run into these issues.

Ed Dyreen
Expert
Posts: 1569
Joined: 16 May 2011 08:21
Location: Flanders(Belgium)
Contact:

Re: a little unicode related subtopic

#20 Post by Ed Dyreen » 23 Jan 2012 02:18

'
taripo wrote:i might fiddle with it later this week, and check back on prior posts but (I recall ASCII is a 7 bit code), I stick to 7 bits, hex 0-7F. so I don't tend to run into these issues.
In the set /a random topic :arrow: viewtopic.php?f=3&t=1817&hilit=set+random
You reported it didn't work on your OS with your codepage and that you solved it with an underscore.

I care if the code is affected and whether forcing codepage 850 makes my code work.
But an underscore is not a safe delimiter, it's far more likely to appear in data than the '¦' delimiter :!:

taripo
Posts: 227
Joined: 01 Aug 2011 13:48

Re: a little unicode related subtopic

#21 Post by taripo » 23 Jan 2012 06:02

Ed Dyreen wrote:'
taripo wrote:i might fiddle with it later this week, and check back on prior posts but (I recall ASCII is a 7 bit code), I stick to 7 bits, hex 0-7F. so I don't tend to run into these issues.
In the set /a random topic :arrow: viewtopic.php?f=3&t=1817&hilit=set+random
You reported it didn't work on your OS with your codepage and that you solved it with an underscore.

I care if the code is affected and whether forcing codepage 850 makes my code work.
But an underscore is not a safe delimiter, it's far more likely to appear in data than the '¦' delimiter :!:


I didn't try your one with different codepages, and there are a number of variables like whether cmd prompt is set to unicode..as well as the codepage being set in two places. So just a chcp command might not be enough.

You could take as a parameter, what your delimiter is.
Or,
you could have a variable in the batch file that the person using it is supposed to set.

Funny things happen with maybe any character over 127(7F), but maybe some are OK.

But many are not. I just tried the one adjacent to £, that is ☼. (see charmap). But see it behaves funny on my system.
And that's not even such a high value one, it's just adjacent to the pound.
The more popular ones may be in more character sets and some may be the same value, and that might help them to behave better.

C:\crp>echo ☼>a

C:\crp>type a
&

C:\crp>

So, at this stage, I don't know which are safe, but you need to look at the mechanism by which ☼ can become & Then you'll have good ideas of what other characters to try. I can't pull up the webpages at the moment but may be able to later this week.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: a little unicode related subtopic

#22 Post by Liviu » 23 Jan 2012 10:35

Ed Dyreen wrote:'
I think my code works for everyone ( just guessing ) if I simply use
...
If not please let me know, thanks for the quick responses :D

I'd guess it works provided it's you who save/publish/send the batch file. However, if people copy the code from your post and save the batch file themselves, they'd better make sure that it's saved while the active codepage is 850.

One other observation is that 'chcp' is not covered by setlocal, so unless you save the original codepage and restore it manually, cmd would be left at 850 after running the batch.

Liviu

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: a little unicode related subtopic

#23 Post by Liviu » 23 Jan 2012 10:50

taripo wrote:
Liviu wrote:The font makes no difference, except look different on the screen and in screen snapshots (or, worse, look the same for different binary characters depending on the active codepage). Whenever in doubt, check the actual file contents with your favorite binary viewer.


I suppose it looks a bit like it does make a difference .. forgetting files 'cos just looking at pasting..


I don't see the difference here. Below is a test run on my xp.sp3 machine.

Code: Select all

C:\tmp>chcp 437
Active code page: 437

C:\tmp>echo @echo [ 437] echo copyright - ©©©
@echo [ 437] echo copyright - ©©©

C:\tmp>echo @echo [ 437] echo copyright - ©©© | more
@echo [ 437] echo copyright - ccc

C:\tmp>echo @echo [ 437] echo copyright - ©©© >copyrite.cmd

C:\tmp>chcp 850
Active code page: 850

C:\tmp>echo @echo [ 850] echo copyright - ©©©
@echo [ 850] echo copyright - ©©©

C:\tmp>echo @echo [ 850] echo copyright - ©©© | more
@echo [ 850] echo copyright - ©©©

C:\tmp>echo @echo [ 850] echo copyright - ©©© >>copyrite.cmd

C:\tmp>chcp 1252
Active code page: 1252

C:\tmp>echo @echo [1252] echo copyright - ©©©
@echo [1252] echo copyright - ©©©

C:\tmp>echo @echo [1252] echo copyright - ©©© | more
@echo [1252] echo copyright - ©©©

C:\tmp>echo @echo [1252] echo copyright - ©©© >>copyrite.cmd

C:\tmp>chcp 437
Active code page: 437

C:\tmp>copyrite
[ 437] echo copyright - ccc
[ 850] echo copyright - ╕╕╕
[1252] echo copyright - ⌐⌐⌐

C:\tmp>chcp 850
Active code page: 850

C:\tmp>copyrite
[ 437] echo copyright - ccc
[ 850] echo copyright - ©©©
[1252] echo copyright - ®®®

C:\tmp>chcp 1252
Active code page: 1252

C:\tmp>copyrite
[ 437] echo copyright - ccc
[ 850] echo copyright - ¸¸¸
[1252] echo copyright - ©©©

The points this test case is trying to make are...

1. The console and cmd themselves are fully unicode. The "echo @echo [ 437] echo copyright - ©©©" at the cmd prompt displays correctly regardless of the active codepage, including in the case of 437 which doesn't even carry the "©" symbol.

2. The active codepage kicks in the moment the output needs to be converted to 8-bit text, such as pipes or redirection. This is why "echo @echo [ 437] echo copyright - ©©© | more" replaces "©" with "c".

2.a. This also explains why the copyrite.cmd file ends up with 3 different lines. The "©" character is saved as 0x63, 0xB8 and 0xA9 under codepages 437, 850 and 1252 respectively.

2.b. Will not discuss the "cmd /u" case here since it's not really relevant in this context - the batch file must be saved as 8-bit text since cmd does not support UTF-16 batch files.

3. The snapshot above was taken with the console font set to Lucida Console. With a Raster font, instead, the screen looks different, but the generated copyrite.cmd file is bit by bit identical.

Liviu

Post Reply