ctrl-z blues

Message

Sponge Belly · #1 Post by **Sponge Belly** » 02 Nov 2013 08:58

Hello All! :-)

Imagine a test file called hi.txt with the contents “hi there” (no quotes and no newline).

It’s possible to print out the characters of the file one by one using a little trick I learnt from Judago:

Code: Select all

cmd /d /u /c type hi.txt | find /v ""
h
i

t
h
e
r
e

This would be extremely useful for processing very long lines, etc, if it weren’t for one little problem… Control-Z (ASCII 26, SUB). :-(

To see what I mean, replace the space in hi.txt with Ctrl-Z and run the command above again. This time the output will be:

Code: Select all

h
i

Is there any way round this? Is there a way to filter out the SUBs? Any suggestions appreciated.

- SB

#2 Post by **aGerman** » 02 Nov 2013 09:13

A few years ago I already found that in a German forum. Last time I used it was in the CHOICE function.

Regards
aGerman

#3 Post by **foxidrive** » 02 Nov 2013 09:17

Old editors add SUB characters to text files...

Sponge Belly · #4 Post by **Sponge Belly** » 02 Nov 2013 13:14

Hi Again,

Thanks for the replies. The Choice Function thread aGerman pointed to was riveting. How did I miss that? :?:

But I’m trying to find a way to use type with cmd /d /u /c so I can read every character in a file one by one. The only thing stopping me is Ctrl-Z (SUB). If the input file contains a SUB, output stops at the preceeding character.

Is there a workaround?

- SB

#5 Post by **Squashman** » 02 Nov 2013 14:14

AFAIK TYPE will always stop reading a file when it sees the SUB.

#6 Post by **aGerman** » 02 Nov 2013 16:02

A workaround for the SUB but I fear it's not exactly what you're looking for ...

Code: Select all

cmd /von /d /u /c "set /p $=<hi.txt&echo(!$!"|find /v ""|findstr .

Regards
aGerman

#7 Post by **foxidrive** » 02 Nov 2013 16:53

Sponge Belly wrote:The only thing stopping me is Ctrl-Z (SUB). If the input file contains a SUB, output stops at the preceeding character.

SUB is also known as an end of file marker, for text files.

AIR it was used in early MSdos days to indicate to the various commands and programs that the file was complete, and to end processing, before the FAT was used to detect the filesize.
It's still honoured in various ways by tools, and the thing is that it should *never* appear within the body of a text file.

Sponge Belly · #8 Post by **Sponge Belly** » 02 Nov 2013 17:19

Me Again!

Thanks aGerman for your suggestion. There was a recent discussion here about using set /p to slurp up long lines 1021 characters at a time. Dave Benham concluded that the method was “lossy” if the line had control characters at the end. Made for interesting reading all the same…

And thanks FoxiDrive for making re-assuring noises about never having to encounter SUB in the body of a text file. But it would be great if we could come up with a way of dealing with it because you never know! ;-)

And anyways, what’s the problem, exactly? This works:

Code: Select all

type hi.txt | find /v ""
hi<SUB>there

As does this:

Code: Select all

type hi.txt >con
hi<SUB>there

So why can’t I get it to work with cmd /d /u /c? :?:

- SB

#9 Post by **penpen** » 02 Nov 2013 19:07

I'm not sure, but i think the problem (of the line: cmd /d /u /c type hi.txt | find /v "") seems to be caused by the usage of 1A (SUB) as an eof character under (old) dos and windows (seems still be in use, when using type and find).
In the good old ms-dos 6.22 days a file was fully read (if buffered or not) up to the end of file. After the file was read fully all further read attemps result in the MS-DOS version of the end of file (eof) character 0x1A, also in use as SUB(SUBSTITUTE).
When reading the file hi.txt (stored in ANSI i think) the content is in hex: 68 69 1A 74 68 65 72 65 0D 0A.
cmd /d /u /c type hi.txt connects a unicode file input stream, so the read results in: 6800 6900 1A00 7400 6800 6500 7200 6500 (Unicode 1.0, UTF-16).
This is piped to the Unicode output interpreted as ANSI: 68 00 69 00 1A 00 74 00 68 00 65 00 72 00 65 00.
The program find then interprets 0x00 chars as line endings, so it treats 1A 00 as a full read line => EOF reached. Finished. So it outputs only the "hi" part.

The other command (type hi.txt | find /v "") does all operations using ANSI, so there is no single read that result is only the char 0x1A, so it writes all.
I'm not sure if the input/outputstreams (file pipe) are set up to putthrough, so all characters after the 0x1A character could be read by the find command, or if it reads only reads the whole actual content that is in the file reading buffer.
So it may result in errors if the rest of the file is big to fit into the file input stream, and only the actual content is read by find: Then all that is not in the buffer is cut of.
The command type itself handles the SUB char as an EOF char, too, just try: type hi.txt.

Same for the command (type hi.txt >con).

penpen

Edit 1/2: You may store the file hi.txt using Unicode, and if the streams are configured to putthrough and are connected alltogether, then your code may work, as you expect, as the input streams are initialized wit FFEE and know that they have to autocast from Unicode to ANSI.

#10 Post by **aGerman** » 02 Nov 2013 19:12

It doesn't work for me btw (Win7 x86). The SUB stops the processing of the line using TYPE.
EDIT: Wait ... it works with the pipe to find. Seems it is the behaviour that penpen explained.

Regards
aGerman

#11 Post by **penpen** » 02 Nov 2013 19:16

Maybe the cmd shell is initialized using unicode under win 7.
Just make sure you are using ANSI by cmd /A before testing.

penpen

#12 Post by **aGerman** » 02 Nov 2013 19:18

Yes it was ANSI. I edited my previous post while you answered.

#13 Post by **foxidrive** » 03 Nov 2013 01:37

Sponge Belly wrote:And thanks FoxiDrive for making re-assuring noises

TBH you haven't explained why you have embedded SUB characters in your files, and if you have text files with them in
then it would be better to remove them with a more appropriate tool.

I understand that the guys like to deal with esoteric parts of cmd and investigating things that are outside my sphere of interest, as
I'm more of a fellow who likes to solve some problems, and show some techniques to newbies. This might pique the interest of the fellows here though.

My workaround to solve an embedded SUB would be to use a tool and remove it, but you should really explain exactly what it is that you are dealing with.

Sponge Belly · #14 Post by **Sponge Belly** » 04 Nov 2013 17:01

Hi Penpen! :-)

Thanks for your reply. You were right about saving the file as Unicode (UTF-16). Everything works fine when I do this. SUB doesn’t cause any problems. But I think type is the stumbling block. This works…

Code: Select all

cmd /d /u /c type unicode.txt > uni.out.txt

but this doesn’t:

Code: Select all

cmd /d /u /c type ansi.txt > ansi.out.txt

The second example only gets as far as the first SUB in the ansi.txt file.

I must either find a way to convert files (that may contain SUBs) from ANSI to Unicode or remove all SUBs from the source file before processing. I’ve hit a brick wall with the former and I don’t know how to do the latter using Batch. Any suggestions appreciated at this point. :-(

And Foxi, there’s no particular reason why I’m hung up on SUBs except that I like to write robust code and i wouldn’t want to let a little thing like a control character stop play.

BFN!

- SB

#15 Post by **Squashman** » 05 Nov 2013 07:06

Sponge Belly wrote:And Foxi, there’s no particular reason why I’m hung up on SUBs except that I like to write robust code and i wouldn’t want to let a little thing like a control character stop play.

BFN!

- SB

I have been doing List Processing for probably close to 15 years. I have literally processed Petabytes of data. Trillions of records. Can't say that I have ever once seen a SUB or EOF in the middle of a text file. I know we have done some quirky things with the SUB character in batch files but with pure data I just don't see it happening.

Now LINEFEEDS are another story. Clients love exporting their data with Linefeeds in the middle of a field or record.

DosTips.com

ctrl-z blues

ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues