Processing text files that have unlimited size lines!

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Processing text files that have unlimited size lines!

#16 Post by penpen » 03 Oct 2013 01:54

Sometimes i'm just blind... :oops: .
penpen wrote:I can provoke an error if i remove 4 W's from the Test.txt above:
(... code block...)
But i assume this is another issue, and has nothing to do with the upper behavior.
This was the \r\n split (and no other) issue, that Aacini has successfully solved it with his third code.

penpen wrote:So i found it curious, and i couldn't explain that behavior:
Why does the character read amount differs in these two solutions?
And why is the amount reduced in Aacinis code:
First line, first block: 1021 characters;
Other lines, first block: 1018 characters;
that seems to be just weird!?

So: What am i missing?
I have missed that the input is the temporary file created using findstr /O, so that the offsets are part of the blocks read using set /P.
That explains the complete behavior... sometimes i really felt stupid... .

penpen

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Processing text files that have unlimited size lines!

#17 Post by dbenham » 03 Oct 2013 11:04

I hate to rain on everyone's parade, but I'm pretty sure it is impossible to use SET /P to reliably read text files with unlimited line length.

The FINDSTR /O technique is a good way to determine where each line begins. But there remains a SET /P problem that cannot be solved without first modifying the input file: SET /P will strip all trailing control characters from the end of the 1023 byte read buffer. The FINDSTR /O trick allows us to determine how many control characters were stripped, but it is impossible to tell what those characters were. Most text files don't have many control characters other than <tab> <CarriageReturn> <LineFeed> and <Ctrl-Z>. But that alone is enough to make it impossible to know for sure what the stripped characters might have been.

The only way I know to reliably read any file (even binary files) is a clever technique jeb showed me using FC. I used that technique to develop a New function - :hexDump that produces a hex dump of any file.


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Processing text files that have unlimited size lines!

#18 Post by penpen » 03 Oct 2013 11:28

dbenham wrote:But there remains a SET /P problem that cannot be solved without first modifying the input file: SET /P will strip all trailing control characters from the end of the 1023 byte read buffer.
I think that is not true, but i've only know this for WinXP 32 home.
The only character that is problematic is the nul character.
All other characters are working without problems, you may see it in my order matrix computation script here:
http://www.dostips.com/forum/viewtopic.php?f=3&t=4954

penpen

Edit: You just have to open the resulting text file with a hex editor to see, that writing all other characters could be done successfully, too.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Processing text files that have unlimited size lines!

#19 Post by dbenham » 03 Oct 2013 11:48

I know that Set /P can read any character except 0x00, but if the last character on any given "line" is a control character (<= 0x1F), then it will be stripped. Actually it strips all trailing control characters.

So a line like "A<tab><tab><tab><LineFeed>" will be read incorrectly as "A".

But a line like "A<tab><tab><tab>B<LineFeed>" will be read correctly as "A<tab><tab><tab>B"

I've tested on Win 7, and I believe it is also true on XP, but I haven't tested yet.


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Processing text files that have unlimited size lines!

#20 Post by penpen » 03 Oct 2013 12:06

OK, you are right, astounding, that i never had an issue with that :D ,
but now i have to revise some of my old scripts... :cry: .

penpen

einstein1969
Expert
Posts: 941
Joined: 15 Jun 2012 13:16
Location: Italy, Rome

Re: Processing text files that have unlimited size lines!

#21 Post by einstein1969 » 03 Oct 2013 12:24

dbenham wrote:I know that Set /P can read any character except 0x00, but if the last character on any given "line" is a control character (<= 0x1F), then it will be stripped. Actually it strips all trailing control characters.

So a line like "A<tab><tab><tab><LineFeed>" will be read incorrectly as "A".

But a line like "A<tab><tab><tab>B<LineFeed>" will be read correctly as "A<tab><tab><tab>B"

I've tested on Win 7, and I believe it is also true on XP, but I haven't tested yet.


Dave Benham


So a possible workaround would be to add a character to the end and then remove it before using it?

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Processing text files that have unlimited size lines!

#22 Post by dbenham » 03 Oct 2013 13:34

einstein1969 wrote:So a possible workaround would be to add a character to the end and then remove it before using it?

Yes, but how are you going to do that without first reading the file :?: :twisted:
It would be easy if we leave the world of native batch commands and include JScript, VBS, or PowerShell. But if you are willing to do that, then why use batch at all? Another possibility is to use 3rd party utilities.

The only pure batch way to do it that I am aware of is the FC technique. Again, if you use that, then why bother writing out modified content just so you can read it again with SET /P?

=============================

@penpen: OJBakker described the trailing control character stripping at How Set/p works


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Processing text files that have unlimited size lines!

#23 Post by penpen » 03 Oct 2013 13:59

I have tested reading & writing using set/P under WinXp 32 home SP3:

Code: Select all

1) "x\n\r"
2) "Ax\n\r"
3) "xA\n\r"
4) "A...1020...A<TAB><TAB><TAB><TAB><TAB>A
5) "AxA\n\r"
6) "Ax...1021...xA\n\r"

The result:
In case 1) all characters with hex value in {0x01, ..., 0x1F} are replaces by 0x20 (<SPACE>).
In case 2) all characters with hex value in {0x01, ..., 0x09, 0x0B, ..., 0x1F} are deleted.
In case 3) the character with the hex value 0x22 is removed.
In case 4) 3 <TABS> got lost.
In case 5) No changes.
In case 6) The character tested is the hex value 0x01: No changes.

penpen

Edit: Added Test 5 and 6.
Edit2: Removed unneeded info.

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Processing text files that have unlimited size lines!

#24 Post by Aacini » 04 Oct 2013 10:04

Below is the last version of the method to read long lines. It fixes several problems with previous code, but it is not perfect yet!

Code: Select all

@echo off

rem ReadLongLines.bat: Process a file with unlimited size lines
rem Antonio Perez Ayala

REM Part required for PATCH
SETLOCAL DISABLEDELAYEDEXPANSION
SET QUOTE="

setlocal EnableDelayedExpansion

for %%a in ("%1") do set lastOffset=%%~Za
set "prevOffset="
set missedLF=0

< "%1" (
for /F "delims=:" %%a in ('findstr /O "^" "%1"') do (

   if defined prevOffset (

      rem Calculate the number of "SET /P" partial lines required to read this file line
      rem excluding the final CR+LF characters, but including any missed LF from previous read
      set /A "lineLen=missedLF+%%a-prevOffset, numParts=(lineLen-2 -1)/1023 +1, missedChars=lineLen%%1023"

      rem Read the next file line in numParts parts
      set "part[!numParts!]="
      for /L %%b in (1,1,!numParts!) do set /P "part[%%b]="

      rem Extract a LF missed from previous read
      if !missedLF! equ 1 (
         if defined part[1] set "part[1]=!part[1]:~1!"
         set missedLF=0
      )

      rem Extract any missed CR+LF or LF characters from end of line
      if !missedChars! lss 3 if defined part[1] (
         if !missedChars! equ 2 (
            rem Extract CR+LF from input stream
            set /P "="
         ) else if !missedChars! equ 1 (
            rem Indicate to extract LF from next read
            set missedLF=1
         )
      )

      REM PATCH: Avoid quotes at beginning or end of each part, and spaces at beginning
      call :PatchLine

      rem Process the line (just show it)
      for /L %%b in (1,1,!numParts!) do set /P "=!part[%%b]!" < NUL
      echo/

   )

   set prevOffset=%%a

)

rem Read and process the last line
set /A "lineLen=missedLF+lastOffset-prevOffset, numParts=(lineLen-2 -1)/1023 +1"
set "part[!numParts!]="
for /L %%b in (1,1,!numParts!) do set /P "part[%%b]="
if !missedLF! equ 1 if defined part[1] set "part[1]=!part[1]:~1!"
REM PATCH: Avoid quotes at beginning or end of each part
call :PatchLine
for /L %%b in (1,1,!numParts!) do set /P "=!part[%%b]!" < NUL
echo/

)
goto :EOF


:patchLine
set i=0
:nextPart
   set /A i+=1, iP1=i+1
   if %i% equ %numParts% goto endLine
   :nextChar
      if "!part[%i%]:~-1!" neq "!quote!" goto checkRightSide
      :shiftChar
      set "part[%i%]=!part[%i%]!!part[%iP1%]:~0,1!"
      set "part[%iP1%]=!part[%iP1%]:~1!"
      goto nextChar
   :checkRightSide
      if "!part[%iP1%]:~0,1!" equ "!quote!" goto shiftChar
      if "!part[%iP1%]:~0,1!" equ " " goto shiftChar
   :endPart
   goto nextPart
:endLine
exit /B

There is a new problem now: if the last read leaves a <LF> in the input stream, AND the next line is an empty one, then the next bytes to read are <LF><CR><LF> instead of the usual <LF>data<CR><LF>. In this case the <LF><CR> characters are read as an empty line that does NOT require to eliminate the first character, AND the next line will have a <LF> at beginning with no warning about that! :evil:

Still working on this... :?

Antonio

Sponge Belly
Posts: 216
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: Processing text files that have unlimited size lines!

#25 Post by Sponge Belly » 30 Nov 2013 14:13

Sorry I’m late! ;-)

Here’s some beta code that reads in a file containing a single long line (NULs and newlines not supported). It splits it up and spits it back out in 4095-character chunks by misusing pause and find:

Code: Select all

@echo off & setlocal enableextensions enabledelayedexpansion
if "%~1" neq "/re-enter" (goto init) else call %2 "%~3"
goto end

:init
set "fz=%~z1"
set /a %fz% 2>nul || goto end
set /a splits=fz / 4095
set /a mod=fz %% 4095
if %mod% equ 0 set /a splits-=1

for /f delims^=^ eol^= %%a in ('
call "%~dpf0" /re-enter :split "%~1"
') do echo(%%a

:end
endlocal & goto :eof

:split
setlocal enabledelayedexpansion

for /l %%i in (0 1 %splits%) do (set /a "skip=%%i * 4095"
cmd /von /q /c type "%~1" ^| (^
(for /l %%j in (1 1 !skip!^) do @pause ^>nul^) ^& ^
find /v ""^) & if %%i neq %splits% echo()

endlocal & exit /b 0


Needs more testing. Haven’t fed it poison characters yet. Couldn’t rewrite the subroutine to work inside the main for /f loop’s in (…) clause. Not sure what to do next so I thought I’d post it here and see what you gurus can do with it.

Any suggestions gratefully appreciated. :-)

- SB

Post Reply