Processing text files that have unlimited size lines!

Message

#1 Post by **Aacini** » 29 Sep 2013 00:48

In this SO topic the OP requested to process text files of about 30-40 KB size that have not NewLines and separate its contents in lines marked by "$" character. The size of the file prevents to process it with FOR /F command (that can read lines up to 8 KB size) nor with SET /P command (that can read lines up to 1 KB size).

After some tests I discovered that after SET /P command reach its maximum size and return the characters read, it does NOT move the file pointer of the input file; this mean that the next SET /P command over the same file (redirected to a code block) continue reading the following characters. This point made possible to read the contents of a text file in blocks of 1 KB size and accumulate and process they in any way as long as the result fits in a variable, that is, be not longer than 8 KB.

Code: Select all

@echo off
setlocal EnableDelayedExpansion

set nextBlock=0
set "thisBlock="

:nextBlock

   rem Read the next block of characters
   set /A nextBlock+=1
   (for /L %%i in (1,1,%nextBlock%) do (
      set "block="
      set /P "block="
   )) < input.txt
   if not defined block goto endFile

   rem Append next block to last part of previous block
   set "thisBlock=!thisBlock!!block!"

   rem Process current block: separate lines at "$" character
   :nextLine
      for /F "tokens=1* delims=$" %%a in ("!thisBlock!") do (
         set "nextLine=%%a"
         set "thisBlock=%%b"
      )
      if defined thisBlock (
         echo !nextLine!
         goto nextLine
      )
   set "thisBlock=!nextLine!"
   goto nextBlock

:endFile

if defined thisBlock echo !thisBlock!

Previous program is not efficient because it read the same data already processed every time it needs to read the next block, but it is an example of how to do it. This method can be modified in order to read the file just one time.

Antonio

#2 Post by **aGerman** » 29 Sep 2013 16:57

Antonio

To get a useful tool for any kind of plain text files we have to determine where a line actually ends. Did you find a workaround? I hoped that SET /P would return an errorlevel greater than 0 if the line end isn't reached but unfortunately it doesn't.

Regards
aGerman

jeb · #3 Post by **jeb** » 30 Sep 2013 01:58

Aacini wrote:After some tests I discovered that after SET /P command reach its maximum size and return the characters read, it does NOT move the file pointer of the input file; this mean that the next SET /P command over the same file (redirected to a code block) continue reading the following characters. This point made possible to read the contents of a text file in blocks of 1 KB size and accumulate and process they in any way as long as the result fits in a variable, that is, be not longer than 8 KB.

Nice, I find out the same(SO: One line is greater than variable's max size. Workaround?), but with the same problem of detecting the line ends.

But this can be solved with findstr /o - offset of the character in a line, so you can create a list for the length of each line .

Code: Select all

findstr /o ^^ myFile.txt

Now you can split the input at the correct characters.
But obviously here is a bit of work left.

jeb

#4 Post by **penpen** » 30 Sep 2013 05:31

As far as i know (i only have used it under win xp), the end of line has been reached if no further data could be read:

Code: Select all

@echo off
call :processFile "Test.txt"
goto :eof


:processFile
::   %~1 filename
   setlocal
   for %%a in ("%~1") do set "attributes=%%~aa"
   if not defined attributes exit /b 1
   if not "%attributes:~0,1%" == "-" exit /b 2
   set "attributes="

   rem:>"Debug"

   for /f "tokens=1* delims=:" %%a in ('findstr /N "^" "%~1"') do set "N=%%a"

   (
      for /L %%a in (1,1,%N%) do call :processLine
   ) < "%~1"
   endlocal
goto :eof


:processLine
   echo "[new line]" >> Debug.txt
:processNextPart
   set "input="
   set /P "input="
   if not defined input goto :eof
   set "\n="
   set \n=%input:~0,1%
   if not defined \n (
      (echo \n found as first char) >> Debug.txt
      set "input=%input:~1%"
   )
   (echo "%input%") >> Debug.txt

   if not defined \n (
      if "%input:~1021,1%" == "" goto :eof
   ) else (
      if "%input:~1022,1%" == "" goto :eof
   )

   goto :processNextPart

penpen

Edit: I forgot te line: if "%input:~1022,1%" == "" goto :eof
Edit2: Removed a bug that could break the batch script.

jeb · #5 Post by **jeb** » 30 Sep 2013 10:41

Hi penpen,

but the point in this thread is the unlimited size lines!
The problem is the detection of the line end, when the line is longer than 1021 or 8192 characters.

Lines longer than 8191 characters can't read with FOR as a parameter can only be expanded if the content is less than 8192 characters long.
Set/p can read lines but maximal ~1021 characters (or to the next line end), but when set/p will stop at the 1021 limit the next set/p will read the next characters from the same line.

Edit: Fixed my wrong understanding of penpen's post

jeb

#6 Post by **aGerman** » 30 Sep 2013 11:11

@jeb
You didn't convince me :lol:

If you have to process the output of FINDSTR /O in a FOR /F loop what's the advantage then? One of the greatest benefits of the SET /P technique is the fast execution of the FOR /L loop while reading the file. I assume you will lose that completely.

@penpen
You don't find the end of the line that way. Only an empty line could be detected.
I created a file with 4 lines.
1st line: 1025 times X
2nd line: 10 times Y
3rd line: empty
4th line: 10 times Z

The output of your code:

Code: Select all

"[new line]" 
"XXX...XXX" (1023 Xs total)
"XX" 
"YYYYYYYYYY" 
"[new line]" 
"ZZZZZZZZZZ" 
"[new line]" 
"[new line]"

Regards
aGerman

#7 Post by **penpen** » 30 Sep 2013 12:34

Sorry i've forgotten one line, i've added it above: Should work now.

penpen

Edit: But i just found a bug, i'm trying to get rid of it.
I found out that \n may be the first character of a line.
The bug can be seen on Test.txt:
1022 times X
1023 times Y
1024 times Z

And this doesn't produced the bug on my system (so it is not seen everytime, why i called this a bug):
1020 times U
1021 times V
1022 times W
1023 times X
1024 times Y
1025 times Z

Edit 2: I add the now (hopefully) bugfree code to my post above.

#8 Post by **aGerman** » 30 Sep 2013 15:35

Looks OK now even if the "\n found as first char" is pretty confusing to me. I have to take a closer look into the code if I have more time.
I increased and decreased the line lengthes. Also tried with and without empty lines. Everything was working so far.

Regards
aGerman

#9 Post by **Aacini** » 30 Sep 2013 18:54

jeb wrote:
Aacini wrote:After some tests I discovered that after SET /P command reach its maximum size and return the characters read, it does NOT move the file pointer of the input file; this mean that the next SET /P command over the same file (redirected to a code block) continue reading the following characters. This point made possible to read the contents of a text file in blocks of 1 KB size and accumulate and process they in any way as long as the result fits in a variable, that is, be not longer than 8 KB.

Nice, I find out the same(SO: One line is greater than variable's max size. Workaround?), but with the same problem of detecting the line ends.

But this can be solved with findstr /o - offset of the character in a line, so you can create a list for the length of each line .
Code: Select all
findstr /o ^^ myFile.txt
Now you can split the input at the correct characters.
But obviously here is a bit of work left.

jeb

Ops! I thought I was the first one that discovered this! :oops:

I tried to develop the method to process unlimited size lines in a general way, but I had some problems. Accordingly to my design the method should work, but it occasionally deletes blank spaces or commas. It seems that the problem is related to the caracters that SET /P can show as prompt (I am using Windows 8 ). This should mean that the long line is correctly read, so if the line would be processed in other ways (that is, not showing it with "SET /P"), the result should be correct. However, I have no time to do more tests now. I am posting the code here so you may do additional tests on it.

Code: Select all

@echo off

rem ReadLongLines.bat: Process a file with unlimited size lines
rem Antonio Perez Ayala

setlocal EnableDelayedExpansion

rem Include line offsets in the file
findstr /O "^" "%1" > "%1.tmp"

rem Create an array with such offsets
set numLines=0
for /F "usebackq delims=:" %%a in ("%1.tmp") do (
   set offset[!numLines!]=%%a
   set /A numLines+=1
)

rem Add the last offset (file size)
for %%a in ("%1.tmp") do set offset[!numLines!]=%%~Za

rem Process each line
set /A prevOffset=0, offsetLen0=0
< "%1.tmp" (for /L %%a in (1,1,!numLines!) do (

   rem Calculate the number of "SET /P" partial lines required to read this file line
   rem including "offset" + ":" part, but excluding the final CR+LF characters
   for /L %%b in (!offsetLen0!,1,9) do (
      if "!prevOffset:~%%b,1!" neq "" set offsetLen0=%%b
   )
   set /A "numParts=(offset[%%a]-prevOffset+(offsetLen0+1)+1-2 -1 )/1023 +1, prevOffset=offset[%%a]"

   rem Read the next file line in numParts parts
   for /L %%b in (1,1,!numParts!) do set /P "part[%%b]="

   rem Remove the "offset:" from first part
   set "part[1]=!part[1]:*:=!"
   if not defined part[1] set numParts=0

   rem Process the line (just show it)
   for /L %%b in (1,1,!numParts!) do set /P "=!part[%%b]!" < NUL
   echo/

))

del "%1.tmp"

Antonio

jeb · #10 Post by **jeb** » 01 Oct 2013 00:32

Aacini wrote:Ops! I thought I was the first one that discovered this!

No, I suppose it was already described by Dave or OJBakker in New technic: set /p can read multiple lines from a file
and How Set/p works.

Your solution looks like the one I think about, but I suppose it should be possible to use only one FOR/F loop for reading the line offsets and also read the content

Code: Select all

< "%1.tmp" ( 
   for /F "delims=:" %%o in ('findstr /o "^" test.bat') do (
      set /a offset=%%o
           ... read the part of the line with set /p
   )
)

#11 Post by **penpen** » 01 Oct 2013 07:46

I#ve optimized my script a little bit:

Code: Select all

@echo off
if not "%~1" == "" goto :processFile
cmd /K read2.bat "Test.txt"
goto :eof


:processFile
::   %~1 filename
setlocal enableDelayedExpansion
for %%a in ("%~1") do set "attributes=%%~aa"
if not defined attributes exit /b 1
if not "!attributes:~0,1!" == "-" exit /b 2
set "attributes="
set \n=^


for /f "tokens=1* delims=:" %%a in ('findstr /N "^" "%~1"') do set "N=%%a"
(
   set "line=1"

   rem:>"Debug.txt"
   (
      echo [start of file: "%~1"]
      echo [line !line!]
   ) >> Debug.txt

   rem allow max filesize of 2^64 bytes: 3 for /L loops
   for /L %%a in (0,1,8) do (
   for /L %%a in (0,1,0x40000000) do (
   for /L %%a in (0,1,0x40000000) do (
      set "input="
      set "first="
      set "last="

      if !line! LEQ !N! (
         set /P "input="

         if defined input (
            set first=!input:~0,1!
            set last=!input:~1022,1!

            if "!first!" == "!\n!" (
               set input=!input:~1!
               (echo [\n ^(0x0A^) found as first char]) >> Debug.txt
            )
         )

         rem process input
         (echo "!line!: !input!") >> Debug.txt

         if not defined last (
            rem line end
            set /A "line+=1"
            (echo [line !line!]) >> Debug.txt
         )
      ) else (
         rem end of file
         (echo [end of file]) >> Debug.txt
         endlocal
         exit
      )
   )
   )
   )
) < "%~1"
endlocal
(echo unknown error) >> Debug.txt
exit

The call and exit of this script is a little bit odd up to now, but it should work, and i actually have no idea how to avoid this.
But it has no further call, so it should be speeded up a little bit.

The only curious thing is, why does my solution has the problem with \n while your solution has no problems with that: Do i miss something?.

penpen

#12 Post by **aGerman** » 01 Oct 2013 09:52

The only curious thing is, why does my solution has the problem with \n

It's only a guess ...
A Windows line break is made from two characters, that is Carriage Return and Line Feed.
I remember a code snippet from jeb where he used the string limit to extract a Carriage Return from the end of the string. I think it's a similar behavior with your code. If the file pointer stops between CR and LF the CR belongs to the first read block while the LF will be read into the next block. The stand-alone CR is removed by the parser as jeb explained at SO while the LF is stable.

@jeb Would you agree?

Regards
aGerman

#13 Post by **Aacini** » 01 Oct 2013 10:38

jeb wrote:
Aacini wrote:Ops! I thought I was the first one that discovered this!

No, I suppose it was already described by Dave or OJBakker in New technic: set /p can read multiple lines from a file
and How Set/p works.

You are right! Although the possibility to continue reading a long line in parts is just briefly mentioned until the end of second topic; the main part of those extensive topics focus on read lines not longer than 1023 characters...

jeb wrote:Your solution looks like the one I think about, but I suppose it should be possible to use only one FOR/F loop for reading the line offsets and also read the content
Code: Select all
< "%1.tmp" ( 
   for /F "delims=:" %%o in ('findstr /o "^" test.bat') do (
      set /a offset=%%o
           ... read the part of the line with set /p
   )
)

Yes, of course! This method produce a simpler code, although it requires to duplicate the processing part after the FOR in order to read the last line...

I debugged the method and it correctly read now any text file with unlimited size lines (with the known restrictions of SET /P command about read lines). However, if you want to write the long lines, then we are limited again for the restrictions of SET /P command about write strings!

I tried to patch the data read in order to fix SET /P problems. My method try to avoid quotes at beginning or end of each part and spaces at beginning of the parts via the translation of these characters to other part. However, there is no way to avoid these characters at beginning of first part nor at end of last one! :evil:

If the required processing over the long line does not involve SET /P, it works perfectly!

Code: Select all

@echo off

rem ReadLongLines.bat: Process a file with unlimited size lines
rem Antonio Perez Ayala

REM Part required for PATCH
SETLOCAL DISABLEDELAYEDEXPANSION
SET QUOTE="

setlocal EnableDelayedExpansion

for %%a in ("%1") do set lastOffset=%%~Za
set "prevOffset="

< "%1" (
for /F "delims=:" %%a in ('findstr /O "^" "%1"') do (

   if defined prevOffset (

      rem Calculate the number of "SET /P" partial lines required to read this file line
      rem (excluding the final CR+LF characters)
      set /A "lineLen=%%a-prevOffset, numParts=(lineLen-2 -1)/1023 +1, missedChars=lineLen%%1023"

      rem Read the next file line in numParts parts
      set "part[!numParts!]="
      for /L %%b in (1,1,!numParts!) do set /P "part[%%b]="

      rem Extract any missed LF or CR+LF characters from end of line
      if !missedChars! lss 3 if defined part[1] set /P "="

      REM PATCH: Avoid quotes at beginning or end of each part, and spaces at beginning
      call :PatchLine

      rem Process the line (just show it)
      for /L %%b in (1,1,!numParts!) do set /P "=!part[%%b]!" < NUL
      echo/

   )

   set prevOffset=%%a

)

rem Read and process the last line
set /A "lineLen=lastOffset-prevOffset, numParts=(lineLen-2 -1)/1023 +1"
set "part[!numParts!]="
for /L %%b in (1,1,!numParts!) do set /P "part[%%b]="
REM PATCH: Avoid quotes at beginning or end of each part
call :PatchLine
for /L %%b in (1,1,!numParts!) do set /P "=!part[%%b]!" < NUL
echo/

)
goto :EOF


:patchLine
set i=0
:nextPart
   set /A i+=1, iP1=i+1
   if %i% equ %numParts% goto endLine
   :nextChar
      if "!part[%i%]:~-1!" neq "!quote!" goto checkRightSide
      :shiftChar
      set "part[%i%]=!part[%i%]!!part[%iP1%]:~0,1!"
      set "part[%iP1%]=!part[%iP1%]:~1!"
      goto nextChar
   :checkRightSide
      if "!part[%iP1%]:~0,1!" equ "!quote!" goto shiftChar
      if "!part[%iP1%]:~0,1!" equ " " goto shiftChar
   :endPart
   goto nextPart
:endLine
exit /B

@aGerman:

The behavior of how CR+LF may be splitted at end of a SET /P is explained by Liviu at end of the second link provided by jeb ("How Set/p works").

EDIT: It seems that my code may have an error! When a read causes that a CR+LF be split and the LF is left behind, I need to check if this LF is read by the next SET /P and included in the next part. If so, the fix is simple: if missedChars is 1, set a flag in order to eliminate the first character from the next reading! But I have not time to do that now... :cry:

Antonio

#14 Post by **aGerman** » 01 Oct 2013 11:58

How could I overlook that link

Thanks for pointing.

#15 Post by **penpen** » 01 Oct 2013 14:00

Sorry, aGerman, you have misunderstood, what i've wanted to say about the \n problem.
I really haven't described it, so it is my own fault, except if you could read my thoughts via internet :oops:

.

I have used this Test.txt:

Code: Select all

_V...1434...........V
_W...1021....W
_X...1020...X
_Y...1021....Y

_Z...1021....Z

I get this result:

Code: Select all

[start of file: "Test.txt"]
[line 1]
"1: _V...1022....V"
"1: V.412.V"
[line 2]
"2: _W...1021...W"
[line 3]
[\n (0x0A) found as first char]
"3: _X...1020..X"
[line 4]
[\n (0x0A) found as first char]
"4: _Y...1021...Y"
"4: "
[line 5]
"5: "
[line 6]
"6: _Z...1021...Z"
[line 7]
[end of file]

And that is produced by Aacinis (second) code (with this last line: del "%1.tmp") (i have added a + as the first sign in the set output so one can see the breaks):

Code: Select all

+_V...1020.......V+V.....414.....V
+_W...1017....W+W...4....W
+_X...1017....X+X...3...X
+_Y...1017....Y+Y...4....Y

+_Z...1017....Z+Z...4....Z

And if i remove 5 W's (i too assume \r\n is the line ending and want to split them) in the Test.txt to provoke the error, it just do it without errors:

Code: Select all

+_V...1020.......V+V.....414.....V
+_W...1016....W
+_X...1017....X+X...3...X
+_Y...1017....Y+Y...4....Y

+_Z...1017....Z+Z...4....Z

So i couldn't provoke this error on Aacinis script, no matter what i do...

.

I can provoke an error if i remove 4 W's from the Test.txt above:

Code: Select all

+_V...1020...V+V...414...V

+_W...1017...W
+_W...1017...W+2457:_X...1017...X
+XXX+3480:_Y...1017...Y
+YYYY

But i assume this is another issue, and has nothing to do with the upper behavior.

I hope, i've counted the resulting chars right.

So i found it curious, and i couldn't explain that behavior:
Why does the character read amount differs in these two solutions?
And why is the amount reduced in Aacinis code:
First line, first block: 1021 characters;
Other lines, first block: 1018 characters;
that seems to be just weird!?

So: What am i missing?

I think it is no timing problem, as the beahavior is the same on multiple runs, on 3 pcs with different hardware (1000 runs on all 3 winxp home), but i'm not sure.

penpen

Edit: I have used Win XP 32 home (all updates, including optional up to now).
Edit2: Specified which of Aacinis code i had tested.
Edit3: Added, why i find these results curious.
Edit4: Added test run result (1000 x 3).

DosTips.com

Processing text files that have unlimited size lines!

Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!

Re: Processing text files that have unlimited size lines!