Batchscript to extract texts from multiple lines

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Batchscript to extract texts from multiple lines

#16 Post by plasma33 » 30 Jul 2017 01:06

@rojo, your code works like a charm too. Thanks for a more efficient code.

Plasma33

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Batchscript to extract texts from multiple lines

#17 Post by Aacini » 30 Jul 2017 11:32

I completed a pure Batch file solution for this problem comprised of two parts: the first part split the original file with 3 lines into three files with one line each. The first and third files have the same original long lines, but the second file have the second long line split in shorter (1023 bytes) lines in order to be processed via a FOR /F command, so it match the same number of bytes that will be read from first and third files via SET /P commands.

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Split a file with 3 very long lines into 3 files:
rem the first and third ones with the original lines 1 and 3 of input file
rem and the second one with shorter lines from line 2
rem http://www.dostips.com/forum/viewtopic.php?f=3&t=4945

echo Processing file, please wait...
for /F %%a in ('copy /Z "%~F0" NUL') do set "CR=%%a"
for /L %%i in (1,1,3) do del input%%i.txt 2> nul
set "in=1"
call :SplitLines < input.txt
goto :EOF


:SplitLines
echo/
echo Reading input line # %in%
set "lineNum=0"
:loopLine
   set /P "line="
   >> input%in%.txt set /P "=%line%" < NUL
   if %in% equ 2 >> input%in%.txt echo/
   set /A "lineNum+=1"
   set /P "=Output line: %lineNum%!CR!" < NUL
if "%line:~1022%" neq "" goto loopLine
echo/
set /A in+=1
if %in% leq 3 goto SplitLines
exit /B

The split method used is explained in this topic. In this case the detection of the end of the original lines is made when a SET /P command read less than 1023 bytes; this means that this method will fail if the original lines have a length multiple of 1023.


The second part process the three files created by first part and generate the desired output:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Process input2.txt as base template and input1/input3 as additional input files,
rem merge the three files and generate output file

echo Processing files, please wait...
for /F %%a in ('copy /Z "%~F0" NUL') do set "CR=%%a"
for /L %%i in (1,1,3) do set "prev%%i="
set /A prevLen=0, line=0
echo Start: %TIME%

(for /F "delims=" %%a in (input2.txt) do (

   rem Read the lines and fix broken alignment between lines
   set /P "ln1=" & set "ln2=%%a" & set /P "ln3=" <&3
   for /L %%i in (1,1,3) do (
      set "ln%%i=.!prev%%i!!ln%%i!."
      set "prev%%i="
   )
   set /A totLen=1023+prevLen, prevLen=0, line+=1
   set /P "=Input line: !line!!CR!" < NUL > CON

   rem Generate output accordingly to template in input2
   for /L %%i in (0,1,!totLen!) do (
      if "!ln2:~%%i,2!" equ ".|" (
         set /A beg=%%i+1
      ) else if "!ln2:~%%i,2!" equ "|." (
         set /A len=%%i-beg+1
         if %%i lss !totLen! (
            for %%m in ("!beg!,!len!") do for /L %%j in (1,1,3) do echo !ln%%j:~%%~m!
            echo/
         ) else (
            rem Possible alignment broken at end of line
            for %%m in ("!beg!,!len!") do for /L %%j in (1,1,3) do set "prev%%j=!ln%%j:~%%~m!"
            set "prevLen=!len!"
         )
      )
   )

)) < input1.txt  3<input3.txt  > output.txt

echo End:   %TIME%


I tested this solution with the 2.86 MB input file and the output was generated correctly. Of course, this method is much slower than the C#, JScript or PowerShell ones, but it may be modified in a simpler way, that is, you just need to know Batch file programming in order to do so...

Antonio

penpen
Expert
Posts: 1992
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Batchscript to extract texts from multiple lines

#18 Post by penpen » 30 Jul 2017 13:11

There are two minor bugs in your code.

The first one is in your splitting batch, and only only occurs with specific line lengthes, for example 1022 characters (if you use "\r\n" as the endl marker):
If the newline character ('\n') ends up on the 1024th character of the "set/p"-input-buffer, then your algorithm will correctly notice the end of the line,
but the next "set/p" will read the newline character and the next line data ("\n|||||..." in the following example), producing a "\r\n" as the first characters in the "input2.txt".

Sample file that provokes this issue on line 2:

Code: Select all

CCC.............1022.......C
|||.............1022.......|
CCC.............1022.......C
This issue should also be possible for line 3 (with a much bigger sample file).

The second issue is that leading spaces are ignored by set/p, so you might lose some spaces in "input2.txt":

Code: Select all

Z:\<nul set /P "= 1 " & echo(#
1 #


penpen

plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Batchscript to extract texts from multiple lines

#19 Post by plasma33 » 01 Aug 2017 20:47

Thanks guys!!

Plasma33

Post Reply