How to QUICKLY merge two input files into one output file

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
alan_b
Expert
Posts: 357
Joined: 04 Oct 2008 09:49

How to QUICKLY merge two input files into one output file

#1 Post by alan_b » 16 Feb 2012 09:27

This is my existing code :-

Code: Select all

<%1 (
  SET "B_ID=!LO!" & SET "NDX2=!LO!"
  for /L %%j in (!LO!,1,!HI!) do (
    SET "LN=" & SET /P "LN="
    if "!LN:~-2!"=="*]" (
      SET /A B_ID+=1
      FOR %%a IN (!B_ID!) DO SET /A NDX2=!V_%%a!
    )
    ECHO !NDX2!  %%j ]!LN!
  )
) > %2

It uses an array of variables V_100001 through to V_100771 - with room for expansion :)
I dislike the exponential growth of execution time as the size of the array V_nnnnnn increases.

Is there a quick easy way to read a sequence of numbers from a file so that I can avoid using the array ?

My code prefixes each line from %1 with an index number and a line count, giving a 6000 line output to %2 of which this is a portion

Code: Select all

100001  100012 ]
100004  100013 ][.Thumbnails*]
100004  100014 ]LangSecRef=3021
100004  100015 ]DetectFile=%UserProfile%\.Thumbnails
100004  100016 ]Default=False
100004  100017 ]FileKey1=%UserProfile%\.Thumbnails|*.*|RECURSE
100004  100018 ]
100002  100019 ][.NET Framework Logs*]
100002  100020 ]LangSecRef=3025
100002  100021 ]DetectFile=%WinDir%\Microsoft.NET
100002  100022 ]Default=False
100002  100023 ]FileKey1=%WinDir%\Microsoft.NET|*.log|RECURSE
100002  100024 ]
100003  100025 ][.NET Reflector*]

The purpose is to prepare the file for
SORT %2 > Sorted.ini

so that Block B_ID 100004 100013_through_to10018 will follow Block B_ID 100003 etc following B_ID 10002 etc

Regards
Alan

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: How to QUICKLY merge two input files into one output fil

#2 Post by Liviu » 16 Feb 2012 21:40

alan_b wrote:I dislike the exponential growth of execution time as the size of the array V_nnnnnn increases.

There is no obvious quadratic factor in the code you posted. It's possible that just accessing the variable may take linear time with the total number of variables defined, though I was not curious enough to measure.

alan_b wrote:Is there a quick easy way to read a sequence of numbers from a file so that I can avoid using the array ?

There is a neat example of that in the nearby thread "Capturing tokens only once from two .txt files". However in that case there is a 1-to-1 mapping between the lines, which makes sync'ing the reads (one off the redirected input, the other from a 'for' loop) easier, so the solution there won't translate directly to your case. And since there is no easy way (well, QUICK and easy) to break out of a 'for' loop based on an arbitrary condition, I don't see a straightforward adaptation of that idea working here.

Liviu

alan_b
Expert
Posts: 357
Joined: 04 Oct 2008 09:49

Re: How to QUICKLY merge two input files into one output fil

#3 Post by alan_b » 17 Feb 2012 03:07

My explanation was too concise.
The quadratic growth comes from the main input file having an average of 11% of the lines being "headers" such as
[.Thumbnails*]
For each such header I have an indexed variable.

It takes 3.330 Seconds to Read the full sized input file and write an output with each line prefixed e.g. by "100004 100013 ]"
after using SORT to create a file "no. 2" with the lines in a suitable sequence,
and it took about 3.2 Seconds to read no.2 and write no.3 with the prefixes removed.

Immediately after I used SETLOCAL at the start of creating the variables,
and used ENDLOCAL after using variables to create no.2 and before using no.2
it then took only 1.850 Seconds to read no.2 and write no.3.
I deduce ENDLOCAL removed a 1.350 seconds penalty incurred by the presence of an array of 771 variable.

Hence the larger the input file the larger the size of the array,
and the larger the array the longer it takes to read a line and process it,
hence the penalty increases with the size of the array.

I never recognized the penalty when I developed my code with a test sample of a few hundred lines,
but now it begins to bite :cry:

A week ago I thought about picking a sequence of "variables" from a second file instead of memory,
but decided the idea would not fly.
Then I saw the thread "Capturing tokens only once from two .txt files" and spent an hour or two trying find a way to apply it,
but I failed so I came to my "go to guys".

Thanks for the reply - I will now get on with my life and not live in false hope.

Regards
Alan

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: How to QUICKLY merge two input files into one output fil

#4 Post by Aacini » 18 Feb 2012 13:21

At this topic you may read an extensive description of the problem about the growth of execution times with large number of environment variables that include a description of the cause of the problem and a possible method to solve this problem, in part at least.

If you use that method and the timing tests prove that the method indeed decrease the execution time, please post the results. This problem/solution requires a much more complete tests that had not been achieved yet.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: How to QUICKLY merge two input files into one output fil

#5 Post by Liviu » 18 Feb 2012 15:19

Aacini wrote:At this topic you may read an extensive description of the problem about the growth of execution times with large number of environment variables that include a description of the cause of the problem and a possible method to solve this problem, in part at least.

Interesting reading, thanks for the link.

Liviu

alan_b
Expert
Posts: 357
Joined: 04 Oct 2008 09:49

Re: How to QUICKLY merge two input files into one output fil

#6 Post by alan_b » 19 Feb 2012 11:35

Aacini wrote:At this topic you may read an extensive description of the problem about the growth of execution times with large number of environment variables that include a description of the cause of the problem and a possible method to solve this problem, in part at least.

If you use that method and the timing tests prove that the method indeed decrease the execution time, please post the results. This problem/solution requires a much more complete tests that had not been achieved yet.

Thanks.

I saw that topic when it was started and it left me floundering out of my depth :?

My situation is that I have a fixed array of 771 variables which I list by the command "SET V_ > VAR.LST as below

Code: Select all

V_100001=100001  
V_100002=100004 
V_100003=100002 
V_100004=100003 
:: etc etc etc
V_100768=100768 
V_100769=100769 
V_100770=100770 
V_100771=100771 

The file VAR.LST is only 15 kByes, so I guess it is miniscule compared to the 1293 kBytes that Dave first mentioned in that post :)

That is a fixed array that is frozen before I launch

Code: Select all

<%1 (
  SET "B_ID=!LO!" & SET "NDX2=!LO!"
  for /L %%j in (!LO!,1,!HI!) do (
    SET "LN=" & SET /P "LN="
    if "!LN:~-2!"=="*]" (
      SET /A B_ID+=1
      FOR %%a IN (!B_ID!) DO SET /A NDX2=!V_%%a!
    )
    ECHO !NDX2!  %%j ]!LN!
  )
) > %2


6000 times the variable LN is cleared and then written by :-
SET "LN=" & SET /P "LN="
Those 6000 times are the lines of a 181 kByte file,
each line varying from 10 through to perhaps 100 bytes.

During the course of reading those 6000 lines,
there are 771 instances where each of the variables are READ but not re-written by :-
FOR %%a IN (!B_ID!) DO SET /A NDX2=!V_%%a!

I will admit to using the variable LN when reading a file before the variables V_??? were created,
so maybe a 15 kByte array of V_ are being shifted up and down because they are sitting on top of a changing variable LN.

I will try using brand new unused variables to replace the !LN! and !NDX2! and hope that avoids re-writing the entire array of V_???

I will be back - either with success or for a suggestion of alternatives.

Regards
Alan

Post Reply