Request for help to speed up batch program for 17,000 TXT files

Message

cuff123 · #1 Post by **cuff123** » 23 Jan 2021 03:36

I have over 17,000 pages that have been scanned (for a local history archive) which I have OCRed using Tesseract to individual TXT files. I want to be able to search/locate every page containing a search word of more than 3, lower case letters. So for each TXT file I need to:

Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -
Remove 1, 2 and 3 letter words - jrepl "\b\w{1,3}\b" "" /x /f %%G /O -
Change all characters to lower case - jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
To be able to sort the remaining words they need to be on separate new lines - jrepl "\s" "\n" /x /f %%G /O -
Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G

I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I'm not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.

This is Batch I am using:

Code: Select all

Setlocal EnableDelayedExpansion
for %%G in (*.txt) do (
set old=%%G
echo !old!
@echo on

rem remove non-alphanumeric
call jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -

rem remove 1, 2 and 3 letter words
call jrepl "\b\w{1,3}\b" "" /x /f %%G /O -

rem all to lowercase
call jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -

rem replace spaces with new lines
call jrepl "\s" "\n" /x /f %%G /O -

rem reduce to unique words
sort /UNIQUE %%G /O %%G

)
pause

#2 Post by **dbenham** » 23 Jan 2021 13:54

OMG

The standard Windows SORT command supports the /UNIQUE option, at least on Win 10, even though it is not documented - I had no idea

Code: Select all

D:\test>sort /?
SORT [/R] [/+n] [/M kilobytes] [/L locale] [/REC recordbytes]
  [[drive1:][path1]filename1] [/T [drive2:][path2]]
  [/O [drive3:][path3]filename3]
  /+n                         Specifies the character number, n, to
                              begin each comparison.  /+3 indicates that
                              each comparison should begin at the 3rd
                              character in each line.  Lines with fewer
                              than n characters collate before other lines.
                              By default comparisons start at the first
                              character in each line.
  /L[OCALE] locale            Overrides the system default locale with
                              the specified one.  The ""C"" locale yields
                              the fastest collating sequence and is
                              currently the only alternative.  The sort
                              is always case insensitive.
  /M[EMORY] kilobytes         Specifies amount of main memory to use for
                              the sort, in kilobytes.  The memory size is
                              always constrained to be a minimum of 160
                              kilobytes.  If the memory size is specified
                              the exact amount will be used for the sort,
                              regardless of how much main memory is
                              available.

                              The best performance is usually achieved by
                              not specifying a memory size.  By default the
                              sort will be done with one pass (no temporary
                              file) if it fits in the default maximum
                              memory size, otherwise the sort will be done
                              in two passes (with the partially sorted data
                              being stored in a temporary file) such that
                              the amounts of memory used for both the sort
                              and merge passes are equal.  The default
                              maximum memory size is 90% of available main
                              memory if both the input and output are
                              files, and 45% of main memory otherwise.
  /REC[ORD_MAXIMUM] characters Specifies the maximum number of characters
                              in a record (default 4096, maximum 65535).
  /R[EVERSE]                  Reverses the sort order; that is,
                              sorts Z to A, then 9 to 0.
  [drive1:][path1]filename1   Specifies the file to be sorted.  If not
                              specified, the standard input is sorted.
                              Specifying the input file is faster than
                              redirecting the same file as standard input.
  /T[EMPORARY]
    [drive2:][path2]          Specifies the path of the directory to hold
                              the sort's working storage, in case the data
                              does not fit in main memory.  The default is
                              to use the system temporary directory.
  /O[UTPUT]
    [drive3:][path3]filename3 Specifies the file where the sorted input is
                              to be stored.  If not specified, the data is
                              written to the standard output.   Specifying
                              the output file is faster than redirecting
                              standard output to the same file.


D:\test>

I'm glad you posted your question with your code

I don't think it has much impact on performance, but there is no need to store %%G in a variable, you can echo %%G directly. So that also eliminates the need for enabled expansion.

Most references to %%G should be quoted in case the file name contains spaces.

Also, the [^...] regex could give the wrong result because the batch CALL statement doubles all quoted ^ characters, so ^ becomes ^^, and the first ^ is interpreted as negation as you want, but the second is a literal ^ character. One solution is to use the \XSEQ option along with the non-standard \c escape sequence. Another option is to store the find and replace strings in environment variables and use the /V option.

The speed of JREPL is relative. Compared to pure batch solutions, it is very fast, in addition to being much more powerful. But it is still using a script to do most of the work (JScript). Compared to a compiled program, it is very slow.

You could achieve much faster results with a compiled utility like the unix sed utility - you can find that for Windows any number of places.

But your JREPL solution could be optimized and made MUCH faster (more than 100 times faster).

Your first find/replace needs to stay pretty much the same, except for using \c with /XSEQ to prevent doubling the caret.

The /J option in your 3rd CALL must dynamically execute the toLowerCase() function via eval() for every replacement, which is very costly. Using /JQ is a bit more tedious to type, but much faster because it is able to dynamically create a replace function once via eval(), and then call it normally for all of the replacements.

But it is possible to use /JMATCHQ instead to reduce the three remaining calls into a single one. Simply search for each word of length 4 or longer and write the lowercase form on a new line via the /JMATCHQ option.

This has not been tested, but I believe it will work

Code: Select all

@echo off
for %%F in (*.txt) do (
  echo %%F

  rem Remove non-alphanumeric characters that aren't whitespace
  call jrepl "[\ca-zA-Z0-9\s]+" "" /xseq /f "%%F" /o -

  rem Write each remaining word >=4 characters as lowercase on a new line
  call jrepl "\S{4,}" "$txt=$0.toLowerCase()" /jmatchq /f "%%F" /o -

  rem Reduce to sorted list of unique words
  sort /unique "%%F" /o "%%F"
)
pause

Dave Benham

Eureka! · #3 Post by **Eureka!** » 23 Jan 2021 19:45

Code: Select all

rem Write each remaining word >=4 characters as lowercase on a new line
  call jrepl "\S{4,}" "$txt=$0.toLowerCase()" /jmatchq /f "%%F" /o -

  rem Reduce to sorted list of unique words
  sort /unique "%%F" /o "%%F"

It might be more efficient to *not* convert the words to lowercase, as this SORT command is case-insensitive.

The remaining unique words can be converted to lowercase in a final step.

BTW:
Without knowing JREPL, doesn't this:

Code: Select all

rem Remove non-alphanumeric characters that aren't whitespace
  call jrepl "[\ca-zA-Z0-9\s]+" "" /xseq /f "%%F" /o -

cause A%&B%&C%&D - which isn't a word - to form "ABCD"?
Maybe convert those ranges of non-alphanumeric characters to a whitespace? Like this:

Code: Select all

  call jrepl "[\ca-zA-Z0-9\s]+" " " /xseq /f "%%F" /o -

(Notice the " " instead of "")

#4 Post by **dbenham** » 24 Jan 2021 00:25

Eureka! wrote: It might be more efficient to *not* convert the words to lowercase, as this SORT command is case-insensitive.

Sure, SORT with /UNIQUE will still sort properly and give the correct values, except the output may have mixed case. That may or may not be a problem.

Eureka! wrote: BTW:
Without knowing JREPL, doesn't this:
CODE: SELECT ALL

rem Remove non-alphanumeric characters that aren't whitespace
call jrepl "[\ca-zA-Z0-9\s]+" "" /xseq /f "%%F" /o -
cause A%&B%&C%&D - which isn't a word - to form "ABCD"?
Maybe convert those ranges of non-alphanumeric characters to a whitespace? Like this:
CODE: SELECT ALL

call jrepl "[\ca-zA-Z0-9\s]+" " " /xseq /f "%%F" /o -
(Notice the " " instead of "")

Yes indeed it does, which is the exact behavior cuff123 has in his original code. I'm assuming that is the desired behavior.

If non-alphanumeric should be treated as space, then the solution can be reduced to a single JREPL call, followed by SORT.
The JREPL could be:

Code: Select all

call jrepl "[a-zA-Z0-9]{4,}" "$txt=$0.toLowerCase()" /jmatchq /f "%%F" /o -

Dave Benham

Eureka! · #5 Post by **Eureka!** » 24 Jan 2021 04:40

dbenham wrote: ↑
24 Jan 2021 00:25
Sure, SORT with /UNIQUE will still sort properly and give the correct values, except the output may have mixed case. That may or may not be a problem.

That's why the ...

Eureka! wrote: ↑
23 Jan 2021 19:45
The remaining unique words can be converted to lowercase in a final step

All words would still be converted to lowercase. It's just that the 'expensive' case conversion would be used on a minimal number of words.

#6 Post by **dbenham** » 24 Jan 2021 06:49

Doh! Of course, I see now. Thanks. That might provide some additional speed. But I don't think the difference is significant enough to worry about.

With the simple J options that use the dynamic eval() it would be a huge difference. But the JQ options encapsulate the call in a well performant user defined function.

For example, I tested processing a 6.2mb file with JREPL 3 different ways. I show the average time for 3 runs of each method.

Original method using slow /JMATCH that relies on eval() to convert to lower case - 235.29 seconds (pathetic performance)

Code: Select all

jrepl ".+" "$0.toLowerCase()" /jmatch /f test.txt /o -

Much faster using /JMATCHQ to convert to lower case - 1.81 seconds (reduced the time by a factor of 130!)

Code: Select all

jrepl ".+" "$txt=$0.toLowerCase()" /jmatchq /f test.txt /o -

Measurable but insignificant performance gain if I skip the toLowerCase() step - 1.77 seconds (only ~2% faster)

Code: Select all

jrepl ".+" "" /match /f test.txt /o -

Back to the original problem - It very well might take longer to add an extra toLowerCase step after the SORT (total of 3 JREPL calls) than just 2 JREPL calls with toLowerCase before SORT.

Dave Benham

jfl · #7 Post by **jfl** » 24 Jan 2021 09:15

For quickly searching for strings in a large set of text files, try The Silver Searcher.
The ag.exe program in that zip file is a port for Windows that I maintain, of the Unix tool ag.
It supports full fledged regular expressions, so finding any sequence of valid characters you can think of will be easy.
It's also very fast, especially when doing successive searches in the same set of files. (It's using memory-mapped files, so it's blindingly fast when the files are already in cache!) So you can fine-tune the regular expression and quickly retry over and over again.

Finally, option -o allows displaying only the matching strings, not the rest of the line. So this allows you to get rid of the surrounding garbage, without even using any replace tool like sed.

DosTips.com

Request for help to speed up batch program for 17,000 TXT files

Request for help to speed up batch program for 17,000 TXT files

Re: Request for help to speed up batch program for 17,000 TXT files

Re: Request for help to speed up batch program for 17,000 TXT files

Re: Request for help to speed up batch program for 17,000 TXT files

Re: Request for help to speed up batch program for 17,000 TXT files

Re: Request for help to speed up batch program for 17,000 TXT files

Re: Request for help to speed up batch program for 17,000 TXT files