Request for help to speed up batch program for 17,000 TXT files
Posted: 23 Jan 2021 03:36
I have over 17,000 pages that have been scanned (for a local history archive) which I have OCRed using Tesseract to individual TXT files. I want to be able to search/locate every page containing a search word of more than 3, lower case letters. So for each TXT file I need to:
Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -
Remove 1, 2 and 3 letter words - jrepl "\b\w{1,3}\b" "" /x /f %%G /O -
Change all characters to lower case - jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
To be able to sort the remaining words they need to be on separate new lines - jrepl "\s" "\n" /x /f %%G /O -
Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G
I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I'm not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.
This is Batch I am using:
Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -
Remove 1, 2 and 3 letter words - jrepl "\b\w{1,3}\b" "" /x /f %%G /O -
Change all characters to lower case - jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
To be able to sort the remaining words they need to be on separate new lines - jrepl "\s" "\n" /x /f %%G /O -
Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G
I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I'm not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.
This is Batch I am using:
Code: Select all
Setlocal EnableDelayedExpansion
for %%G in (*.txt) do (
set old=%%G
echo !old!
@echo on
rem remove non-alphanumeric
call jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -
rem remove 1, 2 and 3 letter words
call jrepl "\b\w{1,3}\b" "" /x /f %%G /O -
rem all to lowercase
call jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
rem replace spaces with new lines
call jrepl "\s" "\n" /x /f %%G /O -
rem reduce to unique words
sort /UNIQUE %%G /O %%G
)
pause