Extract missing record

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Extract missing record

#1 Post by darioit » 01 Feb 2024 11:05

Hello,
I need your help to extract missing row, this is the statement:
In a large ordered list of record I have always a couple of record with variable lenght like this example when the second record must have same name like xx_xx_xx and fix X1 after last underscore
FILEA_FILEB_FILEC.PDF
FILEA_FILEB_FILEC_X1SOME.PDF

I need to ectract all record without this match, here below a test file

file.txt:
ABCDEFG_RST1058_2021111M.pdf
CDEFGH_CJO0023_2021112M.pdf
CDEFGHO_QBC5638_2021121C.pdf
CDEFGHO_QBC5638_2021121C_X1I1234567.pdf
FGHOP_XYA7662_2022011C.pdf
FGHOP_XYA7662_2022011C_X1I23456785.pdf
EFGHOPQ_CJ21234_2021121CLKJH.pdf
EFGHOPQ_CJ21234_2021121CLKJH_X1I3456789.pdf
EFGHOPQ_CJ21234_2021121M_X1I4567890.pdf
FGHOPXR_CJU3971_2021120M.pdf

Results missing X1 record:
ABCDEFG_RST1058_2021111M.pdf
CDEFGH_CJO0023_2021112M.pdf
FGHOPXR_CJU3971_2021120M.pdf

Result only X1 record
EFGHOPQ_CJ21234_2021121M_X1I4567890.pdf

Thank you very much in advance
Dario

Squashman
Expert
Posts: 4470
Joined: 23 Dec 2011 13:59

Re: Extract missing record

#2 Post by Squashman » 01 Feb 2024 13:12

I really think you could attempt this one on your own. Just one solution would be to do the following.
1) FOR /F command to read the text file.
2) FOR /F command to split apart the base file name into multiple FOR variable tokens.
3) IF TOKEN 4 is not blank test if tokens 1_2_3.pdf is in the file. If not echo file name.
4) IF TOKEN 4 is blank test if tokens 1_2_3_X1*.pdf is in the file. If not echo file name.

This is basically a similar concept to your last question. The only difference being you are reading a text file instead of parsing the DIR command.
viewtopic.php?f=3&t=10571&p=67827#p67827

darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Re: Extract missing record

#3 Post by darioit » 01 Feb 2024 14:15

yes you are right, the problem is that with millions of records it takes too long, so I decided to do a dir of the directory first. I'll try to work on it and post the solution, see if I can solve the problem

Squashman
Expert
Posts: 4470
Joined: 23 Dec 2011 13:59

Re: Extract missing record

#4 Post by Squashman » 01 Feb 2024 17:13

darioit wrote:
01 Feb 2024 14:15
yes you are right, the problem is that with millions of records it takes too long, so I decided to do a dir of the directory first. I'll try to work on it and post the solution, see if I can solve the problem
With that big of a file you are going to see slow processing with the FOR /F command reading the file as well. The entire file is read into memory before it is parsed. Same with the FOR /F parsing the DIR command. The DIR command has to finish before the FOR /F begins parsing the output.

Your BEST bet is to use a basic for command to read the directory. Then use a FOR /F to split of the file name. So take my original pseudo code and just do a standard FOR command first.

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Extract missing record

#5 Post by Aacini » 02 Feb 2024 00:32

The solution of this problem have a subtle trick! :shock: :wink:

I think this is the fastest method to solve this problem:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

echo Results missing X1 record:
set "last="
(for %%f in (*.pdf) do (
   for /F "tokens=1-4 delims=_." %%a in ("%%f") do (
      if not defined last (
         if "%%d" equ "pdf" (
            set "last=%%~Nf"
         ) else (
            echo %%f >&2
         )
      ) else (
         if "%%a_%%b_%%c" equ "!last!" (
            set "last="
         ) else (
            echo !last!.pdf
            if "%%d" equ "pdf" (
               set "last=%%~Nf"
            ) else (
               echo %%f >&2
               set "last="
            )
         )
      )
   )
)) 2> onlyX1.txt
if defined last echo %last%.pdf

echo/
echo Result only X1 record:
type onlyX1.txt
del onlyX1.txt
Output:

Code: Select all

Results missing X1 record:
ABCDEFG_RST1058_2021111M.pdf
CDEFGH_CJO0023_2021112M.pdf
FGHOPXR_CJU3971_2021120M.pdf

Result only X1 record:
EFGHOPQ_CJ21234_2021121M_X1I4567890.pdf
Antonio

darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Re: Extract missing record

#6 Post by darioit » 02 Feb 2024 03:15

Thank you Antonio, it works fine, but the problem is a pdf directory is a windows share slow and takes long time to read 1 milion of pdf, so I prefer before do a dir /b > file.txt and work on a file.txt

Squashman
Expert
Posts: 4470
Joined: 23 Dec 2011 13:59

Re: Extract missing record

#7 Post by Squashman » 02 Feb 2024 11:11

darioit wrote:
02 Feb 2024 03:15
Thank you Antonio, it works fine, but the problem is a pdf directory is a windows share slow and takes long time to read 1 milion of pdf, so I prefer before do a dir /b > file.txt and work on a file.txt
As I said in my previous post, a base FOR command should be faster than reading a million line file into memory. The FOR command works on one file at a time. The FOR /F has to read the entire file into memory before it can begin working on it.
Regardless if you understand the code, you should easily be able to change Antonio's code to read the file. You would only have to change one line of code.

Post Reply