Find All Files with Same Filename But Different Extension

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
Samir
Posts: 375
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Find All Files with Same Filename But Different Extension

#1 Post by Samir » 08 Feb 2021 02:11

I have files like this on a drive in various different directories:
brn008077cf04db_014868.REC
brn001ba9ca5166_015479.REC
brn008077cf04db_015727.REC
brn008077cf04db_015767.REC
brn008077cf04db_015982.REC
brn008077cf04db_016928.REC
brn008077cf04db_016600.REC
brn008077cf04db_016679.REC
brn008077cf04db_016828.REC
brn008077cf04db_017322.REC
brn008077cf04db_017431.REC
brn008077cf04db_017478.REC
These files may or may not have a .pdf extension counterpart somewhere on the drive, ie:
brn008077cf04db_014868.PDF
brn008077cf04db_016600.PDF
brn008077cf04db_017431.PDF
I want to get a list of all the .REC files that do not have a corresponding .PDF extension file, and ideally would like the full path to the file (ie, like dir FILENAME /s/b would display).

So for my above example, the list would look like:
\\scanner1\brn001ba9ca5166_015479.REC
\\scanner7\brn008077cf04db_015727.REC
\\scanner1\test\brn008077cf04db_015767.REC
\\scan5\test8\brn008077cf04db_015982.REC
\\blep\blop\brn008077cf04db_016928.REC
\\drive\path\brn008077cf04db_016679.REC
\\folder\folder\folder\brn008077cf04db_016828.REC
\\camel\camel\camel\brn008077cf04db_017322.REC
\\server\share\brn008077cf04db_017478.REC
I know various ways to do this, but all are clunky (writing list to file and then searching every base filename through the whole drive or a list of all the files on the drive). I know there's an elegant (and faster/easier) way to do this, but don't have the knowledge. Any assistance appreciated. Thank you!

penpen
Expert
Posts: 1907
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Find All Files with Same Filename But Different Extension

#2 Post by penpen » 08 Feb 2021 04:05

Samir wrote:
08 Feb 2021 02:11
I have files (...) in various different directories
(...)
a .pdf extension counterpart somewhere on the drive
(...)
I know various ways to do this, but all are clunky (writing list to file and then searching every base filename through the whole drive or a list of all the files on the drive). I know there's an elegant (and faster/easier) way to do this
The only way to speed that up is to only search for REC-files only within the various different directories. But as long as the PDF-files are somewhere on the drive (instead in a specific directory), then you have to search every filename through the whole drive anyway.
Therefore it (nearly) doesn't matter, if you search for both file-types:

Code: Select all

:: assumed the volume you want to search is "Z:", the create a file containing the filenames
dir /s /b "Z:\*.REC"  "Z:\*.PDF" >>"filenames.txt"
:: then sort the files
:: after that search the sorted file names for pairs (REC and PDF) in successive lines.
:: and echo those .REC files that do not have a corresponding .PDF extension file

penpen

Samir
Posts: 375
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Find All Files with Same Filename But Different Extension

#3 Post by Samir » 08 Feb 2021 11:53

Thank you penpen. How would I search the sorted filenames for pairs in successive lines? I can't think of what that test would look like logically. Thank you again for the help!

penpen
Expert
Posts: 1907
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Find All Files with Same Filename But Different Extension

#4 Post by penpen » 08 Feb 2021 20:00

Your use a for/f-loop to traverse the lines (=filenames) of the sorted files and try to find out if the current variable is a pairing PDF-file, or another REC-file or unrelated files (in case you got unrelated PDF-files in that list).
That means you need one (or more) environment variable, that reflects the state you are in.
You might use a variable 'filename' (without the quotes) for that purpose:
1) If filename is undefined, then you fully processed the last pair;
in that state you need to check,
whether the actual file is a REC-file (in which case you store the name - without the extension - into the variable filename),
or if the actual file is an unrelated PDF-file (in which case you do nothing - so you don't need to implement an else case here).
2) If filename is defined then the last filename was a REC-file and you have to check (using an if-statement),
whether your actual name is the corresponding pdf file (in which case you simply undefine filename to reflect you found a pair),
whether your actual name is another REC-file (in which case you echo the content of the filename variable, as the corresponding PDF is missing and set the actual name into the variable),
or if your actual name is an unrelated PDF-file (in which case you report the missing correspondand and undefine the filename variable).

Also note that you have to check for a missing corresponding PDF-file after the for/f-loop finished, because the last filename in the list of sorted filenames might have been a REC-file (== if filename s defined, report missing correxponding PDF-file).


penpen

Samir
Posts: 375
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Find All Files with Same Filename But Different Extension

#5 Post by Samir » 10 Feb 2021 15:21

Thank you very much penpen. The more I started delving into the specifics, I realized I had some mistaken assumptions in my original data.

First, is that my list of file ending in .REC will have the full path. Can I use %~nxF to get just the file name from the full path read into the variable 'filename'?

Second is that if my list will have full paths, the file names won't be sorted since the full path is there. :(

This may be a really inefficient way to do this, but I was thinking maybe another way to do this is to simply have a text file with a list of all the .PDF files with their full paths. Then another text file with a list of the .REC and their full paths. I would read in each line of the text file with the .REC file names and paths, parse out to just the file name+ext and then use find to see if it exists in the text file with the list of .PDF files. If so, copy the full path string from the .REC text file to a 'RESULTS' text file. I know it would require hitting the text file with .PDF in it multiple times, but being cached it should be pretty quick, no? Thoughts?

penpen
Expert
Posts: 1907
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Find All Files with Same Filename But Different Extension

#6 Post by penpen » 10 Feb 2021 18:03

Samir wrote:
10 Feb 2021 15:21
First, is that my list of file ending in .REC will have the full path. Can I use %~nxF to get just the file name from the full path read into the variable 'filename'?
If you are within an appropriate for-loop (your variable %%F has to reference the full path) , then yes.


Samir wrote:
10 Feb 2021 15:21
Second is that if my list will have full paths, the file names won't be sorted since the full path is there. :(
The basic idea is to sort by filenames (in reversed alphabetical order such that 'R' is before 'P'), so you could apply the above algorithm.
You noticed that if the lines would start with the filename, then all would be fine (which makes you nearly solve that task).
But you stated that kind of inverted which demotovated you thinking more in that direction (which in this instance was unhelpfull; you should state descriptions in multiple variations).

Therefore you simply store all your 'complete' filenames (containing volume, path of parent directory, filename with extensions) in the order you need (preferably with as few elements as possible, seperated by a character not allowed to be part of a filename); for example:

Code: Select all

"filename.extension";"path_of_parent_directory"
Alternatively you could store the filename twice (in case you need the complete path and don't want to add special handling for files in root directories to avoid doubled backslashes):

Code: Select all

"filename.extension";"path_of_that_file"

Samir wrote:
10 Feb 2021 15:21
This may be a really inefficient way to do this (...)? Thoughts?
I wouldn't recommend that idea, because i agree that such an algorithm is an inefficient way to do this, though by far not the worst either.
But batch files are slow, so you should prefer to avoid relatively slow algorithms (especially in your batch portion).
If the file of PDF-files is too big (on my system > 4 MB, but i don't know if that depends on my system or on windows), then that file won't even be cached (i wouldn't rely on wishfull thinking unless unavoidable).


penpen

Samir
Posts: 375
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Find All Files with Same Filename But Different Extension

#7 Post by Samir » 11 Feb 2021 14:35

Thank you again penpen. And as I worked through your suggestions, I realized another detail I missed in my initial analysis--the PDF files will contain the REC filename, but may also have other stuff appended, so now even if I can take each path and parse out the file name, the file names won't exactly match. :(

I was thinking about how FINDSTR can take multiple arguements via a file. I could feed findstr the list of REC base file names and then use that to search the text file with all the PDF files names and paths. The results should be all the files that have corresponding PDF files. It might even be faster depending on how findstr works. Thoughts?

I started working on this implementation, but can't seem to get command

Code: Select all

FINDSTR /G:RECBAS.TXT D:\PDF
to do anything. It doesn't even display an error message. The contents of RECBAS.TXT is 700+ lines of this:

Code: Select all

brn_91b3be_015016 
brn_93d8ce_039142 
brn_93d8ce_039148 
brw008092adbb92_001089 
brw008092adbb92_001104 
brw008092adbb92_001107 
brw008092adbb92_000928 
brw008092adbb92_000802 
brw008092adbb92_000852 
brw008092adbb92_000863
D:\PDF is 5MB and an excerpt of it looks like this:

Code: Select all

\\192.168.1.10\ROOT\SPROOT\INCOMING\m_000005.pdf
\\192.168.1.10\ROOT\SPROOT\INCOMING\m_000007.pdf
\\192.168.1.10\ROOT\SPROOT\PROCESS\m_000008.pdf
\\192.168.1.10\ROOT\SPROOT\PROCESS\m_000009.pdf
\\192.168.1.10\ROOT\SPROOT\PROCESS\m_000010.pdf
\\192.168.1.10\ROOT\SPROOT\PROCESS\m_000011.pdf
\\192.168.1.10\ROOT\SPROOT\PROCESS\m_000012.pdf
\\192.168.1.10\ROOT\SPROOT\PROCESS\m_000013.pdf
\\192.168.1.10\ROOT\SPROOT\INCOMING\brw008092adbb92_000014.pdf
\\192.168.1.10\ROOT\SPROOT\INCOMING\brw008092adbb92_000015.pdf
\\192.168.1.10\ROOT\SPROOT\INCOMING\brw008092adbb92_000016.pdf
\\192.168.1.10\ROOT\SCAN\INCOMING\m_000017.pdf
\\192.168.1.10\ROOT\SCAN\INCOMING\m_000018.pdf
\\192.168.1.10\ROOT\SCAN\INCOMING\m_000019.pdf
\\192.168.1.10\ROOT\SCAN\INCOMING\m_000020.pdf
\\192.168.1.10\ROOT\SCAN\INCOMING\m_000021.pdf
\\192.168.1.10\ROOT\SCAN\INCOMING\m_000022.pdf
\\192.168.1.10\ROOT\SPROOT\INCOMING\m_000023.pdf

Samir
Posts: 375
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Find All Files with Same Filename But Different Extension

#8 Post by Samir » 12 Feb 2021 14:01

When I run the following command using the first line in the RECBAS.TXT file, the command works fine:

Code: Select all

FINDSTR /C:brn_91b3be_015016 D:\PDF
The results are expected:

Code: Select all

\\192.168.1.10\ROOT\SPROOT\INCOMING\brn_91b3be_015016_recovered.pdf
Anyone have an idea why the file in /G in my previous post isn't being processed? Thank you in advance! (If it's some sort of little stupid error please point it out!!)

dbenham
Expert
Posts: 2447
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Find All Files with Same Filename But Different Extension

#9 Post by dbenham » 12 Feb 2021 19:26

I think you have a fundamental problem with your goal requirements - as currently stated, I believe you can have ambiguous situations.

Given files abc.rec and abc1.pdf, your rules state that those should be a pair. But then what happens if later on abc1.rec is added?

You could argue the pdf can be paired with both, but I doubt that is what you want.

But if you say the pdf should only match abc1.rec, then your results for abc.rec are inconsistent over time. At first abc.rec matches abc1.rec, and then it doesn't, despite the fact that neither file changed.


Ignoring that issue, here is how I would begin to get a handle on the problem. I would use FOR /R to list all .rec and .pdf files with the fileName.ext first, followed by a bunch of spaces (and/or path incompatible delimiter), and then the path. I would pipe that entire set to SORT and redirect output to a file. Then you will have all the like named .rec and .pdf files next to each other. You can then visually scan for matches, and decide how to deal with ambiguous situations.

Code: Select all

(for /r %f in (*.pdf *.rec) do @echo %~nxf            :        %~dpf) | sort > rec-pdf.list
You could take it a step further and try to write a FOR /F loop to scan the result and look for pairs. You are guaranteed that "matching" .pdf/.rec pairs will be next to each other. But you can't be sure which will come first, .pdf or .rec, because of the extra characters that may be appended to the .pdf file base name.

Samir
Posts: 375
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Find All Files with Same Filename But Different Extension

#10 Post by Samir » 12 Feb 2021 21:10

dbenham wrote:
12 Feb 2021 19:26
I think you have a fundamental problem with your goal requirements - as currently stated, I believe you can have ambiguous situations.

Given files abc.rec and abc1.pdf, your rules state that those should be a pair. But then what happens if later on abc1.rec is added?

You could argue the pdf can be paired with both, but I doubt that is what you want.

But if you say the pdf should only match abc1.rec, then your results for abc.rec are inconsistent over time. At first abc.rec matches abc1.rec, and then it doesn't, despite the fact that neither file changed.


Ignoring that issue, here is how I would begin to get a handle on the problem. I would use FOR /R to list all .rec and .pdf files with the fileName.ext first, followed by a bunch of spaces (and/or path incompatible delimiter), and then the path. I would pipe that entire set to SORT and redirect output to a file. Then you will have all the like named .rec and .pdf files next to each other. You can then visually scan for matches, and decide how to deal with ambiguous situations.

Code: Select all

(for /r %f in (*.pdf *.rec) do @echo %~nxf            :        %~dpf) | sort > rec-pdf.list
You could take it a step further and try to write a FOR /F loop to scan the result and look for pairs. You are guaranteed that "matching" .pdf/.rec pairs will be next to each other. But you can't be sure which will come first, .pdf or .rec, because of the extra characters that may be appended to the .pdf file base name.
I re-read my original post and can't seem to find any fundamental issue with what I want. :?:

I'm not looking for pairs, I want to know if a particular PDF file exists for a particular REC base filename. In your example of abc.rec, the only PDF files that would exist would be abc.pdf or abc_something.pdf, and in the case of abc1.rec, it would be abc1.pdf or abc1_something.pdf, still allowing for a 1:1 match even thought that's not what I'm looking for. I just want to know if the pdf version exists.

I've been doing the work manually already, so that's not what I want to do as I know I can get a list of only the REC files without a corresponding PDF file. If findst would just work as it should, I would be done with this already. :evil: Worse case I can just FOR /F the file line by line into findstr, but that's a real waste since findstr should be reading in the file.

Code: Select all

for /F %f in (RECBAS.TXT) do findstr /c:"%f" d:\pdf
I went ahead and ran this to get a result, but I still don't understand why FINDSTR /G didn't work.

Samir
Posts: 375
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Find All Files with Same Filename But Different Extension

#11 Post by Samir » 14 Feb 2021 11:13

I started from scratch and took a completely different approach. The following worked to give me what I wanted:

Code: Select all

REM LIST OF ALL *.REC FILES ON DRIVE TO REC
DIR \\192.168.1.10\ROOT\*.REC /S/B > \\192.168.1.10\ROOT\REC

REM LIST OF ALL *.PDF FILES ON DRIVE TO PDF
DIR \\192.168.1.10\ROOT\*.PDF /S/B > \\192.168.1.10\ROOT\PDF

REM DELETE/REN EXISTING 2REC
IF EXIST \\192.168.1.10\ROOT\2REC.BAK DEL \\192.168.1.10\ROOT\2REC.BAK
IF EXIST \\192.168.1.10\ROOT\2REC REN \\192.168.1.10\ROOT\2REC *.BAK

REM USING FOR /F FIND BASE REC FILENAME IN PDFLIST AND IF NOT FOUND ECHO REC LINE TO 2REC FILE
for /f %%f in (\\192.168.1.10\ROOT\rec) do find /C /i "%%~nf" \\192.168.1.10\ROOT\PDF & IF ERRORLEVEL 1 (ECHO %%f >> \\192.168.1.10\ROOT\2REC)

penpen
Expert
Posts: 1907
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Find All Files with Same Filename But Different Extension

#12 Post by penpen » 15 Feb 2021 04:07

Samir wrote:
11 Feb 2021 14:35
And as I worked through your suggestions, I realized another detail I missed in my initial analysis--the PDF files will contain the REC filename, but may also have other stuff appended, so now even if I can take each path and parse out the file name, the file names won't exactly match. :(
You could remove the unwanted parts of the pdf base filename before comparing the strings.

To avoid the ambiguity Dave mentioned, you could perform a string comparisons with a modifed base filename, for example add the next expected character, assumed it is a delimiter of the relation between REC base filenames and PDF filenames, which seem to be either dot.character ('.') or an underscore-character ('_') .

Samir wrote:
12 Feb 2021 21:10
I'm not looking for pairs, I want to know if a particular PDF file exists for a particular REC base filename.
If a particular PDF file exists for a particular REC base filename, then you have found a matching pair of the relation you are using.
The relation you seem to use is:
'The REC base filename is a substring of the PDF filename'.

Therefore you are looking for pairs (or more specific the absence of pairs).

Samir wrote:
12 Feb 2021 21:10
I re-read my original post and can't seem to find any fundamental issue with what I want. :?:
(...)
In your example of abc.rec, the only PDF files that would exist would be abc.pdf or abc_something.pdf, and in the case of abc1.rec, it would be abc1.pdf or abc1_something.pdf, still allowing for a 1:1 match even thought that's not what I'm looking for. I just want to know if the pdf version exists.
Dave pointed out that the REC base filename "abc" is a substring of the base filename "abc1" and therefore will match every PDF filename that "abc1" is matching to, although you don't seem to consider those a match.

Example:
Content of file "RECBAS.TXT":

Code: Select all

abc
abc1
Content of file "PDF":

Code: Select all

\\192.168.1.10\ROOT\SCAN\INCOMING\abc1.pdf
\\192.168.1.10\ROOT\SCAN\INCOMING\abc1_something.pdf
Then you will end up finding matches for the base filename "abc" with both pdf files.
Although that is not what you want, your last implementation definitely does that.

Samir wrote:
12 Feb 2021 21:10
I still don't understand why FINDSTR /G didn't work.
If the "RECBAS.TXT" file you gave above is correct, then i suspect the space characters at the end of each line to be the cause.

Post Reply