Compare similar file lists - detect differences*title edited

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Compare similar file lists - detect differences*title edited

#1 Post by foxidrive » 22 Apr 2015 09:06

Read my reply to Aacini's next post as the task wasn't stated clearly

I have a text file which is a DIR listing
and a second DIR listing which has many similar lines

I'd like to copy the two files together and then remove every set of lines which are the same AND that contain a "/" character (it's in the date string).

So if each file contained this line several times then it would remove each one of them.

Code: Select all

29/07/2009  01:44            89,106 Doctors.jpg



Leaving the file unsorted is a requirement - and speed would be nice.
Has this been solved anyplace that someone knows?

Aacini
Expert
Posts: 1932
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Text file: remove every line that is duplicated - leave

#2 Post by Aacini » 22 Apr 2015 12:22

Your request is complicated; you should posted the input files and the desired output... Anyway, I think this is the solution:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Load lines of first file in an array, preserving its order (9999 maximum)
set i=10000
for /F "delims=" %%a in (One.txt) do (
   set /A i+=1
   set line["%%a"]=!i:~1!
)

rem Process lines from second file
for /F "delims=" %%a in (Two.txt) do (
   rem If the line already exist in the array AND contain "/"
   set "line=%%a"
   if defined line["%%a"] if "!line:/=!" neq "!line!" (
      rem Mark this line to be removed
      set line["%%a"]=0
      set "line="
   )
   rem Else: Insert the line in the array
   if defined line (
      set /A i+=1
      set line["%%a"]=!i:~1!
   )
)

rem Output remaining lines, preserving original order
(for /F tokens^=2*^ delims^=[^"]^= %%a in ('set line[') do (
   if %%b neq 0 echo %%b:%%a
)) > tempfile.txt
for /F "tokens=1* delims=:" %%a in ('sort tempfile.txt') do echo %%b
del tempfile.txt

Antonio

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Text file: remove every line that is duplicated - leave

#3 Post by foxidrive » 23 Apr 2015 07:06

Aacini wrote:Your request is complicated; you should posted the input files and the desired output...


Thanks for your code.

Yes, sorry, I wasn't thinking clearly.

I have two backup Acronis files and I mounted them and did a dir /s /a-d >one.txt and second one into two.txt

What I am trying to find out is what files are causing the difference in size between the two full backups.
A problem is that there are 250,000+ files in each archive - and what I didn't realise is that I also need to tell which file contains each set of unique files.

So in essence I need to just make the filedate/time/size/name vanish that are the same in each file.

This example isn't very useful but you can see that it's just a dir /s /a-d of each drive letter.


Code: Select all

 Volume in drive E is C
 Volume Serial Number is F87A-310C

 Directory of e:\

12/12/2013  10:53               110 .dir
11/06/2009  07:42                24 autoexec.bat
14/06/2014  20:46           404,250 bootmgr
18/06/2013  22:18                 1 BOOTNXT
03/11/2013  20:21             8,192 BOOTSECT.BAK
11/06/2009  07:42                10 config.sys
14/05/2012  04:35                 0 IO.SYS
14/05/2012  04:35                 0 MSDOS.SYS
12/05/2012  17:37           284,360 MYXLD
12/05/2012  17:37                20 win7.ld
              10 File(s)        696,967 bytes

 Directory of e:\Boot

27/02/2015  09:24            40,960 BCD
03/11/2012  00:32            40,960 BCD.LOG
13/05/2012  11:16                 0 BCD.LOG1
13/05/2012  11:16                 0 BCD.LOG2
03/11/2013  20:21            65,536 BOOTSTAT.DAT
27/04/2014  06:15         1,192,280 memtest.exe
03/11/2012  00:08               296 reflect.cfg
               7 File(s)      1,340,032 bytes

 Directory of e:\Boot\bg-BG

22/08/2013  15:21            77,152 bootmgr.exe.mui
               1 File(s)         77,152 bytes

 Directory of e:\Boot\cs-CZ



Code: Select all

 Volume in drive F is C
 Volume Serial Number is F87A-310C

 Directory of f:\

12/12/2013  10:53               110 .dir
11/06/2009  07:42                24 autoexec.bat
14/06/2014  20:46           404,250 bootmgr
18/06/2013  22:18                 1 BOOTNXT
03/11/2013  20:21             8,192 BOOTSECT.BAK
11/06/2009  07:42                10 config.sys
14/05/2012  04:35                 0 IO.SYS
14/05/2012  04:35                 0 MSDOS.SYS
12/05/2012  17:37           284,360 MYXLD
12/05/2012  17:37                20 win7.ld
              10 File(s)        696,967 bytes

 Directory of f:\Boot

21/03/2015  10:28            40,960 BCD
03/11/2012  00:32            28,672 BCD.LOG
13/05/2012  11:16                 0 BCD.LOG1
13/05/2012  11:16                 0 BCD.LOG2
03/11/2013  20:21            65,536 BOOTSTAT.DAT
27/04/2014  06:15         1,192,280 memtest.exe
03/11/2012  00:08               296 reflect.cfg
               7 File(s)      1,327,744 bytes

 Directory of f:\Boot\bg-BG

22/08/2013  15:21            77,152 bootmgr.exe.mui
               1 File(s)         77,152 bytes

 Directory of f:\Boot\cs-CZ



I have simple code to do it, but it takes forever.

Code: Select all

@echo off
copy /b "f2drive.txt" file.tmp
call :check "e2drive.txt"
copy /b "e2drive.txt" file.tmp
call :check "f2drive.txt"
echo done
pause
goto :EOF

:check
(
for /f "usebackq delims=" %%a in ("file.tmp") do (
findstr /i /b /e /c:"%%a" "%~1" >nul || echo %%a
)
)>"not-in-%~1"
goto :EOF

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Text file: remove every line that is duplicated - leave

#4 Post by dbenham » 23 Apr 2015 21:23

Give this a try. I sort the output, but I think you still should be able to make use of it. I rely on JREN.BAT to get a listing with both date, time, and size in a format that sorts properly. After sorting, I use JREPL.BAT to first remove duplicate pairs, and then a 2nd time to reorganize the output to make it easier to read.

Code: Select all

@echo off

del list.txt 2>nul
for %%A in (E:\ F:\) do (
  echo listing %%A
  call jren "^.*" "path()+'|'+path().slice(0,1)+'|'+ts({dt:'fsoModified',fmt:'{YYYY}-{MM}-{DD} {HH}:{NN}:{SS}'})+' '+size('               ')" /j /list /s /p %%A >>list.txt
)

echo sorting
sort /+2 list.txt >list2.txt

echo removing duplicates
call jrepl "^.(.*?\|).(.*)\n.\1.\2\n" "" /m /f list2.txt /o diffs.txt

echo reorganizing output
call jrepl "^(.*)\|.\|(.*)" "$2  $1" /f diffs.txt /o diffs2.txt
del list.txt list2.txt


Dave Benham

Aacini
Expert
Posts: 1932
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Text file: remove every line that is duplicated - leave

#5 Post by Aacini » 23 Apr 2015 23:24

Try this:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Load lines of first file in an array, preserving its order (999999 maximum)
set i=1000000
set line["Unique files in One.txt:"]=000000
for /F "delims=" %%a in ('findstr /C:"/" One.txt') do (
   set /A i+=1
   set line["%%a"]=!i:~1!
)

rem Process lines from second file
set /A i+=1
set line["-----------------------------"]=%i:~1%
set /A i+=1
set line["Unique files in Two.txt:"]=%i:~1%
for /F "delims=" %%a in ('findstr /C:"/" Two.txt') do (
   rem If the line from Two already exist in the array
   if defined line["%%a"] (
      rem Remove it
      set "line["%%a"]="
   ) else (
      rem Insert the unique line from Two in the array
      set /A i+=1
      set line["%%a"]=!i:~1!
   )
)

rem Output remaining lines, preserving original order
(for /F tokens^=2*^ delims^=[^"]^= %%a in ('set line[') do echo %%b:%%a) > tempfile.txt
for /F "tokens=1* delims=:" %%a in ('sort tempfile.txt') do echo %%b
del tempfile.txt

Output:

Code: Select all

Unique files in One.txt:
27/02/2015  09:24            40,960 BCD
03/11/2012  00:32            40,960 BCD.LOG
-----------------------------
Unique files in Two.txt:
21/03/2015  10:28            40,960 BCD
03/11/2012  00:32            28,672 BCD.LOG

If you want to do timing tests, please remove all comment lines.

Antonio

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Text file: remove every line that is duplicated - leave

#6 Post by dbenham » 24 Apr 2015 06:32

@Aacini - I see some potential problems with your approach.

1) It will fail if any file name contains = or !. The latter can be fixed, but CALL will definitely slow things down, and and the alternative toggling delayed expansion will likely slow things down with such a big environment.

2) There is no folder information in your output, so you can't tell where the files reside

3) I was assuming two files should be considered different if they reside in different folders. (I might be wrong on this)


@foxidrive - let me know if I should be ignoring the folder path when comparing files. My solution can easily be adapted for that rule. Also, SORT, FINDSTR or JREPL can be used to quickly and easily segregate the file information from the two drives. I wasn't sure if you wanted separate lists for each drive, or if you wanted the differences next to each other.


Dave Benham

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Text file: remove every line that is duplicated - leave

#7 Post by foxidrive » 24 Apr 2015 07:52

Thanks for your assistance with my tasks Guys.


dbenham wrote:if any file name contains = or !


Yes, FWIW I do use ! in names quite often here

3) I was assuming two files should be considered different if they reside in different folders.


Yes. that is a good point and is how it should be considered.

You code works quickly indeed, thank you, and thank you to Antonio for your input also.

@foxidrive - let me know if I should be ignoring the folder path when comparing files. My solution can easily be adapted for that rule. Also, SORT, FINDSTR or JREPL can be used to quickly and easily segregate the file information from the two drives. I wasn't sure if you wanted separate lists for each drive, or if you wanted the differences next to each other.


Using sort /+20 /r <diffs2.txt >diffs3.txt

I got this at the top:

I'm wondering if I should aim for a few ways of processing this:

*) add a list that removes identical path/files with the same size (but different date) - see file 3 and 4 from top
*) and also a list that removes identical path/files - with different date or different size
*) a calculation of total filesize before processing, of each drive, would be the icing on the cake.

where I say identical paths - the drive letter is not significant and should be ignored.



Code: Select all

2015-03-23 01:26:23       257163264  F:\Users\Me\AppData\Roaming\Thunderbird\Profiles\z6ug6.default\global-messages-db.sqlite
2015-03-02 03:51:37       253329408  E:\Users\Me\AppData\Roaming\Thunderbird\Profiles\z6ug6.default\global-messages-db.sqlite
2015-03-07 00:47:05       222542171  F:\Users\Me\AppData\Roaming\Thunderbird\Profiles\z6ug6.default\Mail\Local Folders\Inbox.sbd\to 2015
2015-02-20 15:12:45       222542171  E:\Users\Me\AppData\Roaming\Thunderbird\Profiles\z6ug6.default\Mail\Local Folders\Inbox.sbd\to 2015
2015-03-11 10:55:27       221116192  F:\Windows\WinSxS\ManifestCache\9a63c00e8d010c4f_blobs.bin
2015-02-25 21:42:37       211747016  E:\Windows\WinSxS\ManifestCache\9a63c00e8d010c4f_blobs.bin
2015-03-09 19:46:01       119837696  F:\Windows\System32\MRT.exe
2015-02-11 21:19:50       113756392  E:\Windows\System32\MRT.exe
2015-03-02 03:59:12       108772352  E:\Users\Me\AppData\Roaming\Ditto\Ditto.db
2015-03-23 02:06:23       105982976  F:\Users\Me\AppData\Roaming\Ditto\Ditto.db
2015-03-22 22:09:16        83951616  F:\Windows\SoftwareDistribution\DataStore\DataStore.edb
2015-01-28 00:13:44        81238200  F:\Program Files\Microsoft Office 15\root\vfs\ProgramFilesCommonX86\Microsoft Shared\OFFICE15\msores.dll
2014-12-24 04:07:42        81238200  E:\Program Files\Microsoft Office 15\root\vfs\ProgramFilesCommonX86\Microsoft Shared\OFFICE15\msores.dll
2015-03-02 03:24:51        79757312  E:\Windows\SoftwareDistribution\DataStore\DataStore.edb

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Text file: remove every line that is duplicated - leave

#8 Post by dbenham » 24 Apr 2015 08:43

I thought you might be interested in your first bullet point. I didn't think the 2nd was as useful. I think you can figure out how to do them using my code as a starting point. But let me know if you get into trouble.

I don't have anything built into JREN or JREPL to give the sum of all files sizes on a drive. But it takes only a trivial amount of JScript code:

Code: Select all

WScript.Echo((new ActiveXObject('Scripting.FileSystemObject')).GetFolder('E:\\').size);

You could use my JEVAL.BAT

Code: Select all

for %%A in ('jeval "(new ActiveXObject('Scripting.FileSystemObject')).GetFolder('E:\\').size"') do set eSize=%%A
echo E:\ size = %eSize%

JEVAL.BAT

Code: Select all

@if (@X)==(@Y) @end /* harmless hybrid line that begins a JScrpt comment
@goto :batch

::************ Documentation ***********
:::
:::jEval  JScriptExpression  [/N]
:::jEval  /?
:::
:::  Evaluates a JScript expression and writes the result to stdout.
:::
:::  A newline (CR/LF) is not appended to the result unless the /N
:::  option is used.
:::
:::  The JScript expression should be enclosed in double quotes.
:::
:::  JScript string literals within the expression should be enclosed
:::  in single quotes.
:::
:::  Example:
:::
:::    call jEval "'5/4 = ' + 5/4"
:::
:::  Output:
:::
:::    5/4 = 1.25
:::

============ :Batch portion ================
@echo off

if "%~1" equ "" (
  call :err "Insufficient arguments"
  exit /b
)
if "%~2" neq "" if /i "%~2" neq "/N" (
  call :err "Invalid option"
  exit /b
)
if "%~1" equ "/?" (
  for /f "tokens=* delims=:" %%A in ('findstr "^:::" "%~f0"') do echo(%%A
  exit /b
)
cscript //E:JScript //nologo "%~f0" %*
exit /b

:err
>&2 echo ERROR: %~1. Use jeval /? to get help.
exit /b 1


************ JScript portion ***********/
if (WScript.Arguments.Named.Exists("n")) {
  WScript.StdOut.WriteLine(eval(WScript.Arguments.Unnamed(0)));
} else {
  WScript.StdOut.Write(eval(WScript.Arguments.Unnamed(0)));
}


Dave Benham

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Text file: remove every line that is duplicated - leave

#9 Post by foxidrive » 24 Apr 2015 09:57

dbenham wrote:I think you can figure out how to do them using my code as a starting point. But let me know if you get into trouble.


Thanks for your code, tips and hints Dave, I will give it a shot when I feel up to it and I should be able to figure out how it fits together.
This help, and more help in recent times has been much appreciated.


Just commenting that there are regulars who are quite limited physically, and undoubtedly others with their own medical issues.
My own recent difficulties include the last eight months where I've had less than 3 hours sleep on most days.
I never knew that with insomnia you don't even yawn - you just keep plugging away, and your memory becomes like a thing with holes in it.

Compo
Posts: 600
Joined: 21 Mar 2014 08:50

Re: Text file: remove every line that is duplicated - leave

#10 Post by Compo » 24 Apr 2015 12:09

Is there a reason why you can't use powershell?

This will do the comparison directly from the directories instead of taking input from outputted dir listing files.

Code: Select all

$fld1 = "C:\Users\foxidrive\Pictures"
$fld2 = "E:\Backups\AllPics"
$lst1 = GCI -Path $fld1 -Name -R -Attributes !D
$lst2 = GCI -Path $fld2 -Name -R -Attributes !D
$results = @(Diff -CaseSensitive $lst1 $lst2)
$results | FT -AutoSize -HideTableHeaders | Out-File -A diffs.txt
The results in diff.txt should show 'any files in only' $fld1 or $fld2 followed by <= or =>

Change variables $fld1 and $fld2 to suit your purpose.

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Text file: remove every line that is duplicated - leave

#11 Post by foxidrive » 24 Apr 2015 22:54

Compo wrote:Is there a reason why you can't use powershell?


Thanks for the suggestion Compo, though the output has some problems with truncation etc.
I tried some modifications but I'm not too clued up with PS.

This is a section of the output

Code: Select all

Program Files\Microsoft Office 15\Data\9E3F5A-D80-4BA-347-46513DD\en-us\hash.txt         
Program Files\Microsoft Office 15\Data\9E3F5A-D80-4BA-347-46513DD\en-us\MasterDescript...
Program Files\Microsoft Office 15\Data\9E3F5A-D80-4BA-347-46513DD\en-us\stream.x86.en-...
Program Files\Microsoft Office 15\Data\9E3F5A-D80-4BA-347-46513DD\x-none\hash.txt       
Program Files\Microsoft Office 15\Data\9E3F5A-D80-4BA-347-46513DD\x-none\MasterDescrip...
Program Files\Microsoft Office 15\Data\9E3F5BDA-D8C0-40BA-9347-46513BB455DD\x-none\stream.x86.x-...

Compo
Posts: 600
Joined: 21 Mar 2014 08:50

Re: Text file: remove every line that is duplicated - leave

#12 Post by Compo » 25 Apr 2015 09:41

...it's to do with buffers, Out-File outputs that which is represented in the console window!

Without wanting to slow down the code too much my best suggestion would be to see if you could live with the information column wrapped:
change

Code: Select all

FT -AutoSize -HideTableHeaders
to

Code: Select all

FT -AutoSize -HideTableHeaders -Wrap

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Text file: remove every line that is duplicated - leave

#13 Post by foxidrive » 25 Apr 2015 20:10

I appreciate the tip but the text is wrapped, and you are unable to reconstruct the original path\name
because the number of spaces used in the wrapping won't let you determine how many spaces were in the original name.

It may well be just the usual one space, but it may not be.

Code: Select all

Program                                           =>                                               
Files\PureBasic\Examples\Personal\rpad-nbsp.exe               


Code: Select all

Files\Util\Faststone Capture\Faststone            =>                                               
Capture.old\fsrec.db                                   

Compo
Posts: 600
Joined: 21 Mar 2014 08:50

Re: Compare similar file lists - detect differences*title ed

#14 Post by Compo » 27 Apr 2015 10:50

I don't have access to a Windows PC ATM so I cannot check, but I would think that a conversion to a CSV type format may bypass the buffer problem.

Try this:

Code: Select all

$fld1 = "C:\Users\foxidrive\Pictures"
$fld2 = "E:\Backups\AllPics"
$lst1 = GCI -Path $fld1 -Name -R -Attributes !D
$lst2 = GCI -Path $fld2 -Name -R -Attributes !D
$results = @(Diff -CaseSensitive $lst1 $lst2)
$results | % {if ($_.sideindicator -match '<') {$_.sideindicator = $fld1}
    else {$_.sideindicator = $fld2}
}
$results | select @{
    l='Directory';e={$_.SideIndicator}},@{l='Path\File';e={$_.InputObject}
} | Export-Csv Diffs.csv -NoTypeInformation
I've played with the layout a little which may have an effect on the overall speed!

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Compare similar file lists - detect differences*title ed

#15 Post by foxidrive » 29 Apr 2015 07:26

Regarding speed - Dave's was 2 minutes, Aacini's had some failures showing on the screen after 25 minutes and hadn't created the file yet at that stage (REMS removed).
That may be the ! in the filenames etc - this is a sample of what was on the console. and other lines appeared every so often.

Code: Select all

Environment variable line["25/02/2007  23:37             6,644 i:~1 not defined
Environment variable line["18/12/2002  01:52               909 Doom Maniai:~1 not defined
Environment variable line["18/12/2002  22:24           538,884 Raytracei:~1 not defined
Environment variable line["07/06/2006  04:36           311,390 LINUX administration made easyi:~1 not defined
Environment variable line["23/12/1997  12:04           122,420 TIPSi:~1 not defined
Environment variable line["13/10/1995  20:42           123,204 TIPSi:~1 not defined
Environment variable line["13/10/1995  20:37            96,501 TIPSi:~1 not defined
Environment variable line["23/12/1997  11:56            73,488 TIPSi:~1 not defined
Environment variable line["23/12/1997  11:57           198,202 TIPSi:~1 not defined
Environment variable line["10/03/1993  05:00            22,717 FORMATi:~1 not defined


This is your code I was using Antonio:

Code: Select all

@echo off
setlocal EnableDelayedExpansion


set i=1000000
set line["Unique files in One.txt:"]=000000
for /F "delims=" %%a in ('findstr /C:"/" One.txt') do (
   set /A i+=1
   set line["%%a"]=!i:~1!
)


set /A i+=1
set line["-----------------------------"]=%i:~1%
set /A i+=1
set line["Unique files in Two.txt:"]=%i:~1%
for /F "delims=" %%a in ('findstr /C:"/" Two.txt') do (

   if defined line["%%a"] (

      set "line["%%a"]="
   ) else (

      set /A i+=1
      set line["%%a"]=!i:~1!
   )
)


(for /F tokens^=2*^ delims^=[^"]^= %%a in ('set line[') do echo %%b:%%a) > tempfile.txt
for /F "tokens=1* delims=:" %%a in ('sort tempfile.txt') do echo %%b
del  empfile.txt





For Compo's - around 10 minutes for this version and and the one I show below too.


Thanks for your input Compo, it seems to provide the drive and filename ok, though date and time and size can be useful for some uses.

Code: Select all

"F:\","Files\report_2015-03-22.zip"
"F:\","Files\results.txt"
"F:\","Files\bat\gettime.bat"
"F:\","Files\bat\HTML pictures.BAT"


Code: Select all

$fld1 = "E:\"
$fld2 = "F:\"
$lst1 = GCI -Path $fld1 -Name -R -Attributes !D
$lst2 = GCI -Path $fld2 -Name -R -Attributes !D
$results = @(Diff $lst1 $lst2)
$results | FT -HideTableHeaders -wrap | Out-File -A diffs6.txt


I was wondering about the case comparison - it could be more useful to use a case insensitive compare under Windows.


Yours and Dave's solution has a problem with a couple of non-standard characters, but that's sorta par-for-the-course.

The two non-alpha characters below were changed.

Code: Select all

user’s
Ehlers–Danlos

Post Reply