Remove DUPLICATES without SORTING the data (in .txt files)

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Remove DUPLICATES without SORTING the data (in .txt files)

#1 Post by miskox » 19 May 2014 13:13

I had a situation when I had a list of serial numbers in a .txt file. Something like this:

test.txt contains these records (and there are no empty lines):

Code: Select all

1
6
5
4
3
2
1
2
0


I wanted to remove duplicate entries without sorting the file (it was very important to keep the current order).

One way of achieving this is like this (let's just say that the data is unique in that way that it is not possible for findstr to find one record in another record as a substring - it can only be matched as a whole record (I have an 8 digit serial numbers)):

Code: Select all

@echo off
echo %time%
type nul>test.out_dir_3
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_dir_3>nul||>>test.out_dir_3 (echo.%%f)
echo %time%


Another way is using temporary files:

Code: Select all

@echo off
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo %time%
REM for each record write a temporary file
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
REM if file exists write this record and delete this temporary file. So it means
REM that if file does not exist it has already been written to the output
REM file only once
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_dir_1 (echo %%f)
echo %time%
rd /s /q %tmpdir%


Surprisingly the second way is much faster (I have 24000+ records). FINDSTR has to search bigger output file every time (plus any antivirus programs will delay such operations - I am sure IF EXIST and DEL commands do not cause antivirus to check for any threats).

Of course this approach depends on the data you have. Just thought maybe someone finds this approach useful.

Saso

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Remove DUPLICATES without SORTING the data (in .txt file

#2 Post by Aacini » 19 May 2014 13:45

You may use the same approach of your temporary files, but using variables instead; that is:

Code: Select all

@echo off
setlocal

(for /F "delims=" %%a in (test.txt) do (
   if not defined n[%%a] (
      set n[%%a]=Y
      echo %%a
   )
)) > test.out_dir_1


This method should run faster with files up to a certain size or, more precisely, up to a certain amount of unique numbers. After that, when the number of environment variables grew too much, this method will be every time slower.

Antonio

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: Remove DUPLICATES without SORTING the data (in .txt file

#3 Post by miskox » 20 May 2014 06:53

Sure, for not too many records. I have 24,000+ records so I guess this would not work as fast as temp files version.

Thanks.
Saso

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Remove DUPLICATES without SORTING the data (in .txt file

#4 Post by Aacini » 20 May 2014 09:58

From those 24000 records, how many are unique ones? The half, perhaps?

I think that would be worthy if you do this test anyway... :) You may remove the braquets in order to get shorter variable names:

Code: Select all

set n%%a=Y

Antonio

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Remove DUPLICATES without SORTING the data (in .txt file

#5 Post by aGerman » 20 May 2014 10:11

The SET /P technique is faster than FOR /F because it doesn't buffer the entire file contents first. Aacinis suggestion is the best I can think of. The question is how many DIFFERENT records are saved in your file? This is the number of environment variables that will be created and only this will eventually influence the speed.

untested

Code: Select all

set "infile=test.txt"
set "outfile=test.out_dir_1"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
  for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
    set "line=" &set /p "line="
    if not defined n[!line!] (
      set n[!line!]=Y
      echo !line!
    )
  )
)
endlocal

Regards
aGerman

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Remove DUPLICATES without SORTING the data (in .txt file

#6 Post by foxidrive » 20 May 2014 12:37

Using the first four solutions posted - on my machine with 24,000 8 byte numbers in the test.txt file:

Code: Select all

using 24,000 unique records

   1: miskox
0 days 00:04:40.113

   2: miskox
0 days 00:02:59.340

   3: aacini
0 days 00:00:25.468

   4: aGerman
0 days 00:03:00.060


Code: Select all

Using 12,000 unique records

   1: miskox
0 days 00:02:41.690

   2: miskox
0 days 00:01:01.074

   3: aacini
0 days 00:00:07.929

   4: aGerman
0 days 00:01:18.367


aacini's method wins hands down.

All solutions gave the same result.

Testing code follows, and uses Dave's getTimestamp.bat

Code: Select all

@echo off

del test.out.* 2>nul

@echo off
echo %time%
call getTimestamp -f {ums} -r t1
type nul>test.out_dir_1
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_dir_1 >nul||>>test.out_dir_1 (echo.%%f)
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

@echo off
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo %time%
call getTimestamp -f {ums} -r t1
REM for each record write a temporary file
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
REM if file exists write this record and delete this temporary file. So it means
REM that if file does not exist it has already been written to the output
REM file only once
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_dir_2 (echo %%f)
rd /s /q %tmpdir%
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

@echo off
setlocal
echo %time%
call getTimestamp -f {ums} -r t1
(for /F "delims=" %%a in (test.txt) do (
   if not defined n[%%a] (
      set n[%%a]=Y
      echo %%a
   )
)) > test.out_dir_3
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

@echo off
echo %time%
call getTimestamp -f {ums} -r t1
set "infile=test.txt"
set "outfile=test.out_dir_4"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
  for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
    set "line=" &set /p "line="
    if not defined n!line! (
      set n!line!=Y
      echo !line!
    )
  )
)
endlocal
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Remove DUPLICATES without SORTING the data (in .txt file

#7 Post by aGerman » 20 May 2014 13:18

Surprising :o Obviously always worth it to do some tests. Thanks foxi.

Regards
aGerman

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Remove DUPLICATES without SORTING the data (in .txt file

#8 Post by penpen » 20 May 2014 15:08

I don't think it is surprising, as you have used a more complex for/f loop (with additional piping, that is slow, too) as Aacini, and added a second inner loop.

This should be the "set /P" opponent of Aacinis code:

Code: Select all

@echo off
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1" cmd /V:ON /C  "@echo off&for /L %%a in (0, 0, 0) do (set "input=" & set /P "input=" & (if not defined input exit) & (if not defined n[^!input^!] (set n[^!input^!]=Y & echo ^!input^!)))"
endlocal
Actually i'm not able to test it (same with speed tests...).
Actually i'm not sure if that is faster, than Aacinis code, as it uses much more set and "if defined" commands.

penpen

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Remove DUPLICATES without SORTING the data (in .txt file

#9 Post by foxidrive » 20 May 2014 22:56

Here are the results for penpen's code:

Code: Select all

24,000

0 days 00:00:49.513



Code: Select all

12,000

0 days 00:00:26.289

Dos_Probie
Posts: 233
Joined: 21 Nov 2010 08:07
Location: At My Computer

Re: Remove DUPLICATES without SORTING the data (in .txt file

#10 Post by Dos_Probie » 21 May 2014 06:12

Still LQQks like Aacini wins the speed test and with only 8 lines of code and no disableDelayedExpansion! :mrgreen:

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Remove DUPLICATES without SORTING the data (in .txt file

#11 Post by Aacini » 21 May 2014 08:41

I think that the fastest SET /P method would be this one:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

find /C /V "" < test.txt > test.out
set /P lines=< test.out

< test.txt (for /L %%a in (1,1,%lines%) do (
   set /P input=
   if not defined n[!input!] (
      set n[!input!]=Y
      echo !input!
   )
)) > test.out_dir_1

If the execution of find.exe is faster that cmd.exe, this method should run faster than penpen's one; however, I am pretty sure this method will still be slower than the first FOR /F one.

An interesting comparison would be to take the time in this method after the FIND and SET /P commands (just the FOR /L loop). This would give us a direct comparison of FOR /F method vs. SET /P one!

Antonio

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Remove DUPLICATES without SORTING the data (in .txt file

#12 Post by foxidrive » 21 May 2014 09:32

Code: Select all

24000

0 days 00:00:39.146


Code: Select all

12000

0 days 00:00:17.437


Using this code:

Code: Select all

@echo off
call getTimestamp -f {ums} -r t1
setlocal EnableDelayedExpansion
find /C /V "" < test.txt > test.out
set /P lines=< test.out
del test.out
< test.txt (for /L %%a in (1,1,%lines%) do (
   set /P input=
   if not defined n[!input!] (
      set n[!input!]=Y
      echo !input!
   )
)) > test.out_dir_7
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Remove DUPLICATES without SORTING the data (in .txt file

#13 Post by penpen » 21 May 2014 11:05

On my WinXp home, this is as fast as Aacinis set/P method on 12,000 uniqe entries:

Code: Select all

@echo off
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
(But my pc is SLOW, so it might be corrupted by hdd speed, ...)

penpen

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Remove DUPLICATES without SORTING the data (in .txt file

#14 Post by foxidrive » 21 May 2014 11:16

Code: Select all

24,000

0 days 00:00:37.942



Code: Select all

12,000

0 days 00:00:16.769




Test code from penpen's last post.



Code: Select all

@echo off
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: Remove DUPLICATES without SORTING the data (in .txt file

#15 Post by miskox » 21 May 2014 12:11

Here are my test results (my computer is not very powerfull (old IBM laptop T43)). Results differ a lot. I think that the main advantage is CPU power.

Code: Select all

MISKOX -version 1 (findstr) (http://www.dostips.com/forum/viewtopic.php?p=34566#
p34566)
0 days 00:12:20.641

MISKOX -version 2 (temp files) (http://www.dostips.com/forum/viewtopic.php?p=345
66#p34566)
0 days 00:00:51.828

AACINI (http://www.dostips.com/forum/viewtopic.php?p=34567#p34567)
0 days 00:01:10.437

AGERMAN (http://www.dostips.com/forum/viewtopic.php?p=34584#p34584)
0 days 00:02:05.281

FOXIDRIVE (http://www.dostips.com/forum/viewtopic.php?p=34603#p34603)
0 days 00:01:29.907

PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34616#p34616)
0 days 00:01:30.828

PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34589#p34589)
0 days 00:02:27.516


Code used for the test:

Code: Select all

@echo off
cls
del test.out_* 2>nul

call :1
call :2
call :3
call :4
call :5
call :6
call :7

goto :EOF

:: another option was to call the .bat with a parameter.
if "%1"=="1" goto :1
if "%1"=="2" goto :2
if "%1"=="3" goto :3
if "%1"=="4" goto :4
if "%1"=="5" goto :5
if "%1"=="6" goto :6
if "%1"=="7" goto :7
goto :EOF

:1
echo MISKOX -version 1 (findstr) (http://www.dostips.com/forum/viewtopic.php?p=34566#p34566)
echo 1 %time%
call getTimestamp -f {ums} -r t1
type nul>test.out_1
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_1 >nul||>>test.out_1 (echo.%%f)
echo 1 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:2
echo MISKOX -version 2 (temp files) (http://www.dostips.com/forum/viewtopic.php?p=34566#p34566)
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo 2 %time%
call getTimestamp -f {ums} -r t1
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_2 (echo %%f)
rd /s /q %tmpdir%
echo 2 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:3
echo AACINI (http://www.dostips.com/forum/viewtopic.php?p=34567#p34567)
setlocal
echo 3 %time%
call getTimestamp -f {ums} -r t1
(for /F "delims=" %%a in (test.txt) do (
   if not defined n[%%a] (
      set n[%%a]=Y
      echo %%a
   )
)) > test.out_3
echo 3 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:4
echo AGERMAN (http://www.dostips.com/forum/viewtopic.php?p=34584#p34584)
echo 4 %time%
call getTimestamp -f {ums} -r t1
set "infile=test.txt"
set "outfile=test.out_4"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
  for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
    set "line=" &set /p "line="
    if not defined n!line! (
      set n!line!=Y
      echo !line!
    )
  )
)
endlocal
echo 4 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

:5
echo FOXIDRIVE (http://www.dostips.com/forum/viewtopic.php?p=34603#p34603)
setlocal EnableDelayedExpansion
echo 5 %time%
call getTimestamp -f {ums} -r t1
find /C /V "" < test.txt > test.out
set /P lines=< test.out

< test.txt (for /L %%a in (1,1,%lines%) do (
   set /P input=
   if not defined n[!input!] (
      set n[!input!]=Y
      echo !input!
   )
)) > test.out_5
echo 5 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:6
echo PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34616#p34616)
echo 6 %time%
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_6.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
echo 6 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF


:7
echo PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34589#p34589)
echo 7 %time%
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_7" cmd /V:ON /C  "@echo off&for /L %%a in (0, 0, 0) do (set "input=" & set /P "input=" & (if not defined input exit) & (if not defined n[^!input^!] (set n[^!input^!]=Y & echo ^!input^!)))"
endlocal
echo 7 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF


Test.txt contains 24015 records.
Test.out (duplicates removed) contains 23947 (68 duplicates)

Test.txt will be posted in the next post because it is too long.

Saso
Last edited by miskox on 21 May 2014 12:17, edited 1 time in total.

Post Reply