Page 1 of 2
Remove DUPLICATES without SORTING the data (in .txt files)
Posted: 19 May 2014 13:13
by miskox
I had a situation when I had a list of serial numbers in a .txt file. Something like this:
test.txt contains these records (and there are no empty lines):
I wanted to remove duplicate entries without sorting the file (it was very important to keep the current order).
One way of achieving this is like this (let's just say that the data is unique in that way that it is not possible for findstr to find one record in another record as a substring - it can only be matched as a whole record (I have an 8 digit serial numbers)):
Code: Select all
@echo off
echo %time%
type nul>test.out_dir_3
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_dir_3>nul||>>test.out_dir_3 (echo.%%f)
echo %time%
Another way is using temporary files:
Code: Select all
@echo off
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo %time%
REM for each record write a temporary file
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
REM if file exists write this record and delete this temporary file. So it means
REM that if file does not exist it has already been written to the output
REM file only once
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_dir_1 (echo %%f)
echo %time%
rd /s /q %tmpdir%
Surprisingly the second way is much faster (I have 24000+ records). FINDSTR has to search bigger output file every time (plus any antivirus programs will delay such operations - I am sure IF EXIST and DEL commands do not cause antivirus to check for any threats).
Of course this approach depends on the data you have. Just thought maybe someone finds this approach useful.
Saso
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 19 May 2014 13:45
by Aacini
You may use the same approach of your temporary files, but using variables instead; that is:
Code: Select all
@echo off
setlocal
(for /F "delims=" %%a in (test.txt) do (
if not defined n[%%a] (
set n[%%a]=Y
echo %%a
)
)) > test.out_dir_1
This method should run faster with files up to a certain size or, more precisely, up to a certain amount of unique numbers. After that, when the number of environment variables grew too much, this method will be every time slower.
Antonio
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 20 May 2014 06:53
by miskox
Sure, for not too many records. I have 24,000+ records so I guess this would not work as fast as temp files version.
Thanks.
Saso
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 20 May 2014 09:58
by Aacini
From those 24000 records, how many are unique ones? The half, perhaps?
I think that would be worthy if you do this test anyway...

You may remove the braquets in order to get shorter variable names:
Antonio
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 20 May 2014 10:11
by aGerman
The SET /P technique is faster than FOR /F because it doesn't buffer the entire file contents first. Aacinis suggestion is the best I can think of. The question is how many DIFFERENT records are saved in your file? This is the number of environment variables that will be created and only this will eventually influence the speed.
untested
Code: Select all
set "infile=test.txt"
set "outfile=test.out_dir_1"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
set "line=" &set /p "line="
if not defined n[!line!] (
set n[!line!]=Y
echo !line!
)
)
)
endlocal
Regards
aGerman
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 20 May 2014 12:37
by foxidrive
Using the first four solutions posted - on my machine with 24,000 8 byte numbers in the test.txt file:
Code: Select all
using 24,000 unique records
1: miskox
0 days 00:04:40.113
2: miskox
0 days 00:02:59.340
3: aacini
0 days 00:00:25.468
4: aGerman
0 days 00:03:00.060
Code: Select all
Using 12,000 unique records
1: miskox
0 days 00:02:41.690
2: miskox
0 days 00:01:01.074
3: aacini
0 days 00:00:07.929
4: aGerman
0 days 00:01:18.367
aacini's method wins hands down.
All solutions gave the same result.
Testing code follows, and uses Dave's
getTimestamp.batCode: Select all
@echo off
del test.out.* 2>nul
@echo off
echo %time%
call getTimestamp -f {ums} -r t1
type nul>test.out_dir_1
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_dir_1 >nul||>>test.out_dir_1 (echo.%%f)
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
@echo off
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo %time%
call getTimestamp -f {ums} -r t1
REM for each record write a temporary file
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
REM if file exists write this record and delete this temporary file. So it means
REM that if file does not exist it has already been written to the output
REM file only once
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_dir_2 (echo %%f)
rd /s /q %tmpdir%
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
@echo off
setlocal
echo %time%
call getTimestamp -f {ums} -r t1
(for /F "delims=" %%a in (test.txt) do (
if not defined n[%%a] (
set n[%%a]=Y
echo %%a
)
)) > test.out_dir_3
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
@echo off
echo %time%
call getTimestamp -f {ums} -r t1
set "infile=test.txt"
set "outfile=test.out_dir_4"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
set "line=" &set /p "line="
if not defined n!line! (
set n!line!=Y
echo !line!
)
)
)
endlocal
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 20 May 2014 13:18
by aGerman
Surprising

Obviously always worth it to do some tests. Thanks foxi.
Regards
aGerman
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 20 May 2014 15:08
by penpen
I don't think it is surprising, as you have used a more complex for/f loop (with additional piping, that is slow, too) as Aacini, and added a second inner loop.
This should be the "set /P" opponent of Aacinis code:
Code: Select all
@echo off
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1" cmd /V:ON /C "@echo off&for /L %%a in (0, 0, 0) do (set "input=" & set /P "input=" & (if not defined input exit) & (if not defined n[^!input^!] (set n[^!input^!]=Y & echo ^!input^!)))"
endlocal
Actually i'm not able to test it (same with speed tests...).
Actually i'm not sure if that is faster, than Aacinis code, as it uses much more set and "if defined" commands.
penpen
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 20 May 2014 22:56
by foxidrive
Here are the results for penpen's code:
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 21 May 2014 06:12
by Dos_Probie
Still LQQks like Aacini wins the speed test and with only 8 lines of code and no disableDelayedExpansion!

Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 21 May 2014 08:41
by Aacini
I think that the fastest SET /P method would be this one:
Code: Select all
@echo off
setlocal EnableDelayedExpansion
find /C /V "" < test.txt > test.out
set /P lines=< test.out
< test.txt (for /L %%a in (1,1,%lines%) do (
set /P input=
if not defined n[!input!] (
set n[!input!]=Y
echo !input!
)
)) > test.out_dir_1
If the execution of find.exe is faster that cmd.exe, this method should run faster than penpen's one; however, I am pretty sure this method will still be slower than the first FOR /F one.
An interesting comparison would be to take the time in this method
after the FIND and SET /P commands (just the FOR /L loop). This would give us a direct comparison of FOR /F method vs. SET /P one!
Antonio
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 21 May 2014 09:32
by foxidrive
Using this code:
Code: Select all
@echo off
call getTimestamp -f {ums} -r t1
setlocal EnableDelayedExpansion
find /C /V "" < test.txt > test.out
set /P lines=< test.out
del test.out
< test.txt (for /L %%a in (1,1,%lines%) do (
set /P input=
if not defined n[!input!] (
set n[!input!]=Y
echo !input!
)
)) > test.out_dir_7
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 21 May 2014 11:05
by penpen
On my WinXp home, this is as fast as Aacinis set/P method on 12,000 uniqe entries:
Code: Select all
@echo off
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
(But my pc is SLOW, so it might be corrupted by hdd speed, ...)
penpen
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 21 May 2014 11:16
by foxidrive
Test code from penpen's last post.
Code: Select all
@echo off
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause
Re: Remove DUPLICATES without SORTING the data (in .txt file
Posted: 21 May 2014 12:11
by miskox
Here are my test results (my computer is not very powerfull (old IBM laptop T43)). Results differ a lot. I think that the main advantage is CPU power.
Code: Select all
MISKOX -version 1 (findstr) (http://www.dostips.com/forum/viewtopic.php?p=34566#
p34566)
0 days 00:12:20.641
MISKOX -version 2 (temp files) (http://www.dostips.com/forum/viewtopic.php?p=345
66#p34566)
0 days 00:00:51.828
AACINI (http://www.dostips.com/forum/viewtopic.php?p=34567#p34567)
0 days 00:01:10.437
AGERMAN (http://www.dostips.com/forum/viewtopic.php?p=34584#p34584)
0 days 00:02:05.281
FOXIDRIVE (http://www.dostips.com/forum/viewtopic.php?p=34603#p34603)
0 days 00:01:29.907
PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34616#p34616)
0 days 00:01:30.828
PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34589#p34589)
0 days 00:02:27.516
Code used for the test:
Code: Select all
@echo off
cls
del test.out_* 2>nul
call :1
call :2
call :3
call :4
call :5
call :6
call :7
goto :EOF
:: another option was to call the .bat with a parameter.
if "%1"=="1" goto :1
if "%1"=="2" goto :2
if "%1"=="3" goto :3
if "%1"=="4" goto :4
if "%1"=="5" goto :5
if "%1"=="6" goto :6
if "%1"=="7" goto :7
goto :EOF
:1
echo MISKOX -version 1 (findstr) (http://www.dostips.com/forum/viewtopic.php?p=34566#p34566)
echo 1 %time%
call getTimestamp -f {ums} -r t1
type nul>test.out_1
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_1 >nul||>>test.out_1 (echo.%%f)
echo 1 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
goto :EOF
:2
echo MISKOX -version 2 (temp files) (http://www.dostips.com/forum/viewtopic.php?p=34566#p34566)
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo 2 %time%
call getTimestamp -f {ums} -r t1
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_2 (echo %%f)
rd /s /q %tmpdir%
echo 2 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
goto :EOF
:3
echo AACINI (http://www.dostips.com/forum/viewtopic.php?p=34567#p34567)
setlocal
echo 3 %time%
call getTimestamp -f {ums} -r t1
(for /F "delims=" %%a in (test.txt) do (
if not defined n[%%a] (
set n[%%a]=Y
echo %%a
)
)) > test.out_3
echo 3 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
goto :EOF
:4
echo AGERMAN (http://www.dostips.com/forum/viewtopic.php?p=34584#p34584)
echo 4 %time%
call getTimestamp -f {ums} -r t1
set "infile=test.txt"
set "outfile=test.out_4"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
set "line=" &set /p "line="
if not defined n!line! (
set n!line!=Y
echo !line!
)
)
)
endlocal
echo 4 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
:5
echo FOXIDRIVE (http://www.dostips.com/forum/viewtopic.php?p=34603#p34603)
setlocal EnableDelayedExpansion
echo 5 %time%
call getTimestamp -f {ums} -r t1
find /C /V "" < test.txt > test.out
set /P lines=< test.out
< test.txt (for /L %%a in (1,1,%lines%) do (
set /P input=
if not defined n[!input!] (
set n[!input!]=Y
echo !input!
)
)) > test.out_5
echo 5 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
goto :EOF
:6
echo PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34616#p34616)
echo 6 %time%
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_6.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
echo 6 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
goto :EOF
:7
echo PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34589#p34589)
echo 7 %time%
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_7" cmd /V:ON /C "@echo off&for /L %%a in (0, 0, 0) do (set "input=" & set /P "input=" & (if not defined input exit) & (if not defined n[^!input^!] (set n[^!input^!]=Y & echo ^!input^!)))"
endlocal
echo 7 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.
goto :EOF
Test.txt contains 24015 records.
Test.out (duplicates removed) contains 23947 (68 duplicates)
Test.txt will be posted in the next post because it is too long.
Saso