Remove DUPLICATES without SORTING the data (in .txt files)

Message

miskox · #1 Post by **miskox** » 19 May 2014 13:13

I had a situation when I had a list of serial numbers in a .txt file. Something like this:

test.txt contains these records (and there are no empty lines):

Code: Select all

I wanted to remove duplicate entries without sorting the file (it was very important to keep the current order).

One way of achieving this is like this (let's just say that the data is unique in that way that it is not possible for findstr to find one record in another record as a substring - it can only be matched as a whole record (I have an 8 digit serial numbers)):

Code: Select all

@echo off
echo %time%
type nul>test.out_dir_3
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_dir_3>nul||>>test.out_dir_3 (echo.%%f)
echo %time%

Another way is using temporary files:

Code: Select all

@echo off
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo %time%
REM for each record write a temporary file
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
REM if file exists write this record and delete this temporary file. So it means
REM that if file does not exist it has already been written to the output
REM file only once
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_dir_1 (echo %%f)
echo %time%
rd /s /q %tmpdir%

Surprisingly the second way is much faster (I have 24000+ records). FINDSTR has to search bigger output file every time (plus any antivirus programs will delay such operations - I am sure IF EXIST and DEL commands do not cause antivirus to check for any threats).

Of course this approach depends on the data you have. Just thought maybe someone finds this approach useful.

Saso

#2 Post by **Aacini** » 19 May 2014 13:45

You may use the same approach of your temporary files, but using variables instead; that is:

Code: Select all

@echo off
setlocal

(for /F "delims=" %%a in (test.txt) do (
   if not defined n[%%a] (
      set n[%%a]=Y
      echo %%a
   )
)) > test.out_dir_1

This method should run faster with files up to a certain size or, more precisely, up to a certain amount of unique numbers. After that, when the number of environment variables grew too much, this method will be every time slower.

Antonio

miskox · #3 Post by **miskox** » 20 May 2014 06:53

Sure, for not too many records. I have 24,000+ records so I guess this would not work as fast as temp files version.

Thanks.
Saso

#4 Post by **Aacini** » 20 May 2014 09:58

From those 24000 records, how many are unique ones? The half, perhaps?

I think that would be worthy if you do this test anyway...

You may remove the braquets in order to get shorter variable names:

Code: Select all

set n%%a=Y

Antonio

#5 Post by **aGerman** » 20 May 2014 10:11

The SET /P technique is faster than FOR /F because it doesn't buffer the entire file contents first. Aacinis suggestion is the best I can think of. The question is how many DIFFERENT records are saved in your file? This is the number of environment variables that will be created and only this will eventually influence the speed.

untested

Code: Select all

set "infile=test.txt"
set "outfile=test.out_dir_1"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
  for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
    set "line=" &set /p "line="
    if not defined n[!line!] (
      set n[!line!]=Y
      echo !line!
    )
  )
)
endlocal

Regards
aGerman

#6 Post by **foxidrive** » 20 May 2014 12:37

Using the first four solutions posted - on my machine with 24,000 8 byte numbers in the test.txt file:

Code: Select all

using 24,000 unique records

   1: miskox
0 days 00:04:40.113

   2: miskox
0 days 00:02:59.340

   3: aacini
0 days 00:00:25.468

   4: aGerman
0 days 00:03:00.060

Code: Select all

Using 12,000 unique records

   1: miskox
0 days 00:02:41.690

   2: miskox
0 days 00:01:01.074

   3: aacini
0 days 00:00:07.929

   4: aGerman
0 days 00:01:18.367

aacini's method wins hands down.

All solutions gave the same result.

Testing code follows, and uses Dave's getTimestamp.bat

Code: Select all

@echo off

del test.out.* 2>nul

@echo off
echo %time%
call getTimestamp -f {ums} -r t1
type nul>test.out_dir_1
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_dir_1 >nul||>>test.out_dir_1 (echo.%%f)
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

@echo off
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo %time%
call getTimestamp -f {ums} -r t1
REM for each record write a temporary file
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
REM if file exists write this record and delete this temporary file. So it means
REM that if file does not exist it has already been written to the output
REM file only once
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_dir_2 (echo %%f)
rd /s /q %tmpdir%
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

@echo off
setlocal
echo %time%
call getTimestamp -f {ums} -r t1
(for /F "delims=" %%a in (test.txt) do (
   if not defined n[%%a] (
      set n[%%a]=Y
      echo %%a
   )
)) > test.out_dir_3
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

@echo off
echo %time%
call getTimestamp -f {ums} -r t1
set "infile=test.txt"
set "outfile=test.out_dir_4"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
  for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
    set "line=" &set /p "line="
    if not defined n!line! (
      set n!line!=Y
      echo !line!
    )
  )
)
endlocal
echo %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause

#7 Post by **aGerman** » 20 May 2014 13:18

Surprising

Obviously always worth it to do some tests. Thanks foxi.

Regards
aGerman

#8 Post by **penpen** » 20 May 2014 15:08

I don't think it is surprising, as you have used a more complex for/f loop (with additional piping, that is slow, too) as Aacini, and added a second inner loop.

This should be the "set /P" opponent of Aacinis code:

Code: Select all

@echo off
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1" cmd /V:ON /C  "@echo off&for /L %%a in (0, 0, 0) do (set "input=" & set /P "input=" & (if not defined input exit) & (if not defined n[^!input^!] (set n[^!input^!]=Y & echo ^!input^!)))"
endlocal

Actually i'm not able to test it (same with speed tests...).
Actually i'm not sure if that is faster, than Aacinis code, as it uses much more set and "if defined" commands.

penpen

#9 Post by **foxidrive** » 20 May 2014 22:56

Here are the results for penpen's code:

Code: Select all

24,000

0 days 00:00:49.513

Code: Select all

12,000

0 days 00:00:26.289

Dos_Probie · #10 Post by **Dos_Probie** » 21 May 2014 06:12

Still LQQks like Aacini wins the speed test and with only 8 lines of code and no disableDelayedExpansion! :mrgreen:

#11 Post by **Aacini** » 21 May 2014 08:41

I think that the fastest SET /P method would be this one:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

find /C /V "" < test.txt > test.out
set /P lines=< test.out

< test.txt (for /L %%a in (1,1,%lines%) do (
   set /P input=
   if not defined n[!input!] (
      set n[!input!]=Y
      echo !input!
   )
)) > test.out_dir_1

If the execution of find.exe is faster that cmd.exe, this method should run faster than penpen's one; however, I am pretty sure this method will still be slower than the first FOR /F one.

An interesting comparison would be to take the time in this method after the FIND and SET /P commands (just the FOR /L loop). This would give us a direct comparison of FOR /F method vs. SET /P one!

Antonio

#12 Post by **foxidrive** » 21 May 2014 09:32

Code: Select all

24000

0 days 00:00:39.146

Code: Select all

12000

0 days 00:00:17.437

Using this code:

Code: Select all

@echo off
call getTimestamp -f {ums} -r t1
setlocal EnableDelayedExpansion
find /C /V "" < test.txt > test.out
set /P lines=< test.out
del test.out
< test.txt (for /L %%a in (1,1,%lines%) do (
   set /P input=
   if not defined n[!input!] (
      set n[!input!]=Y
      echo !input!
   )
)) > test.out_dir_7
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause

#13 Post by **penpen** » 21 May 2014 11:05

On my WinXp home, this is as fast as Aacinis set/P method on 12,000 uniqe entries:

Code: Select all

@echo off
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal

(But my pc is SLOW, so it might be corrupted by hdd speed, ...)

penpen

#14 Post by **foxidrive** » 21 May 2014 11:16

Code: Select all

24,000

0 days 00:00:37.942

Code: Select all

12,000

0 days 00:00:16.769

Test code from penpen's last post.

Code: Select all

@echo off
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_dir_1.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause

miskox · #15 Post by **miskox** » 21 May 2014 12:11

Here are my test results (my computer is not very powerfull (old IBM laptop T43)). Results differ a lot. I think that the main advantage is CPU power.

Code: Select all

MISKOX -version 1 (findstr) (http://www.dostips.com/forum/viewtopic.php?p=34566#
p34566)
0 days 00:12:20.641

MISKOX -version 2 (temp files) (http://www.dostips.com/forum/viewtopic.php?p=345
66#p34566)
0 days 00:00:51.828

AACINI (http://www.dostips.com/forum/viewtopic.php?p=34567#p34567)
0 days 00:01:10.437

AGERMAN (http://www.dostips.com/forum/viewtopic.php?p=34584#p34584)
0 days 00:02:05.281

FOXIDRIVE (http://www.dostips.com/forum/viewtopic.php?p=34603#p34603)
0 days 00:01:29.907

PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34616#p34616)
0 days 00:01:30.828

PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34589#p34589)
0 days 00:02:27.516

Code used for the test:

Code: Select all

@echo off
cls
del test.out_* 2>nul

call :1
call :2
call :3
call :4
call :5
call :6
call :7

goto :EOF

:: another option was to call the .bat with a parameter.
if "%1"=="1" goto :1
if "%1"=="2" goto :2
if "%1"=="3" goto :3
if "%1"=="4" goto :4
if "%1"=="5" goto :5
if "%1"=="6" goto :6
if "%1"=="7" goto :7
goto :EOF

:1
echo MISKOX -version 1 (findstr) (http://www.dostips.com/forum/viewtopic.php?p=34566#p34566)
echo 1 %time%
call getTimestamp -f {ums} -r t1
type nul>test.out_1
for /f "tokens=1 delims=" %%f in (test.txt) do findstr /B /C:%%f test.out_1 >nul||>>test.out_1 (echo.%%f)
echo 1 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:2
echo MISKOX -version 2 (temp files) (http://www.dostips.com/forum/viewtopic.php?p=34566#p34566)
set tmpdir=TMPDIR%random%_%random%_%random%
md %tmpdir%
echo 2 %time%
call getTimestamp -f {ums} -r t1
for /f "tokens=1 delims=" %%f in (test.txt) do >%tmpdir%\%%f (echo X)
for /f "tokens=1 delims=" %%f in (test.txt) do if exist %tmpdir%\%%f del %tmpdir%\%%f&>>test.out_2 (echo %%f)
rd /s /q %tmpdir%
echo 2 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:3
echo AACINI (http://www.dostips.com/forum/viewtopic.php?p=34567#p34567)
setlocal
echo 3 %time%
call getTimestamp -f {ums} -r t1
(for /F "delims=" %%a in (test.txt) do (
   if not defined n[%%a] (
      set n[%%a]=Y
      echo %%a
   )
)) > test.out_3
echo 3 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:4
echo AGERMAN (http://www.dostips.com/forum/viewtopic.php?p=34584#p34584)
echo 4 %time%
call getTimestamp -f {ums} -r t1
set "infile=test.txt"
set "outfile=test.out_4"
setlocal EnableDelayedExpansion
<"!infile!" >"!outfile!" (
  for /f %%i in ('type "!infile!"^|find /c /v ""') do for /l %%j in (1 1 %%i) do (
    set "line=" &set /p "line="
    if not defined n!line! (
      set n!line!=Y
      echo !line!
    )
  )
)
endlocal
echo 4 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

:5
echo FOXIDRIVE (http://www.dostips.com/forum/viewtopic.php?p=34603#p34603)
setlocal EnableDelayedExpansion
echo 5 %time%
call getTimestamp -f {ums} -r t1
find /C /V "" < test.txt > test.out
set /P lines=< test.out

< test.txt (for /L %%a in (1,1,%lines%) do (
   set /P input=
   if not defined n[!input!] (
      set n[!input!]=Y
      echo !input!
   )
)) > test.out_5
echo 5 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

:6
echo PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34616#p34616)
echo 6 %time%
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_6.txt" cmd /Q /D /E:ON /V:ON /C "set "input= "&set "n[ ]= "&for /L %%a in () do if errorlevel 1 (exit) else set /P "input=" & if not defined n[!input!] set "n[!input!]=Y" & echo !input!"
endlocal
echo 6 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF


:7
echo PENPEN (http://www.dostips.com/forum/viewtopic.php?p=34589#p34589)
echo 7 %time%
call getTimestamp -f {ums} -r t1
setlocal disableDelayedExpansion
< "test.txt" > "test.out_7" cmd /V:ON /C  "@echo off&for /L %%a in (0, 0, 0) do (set "input=" & set /P "input=" & (if not defined input exit) & (if not defined n[^!input^!] (set n[^!input^!]=Y & echo ^!input^!)))"
endlocal
echo 7 %time%
call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
echo.

goto :EOF

Test.txt contains 24015 records.
Test.out (duplicates removed) contains 23947 (68 duplicates)

Test.txt will be posted in the next post because it is too long.

Saso

DosTips.com

Remove DUPLICATES without SORTING the data (in .txt files)

Remove DUPLICATES without SORTING the data (in .txt files)

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file

Re: Remove DUPLICATES without SORTING the data (in .txt file