Page 1 of 1

Batch script to remove duplicate rows while ignoring first few characters

Posted: 04 Sep 2019 08:20
by tanhy
Hi, like to please request for help in modifying a batch script to remove duplicate rows in a text file.

Script of interest (obtained from https://stackoverflow.com/questions/116 ... -text-file):

Code: Select all

@echo off
setlocal disableDelayedExpansion
set "file=%~1"
set "line=%file%.line"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^


::The 2 blank lines above are critical, do not remove
>"%deduped%" (
  for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%file%") do (
    set "ln=%%A"
    setlocal enableDelayedExpansion
    >"%line%" (echo !ln:\=\\!)
    >nul findstr /xlg:"%line%" "%deduped%" || (echo !ln!)
    endlocal
  )
)
>nul move /y "%deduped%" "%file%"
2>nul del "%line%"
The above script has been working nicely to remove duplicate rows with no issues. However, I now have a new requirement.

I am seeking for a modification to the above script to remove duplicates by checking only from the 8th character onwards; that is, to ignore the first 7 characters even if they are different. If there is no difference from the 8th character onwards to the end of line, the whole row is to be removed for repeated entries including the first 7 characters. The first 8 characters has the following format "0.00.00_" (without the quotes).

Example...
For an original text file with the following 5 entries:
0.10.01_ABC_X
0.10.04_DEFG_Y
0.10.01_ABC_X
0.10.02_DEFG_Y
1.11.03_PQRST_M

I will like the output to be:
0.10.01_ABC_X
0.10.04_DEFG_Y
1.11.03_PQRST_M

Thank you very much!

Re: Batch script to remove duplicate rows while ignoring first few characters

Posted: 04 Sep 2019 13:58
by Eureka!
Without (part of) the actual text file, this might produce false results.
But as you did't provide that:

Code: Select all

@echo off
setlocal

::__________________________________________________
::
::      SETTINGS
::__________________________________________________
::
    set LENGTH=8
    set SUFFIX=
    set INPUT=INPUT.txt
    set TUSSEN1=temp1.deleteme
    set OUTPUT=output.txt


::__________________________________________________
::
::      ACTION
::__________________________________________________
::

    del "%OUTPUT%"
    sort /+%LENGTH% "%INPUT%" > "%TUSSEN1%"
    for /f "delims=" %%x in (%TUSSEN1%) DO call :PARSELINE "%%x"

    del "%TUSSEN1%"
    type "%OUTPUT%"
goto :EOF


::==================================================
::__________________________________________________
::
        :PARSELINE
::__________________________________________________
::
    set thisline=%~1
    call set tail=%%thisline:~%LENGTH%%%
    if /i "%tail%" neq "%SUFFIX%" (
        echo %thisline%>>"%OUTPUT%"
        set SUFFIX=%tail%
    ) 
goto :EOF

Re: Batch script to remove duplicate rows while ignoring first few characters

Posted: 04 Sep 2019 15:48
by Aacini

Code: Select all

@echo off
setlocal DisableDelayedExpansion

(for /F "tokens=1* delims=_" %%a in (input.txt) do (
   if not defined key["%%b"] (
      set key["%%b"]=1
      echo %%a_%%b
   )
)) > output.txt
Antonio

Re: Batch script to remove duplicate rows while ignoring first few characters

Posted: 05 Sep 2019 08:46
by tanhy
Excellent stuff! Thank you both very much! It is greatly appreciated. :D

Both Eureka! & Aacini scripts work beautifully. Requirement met.