Batch script to remove duplicate rows while ignoring first few characters

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
tanhy
Posts: 2
Joined: 03 Sep 2019 12:18

Batch script to remove duplicate rows while ignoring first few characters

#1 Post by tanhy » 04 Sep 2019 08:20

Hi, like to please request for help in modifying a batch script to remove duplicate rows in a text file.

Script of interest (obtained from https://stackoverflow.com/questions/116 ... -text-file):

Code: Select all

@echo off
setlocal disableDelayedExpansion
set "file=%~1"
set "line=%file%.line"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^


::The 2 blank lines above are critical, do not remove
>"%deduped%" (
  for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%file%") do (
    set "ln=%%A"
    setlocal enableDelayedExpansion
    >"%line%" (echo !ln:\=\\!)
    >nul findstr /xlg:"%line%" "%deduped%" || (echo !ln!)
    endlocal
  )
)
>nul move /y "%deduped%" "%file%"
2>nul del "%line%"
The above script has been working nicely to remove duplicate rows with no issues. However, I now have a new requirement.

I am seeking for a modification to the above script to remove duplicates by checking only from the 8th character onwards; that is, to ignore the first 7 characters even if they are different. If there is no difference from the 8th character onwards to the end of line, the whole row is to be removed for repeated entries including the first 7 characters. The first 8 characters has the following format "0.00.00_" (without the quotes).

Example...
For an original text file with the following 5 entries:
0.10.01_ABC_X
0.10.04_DEFG_Y
0.10.01_ABC_X
0.10.02_DEFG_Y
1.11.03_PQRST_M

I will like the output to be:
0.10.01_ABC_X
0.10.04_DEFG_Y
1.11.03_PQRST_M

Thank you very much!

Eureka!
Posts: 136
Joined: 25 Jul 2019 18:25

Re: Batch script to remove duplicate rows while ignoring first few characters

#2 Post by Eureka! » 04 Sep 2019 13:58

Without (part of) the actual text file, this might produce false results.
But as you did't provide that:

Code: Select all

@echo off
setlocal

::__________________________________________________
::
::      SETTINGS
::__________________________________________________
::
    set LENGTH=8
    set SUFFIX=
    set INPUT=INPUT.txt
    set TUSSEN1=temp1.deleteme
    set OUTPUT=output.txt


::__________________________________________________
::
::      ACTION
::__________________________________________________
::

    del "%OUTPUT%"
    sort /+%LENGTH% "%INPUT%" > "%TUSSEN1%"
    for /f "delims=" %%x in (%TUSSEN1%) DO call :PARSELINE "%%x"

    del "%TUSSEN1%"
    type "%OUTPUT%"
goto :EOF


::==================================================
::__________________________________________________
::
        :PARSELINE
::__________________________________________________
::
    set thisline=%~1
    call set tail=%%thisline:~%LENGTH%%%
    if /i "%tail%" neq "%SUFFIX%" (
        echo %thisline%>>"%OUTPUT%"
        set SUFFIX=%tail%
    ) 
goto :EOF

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Batch script to remove duplicate rows while ignoring first few characters

#3 Post by Aacini » 04 Sep 2019 15:48

Code: Select all

@echo off
setlocal DisableDelayedExpansion

(for /F "tokens=1* delims=_" %%a in (input.txt) do (
   if not defined key["%%b"] (
      set key["%%b"]=1
      echo %%a_%%b
   )
)) > output.txt
Antonio

tanhy
Posts: 2
Joined: 03 Sep 2019 12:18

Re: Batch script to remove duplicate rows while ignoring first few characters

#4 Post by tanhy » 05 Sep 2019 08:46

Excellent stuff! Thank you both very much! It is greatly appreciated. :D

Both Eureka! & Aacini scripts work beautifully. Requirement met.

Post Reply