Page 1 of 1
Parsing taking FOREVER - possible JSCRIPT solution?
Posted: 02 Dec 2018 05:01
by SIMMS7400
Hi Folks -
I have a text file that consists of 85k rows. I have a need to parse that file and extract from token 2 only strings that begin with "C-" and spool those results to a file. From there, I need to remove all duplicates from that file. THe end result should be a file containing NO duplicates.
I'm using the following solution which is taking a considerable amount of time:
Code: Select all
@ECHO OFF
for /f "tokens=2 delims=|" %%A in (FDRII_outline.txt) do (
ECHO "%%~A" | FINDSTR /C:"C-" >Nul 2>&1 && ECHO %%~A>>"out.txt"
)
jsort out.txt /u >out.txt.new
move /y out.txt.new out.txt >nul
Could the first portion of my code be replaced with another JSCRIPT solution? I could also leverage a VB script and use a dictionary but figured I'd ask first if anyone had any more efficient ways than my current solution.
Thank you!
EDIT:
What I mean when I say remove duplicates is that I need to remove the duplicate value AS WELL AS the original value. Essentially, the final file should be all instances that NEVER had a duplicate to begin with.
Re: Parsing taking FOREVER - possible JSCRIPT solution?
Posted: 02 Dec 2018 08:16
by ShadowThief
Does the order of the lines matter, or can I sort the strings alphabetically in order to make it easier for me to detect and remove duplicates?
Is the leftmost column a fixed length?
Any poison characters I need to look out for in the "C-" section?
Re: Parsing taking FOREVER - possible JSCRIPT solution?
Posted: 02 Dec 2018 08:40
by SIMMS7400
HI Shadow -
Nope, Sorting doesn't matter at all as long as the final file result has removed the duplicate(s) and the original.
Example:
Before:
1
2
3
1
1
4
5
After:
2
3
4
5
Thanks!
Re: Parsing taking FOREVER - possible JSCRIPT solution?
Posted: 02 Dec 2018 08:41
by Aacini
You have not given a single description of the desired values: How long they are? Contains they special characters? How many unique values could be expected in the 85K rows? All these points are needed in order to create an
efficient solution. I invite you to carefully read
the first post in this forum...
With no info about the values, I just could write the simplest solution that I think could run fast:
Code: Select all
@echo off
setlocal EnableDelayedExpansion
rem Count desired values
for /F "tokens=2 delims=|" %%a in (FDRII_outline.txt) do (
set "value=%%~a"
if "!value:~0,2!" equ "C-" set /A "row[!value:~2!]+=1"
)
rem Output unique values
(for /F "tokens=2,3 delims=[]=" %%a in ('set row[') do (
if %%b equ 1 echo C-%%a
)) > out.txt
This method fail if the values contain special characters that are SET /A arithmetic operators (other than the minus sign at second position).
This method run every time slower if the values are very long or there are a large amount of unique values. Anyway, I am pretty sure that this method will run much faster than the original code.
Ah! And this method output the values in sorted order.
Obviously, I could not test this code because you have not posted a segment of the input file...
Antonio
Re: Parsing taking FOREVER - possible JSCRIPT solution?
Posted: 02 Dec 2018 09:04
by Aacini
Another method that have not the restrictions of my previous one...
Code: Select all
@echo off
setlocal EnableDelayedExpansion
rem Extract desired values
(for /F "tokens=2 delims=|" %%a in (FDRII_outline.txt) do (
set "value=%%~a"
if "!value:~0,2!" equ "C-" echo !value!
)) > out1.txt
rem Sort desired values (this is faster than do the SORT into the FOR)
sort out1.txt > out2.txt
rem Output unique values
set "last="
set "count=0
(for /F "delims=" %%a in (out2.txt) do (
if "%%a" equ "!last!" (
set /A count+=1
) else (
if !count! equ 1 echo !last!
set count=1
)
set "last=%%a"
)) > out.txt
This method only fails if the values have an exclamation mark. This point can be fixed, if needed...
Antonio
Re: Parsing taking FOREVER - possible JSCRIPT solution?
Posted: 02 Dec 2018 09:28
by SIMMS7400
Aacini wrote: ↑02 Dec 2018 09:04
Another method that have not the restrictions of my previous one...
Code: Select all
@echo off
setlocal EnableDelayedExpansion
rem Extract desired values
(for /F "tokens=2 delims=|" %%a in (FDRII_outline.txt) do (
set "value=%%~a"
if "!value:~0,2!" equ "C-" echo !value!
)) > out1.txt
rem Sort desired values (this is faster than do the SORT into the FOR)
sort out1.txt > out2.txt
rem Output unique values
set "last="
set "count=0
(for /F "delims=" %%a in (out2.txt) do (
if "%%a" equ "!last!" (
set /A count+=1
) else (
if !count! equ 1 echo !last!
set count=1
)
set "last=%%a"
)) > out.txt
This method only fails if the values have an exclamation mark. This point can be fixed, if needed...
Antonio
Antonio -
That worked like an absolute charm!!!!!
Thank you so much! Here is the end result:
C-12527631
C-12527651
C-12527800
C-12527825
C-12527844
C-12527883
C-12527892
C-12527902
C-12527904
C-12527906
C-12527907
C-12527908
C-12527911
C-12527913
C-12527914
C-12527920
C-12527925
C-12527926
C-12527930
C-12527934
C-12527945
C-12527947
C-12527950
C-12527955
C-12527956
C-12527960
C-12527961
C-12527965
C-12527977
C-12527978
C-12527980
C-12527983
C-12527987
C-12527989
C-12527990
C-12527991
C-12527992
C-12527993
C-12527996
C-12527997
C-12528000
C-12528004
C-12528006
Re: Parsing taking FOREVER - possible JSCRIPT solution?
Posted: 02 Dec 2018 10:41
by Aacini
I suggest you to also test my first code. You may get the time that both methods takes and post they (and also the time that your original code takes).
IMHO this topic is about efficiency. Isn't it? So this point should take full attention (not just to get the correct result).
I am very interested in this type of tests!
Antonio