JREPL to extract unique values from 16million row file?

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
SIMMS7400
Posts: 507
Joined: 07 Jan 2016 07:47

JREPL to extract unique values from 16million row file?

#1 Post by SIMMS7400 » 03 Apr 2021 06:44

HI Dave/Team -

I have a text file that is 16 million rows and I need to extract the unique values from the 8th position of the file.

Code: Select all

"Property Gross Royalties"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	120
"Sides"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	1
"Volume"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	17000
"Property AGC"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	2000
Is JREPL the right tool for this?

Aacini
Expert
Posts: 1693
Joined: 06 Dec 2011 22:15
Location: México City, México

Re: JREPL to extract unique values from 16million row file?

#2 Post by Aacini » 04 Apr 2021 08:21

I am afraid I don't understand what exactly you want. Do you want the values of the 8th position that appear just one time? (that is, in just one line)

This program counts the number of times that appear each value of the 8th column:

Code: Select all

@if (@CodeSection == @Batch) @then

@echo off
cscript //nologo //E:jscript "%~F0" < test.txt
goto :EOF

@end

var count = new Array();
while ( ! WScript.Stdin.AtEndOfStream ) {
   var line = WScript.Stdin.ReadLine();
   // WScript.Stdout.WriteLine(line);
   var column = line.split("\t");
   // WScript.Stdout.WriteLine(column[7]);
   if (count[column[7]] == undefined) count[column[7]] = 0;
   count[column[7]] += 1;
}

for ( var item in count) {
   WScript.Stdout.WriteLine(item+": "+count[item]+" times");
}
For example:

Code: Select all

"O-20003921": 4 times
... so just add an "if (count[item] == 1)" to show just the values that appear one time...

Antonio

SIMMS7400
Posts: 507
Joined: 07 Jan 2016 07:47

Re: JREPL to extract unique values from 16million row file?

#3 Post by SIMMS7400 » 04 Apr 2021 10:19

Hi Aacini !

Thank you ! Basically I need to pull all the unique values in column 8. Is that possible?

Squashman
Expert
Posts: 4280
Joined: 23 Dec 2011 13:59

Re: JREPL to extract unique values from 16million row file?

#4 Post by Squashman » 04 Apr 2021 10:20

Technically speaking a well formatted delimited file could be read by a FOR /F command. Only drawback to using a FOR /F is that it will load the entire file into memory and will fail if the file is larger than the 32Bit limit. Roughly 2.1GB's.

Squashman
Expert
Posts: 4280
Joined: 23 Dec 2011 13:59

Re: JREPL to extract unique values from 16million row file?

#5 Post by Squashman » 04 Apr 2021 10:27

Dave's JSORT might be another option for you as well. That would give you all the unique values in the 8th delimited field but would also give you the whole record. But again, you could use a FOR /F to parse the output of JSORT from inside the IN clause.

Aacini
Expert
Posts: 1693
Joined: 06 Dec 2011 22:15
Location: México City, México

Re: JREPL to extract unique values from 16million row file?

#6 Post by Aacini » 04 Apr 2021 20:21

Use this code. I think it is the fastest way to get this result:

Code: Select all

@if (@CodeSection == @Batch) @then

@echo off
cscript //nologo //E:jscript "%~F0" < test.txt
goto :EOF

@end

var count = new Array();
while ( ! WScript.Stdin.AtEndOfStream ) {
   var line = WScript.Stdin.ReadLine();
   var column = line.split("\t");
   count[column[7]] = 0;
}

for ( var item in count) {
   WScript.Stdout.WriteLine(item);
}
Antonio

SIMMS7400
Posts: 507
Joined: 07 Jan 2016 07:47

Re: JREPL to extract unique values from 16million row file?

#7 Post by SIMMS7400 » 08 Apr 2021 02:16

Antonio -

That worked wonderfully!! 16+ millions rows very quickly. Thank you again for this, much appreciated!!

Post Reply