Page 1 of 1

JREPL to extract unique values from 16million row file?

Posted: 03 Apr 2021 06:44
by SIMMS7400
HI Dave/Team -

I have a text file that is 16 million rows and I need to extract the unique values from the 8th position of the file.

Code: Select all

"Property Gross Royalties"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	120
"Sides"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	1
"Volume"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	17000
"Property AGC"	"Commercial"	"Actual"	"Final"	"FY14"	"Jan"	"Periodic"	"O-20003921"	"FDR"	"PA"	2000
Is JREPL the right tool for this?

Re: JREPL to extract unique values from 16million row file?

Posted: 04 Apr 2021 08:21
by Aacini
I am afraid I don't understand what exactly you want. Do you want the values of the 8th position that appear just one time? (that is, in just one line)

This program counts the number of times that appear each value of the 8th column:

Code: Select all

@if (@CodeSection == @Batch) @then

@echo off
cscript //nologo //E:jscript "%~F0" < test.txt
goto :EOF

@end

var count = new Array();
while ( ! WScript.Stdin.AtEndOfStream ) {
   var line = WScript.Stdin.ReadLine();
   // WScript.Stdout.WriteLine(line);
   var column = line.split("\t");
   // WScript.Stdout.WriteLine(column[7]);
   if (count[column[7]] == undefined) count[column[7]] = 0;
   count[column[7]] += 1;
}

for ( var item in count) {
   WScript.Stdout.WriteLine(item+": "+count[item]+" times");
}
For example:

Code: Select all

"O-20003921": 4 times
... so just add an "if (count[item] == 1)" to show just the values that appear one time...

Antonio

Re: JREPL to extract unique values from 16million row file?

Posted: 04 Apr 2021 10:19
by SIMMS7400
Hi Aacini !

Thank you ! Basically I need to pull all the unique values in column 8. Is that possible?

Re: JREPL to extract unique values from 16million row file?

Posted: 04 Apr 2021 10:20
by Squashman
Technically speaking a well formatted delimited file could be read by a FOR /F command. Only drawback to using a FOR /F is that it will load the entire file into memory and will fail if the file is larger than the 32Bit limit. Roughly 2.1GB's.

Re: JREPL to extract unique values from 16million row file?

Posted: 04 Apr 2021 10:27
by Squashman
Dave's JSORT might be another option for you as well. That would give you all the unique values in the 8th delimited field but would also give you the whole record. But again, you could use a FOR /F to parse the output of JSORT from inside the IN clause.

Re: JREPL to extract unique values from 16million row file?

Posted: 04 Apr 2021 20:21
by Aacini
Use this code. I think it is the fastest way to get this result:

Code: Select all

@if (@CodeSection == @Batch) @then

@echo off
cscript //nologo //E:jscript "%~F0" < test.txt
goto :EOF

@end

var count = new Array();
while ( ! WScript.Stdin.AtEndOfStream ) {
   var line = WScript.Stdin.ReadLine();
   var column = line.split("\t");
   count[column[7]] = 0;
}

for ( var item in count) {
   WScript.Stdout.WriteLine(item);
}
Antonio

Re: JREPL to extract unique values from 16million row file?

Posted: 08 Apr 2021 02:16
by SIMMS7400
Antonio -

That worked wonderfully!! 16+ millions rows very quickly. Thank you again for this, much appreciated!!