Page 1 of 1

Easy way to check tokens 5-16 and delete if specified pattern?

Posted: 20 Feb 2020 03:22
by SIMMS7400
Hi Folks -

I have the following data file (sample) that is created by another team and they are unwilling to work with me to clean up their process. Therefore, I need to handle cleaning up the file on my end:
E01,S00000,1040-00,2019,"","","","","","","","","","","",""
E01,S00000,1100-00,2019,267171.46,273027.23,717379.93,195954.96,296135.03,813816.07,252308.17,304287.16,698450.70,335489.17,567923.76,728968.55
E01,S00000,1100-30,2019,1889066.33,2863499.20,682036.53,1985084.97,3921360.24,980411.11,2915929.92,4929743.93,1060628.90,2744523.44,3903142.91,595767.42
E01,S00000,1130-00,2019,"","","","","","","","","","","",""
E01,S00000,1140-00,2019,"","","","","","","","","","","",""
E01,S00000,1210-30,2019,163016436.29,162153109.71,161479341.90,161241199.50,160727337.53,159780948.62,159718783.88,159194871.14,159133778.25,158872207.43,158751022.73,158882082.53
E01,S00000,1210-40,2019,-218211945.61,-217738772.03,-221904629.75,-225170015.31,-224702941.25,-225234443.63,-227407317.13,-230219433.60,-229532466.89,-217082211.44,-204252043.55,-190508732.54
E01,S00000,1210-90,2019,"",'#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING'
E01,S00000,1210-50,2019,1578792.53,4648868.03,-263904.89,2608853.62,2688750.22,-937903.84,147911.79,-747464.56,765309.36,656466.40,-145911.30,501938.01
E01,S00000,1210-90,2019,'#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING'
What I need to do is remove the lines where tokens 5-16 are either NULL (""), 'MISSING' or a combination of "" & 'MISSING'.

Right now, I have this code running before the data upload and it's working fine:
FINDSTR /V /C:"\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\"" "%DATAFILENAME%" > "%DATAFILENAME%out"
FINDSTR /V /C:"'#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING','#MISSING'" "%DATAFILENAME%out" > "%DATAFILENAME%out1"
But I'm unsure how to code for the 3rd exception I mentioned above when it would be a combination of 'MISSING' and "". Essentially, as long as there are NO numerical values in tokens 5-16, I can delete that line. Any ideas?

Re: Easy way to check tokens 5-16 and delete if specified pattern?

Posted: 20 Feb 2020 06:04
by penpen
You should use Dave's JREPL.bat:
viewtopic.php?f=3&t=6044

penpen

Re: Easy way to check tokens 5-16 and delete if specified pattern?

Posted: 20 Feb 2020 07:39
by dbenham
Your two existing FINDSTR statements are special cases of a general solution.

You have already stated the simple solution without realizing it:
SIMMS7400 wrote:Essentially, as long as there are NO numerical values in tokens 5-16, I can delete that line.
Based on your description, plus your sample data, all you need to do is delete any line that does not contain a digit after the 4th comma. Or, more appropriately, preserve any line that does have a digit after the 4th comma.

Code: Select all

findstr "^[^,]*,[^,]*,[^,]*,[^,]*,.*[0123456789]" "%DATAFILENAME%" >""%DATAFILENAME%out"
Certainly JREPL can do this easily, but there really is no need to use anything other than FINDSTR, except for the fact that the JREPL regex can be more compact.

Code: Select all

call jrepl "^(.*?,){4}.*\d" "" /k 0 /f "%DATAFILENAME%" /o "%DATAFILENAME%out"
And if you are OK with overwriting the original file, then

Code: Select all

call jrepl "^(.*?,){4}.*\d" "" /k 0 /f "%DATAFILENAME%" /o -
Dave Benham

Re: Easy way to check tokens 5-16 and delete if specified pattern?

Posted: 20 Feb 2020 11:28
by SIMMS7400
wow!!!! THat was easy enough.

Thank you Dave, I really appreciate it!!!

Re: Easy way to check tokens 5-16 and delete if specified pattern?

Posted: 20 Feb 2020 13:06
by penpen
We don't know exactly how those numbers are generated and most probably they are IEEE 754 numbers where you might get "inf", "+inf", "-inf" or "NaN" as well.
To not miss such lines i would use something like the following (untested):

Code: Select all

call jrepl "^[^,]*(,[^,]*){3}(,\x22\x22|,'#MISSING'){12}.*" "" /R 0 /f "%DATAFILENAME%"
penpen