Safely parse nearly any CSV with parseCSV.bat

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
dbenham
Expert
Posts: 2198
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Safely parse nearly any CSV with parseCSV.bat

#1 Post by dbenham » 19 Jun 2014 00:15

It is fairly common that someone wants to parse CSV using FOR /F. This is a simple task if you know all columns are populated, and there are no commas, newlines, or quotes within values. Assume there are 4 columns:

Code: Select all

@echo off
for /f "tokens=1-4 delims=," %%A in (test.csv) do (
  echo ----------------------
  echo A=%%~A
  echo B=%%~B
  echo C=%%~C
  echo D=%%~D
  echo(
)

But things become more difficult if any of the following conditions occur:

1 - Values may be empty, with consecutive commas. FOR /F treats consecutive delimiters as one, so it will throw off the column assignment.

2 - Quoted values may contain commas. FOR /F will incorrectly treat a quoted comma as a column delimiter.

3 - Quoted values may contain newlines. FOR /F will break the line at the newline and incorrectly treat the one row as two.

4 - Quoted values may contain paired quotes that represent one quote. For example, "He said, ""Hello there"". A method is needed to convert "" into ".

Then there is a secondary problems that can crop up if delayed expansion is enabled.

5 - A FOR variable %%A will be corrupted if it contains ! (or sometimes ^) if delayed expansion is enabled when the variable is expanded.

There are fairly easy solutions for some of these issues, but solving all of them is extremely difficult (and slow) with pure batch.

I have written a hybrid JScript/batch utility called parseCSV.bat that makes it easy and relatively efficient to correctly parse nearly any CSV file with FOR /F.

EDIT 2017-02-19: Updated to version 1.1 to enable CSV transformations for purposes other than parsing by FOR /F. Also now supports escape sequences.
EDIT 2017-02-20: Updated to version 1.2 to fix a bug concerning /E with /Q:N, added the /U option, and improved performance

parseCSV.bat

Code: Select all

@if (@X)==(@Y) @end /* harmless hybrid line that begins a JScrpt comment

::************ Documentation ***********
::parseCSV.bat version 1.2
:::
:::parseCSV  [/option]...
:::
:::  Parse stdin as CSV and write it to stdout in a way that can be safely
:::  parsed by FOR /F. All columns will be enclosed by quotes so that empty
:::  columns may be preserved. It also supports delimiters, newlines, and
:::  escaped quotes within quoted values. Two consecutive quotes within a
:::  quoted value are converted into one quote by default.
:::
:::  Available options:
:::
:::    /I:string = Input delimiter. Default is a comma (,)
:::
:::    /O:string = Output delimiter. Default is a comma (,)
:::
:::         The entire option must be quoted if specifying poison character
:::         or whitespace literals as a delimiters for /I or /O.
:::
:::         Examples:  pipe = "/I:|"
:::                   space = "/I: "
:::
:::         Standard JScript escape sequences can also be used.
:::
:::         Examples:       tab = /I:\t  or  /I:\x09
:::                   backslash = /I:\\
:::
:::    /E = Encode output delimiter literal within value as \D
:::         Encode newline within value as \N
:::         Encode backslash within value as \S
:::
:::    /D = escape exclamation point and caret for Delayed expansion
:::         ! becomes ^!
:::         ^ becomes ^^
:::
:::    /L = treat all input quotes as quote Literals
:::
:::    /Q:QuoteOutputFormat
:::
:::       Controls output of Quotes, where QuoteOutputFormat may be any
:::       one of the following:
:::
:::         L = all columns quoted, quote Literals output as "   (Default)
:::         E = all columns quoted, quote literals Escaped as ""
:::         N = No columns quoted, quote literals output as "
:::
:::       The /Q:E and /Q:N options are useful for transforming data for
:::       purposes other than parsing by FOR /F
:::
:::    /U = Write unix style lines with newline (\n) instead of the default
:::         Windows style of carriage return and linefeed (\r\n).
:::
:::parseCSV  /?
:::
:::  Display this help
:::
:::parseCSV  /V
:::
:::  Display the version of parseCSV.bat
:::
:::parseCSV.bat was written by Dave Benham. Updates are available at the original
:::posting site: http://www.dostips.com/forum/viewtopic.php?f=3&t=5702
:::

::************ Batch portion ***********
@echo off
if "%~1" equ "/?" (
  setlocal disableDelayedExpansion
  for /f "delims=: tokens=*" %%A in ('findstr "^:::" "%~f0"') do echo(%%A
  exit /b 0
)
if /i "%~1" equ "/V" (
  for /f "delims=:" %%A in ('findstr /bc:"::%~nx0 version " "%~f0"') do echo %%A
  exit /b 0
)
cscript //E:JScript //nologo "%~f0" %*
exit /b 0


************ JScript portion ***********/
var args     = WScript.Arguments.Named,
    stdin    = WScript.Stdin,
    stdout   = WScript.Stdout,
    escape   = args.Exists("E"),
    literalQ = args.Exists("L"),
    escapeQ  = (args.Item("Q")&&args.Item("Q").toUpperCase()=="E"),
    quoteCol = (args.Item("Q")&&args.Item("Q").toUpperCase()=="N") ? '' : '"',
    delayed  = args.Exists("D"),
    inDelim  = args.Item("I") ? eval('"'+args.Item("I")+'"') : ",",
    outDelim = args.Item("O") ? eval('"'+args.Item("O")+'"') : ",",
    newline  = args.Exists("U") ? "\n" : "\r\n",
    quote    = false,
    ln, c, n, out;
while (!stdin.AtEndOfStream) {
  ln=stdin.ReadLine();
  out="";
  if (!quote) stdout.Write(quoteCol);
  for (n=0; n<ln.length; n++ ) {
    c=ln.charAt(n);
    if (c == '"') {
      if (literalQ) {
        if (escapeQ) c+='"';
      } else if (quote && ln.charAt(n+1) == '"') {
        n++;
        if (escapeQ) c+='"';
      } else {
        quote=!quote;
        continue;
      }
    }
    else if (c == inDelim && !quote) c=quoteCol+outDelim+quoteCol;
    else if (escape) {
      if (c == outDelim) c="\\D";
      if (c == "\\") c="\\S";
    }
    else if (delayed) {
      if (c == "!") c="^!";
      if (c == "^") c="^^";
    }
    out+=c;
  }
  out += (quote) ? ((escape) ? "\\N" : newline) : quoteCol+newline;
  stdout.Write(out);
}

I have also written a script that defines a macro to assist with parsing the most problematic CSV files. See viewtopic.php?f=3&t=1827 for background information about batch macros with arguments.
EDIT 2015-01-08: Updated to version 1.1 to fix a silly mistake in the Example usage code

define_csvGetCol.bat

Code: Select all

::define_csvGetCol.bat version 1.1
::
:: Defines variable LF and macro csvGetCol to be used with
:: parseCSV.bat to parse nearly any CSV file.
::
:: This script must be called with delayedExpansion disabled.
::
:: The %csvGetCol% macro must be used with delayedExpansion enabled.
::
:: Example usage:
::
::   @echo off
::   setlocal disableDelayedExpansion
::   call define_csvGetCol
::   setlocal enableDelayedExpansion
::   for /f "tokens=1-3 delims=," %%A in ('parseCSV /d /e ^<test.csv') do (
::     %== Load and decode column values ==%
::     %csvGetCol% A "," %%A
::     %csvGetCol% B "," %%B
::     %csvGetCol% C "," %%C
::     %== Display the result ==%
::     echo ----------------------
::     for %%V in (A B C) do echo %%V=!%%V!
::     echo(
::   )
::
:: Written by Dave Benham
::

:: Delayed expansion must be disabled during macro definition

:: Define LF to contain a linefeed (0x0A) character
set ^"LF=^

^" The empty line above is critical - DO NOT REMOVE

:: define a newline with line continuation
set ^"\n=^^^%LF%%LF%^%LF%%LF%^^"

:: Define csvGetCol
:: %csvGetCol%  envVarName  "Delimiter"  FORvar
set csvGetCol=for %%# in (1 2) do if %%#==2 (%\n%
setlocal enableDelayedExpansion^&for /f "tokens=1,2*" %%1 in ("!args!") do (%\n%
  endlocal^&endlocal%\n%
  set "%%1=%%~3"!%\n%
  if defined %%1 (%\n%
    for %%L in ("!LF!") do set "%%1=!%%1:\N=%%~L!"%\n%
    set "%%1=!%%1:\D=%%~2!"%\n%
    set "%%1=!%%1:\S=\!"%\n%
  )%\n%
)) else setlocal disableDelayedExpansion ^& set args=


Usage is extremely simple if you know there are no commas or newlines in any values, and delayed expansion is not needed:

test1.csv

Code: Select all

"value1 with ""quotes""",value2: No problem!,value3: 2^3=8,value4: (2^2)!=16
value1,,value3,value4
value1,,,value4
value1,,,
,,,value4

test1.bat - no delayed expansion, no commas or newlines in values

Code: Select all

@echo off
for /f "tokens=1-4 delims=," %%A in ('parseCSV ^<test1.csv') do (
  echo -------------
  echo(A=%%~A
  echo(B=%%~B
  echo(C=%%~C
  echo(D=%%~D
  echo(
)

--OUTPUT1--

Code: Select all

-------------
A=value1 with "quotes"
B=value2: No problem!
C=value3: 2^3=8
D=value4: (2^2)!=16

-------------
A=value1
B=
C=value3
D=value4

-------------
A=value1
B=
C=
D=value4

-------------
A=value1
B=
C=
D=

-------------
A=
B=
C=
D=value4


It is also quite simple when commas are in values if you know of a character that does not exist in any value. Simply specify a unique character for the output delimiter.

test2.csv

Code: Select all

"value1 with ""quotes""","value2, No problem!","value3, 2^3=8","value4, (2^2)!=16"
value1,,value3,value4
value1,,,value4
value1,,,
,,,value4

test2.bat - no delayed expansion, no newlines or pipes in values EDIT: Note that the entire option must be quoted if the delimiter is a poison character

Code: Select all

@echo off
for /f "tokens=1-4 delims=|" %%A in ('parseCSV "/o:|" ^<test2.csv') do (
  echo -------------
  echo(A=%%~A
  echo(B=%%~B
  echo(C=%%~C
  echo(D=%%~D
  echo(
)

--OUTPUT2--

Code: Select all

-------------
A=value1 with "quotes"
B=value2, No problem!
C=value3, 2^3=8
D=value4, (2^2)!=16

-------------
A=value1
B=
C=value3
D=value4

-------------
A=value1
B=
C=
D=value4

-------------
A=value1
B=
C=
D=

-------------
A=
B=
C=
D=value4


It only takes a bit more code if values may contain newlines or if you don't know of a character that does not appear in any value. This solution encodes newlines, delimiters, and slashes as \N, \D, and \S. Delayed expansion is needed within the loop to decode the values, so ! and ^ must be escaped as ^! and ^^.

test3.csv

Code: Select all

"2^3=8","(2^2)!=16","Success!",Value4
value1,value2,value3,value4
,,,value4
"value1","value2","value3","value4"
"He said, ""Hey cutie.""","She said, ""Drop dead!""","value3 line1
value3 line2",c:\Windows

test3.bat - Allow virtually any valid CSV, without using a macro.

Code: Select all

@echo off
setlocal enableDelayedExpansion

:: Define LF to contain a linefeed (0x0A) character
set ^"LF=^

^" The empty line above is critical - DO NOT REMOVE

for /f "tokens=1-4 delims=," %%A in ('parseCSV /e /d ^<test3.csv') do (
  %== Load columns with encoded values. The trailing ! is important ==%
  set "A=%%~A"!
  set "B=%%~B"!
  set "C=%%~C"!
  set "D=%%~D"!
  %== Decode values ==%
  for %%L in ("!LF!") do for %%V in (A B C D) do if defined %%V (
    set "%%V=!%%V:\N=%%~L!"
    set "%%V=!%%V:\D=,!"
    set "%%V=!%%V:\S=\!"
  )
  %== Print results ==%
  echo ---------------------
  for %%V in (A B C D) do echo(%%V=!%%V!
  echo(
)

--OUTPUT3--

Code: Select all

---------------------
A=2^3=8
B=(2^2)!=16
C=Success!
D=Value4

---------------------
A=value1
B=value2
C=value3
D=value4

---------------------
A=
B=
C=
D=value4

---------------------
A=value1
B=value2
C=value3
D=value4

---------------------
A=He said, "Hey cutie."
B=She said, "Drop dead!"
C=value3 line1
value3 line2
D=c:\Windows

test4.bat - Allow virtually any valid CSV, but now use the %csvGetCol% macro.

Code: Select all

@echo off

:: Delayed expansion must be disabled during macro definition
setlocal disableDelayedExpansion
call define_csvGetCol

:: Delayed expansion must be enabled when using %csvGetCol%
setlocal enableDelayedExpansion
for /f "tokens=1-4 delims=," %%A in ('parseCSV /e /d ^<test3.csv') do (
  %== Load and decode column values ==%
  %csvGetCol% A "," %%A
  %csvGetCol% B "," %%B
  %csvGetCol% C "," %%C
  %csvGetCol% D "," %%D
  %== Print results ==%
  echo ---------------------
  for %%V in (A B C D) do echo(%%V=!%%V!
  echo(
)

Output is identical to test3.bat


If the CSV file is very large, then it is much more efficient to save the output of parseCSV.bat to a temporary file, and then use the FOR /F loop to read the temporary file.


There are still a couple inherent limitations that are true for all FOR /F usage:

1) A single FOR /F cannot parse more than 32 columns.

2) Batch line length restriction of 8191 characters can still be a problem.


Dave Benham
Last edited by dbenham on 08 Jan 2015 10:14, edited 3 times in total.

Samir
Posts: 341
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Safely parse nearly any CSV with parseCSV.bat

#2 Post by Samir » 08 Jul 2014 11:41

Very nice. I need to remember this for an upcoming batch where I'll need to a parse by tabs...

foxidrive
Expert
Posts: 6033
Joined: 10 Feb 2012 02:20

Re: Safely parse nearly any CSV with parseCSV.bat

#3 Post by foxidrive » 03 Sep 2014 08:16

Not sure if I have the right end of the stick - I tend to skim read large texts - but is this the way to handle pipe delimited .csv Dave?

Code: Select all

d:\>type file.csv |ParseCSV.bat /I:"|" >file2.csv
| was unexpected at this time.

Squashman
Expert
Posts: 4072
Joined: 23 Dec 2011 13:59

Re: Safely parse nearly any CSV with parseCSV.bat

#4 Post by Squashman » 03 Sep 2014 09:06

Maybe you can't pipe input to ParseCSV??? All of Dave's examples are redirected input.

dbenham
Expert
Posts: 2198
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Safely parse nearly any CSV with parseCSV.bat

#5 Post by dbenham » 03 Sep 2014 09:50

Whoa :!: :shock: :? :oops: :evil:

Sorry guys. I don't know what I have done. I swear this used to work with a poison character as a delimiter. But now it clearly doesn't. I tested all those examples, and the output was real.

I must have made some change after testing that broke things, without realizing it.

I'll try to fix this to work properly with poison character delimiters within the next couple days.

BTW - It should work with either pipe or redirection, although I think redirection is a bit more efficient.


Dave Benham

dbenham
Expert
Posts: 2198
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Safely parse nearly any CSV with parseCSV.bat

#6 Post by dbenham » 03 Sep 2014 10:12

OK, I must have made an invalid change to the test2.bat example after testing.

When quoting a delimiter, the entire option must be quoted, not just the value. So "/i:|" is correct, and /i:"|" fails. I don't understand the mechanism, but I don't have time to investigate at the moment.

I've edited the top post to fix test2.bat.


Dave Benham

foxidrive
Expert
Posts: 6033
Joined: 10 Feb 2012 02:20

Re: Safely parse nearly any CSV with parseCSV.bat

#7 Post by foxidrive » 03 Sep 2014 10:34

Thanks Dave.

dbenham
Expert
Posts: 2198
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Safely parse nearly any CSV with parseCSV.bat

#8 Post by dbenham » 18 Feb 2017 23:28

Squashman presented a good idea in his comment to this StackOverflow question dealing with CSV format conversion. The OP there wanted to change the delimiter from semicolon to comma. Squashman suggests using parseCSV.bat to do the conversion. I had never thought of this use for my utility, and I think it is a good idea. However, there is one problem in that escaped quote literals are transformed into un-escaped literals for the purpose of FOR /F parsing. This is not desirable for a simple conversion.

So I decided to add some new options to make parseCSV.bat generally useful for nearly any type of CSV format transformation, as long as both the input and output use delimiter separated values. While I was at it, I modified the /I and /O options to accept escape sequences for the delimiters. This is especially useful for specifying tab as the input delimiter using /I:\t.

There are two new options, one of which has multiple forms:

/L - specifies that all quotes within the input are quote literals. This is useful for CSV input that does not follow the standard of quoting values that contain quotes, and doubling quote literals.

/Q:E - Escape quote literals as "" within the output. Also encloses each column within quotes.

/Q:N - Do not quote column values in the output, and quote literals are not escaped (not doubled). This is only useful if you know that none of the column values contain the output delimiter.

For completeness, there is also /Q:L - which is the same as the original default behavior: All output columns are quoted, and quote literals are not escaped (not doubled).

I've updated the code in the first post of this thread to version 1.1, with the above changes.


Dave Benham

dbenham
Expert
Posts: 2198
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Safely parse nearly any CSV with parseCSV.bat

#9 Post by dbenham » 19 Feb 2017 22:53

I updated the first post to version 1.2 with the following changes:

  • Fixed a bug that caused all output delimiters to be escaped if /E was combined /Q:N.
  • Modified behavior to write Windows style lines with \r\n terminators by default, and added the /U option to specify Unix style output with \n terminators.
  • Improved performance by writing entire lines instead of one character at a time.

Dave Benham

Post Reply