Safely parse nearly any CSV with pure batch

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Safely parse nearly any CSV with pure batch

#1 Post by dbenham » 30 May 2019 23:07

Quite some time ago I developed a hybrid JScript/batch parseCSV.bat utility to assist with parsing CSV files with FOR /F. It could handle any CSV as long as you didn't need to parse more than 32 tokens, and no line approached 8191 bytes.

Much later, Aacini started a thread where routines were developed to easily parse more than 32 columns. But it did nothing to handle delimiters, quotes, or newlines in values, or empty values.

Recently I decided to demonstrate how to combine the two techniques, such that you can safely parse any CSV as long as no line approaches the 8191 limit.

But the above relies on CSCRIPT JScript to prepare each line for parsing by FOR /F. Over at StackOverflow I developed a technique to parse a CSV using pure batch. The key hurdle was how to differentiate quoted delimiter literals from field delimiters. The solution is a variant of a technique developed by jeb that relies on the difference between how phase 2 treats quoted "^,^," vs unquoted ^,^,

The StackOverflow pure batch CSV parser can handle most CSV complications:
  • A line can begin with any character
  • A field may be empty
  • A field value may contain any poison characters
  • A field value may contain the field delimiter
  • A field value may contain doubled quote literals
  • A field may contain ^ and/or !, and delayed expansion can be used
The complications it cannot handle are:
  • Line length cannot approach 8191
  • No field may contain newline
  • Cannot parse more than 32 tokens
Below I demonstrate how to combine the pure batch technique to read more than 32 tokens with the StackOverflow pure batch technique to handle most CSV complications.

test.csv

Code: Select all

C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33,C34,C35,C36
;,,"""quoted ,"" and non quoted ,",(<@a@c&|>)^,"(<@a@c,&|>)^ ""(<@a@c,&|>)^""",(<@a@c&|>)^!,"(<@a@c,&|>)^! ""(<@a@c,&|>)^!""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,Last Column!
The code below demonstrates how to parse the CSV above and store each column for each row in a 2 dimensional array. Since it supports delayed expansion, you can do pretty much anything you want with each token within the inner-most loop.

The only things the code cannot handle are lines approaching 8191 bytes, and newlines within values. I don't think it is possible to solve either of these remaining problems with pure batch.

Code: Select all

@echo off
setlocal enableDelayedExpansion
call :defineFor For36InA A 36 ","
set "row=0"
for /f usebackq^ delims^=^ eol^= %%A in ("test.csv") do (
  set /a row+=1
  call :prepareLine A
  for /f delims^=^ eol^= %%A in ("!ln!") do (
    setlocal disableDelayedExpansion
    %For36InA% (
      endlocal
      for /l %%C in (1 1 36) do call :decodeAndStoreToken array.!row!.%%C "," !$%%C!
    )
  )
)

:: Show results
for /l %%R in (1 1 %row%) do for /l %%C in (1 1 36) do echo Row %%R Col %%C = [!array.%%R.%%C!]
exit /b

:prepareLine C
::
:: Prepares the CSV line in FOR variable %%C to be parsed by FOR /F
:: and stores the result in variable ln
::
:: Quoted , literals within a field are encoded as @c
:: @ literals are encoded as @a
:: All fields are enclosed within quotes
:: Doubled quote literals "" within a field are undoubled to "
::
:: Delayed expansion must be enabled when :prepareLine is called
::
:: If CSV delim is not , then modify lines as indicated by %= comment =%
::
setlocal disableDelayedExpansion
for %%. in (.) do set "ln=%%%1"
set "ln=,%ln:"=""%,"              %= Replace , with actual delim =%
set "ln=%ln:^=^^^^%"
set "ln=%ln:&=^&%"
set "ln=%ln:|=^|%"
set "ln=%ln:<=^<%"
set "ln=%ln:>=^>%"
set "ln=%ln:!=^^!%"
set "ln=%ln:,=^,^,%"              %= Replace , with actual delim =%
set ^"ln=%ln:""="%^"
set "ln=%ln:"=""%"
set "ln=%ln:@=@A%"
set "ln=%ln:^,^,=@D%"             %= Replace , with actual delim =%
endlocal & set "ln=%ln:""="%" !
set "ln=!ln:,,"=,,!"              %= Replace , with actual delim =%
set "ln=!ln:",,=,,!"              %= Replace , with actual delim =%
set "ln=!ln:~2,-2!"
set "ln=!ln:^=^^^^!"
set "ln=%ln:!=^^^!%"
set "ln=!ln:""="!"
set "ln="!ln:,,=","!"" !          %= Replace , with actual delim =%
exit /b


::==============================================================================================
:: The remainder of this script is constant for any CSV with a fixed number of fields per line,
:: regardless how many columns, or what delimiters are used, etc.

:decodeAndStoreToken  RtnVar  Delim  TokenName
::
:: Decodes FOR /F %%TokenName and stores result in RtnVar
::   @D becomes Delim
::   @A becomes @
::
for %%. in (.) do set "%1=%%~%~3" !
if defined %1 (
  set "%1=!%1:@D=%~2!"
  set "%1=!%1:@A=@!"
)
exit /b


:defineFor  ForMacroName  InputVar  TokenCount  [DelimChars]
::
:: Defines a macro to be used for parsing an arbitrary number of tokens from
:: a FOR variable string. The macro always parses one additional token to hold
:: any remainder of the line that lies beyond the TokenCount tokens.
::
::    ForMacroName = The name of the macro variable to be created.
::
::    InputVar = The name of the FOR variable that contains the string of tokens.
::
::    TokenCount = The number of tokens to parse.
::                 The maximum value is 2304 (256*9)
::
::    DelimChars = An optional string of one or more characters, each of which
::                 is treated as a token delimiter. Default is "<tab><space>".
::                 If <space> is included in the string, then it must be the
::                 last character in the string.
::
:: Tokens are accessed by $n variables.
:: For example, %%%$45% would represent the 45th token.
::
:: FOR /F modifiers may be freely used. For example, %%~nx%$10% would treat the
:: 10th token as a file path, and would expand to the file name and extension.
::
:: Normally, a single FOR /F is limited to 31 tokens, but the macro supports
:: many more, theoretically as many as 2304. However, each line to be parsed
:: must be less than 8191 characters in length.
::
:: This function may be called with delayed expansion enabled or disabled.
:: It is generally recommended that the macro be used with delayed expansion
:: disabled so that tokens containing ! are not corrupted.
::
:: This function automatically calls :defineForChars to define enough $n
:: variables to satisfy the TokenCount+1 tokens.
::
:: Example usage - Suppose you want to parse a well behaved CSV file named
:: test.csv that contains 300 columns. All lines must have the same number of
:: columns, and no column value may contain a comma.
::
:: The following code will correctly parse each data line of test.csv:
::
::    @echo off
::    setlocal disableDelayedExpansion
::    call :defineFor For300InA A 300 ","
::    for /f "skip=1 delims=" %%A in (test.csv) do %For300InA% (
::      echo token   1 = %%%$1%
::      echo token   2 = %%%$2%
::      echo ...
::      echo token 300 = %%%$300%
::    )
::
:: If the first token might begin with any character, including the default
:: EOL character, then the FOR /F line should be changed as follows:
::
::    for /f skip^=1^ delims^=^ eol^= %%A in (test.csv) do %For300InA% (
::
if %$max%0 gtr %~30 goto :defineForInternal
set /a "$max=(%~3)/256+1"
call :defineForChars %$max%
:defineForInternal
setlocal enableDelayedExpansion
set "delims=%~4"
if not defined delims set "delims= "
set "in=%~2"
set "macro="
set /a max=31, end=0
for /l %%N in (1 31 %~3) do (
  if %%N neq 1 set "in=!$%%N!"
  set /a end+=31
  if !end! gtr %~3 set /a "max=%~3-%%N+1"
  set "macro=!macro! for /f "eol=!delims:~0,1! tokens=1-!max!* delims=!delims!" %%!$%%N! in ("%%!in!") do"
)
for /f "delims=" %%A in ("!macro! ") do endlocal & set "%~1=%%A"
exit /b


:defineForChars  Count
::
:: Defines variables to be used as FOR /F tokens, from $1 to $n, where n = Count*256
:: Also defines $max = Count*256.
:: No other variables are defined or tampered with.
::
:: Once defined, the variables are very useful for parsing lines with many tokens, as
:: the values are guaranteed to be contiguous within the FOR /F mapping scheme.
::
:: For example, you can use $1 as a FOR variable by using %%%$1%.
::
::   FOR /F "TOKENS=1-31" %%%$1% IN (....) DO ...
::
::      %%%$1% = token 1, %%%$2% = token 2, ... %%%$31% = token 31
::
:: This routine never uses SETLOCAL, and works regardless whether delayed expansion
:: is enabled or disabled.
::
:: Three temporary files are created and deleted in the %TEMP% folder, and the active
:: code page is temporarily set to 65001, and then restored to the starting value
:: before returning. Once defined, the $n variables can be used with any code page.
::
for /f "tokens=2 delims=:." %%P in ('chcp') do call :DefineForCharsInternal %1
exit /b
:defineForCharsInternal
set /a $max=%1*256
>"%temp%\forVariables.%~1.hex.txt" (
  echo FF FE
  for %%H in (
    "0 1 2 3 4 5 6 7 8 9 A B C D E F"
  ) do for /l %%N in (1 1 %~1) do for %%A in (%%~H) do for %%B in (%%~H) do (
    echo %%A%%B 0%%N 0D 00 0A 00
  )
)
>nul certutil.exe -decodehex -f "%temp%\forVariables.%~1.hex.txt" "%temp%\forVariables.%~1.utf-16le.bom.txt"
>nul chcp 65001
>"%temp%\forVariables.%~1.utf8.txt" type "%temp%\forVariables.%~1.utf-16le.bom.txt"
<"%temp%\forVariables.%~1.utf8.txt" (for /l %%N in (1 1 %$max%) do set /p "$%%N=")
for %%. in (dummy) do >nul chcp %%P  
del "%temp%\forVariables.%~1.*.txt"
exit /b
--OUTPUT--

Code: Select all

Row 1 Col 1 = [C1]
Row 1 Col 2 = [C2]
Row 1 Col 3 = [C3]
Row 1 Col 4 = [C4]
Row 1 Col 5 = [C5]
Row 1 Col 6 = [C6]
Row 1 Col 7 = [C7]
Row 1 Col 8 = [C8]
Row 1 Col 9 = [C9]
Row 1 Col 10 = [C10]
Row 1 Col 11 = [C11]
Row 1 Col 12 = [C12]
Row 1 Col 13 = [C13]
Row 1 Col 14 = [C14]
Row 1 Col 15 = [C15]
Row 1 Col 16 = [C16]
Row 1 Col 17 = [C17]
Row 1 Col 18 = [C18]
Row 1 Col 19 = [C19]
Row 1 Col 20 = [C20]
Row 1 Col 21 = [C21]
Row 1 Col 22 = [C22]
Row 1 Col 23 = [C23]
Row 1 Col 24 = [C24]
Row 1 Col 25 = [C25]
Row 1 Col 26 = [C26]
Row 1 Col 27 = [C27]
Row 1 Col 28 = [C28]
Row 1 Col 29 = [C29]
Row 1 Col 30 = [C30]
Row 1 Col 31 = [C31]
Row 1 Col 32 = [C32]
Row 1 Col 33 = [C33]
Row 1 Col 34 = [C34]
Row 1 Col 35 = [C35]
Row 1 Col 36 = [C36]
Row 2 Col 1 = [;]
Row 2 Col 2 = []
Row 2 Col 3 = ["quoted ," and non quoted ,]
Row 2 Col 4 = [(<@a@c&|>)^]
Row 2 Col 5 = [(<@a@c,&|>)^ "(<@a@c,&|>)^"]
Row 2 Col 6 = [(<@a@c&|>)^!]
Row 2 Col 7 = [(<@a@c,&|>)^! "(<@a@c,&|>)^!"]
Row 2 Col 8 = []
Row 2 Col 9 = []
Row 2 Col 10 = []
Row 2 Col 11 = []
Row 2 Col 12 = []
Row 2 Col 13 = []
Row 2 Col 14 = []
Row 2 Col 15 = []
Row 2 Col 16 = []
Row 2 Col 17 = []
Row 2 Col 18 = []
Row 2 Col 19 = []
Row 2 Col 20 = []
Row 2 Col 21 = []
Row 2 Col 22 = []
Row 2 Col 23 = []
Row 2 Col 24 = []
Row 2 Col 25 = []
Row 2 Col 26 = []
Row 2 Col 27 = []
Row 2 Col 28 = []
Row 2 Col 29 = []
Row 2 Col 30 = []
Row 2 Col 31 = []
Row 2 Col 32 = []
Row 2 Col 33 = []
Row 2 Col 34 = []
Row 2 Col 35 = []
Row 2 Col 36 = [Last Column!]

Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Safely parse nearly any CSV with pure batch

#2 Post by dbenham » 10 Jun 2019 11:44

I decided to make a BatchParseCSV.bat utility script containing all pure batch code necessary to easily parse nearly any CSV.

BatchParcseCSV.bat

Code: Select all

@echo off
if "%~1" equ "" goto :Help
shift /1 & goto %1 2>nul || >&2 echo ERROR: Invalid %~nx0 function '%1'


:GetToken  ForVar  EnvVar
::
::  Gets field in %%ForVar and stores in environment variable EnvVar.
::  ForVar may contain modifiers, but not a ~
::
::  Enclosing quotes are stripped
::  Escaped ^! and ^^ are reverted to their unescaped ! and ^ forms
::
::  Delayed expansion must be enabled to use this function.
::.
for %%. in (.) do set "%~2=%%~%~1" !
exit /b


:DecodeToken  ForVar  EnvVar
::
::  Decodes field in %%ForVar and stores in environment variable EnvVar.
::
::  Enclosing quotes are stripped
::  All @c become ,
::  All @a become @
::  Escaped ^! and ^^ are reverted to their unescaped ! and ^ forms
::
::  ForVar should not include modifiers because they would be applied before
::  @c and @a are decoded.
::
::  Delayed expansion must be enabled to use this function.
::.
for %%. in (.) do set "%~2=%%~%~1" !
if defined %~2 (
  set "%~2=!%~2:@c=,!"
  set "%~2=!%~2:@a=@!"
)
exit /b


:ReformatLine  ForVarCharIn  EnvVarOut  [Delim]
::
::  Prepares CSV line stored in FOR variable %%ForVarIn to be safely parsed by
::  FOR /F with delayed expansion enabled. The result is stored in environment
::  variable EnvVarOut. Delim should only be used if you know the Delim character
::  never appears in any of the field values.
::
::    All fields are enclosed within quotes
::    All escaped quote literals "" become "
::    All ^ become ^^
::    All ! become ^!
::
::  In addition, if Delim is specified then
::
::    All , delimiters are converted to Delim.
::    Fields should be retrieved with GetToken, or else any statement that
::    references a %%Token should contain ! to ensure that ^^ are converted to ^
::
::  Else if Delim is not specified, then
::
::    All @ become @a
::    All quoted , within fields become @c
::    Fields should be retrieved by DecodeToken
::
::  Delayed expansion must be enabled to use this function.
::.
setlocal
setlocal disableDelayedExpansion
for %%. in (.) do set "ln=%%%~1"
set "ln=,%ln:"=""%,"
set "ln=%ln:^=^^^^%"
set "ln=%ln:&=^&%"
set "ln=%ln:|=^|%"
set "ln=%ln:<=^<%"
set "ln=%ln:>=^>%"
set "ln=%ln:!=^^!%"
set "ln=%ln:,=^,^,%"
set ^"ln=%ln:""="%^"
set "ln=%ln:"=""%"
set "ln=%ln:@=@a%"
set "ln=%ln:^,^,=@c%"
endlocal & set "ln=%ln:""="%" !
set "ln=!ln:,,"=,,!"
set "ln=!ln:",,=,,!"
set "ln=!ln:~2,-2!"
set "ln=!ln:^=^^^^!"
endlocal&set "%~2=%ln:!=^^^!%"
set "%~2=!%~2:""="!"
if "%~3" neq "" (
  set "%~2="!%~2:,,="%~3"!""
  set "%~2=!%~2:@c=,!"
  set "%~2=!%~2:@a=@!"
) else set "%~2="!%~2:,,=","!""
exit /b


:DefineFor  ForMacroName  InputVar  TokenCount  [DelimChars]
::
:: Defines a macro to be used for parsing an arbitrary number of tokens from
:: a string. The macro always parses one additional token to hold any remainder
:: of the line that lies beyond the TokenCount tokens. The input line should have
:: at least as many tokens as TokenCount, else you run the risk of the macro not
:: returning anything (no tokens parsed).
::
::    ForMacroName = The name of the macro variable to be created.
::
::    InputVar = The name of a variable that contains the string used to access
::                 the line to be parsed. The string should be of the form !VAR!
::                 to access the environment variable returned by ReformatLine,
::                 else %%C to access a FOR variable when ReformatLine is not
::                 used.
::
::    TokenCount = The number of tokens to parse.
::                 The maximum value is 2303 (256*9-1)
::
::    DelimChars = An optional string of one or more characters, each of which
::                 is treated as a token delimiter. Default is "<tab><space>".
::                 If <space> is included in the string, then it must be the
::                 last character in the string.
::
:: Tokens are accessed by $n variables.
:: For example, %%%$45% would represent the 45th token.
::
:: FOR /F modifiers may be freely used. For example, %%~nx%$10% would treat the
:: 10th token as a file path, and would expand to the file name and extension.
::
:: Normally, a single FOR /F is limited to 31 tokens, but the macro supports
:: many more, theoretically as many as 2303. However, each line to be parsed
:: must be less than 8191 characters in length.
::
:: This function automatically calls :defineForChars to define enough $n
:: variables to satisfy the TokenCount+1 tokens.
::
:: Example usage - Suppose you want to parse a well behaved CSV file named
:: test.csv that contains 300 columns. All lines must have the same number of
:: columns, and no column value may contain a comma.
::
:: The following code will correctly parse each data line of test.csv:
::
::    @echo off
::    setlocal disableDelayedExpansion
::    call :defineFor For300InA A 300 ","
::    for /f "skip=1 delims=" %%A in (test.csv) do %For300InA% (
::      echo token   1 = %%%$1%
::      echo token   2 = %%%$2%
::      echo ...
::      echo token 300 = %%%$300%
::    )
::
:: If the first token might begin with any character, then EOL should be
:: disabled as follows:
::
::    for /f skip^=1^ delims^=^ eol^= %%A in (test.csv) do %For300InA% (
::.
if %$max%0 gtr %~30 goto :defineForInternal
set /a "$max=(%~3+256)/256"
call :defineForChars %$max%
 :defineForInternal
setlocal
set "noDelay=!!"
setlocal enableDelayedExpansion
set "delims=%~4"
if not defined delims set "delims= "
set "in=!%~2!"
if not defined noDelay set "in=%in:!=^^^!%"
set "macro="
set /a max=31, end=0
for /l %%N in (1 31 %~3) do (
  if %%N neq 1 set "in=%%!$%%N!"
  set /a end+=31
  if !end! gtr %~3 set /a "max=%~3-%%N+1"
  set "macro=!macro! for /f "eol=!delims:~0,1! tokens=1-!max!* delims=!delims!" %%!$%%N! in ("!in!") do"
)
for /f "delims=" %%A in ("!macro! ") do endlocal & endlocal & set "%~1=%%A"
exit /b


:DefineForChars  Count
::
:: Defines variables to be used as FOR /F tokens, from $1 to $n,
:: where n = Count*256
:: Also defines $max = Count*256.
:: No other variables are defined or tampered with.
::
:: The maximum allowed Count is 9, meaning the max $max is 2304.
::
:: Once defined, the variables are very useful for parsing lines with a fixed
:: number of tokens > 31, as the values are guaranteed to be contiguous within
:: the FOR /F mapping scheme.
::
:: For example, you can use $1 as a FOR variable by using %%%$1%.
::
::   FOR /F "TOKENS=1-31" %%%$1% IN (....) DO ...
::
::      %%%$1% = token 1, %%%$2% = token 2, ... %%%$31% = token 31
::
:: This routine never uses SETLOCAL, and works regardless whether delayed expansion
:: is enabled or disabled.
::
:: Three temporary files are created and deleted in the %TEMP% folder, and the active
:: code page is temporarily set to 65001, and then restored to the starting value
:: before returning. Once defined, the $n variables can be used with any code page.
::.
for /f "tokens=2 delims=:." %%P in ('chcp') do call :DefineForCharsInternal %1
exit /b
 :defineForCharsInternal
set /a $max=%1*256
>"%temp%\forVariables.%~1.hex.txt" (
  echo FF FE
  for %%H in (
    "0 1 2 3 4 5 6 7 8 9 A B C D E F"
  ) do for /l %%N in (1 1 %~1) do for %%A in (%%~H) do for %%B in (%%~H) do (
    echo %%A%%B 0%%N 0D 00 0A 00
  )
)
>nul certutil.exe -decodehex -f "%temp%\forVariables.%~1.hex.txt" "%temp%\forVariables.%~1.utf-16le.bom.txt"
>nul chcp 65001
>"%temp%\forVariables.%~1.utf8.txt" type "%temp%\forVariables.%~1.utf-16le.bom.txt"
<"%temp%\forVariables.%~1.utf8.txt" (for /l %%N in (1 1 %$max%) do set /p "$%%N=")
for %%. in (dummy) do >nul chcp %%P
del "%temp%\forVariables.%~1.*.txt"
exit /b


:DefineVariantFor  ForMacroName  InputVar  TokenCount  [DelimChars  [DummyChar]]
::
:: Nearly the same as DefineFor, except the resultant macro is designed to work
:: with input lines that may have fewer than TokenCount tokens.
::
::    ForMacroName = The name of the macro variable to be created.
::
::    InputVar = The name of a variable that contains the string used to access
::                 the CSV line. The string should be of the form !VAR! to access
::                 the environment variable returned by ReformatLine, else %%C to
::                 to access a FOR variable when ReformatLine is not used.
::
::    TokenCount = The maximum number of tokens to parse.
::                 The maximum value is 2226.
::
::    DelimChars = An optional string of one or more characters, each of which
::                 is treated as a token delimiter. Default is "<tab><space>".
::                 If <space> is included in the string, then it must be the
::                 last character in the string.
::
::    DummyChar  = An optional string to use as a prepended dummy token for each
::                 FOR /F iteration. The DummyChar is needed to support the
::                 requested number of tokens, yet still have the macro return
::                 results if the input line has fewer than TokenCount tokens.
::                 The DummyChar must not match any character within Delims.
::                 The default is "x". This optional parameter is only needed if
::                 Delims includes "x".
::
:: Tokens are accessed by $vn variables.
:: For example, %%%$v45% would represent the 45th token.
::
:: Any remaining unparsed string is contained in one of two possible tokens:
::   - If TokenCount is divisible by 30,
::     then the remainder is in $vExtra(TokenCount+1)
::   - Else the remainder is in $v(TokenCount+1)
::
:: So if TokenCount=300 (a multiple of 30), then the remainder is in $vExtra301.
:: But if TokenCount=299 (not a multiple of 30), then the remainder is in $v300.
::
:: This function automatically calls DefineVariantForChars to define enough $vn
:: and $vExtran variables to satisfy the TokenCount+1 requirement.
::
:: All other behaviors are the same as for DefineFor
::.
if %$vmax%0 gtr %~30 goto :defineVariantForInternal
set /a "$vmax=(%~3+(%~3+30)/30+257)/256"
call :defineVariantForChars %$vmax%
 :defineVariantForInternal
setlocal
set "noDelay=!!"
setlocal enableDelayedExpansion
set "dummy=%~5"
if not defined dummy set "dummy=x"
set "delims=%~4"
if not defined delims set "delims= "
set "dummy=!dummy!!delims:~0,1!"
set "in=!%~2!"
if not defined noDelay set "in=%in:!=^^^!%"
set "macro="
set /a max=31, end=0
for /l %%N in (1 30 %~3) do (
  if %%N neq 1 set "in=!$vExtra%%N!"
  set /a end+=30
  if !end! gtr %~3 set /a "max=%~3-%%N+2"
  set "macro=!macro! for /f "eol=!delims:~0,1! tokens=1-!max!* delims=!delims!" %%!$vExtra%%N! in ("!dummy!%%!in!") do"
)
for /f "delims=" %%A in ("!macro! ") do endlocal & endlocal & set "%~1=%%A"
exit /b


:DefineVariantForChars  Count
::
:: Nearly the same as DefineForChars, except this function defines $vn, $vmax,
:: and $vExtran variables instead of $n and $max variables.
::
:: These variables are useful for parsing lines with many tokens when the number
:: of tokens may vary on each line.
::
:: For every 30 $vn variables, there is one $vExtran variable defined.
::
:: The maximum Count allowed is 9, which results in a max $vmax of 227, which
:: corresponds to the largest $vn of $v2227.
::
:: All other behaviors are the same as for DefineForChars.
::.
for /f "tokens=2 delims=:." %%P in ('chcp') do call :DefineVariantForCharsInternal %1
exit /b
 :defineVariantForCharsInternal
set /a "$vmax=%1*256, $vmax-=($vmax+30)/30"
>"%temp%\forVariables.%~1.hex.txt" (
  echo FF FE
  for %%H in (
    "0 1 2 3 4 5 6 7 8 9 A B C D E F"
  ) do for /l %%N in (1 1 %~1) do for %%A in (%%~H) do for %%B in (%%~H) do (
    echo %%A%%B 0%%N 0D 00 0A 00
  )
)
>nul certutil.exe -decodehex -f "%temp%\forVariables.%~1.hex.txt" "%temp%\forVariables.%~1.utf-16le.bom.txt"
>nul chcp 65001
>"%temp%\forVariables.%~1.utf8.txt" type "%temp%\forVariables.%~1.utf-16le.bom.txt"
<"%temp%\forVariables.%~1.utf8.txt" (for /l %%N in (1 1 %$vmax%) do (
	2>nul set /a "1/((%%N-1)%% 30)" || set /p "$vExtra%%N="
  set /p "$v%%N="
))
for %%. in (dummy) do >nul chcp %%P
del "%temp%\forVariables.%~1.*.txt"
exit /b


:Help  [Function]
::
::  Provides general help for BatchParseCSV.bat as a whole.
::
::  If Function is specified, then provides detailed help for that Function.
::.
echo(
%= General Help =%
if "%~1" equ "" (
  echo %~nx0 - Functions to safely parse nearly any CSV as long as no line approaches
  echo   the 8191 character limit, and no field contains newline characters.
  echo(
  echo   Syntax:
  for /f "delims=:" %%A in ('findstr "^:[^:]" "%~f0"') do @echo(    %~n0  %%A
  echo(
  echo   Use %~n0 Help Function to get detailed help about any given Function.
  echo   For example: %~n0 Help ReformatLine
  echo(
  echo   Alternatively, all functions except Help and Examples may be copied into
  echo   your script in case you want your script to be totally self-contained.
  echo(
  exit /b
)
%= Function Help =%
setlocal disableDelayedExpansion
(for /f "delims=:" %%N in ('findstr /rinc:"^:%~1 " /c:"^:%~1$" "%~f0"') do set /a "N=%%N-1") || (
  echo Function %1 not found
  exit /b
)
if /i "%~1" equ "Examples" set /a N+=2
for /f "usebackq skip=%N% delims=" %%A in ("%~f0") do (
  for /f "tokens=* delims=:." %%B in ("%%A") do (
    set "ln=%%B"
    setlocal enableDelayedExpansion
    if defined ln (echo(!ln:batchParseCSV=%~n0!) else echo(
    endlocal
  )
  if %%A == ::. exit /b
)
exit /b


:Examples
::
::example.csv (Used by all examples)
::
::  <1>,,"<@3>, OK",<4>,<5>,"<6> ""OK""",<7^>,<8!>,,,,,,,,,,,,,,,,,,,,,,,,,,,<35>
::
::Example 1:
::  Use ReformatLine to change delimiter to ;
::  Show tokens 3,6,7,8 without using GetToken
::  Note trailing ! is removed by delayed expansion - restores ^^ to ^
::
::    @echo off
::    setlocal enableDelayedExpansion
::    for /f "delims=" %%A in (example.csv) do (
::      call batchParseCSV ReformatLine A ln ";"
::      for /f "tokens=3,6,7,8 delims=;" %%A in ("!ln!") do (
::        echo A=%%~A!
::        echo B=%%~B!
::        echo C=%%~C!
::        echo D=%%~D!
::      )
::    )
::
::  Output:
::
::    A=<@3>, OK
::    B=<6> "OK"
::    C=<7^>
::    D=<8!>
::
::Example 2:
::  Same as before, but use GetToken
::
::    @echo off
::    setlocal enableDelayedExpansion
::    for /f "delims=" %%A in (example.csv) do (
::      call batchParseCSV ReformatLine A ln ";"
::      for /f "tokens=3,6,7,8 delims=;" %%A in ("!ln!") do (
::        for %%t in (A B C D) do (
::          call batchParseCSV GetToken %%t %%t
::          echo %%t=!%%t!
::        )
::      )
::    )
::
::  Output: Same as Example 1
::
::Example 3:
::  Use ReformatLine without changing delimiter
::  Show columns 3,6,7,8 using DecodeToken
::
::    @echo off
::    setlocal enableDelayedExpansion
::    for /f "delims=" %%A in (example.csv) do (
::      call batchParseCSV ReformatLine A ln
::      for /f "tokens=3,6,7,8 delims=," %%A in ("!ln!") do (
::        for %%t in (A B C D) do (
::          call batchParseCSV DecodeToken %%t %%t
::          echo %%t=!%%t!
::        )
::      )
::    )
::
::  Output: Same as Example 1
::
::Example 4:
::  Use DefineFor to access more than 31 tokens when all lines have all tokens
::  No escape needed because macro defined while delayed expansion is disabled
::  Use ReformatLine without changing delimiter
::  Show tokens 3,6,7,8,35 using DecodeToken
::
::    @echo off
::    setlocal disableDelayedExpansion
::    set "var=!ln!"
::    call batchParseCSV DefineFor For35Do var 35 ","
::    setlocal enableDelayedExpansion
::    for /f "delims=" %%A in (example.csv) do (
::      call batchParseCSV ReformatLine A ln
::      %For35Do% (
::        for %%t in (3 6 7 8 35) do (
::          call batchParseCSV DecodeToken !$%%t! col%%t
::          echo col%%t=!col%%t!
::        )
::      )
::    )
::
::  Output:
::
::    col3=<@3>, OK
::    col6=<6> "OK"
::    col7=<7^>
::    col8=<8!>
::    col35=<35>
::
::Example 5:
::  Use DefineVariantFor to access more than 31 tokens when some may be missing
::  Escape needed because macro defined while delayed expansion is enabled
::  Use ReformatLine without changing delimiter
::  Show tokens 3,6,7,8,35 using DecodeToken
::
::    @echo off
::    setlocal enableDelayedExpansion
::    set "var=^!ln^!"
::    call batchParseCSV DefineVariantFor For70Do var 70 ","
::    for /f "delims=" %%A in (example.csv) do (
::      call batchParseCSV ReformatLine A ln
::      %For70Do% (
::        for %%t in (3 6 7 8 35) do (
::          call batchParseCSV DecodeToken !$v%%t! col%%t
::          echo col%%t=!col%%t!
::        )
::      )
::    )
::
::  Output: Same as Example 4
::.
call :help examples
exit /b
Full documentation is available within the script. I decided to name the utility script BatchParseCSV.bat because I already have a script named ParseCSV.bat. But you can rename BatchParseCSV.bat to any name you like, and the documentation will automatically adjust to the new name.

Here is the basic help received by entering either BatchParseCSV without any arguments, or BatchParseCSV Help

Code: Select all

BatchParseCSV.bat - Functions to safely parse nearly any CSV as long as no line approaches
  the 8191 character limit, and no field contains newline characters.

  Syntax:
    BatchParseCSV  GetToken  ForVar  EnvVar
    BatchParseCSV  DecodeToken  ForVar  EnvVar
    BatchParseCSV  ReformatLine  ForVarCharIn  EnvVarOut  [Delim]
    BatchParseCSV  DefineFor  ForMacroName  InputVar  TokenCount  [DelimChars]
    BatchParseCSV  DefineForChars  Count
    BatchParseCSV  DefineVariantFor  ForMacroName  InputVar  TokenCount  [DelimChars  [DummyChar]]
    BatchParseCSV  DefineVariantForChars  Count
    BatchParseCSV  Help  [Function]
    BatchParseCSV  Examples

Code: Select all

  Use BatchParseCSV Help Function to get detailed help about any given Function.
  For example: BatchParseCSV Help ReformatLine

  Alternatively, all functions except Help and Examples may be copied into
  your script in case you want your script to be totally self-contained.

And below are the Examples from the documentation:

example.csv (Used by all examples)

Code: Select all

<1>,,"<@3>, OK",<4>,<5>,"<6> ""OK""",<7^>,<8!>,,,,,,,,,,,,,,,,,,,,,,,,,,,<35>
Example 1:
Use ReformatLine to change delimiter to ;
Show tokens 3,6,7,8 without using GetToken
Note trailing ! is removed by delayed expansion - restores ^^ to ^

Code: Select all

@echo off
setlocal enableDelayedExpansion
for /f "delims=" %%A in (example.csv) do (
  call BatchParseCSV ReformatLine A ln ";"
  for /f "tokens=3,6,7,8 delims=;" %%A in ("!ln!") do (
    echo A=%%~A!
    echo B=%%~B!
    echo C=%%~C!
    echo D=%%~D!
  )
)
Output:

Code: Select all

A=<@3>, OK
B=<6> "OK"
C=<7^>
D=<8!>
Example 2:
Same as before, but use GetToken

Code: Select all

@echo off
setlocal enableDelayedExpansion
for /f "delims=" %%A in (example.csv) do (
  call BatchParseCSV ReformatLine A ln ";"
  for /f "tokens=3,6,7,8 delims=;" %%A in ("!ln!") do (
    for %%t in (A B C D) do (
      call BatchParseCSV GetToken %%t %%t
      echo %%t=!%%t!
    )
  )
)
Output: Same as Example 1

Example 3:
Use ReformatLine without changing delimiter
Show columns 3,6,7,8 using DecodeToken

Code: Select all

@echo off
setlocal enableDelayedExpansion
for /f "delims=" %%A in (example.csv) do (
  call BatchParseCSV ReformatLine A ln
  for /f "tokens=3,6,7,8 delims=," %%A in ("!ln!") do (
    for %%t in (A B C D) do (
      call BatchParseCSV DecodeToken %%t %%t
      echo %%t=!%%t!
    )
  )
)
Output: Same as Example 1

Example 4:
Use DefineFor to access more than 31 tokens when all lines have all tokens
No escape needed because macro defined while delayed expansion is disabled
Use ReformatLine without changing delimiter
Show tokens 3,6,7,8,35 using DecodeToken

Code: Select all

@echo off
setlocal disableDelayedExpansion
set "var=!ln!"
call BatchParseCSV DefineFor For35Do var 35 ","
setlocal enableDelayedExpansion
for /f "delims=" %%A in (example.csv) do (
  call BatchParseCSV ReformatLine A ln
  %For35Do% (
    for %%t in (3 6 7 8 35) do (
      call BatchParseCSV DecodeToken !$%%t! col%%t
      echo col%%t=!col%%t!
    )
  )
)
Output:

Code: Select all

col3=<@3>, OK
col6=<6> "OK"
col7=<7^>
col8=<8!>
col35=<35>
Example 5:
Use DefineVariantFor to access more than 31 tokens when some may be missing
Escape needed because macro defined while delayed expansion is enabled
Use ReformatLine without changing delimiter
Show tokens 3,6,7,8,35 using DecodeToken

Code: Select all

@echo off
setlocal enableDelayedExpansion
set "var=^!ln^!"
call BatchParseCSV DefineVariantFor For70Do var 70 ","
for /f "delims=" %%A in (example.csv) do (
  call BatchParseCSV ReformatLine A ln
  %For70Do% (
    for %%t in (3 6 7 8 35) do (
      call BatchParseCSV DecodeToken !$v%%t! col%%t
      echo col%%t=!col%%t!
    )
  )
)
Output: Same as Example 4


Dave Benham

SIMMS7400
Posts: 546
Joined: 07 Jan 2016 07:47

Re: Safely parse nearly any CSV with pure batch

#3 Post by SIMMS7400 » 14 Apr 2021 07:04

HI Dave -

This works great for me situation, many thanks!!!

Now I understand beggars can't be choosers, but this is running "slow" when cycling through a big file. Is this expected behavior? And by slow I mean about ~1 minute for a 3k row file.

Other than, absolutely fantastic!!

Post Reply