Automatically escaping strings to survive multiple parsings

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Automatically escaping strings to survive multiple parsings

#1 Post by jfl » 05 Oct 2019 11:06

There are many cases where batch strings are parsed multiple times before being used.
For example when you build a command line from multiple parts, or pass arguments to a subroutine.
Each time it's tricky if the command or arguments contain special characters like ^ & | ! etc...
In that case, it's necessary to escape the special characters using additional ^ characters.
And the escaping rules are tricky because the number of ^ characters to add depends on the presence of ! and " characters!
Things can become really horrible when there are several levels of parsing to go through, some with delayed !expansion, others without.
The worst case being macros: When writing a complex macro, you have to prepare things for at least two, often three parsings!

Anyway, after wasting too much time finding the correct escaping for a complex case, I decided to automate that:
What I needed was a routine that took a string, and that would escape it so that it could survive intact through a given number of parsings.

As the rules are context-dependent, this basically required writing a batch parser, that duplicates the state machine used by the real cmd.exe parser; And depending on the current state anywhere in the string, that generates the necessary number of ^ ahead of tricky characters.
Fortunately, I didn't need a full-fledged parser. A subset modeling the tokenizer seems to be enough.
(Stage 2 in the reference on the subject: https://stackoverflow.com/a/4095133)

I decided to experiment in batch.
The first attempts were quite complex.
But after a while, I ended up with a relatively simple algorithm that seems to give good results:

Code: Select all

:#----------------------------------------------------------------------------#
:#                                                                            #
:#  Function        EscapeCmdString                                           #
:#                                                                            #
:#  Description     Prepare a command for passing through multiple parsings   #
:#                                                                            #
:#  Arguments       %1  Name of the variable containing the command string    #
:#                  %2  Output variable name. Default: Same as input variable #
:#                  %3  Number of parsings to go through. Default: 1          #
:#                  %4  # of the above with !expansion. Default: 0|%3 if exp  #
:#                                                                            #
:#  Notes           The cmd parser tokenizer removes levels of ^ escaping.    #
:#                  This routine escapes a command line, or an argument, so   #
:#                  that special characters like ^ & | > < ( ) make it        #
:#                  through intact through one or more tokenizations.         #
:#                                                                            #
:#                  Known limitation: The LF character is not managed.        #
:#                                                                            #
:#  History                                                                   #
:#   2019-10-03 JFL Initial implementation                                    #
:#                                                                            #
:#----------------------------------------------------------------------------#

:EscapeCmdString %1=CMDVAR [%2=OUTVAR] [%3=# parsings] [%4=# with !expansion]
for /f "tokens=2" %%e in ("!! 0 1") do setlocal EnableDelayedExpansion & set "CallerExp=%%e"
set "H0=^^"             &:# Return a Hat ^ with QUOTE_MODE 0=off
set "H1=^"              &:# Return a Hat ^ with QUOTE_MODE 1=on
if %CallerExp%==1 set "H0=!H0!!H0!" & set "H1=!H1!!H1!" &:# !escape our return value
set "NPESC=1"                     &:# Default number of %expansion escaping to do
if not "%~3"=="" set "NPESC=%~3"  &:# specified # of extra %expansion escaping to do
set /a "NXESC=%CallerExp%*NPESC"  &:# Default number of !expansion escaping to do
if not "%~4"=="" set "NXESC=%~4"  &:# specified # of extra !expansion escaping to do
for /l %%i in (1,1,%NXESC%) do set "H0=!H0!!H0!" & set "H1=!H1!!H1!"
for /l %%i in (1,1,%NPESC%) do set "H0=!H0!!H0!"
:# Define characters that need escaping outside of quotes
for %%c in ("<" ">" "|" "&" "(" ")") do set ^"EscapeCmdString.NE[%%c]=1^"
set ^"STRING=!%1!^"
set "OUTVAR=%2"
if not defined OUTVAR set "OUTVAR=%1"
set "RESULT="
set "QUOTE_MODE=0"      &:# 1=Inside a quoted string
set "ESCAPE=0"          &:# 1=The previous character was a ^ character
set "N=-1"
:EscapeCmdString.loop
  set /a "N+=1"
  set "C=!STRING:~%N%,1!" &:# Get the Nth character in the string
  if not defined C goto :EscapeCmdString.end
  if "!C!!C!"=="""" (
    if !ESCAPE!==0 (
      set /a "QUOTE_MODE=1-QUOTE_MODE"
    ) else ( :# Open " quotes can be escaped, but not close " quotes
      if "!QUOTE_MODE!"=="0" set "RESULT=!RESULT!!H0:~1!"
    )
  ) else if "!C!"=="^" (
    if "!QUOTE_MODE!"=="0" set /a "ESCAPE=1-ESCAPE"
    set "RESULT=!RESULT!!H%QUOTE_MODE%:~1!"
  ) else if "!C!"=="^!" (
    set "RESULT=!RESULT!!H%QUOTE_MODE%:~1!"
  ) else if defined EscapeCmdString.NE["!C!"] ( :# Characters that need escaping outside of quotes
    if "!QUOTE_MODE!"=="0" set "RESULT=!RESULT!!H0:~1!"
  )
  if not "!C!"=="^" set "ESCAPE=0"
  set "RESULT=!RESULT!!C!"
goto :EscapeCmdString.loop
:EscapeCmdString.end
endlocal & set ^"%OUTVAR%=%RESULT%^" ! = &:# The ! forces always having !escaping ^ removal in delayed expansion mode
exit /b
To test it, you can download my latest batch library.bat from github, which includes the above :EscapeCmdString routine, and test code.
Ex:

Code: Select all

C:\JFL\Temp>Library.bat -te "echo R^&D !"
_INITIAL=echo R^&D !
# EnableDelayedExpansion
_ESCAPED=echo R^^^^^^^&D ^^^!
REPARSED=echo R^&D !

C:\JFL\Temp>Library.bat -te "echo R^&D !" off
_INITIAL=echo R^&D !
# DisableDelayedExpansion
_ESCAPED=echo R^^^&D ^!
REPARSED=echo R^&D !

C:\JFL\Temp>Library.bat -te "echo R^&D !" on 2
_INITIAL=echo R^&D !
# EnableDelayedExpansion
_ESCAPED=echo R^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^&D ^^^^^^^^^^^^^^^!
REPARSED=echo R^^^^^^^&D ^^^!
REPARSED=echo R^&D !

C:\JFL\Temp>
In each test, the variable _INITIAL is set from the first argument, then passed to routine :EscapeCmdString.
The result of :EscapeCmdString is stored in variable _ESCAPED.
Finally, I run (set REPARSED=%_ESCAPED%) to force one level of parsing, and display the value.
(The _ ahead of the first two variables names is there only to make sure all values are aligned, which makes it easier to check if the first and last strings match.)

The first two examples above show that the escaped string is not the same if delayed expansion is on or off in the test routine.
The third example shows how a string that must survive two parsings with delayed expansion on grows ridiculously long.

To allow testing more tricky cases, without losing the tricky characters in the library invocation or its argument processing loop, the test routine allows passing HTML entities for tricky characters.
But I couldn't use the HTML syntax &name; as the & itself is tricky for batch. So, instead, I use [name] with brackets.
Ex:

Code: Select all

C:\JFL\Temp>Library.bat -te "0[excl]1[quot]2[excl]3[quot]" on 2
_INITIAL=0!1"2!3"
# EnableDelayedExpansion
_ESCAPED=0^^^^^^^^^^^^^^^!1"2^^^!3"
REPARSED=0^^^!1"2^!3"
REPARSED=0!1"2!3"

C:\JFL\Temp>
Notice how the number of ^ is not the same before the first and second exclamation point.
Use (library.bat -?) to display a help screen with the list of HTML entities supported.

I considered using this :EscapeCmdString to pass return values through endlocal barriers.
This would work, but the performance would be poor. I don't recommend it unless you're desperate. :-)


The next step is improving it for macro support:
Routine :EscapeCmdString does not yet support LF in strings. (The [lf] entity works, but strings with LF characters now break :EscapeCmdString.)
I plan to add that eventually. And when it's done, we'll be able to do things like calling a macro from another macro:
Passing your favorite $macro to :EscapeCmdString will automatically generate a $$macro, hopefully usable from within other macros. :-)
As this would be done only once during the program initialization, I hope the performance would be less of an issue.

Any feedback welcome!

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Automatically escaping strings to survive multiple parsings

#2 Post by aGerman » 05 Oct 2019 11:53

Nice piece of code, Jean-François!

Unfortunately things are not always that simple.

Code: Select all

@echo off &setlocal DisableDelayedExpansion
for /f %%i in ('for /f %%j in ('for /f %%k in ('echo a^^^^^^^&echo b'^^^) do @echo %%k'^) do @echo %%j') do echo ***%%i***
pause
If you want to write a nestet FOR /F like the above and you want to know how to escape the code, you would need to do that in several steps for each '...' clause. But that should be quite clear to everyone I guess. I'd be rather interested in the reason why you escape the left parenthesis. I can't recall any situation where this is necessary. Could you shed some light on that?

Steffen

jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: Automatically escaping strings to survive multiple parsings

#3 Post by jfl » 05 Oct 2019 14:18

You're right, I had not thought about recursive cases like this where different part of the command string are parsed different number of times.
Indeed it's necessary to invoke my routines multiple times, with growing strings:

Code: Select all

C:\JFL\Temp>Library.bat -te "echo a&echo b" off
_INITIAL=echo a&echo b
# DisableDelayedExpansion
_ESCAPED=echo a^&echo b
REPARSED=echo a&echo b

C:\JFL\Temp>Library.bat -te "for /f %%k in ('echo a^&echo b') do @echo %%k" off
_INITIAL=for /f %%k in ('echo a^&echo b') do @echo %%k
# DisableDelayedExpansion
_ESCAPED=for /f %%k in ^('echo a^^^&echo b'^) do @echo %%k
REPARSED=for /f %%k in ('echo a^&echo b') do @echo %%k

C:\JFL\Temp>Library.bat -te "for /f %%j in ('for /f %%k in ^('echo a^^^&echo b'^) do @echo %%k') do @echo %%j" off
_INITIAL=for /f %%j in ('for /f %%k in ^('echo a^^&echo b'^) do @echo %%k') do @echo %%j
# DisableDelayedExpansion
_ESCAPED=for /f %%j in ^('for /f %%k in ^^^('echo a^^^^^&echo b'^^^) do @echo %%k'^) do @echo %%j
REPARSED=for /f %%j in ('for /f %%k in ^('echo a^^&echo b'^) do @echo %%k') do @echo %%j

C:\JFL\Temp>
Now here I hit a problem I don't understand yet:
Even though the library.bat argument is a quoted string, one ^ is lost in the middle before making it to the _INITIAL string.
I can work around that using entities, but I'll need to investigate this further tomorrow!

Code: Select all

C:\JFL\Temp>Library.bat -te "for /f %%j in ('for /f %%k in ^('echo a[hat][hat][hat]&echo b'^) do @echo %%k') do @echo %%j" off
_INITIAL=for /f %%j in ('for /f %%k in ^('echo a^^^&echo b'^) do @echo %%k') do @echo %%j
# DisableDelayedExpansion
_ESCAPED=for /f %%j in ^('for /f %%k in ^^^('echo a^^^^^^^&echo b'^^^) do @echo %%k'^) do @echo %%j
REPARSED=for /f %%j in ('for /f %%k in ^('echo a^^^&echo b'^) do @echo %%k') do @echo %%j

C:\JFL\Temp>for /f %i in ('for /f %j in ^('for /f %k in ^^^('echo a^^^^^^^&echo b'^^^) do @echo %k'^) do @echo %j') do echo ***%i***

C:\JFL\Temp>echo ***a***
***a***

C:\JFL\Temp>echo ***b***
***b***

C:\JFL\Temp>
So maybe you're right and escaping the ( is not necessary. But apparently it works with it too. :-)

To be absolutely foolproof, I'd need a full-fledged parser, knowing about special cases like if, rem, for parsing.
I planned for some improvements, but not that much!
So apparently my dream of automating macro escaping is way harder than I hoped :-(
But still, the routine as is is can help in many simpler cases!

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Automatically escaping strings to survive multiple parsings

#4 Post by aGerman » 05 Oct 2019 15:06

So maybe you're right and escaping the ( is not necessary. But apparently it works with it too. :)
Oh, of course it works. My question goes in the direction of performance of your code and the limited string length in Batch. Both could be positively influenced.
So apparently my dream of automating macro escaping is way harder than I hoped :(
jeb thought it would be impossible to call macros from within a macro. Maybe you can prove him wrong.
But still, the routine as is is can help in many simpler cases!
No doubt.

Steffen

Post Reply