Closing parenthesis prevents escape of subsequent special character operator

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Closing parenthesis prevents escape of subsequent special character operator

#16 Post by dbenham » 26 Dec 2017 05:57

Curiouser and curiouser :!:

Test 2 is an "expected" failure of an escaped ampersand
But how pray tell is Test 5 working :?: :shock:

Code: Select all

prompt>echo 1: Expected behavior - 1st character of continued line is escaped ^
More? & so this is all part of ECHO statement
1: Expected behavior - 1st character of continued line is escaped & so this is all part of ECHO statement

prompt>(echo 2: "Expected" bug behavior) ^
More? & echo Escape of ampersand after closure fails
3: "Expected" bug behavior
Escape of ampersand after closure fails

prompt>echo 3: Expected behavior ^
More? ^& echo The 1st escaped character is a caret, so ampersand is functional
2: Expected behavior ^
The 1st escaped character is a caret, so ampersand is functional

prompt>(echo 4: Expected behavior - cannot start with escaped caret after closure) ^^& echo Syntax error expected
^ was unexpected at this time.

prompt>(echo 5: What is going on here?) ^
More? ^& echo I expect a syntax error with escaped caret, but both carets dissappear, leaving functional ampersand
5: What is going on here?
I expect a syntax error with escaped caret, but both carets dissappear, leaving functional ampersand

prompt>(echo 6: I finally get) ^
More? ^^& my caret syntax error
^ was unexpected at this time.
jeb - where are you :?: We need help coming up with some rules that make sense of all of this nonsense.


Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Closing parenthesis prevents escape of subsequent special character operator

#17 Post by dbenham » 26 Dec 2017 06:44

I just looked closer at penpen's tests in post #8, and it is totally bizare :!:

The first & is getting stripped, so && is functioning as simple command concatenation, not as conditional command concatenation.

Here is my proof.

First I'll show normal expected behavior of && and ||. The HELP command gives bass-ackwards error return codes:

Code: Select all

prompt>help help>nul && echo SUCCESS || echo FAILURE
FAILURE
The only way I can explain the following is if the first & is dropped:

Code: Select all

prompt>(help help>nul) ^
More? &^
More? & echo This should not execute
This should not execute
Likewise, the first | must be dropped:

Code: Select all

prompt>(help help) ^
More? |^
More? | findstr /n "^"
1:Provides help information for Windows commands.
2:
3:HELP [command]
4:
5:    command - displays help information on that command.
And here I combine the effects:

Code: Select all

prompt>(help help >nul) ^
More? &^
More? & (echo This shouldn't execute) ^
More? |^
More? | findstr /n "^"
1:This shouldn't execute
Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Closing parenthesis prevents escape of subsequent special character operator

#18 Post by dbenham » 27 Dec 2017 00:55

Wow, it almost doesn't matter what characters appear within the intermediate continued lines, as long as each line contains a single token.
All intermediate lines are ignored :!:

Code: Select all

prompt>(call )^
More? This^
More? is^
More? all^
More? ignored^
More? &echo This is executed
This is executed
Token delimiters can be escaped

Code: Select all

prompt>(call )^
More? Token^ delimiters^ can^ be^ escaped^ to^ force^ a^ single^ token^
More? &echo This is executed
This is executed
It even works with newlines:

Code: Select all

prompt>(call )^
More?
More? Even^ with^ newlines^,^
More?
More? intermediate^ single^ token^ lines^ are^ ignored^
More?
More? &echo This is executed
This is executed
Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Closing parenthesis prevents escape of subsequent special character operator

#19 Post by dbenham » 30 Aug 2019 21:29

I stumbled upon this old post of mine, and I just realized the last behavior is identical to REM :shock: :!:
If there is only one following token ending with escape and end of line, then the token is thrown away. This repeats until there is more than one token on the line, or the line doesn't end with escape. Note that token delimiters can be escaped.

Note this script is run with ECHO ON

Code: Select all

rem ignore^
 ignore^ ignore^
 This is the comment

(break) ignore^
 ignore^ ignore^
 &echo This is echoed
--OUTPUT--

Code: Select all


C:\test>rem This is the comment 

C:\test>(break)  & echo This is echoed 
This is echoed
I can't believe this is a coincidence. It seems the closing parenthesis and REM implementations must share some common code. But why that behavior is there is still a mystery :?
Perhaps penpen's theory can explain this? But I don't understand it, so I can't tell.


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Closing parenthesis prevents escape of subsequent special character operator

#20 Post by penpen » 01 Sep 2019 13:14

I should explain my theorie inmore detail (you might want to copy-paste the ascii graphics into a notepad).

Example:

Code: Select all

(
echo some
:label1
:label2
echo text
)
The optimized parse tree probably looks somehow like that (i removed some parts that are unneccessary for my explaination, to keep that example small):

Code: Select all

                     ┌─────┐
                     │ ( ) │
                     └┬┬┬┬┬┘
    ┌─────────────────┘│││└────────────────────────────┐
    │         ┌────────┘│└───────────┐                 │
┌───┴──┐   ┌──┴──┐   ┌──┴───┐    ┌───┴───┐          ┌──┴──┐
│ CRLF │   │ Cmd │   │ CRLF │    │ Label │          │ Cmd │
└──────┘   └┬───┬┘   └──────┘    └───┬───┘          └┬───┬┘
         ┌──┘   └──┐                 │            ┌──┘   └──┐
     ┌───┴──┐   ┌──┴───┐        ┌────┴────┐   ┌───┴──┐   ┌──┴───┐
     │ echo │   │ some │        │ :label2 │   │ echo │   │ text │
     └──────┘   └──────┘        └─────────┘   └──────┘   └──────┘
Stored in two arrays:
tree[] := [ (), CRLF, Cmd, CRLF, Label, Cmd, echo, some, :label2, echo, text ]
children[] := [ (1, 5), null, (6, 7), null, (8, 8), (9, 10), null, null, null, null, null ]

One advantage of such a representation is, that you could optimize easily, without knowing sublevels of the tree.


Because :label1 was part od the initial command there must be an intermediate parse tree:

Code: Select all

                                ┌───────┐
                                │  ( )  │
                                └┬┬┬┬┬┬┬┘
    ┌────────────────────────────┘│││││└─────────────────────────────────────┐
    │         ┌───────────────────┘│││└────────────────────┐                 │
    │         │         ┌──────────┘│└─────────┐           │                 │
┌───┴──┐   ┌──┴──┐   ┌──┴───┐   ┌───┴───┐   ┌──┴───┐   ┌───┴───┐          ┌──┴──┐
│ CRLF │   │ Cmd │   │ CRLF │   │ Label │   │ CRLF │   │ Label │          │ Cmd │
└──────┘   └┬───┬┘   └──────┘   └───┬───┘   └──────┘   └───┬───┘          └┬───┬┘
         ┌──┘   └──┐                │                      │            ┌──┘   └──┐
     ┌───┴──┐   ┌──┴───┐       ┌────┴────┐            ┌────┴────┐   ┌───┴──┐   ┌──┴───┐
     │ echo │   │ some │       │ :label1 │            │ :label2 │   │ echo │   │ text │
     └──────┘   └──────┘       └─────────┘            └─────────┘   └──────┘   └──────┘
stored
tree[] := [ (), CRLF, Cmd, CRLF, Label, CRLF, Label, Cmd, echo, "some", :label1, :label2, echo, "text" ]
children[] := [ (1, 7), null, (8, 9), null, (10, 10), null, (11, 11), (12, 13), null, null, null, null, null, null ]

So the opimization rule most probably was "remove [Label, CRLF]" (which would be the default rule to use in such a case):
If the rule was "remove [CRLF, Label]" it would have removed both labels,
If the rule was "remove [Label]" it would have removed both labels, too.



It is very unusual to use two command tokens without a delimiter, if delimiters are valid tokens.
It is also very unusual to start with a delimiter.
So what i would have expected the intermediate parse tree should look like:

Code: Select all

                                      ┌───────┐
                                      │  ( )  │
                                      └┬┬┬┬┬┬┬┘
         ┌─────────────────────────────┘│││││└──────────────────────────────┐
         │          ┌───────────────────┘│││└────────────────────┐          │
         │          │          ┌─────────┘│└──────────┐          │          │
      ┌──┴──┐   ┌───┴──┐   ┌───┴───┐   ┌──┴───┐   ┌───┴───┐   ┌──┴───┐   ┌──┴──┐
      │ Cmd │   │ CRLF │   │ Label │   │ CRLF │   │ Label │   │ CRLF │   │ Cmd │
      └┬───┬┘   └──────┘   └───┬───┘   └──────┘   └───┬───┘   └──────┘   └┬───┬┘
    ┌──┘   └──┐                │                      │                ┌──┘   └──┐
┌───┴──┐   ┌──┴───┐       ┌────┴────┐            ┌────┴────┐       ┌───┴──┐   ┌──┴───┐
│ echo │   │ some │       │ :label1 │            │ :label2 │       │ echo │   │ text │
└──────┘   └──────┘       └─────────┘            └─────────┘       └──────┘   └──────┘
stored:
tree[] := [ (), Cmd, CRLF, Label, CRLF, Label, CRLF, Cmd, echo, some, :label1, :label2, echo, text ]
children[] := [ (1, 7), (8, 9), null, (10, 10), null, (11, 11), null, (12, 13), null, null, null, null, null, null ]

The label removement would have worked correctly, no matter the rule used.

However, it is possible that that may explain the above behaviour:
In case that parse tree is build from bottom to top, the " ^" part may end up as an argument for the "( )" command (if there is a phase or "read argument"-code in which the compound command is read in which the "^" is no special character).
But I don't know anything about that building process, so that's just an unproven hypothesis based on the (from programmers viewpoint) misplaced CRLF and a label that isn't removed properly from that parse tree while another is.
The only thing i know is, that such errors are typical to programmers and are hard to find unless you are searching for that specifically.


penpen

sst
Posts: 93
Joined: 12 Apr 2018 23:45

Re: Closing parenthesis prevents escape of subsequent special character operator

#21 Post by sst » 02 Sep 2019 08:10

dbenham wrote:
23 Oct 2017 12:23
I am not able to escape a special character operator token if it appears immediately after a closing parenthesis. Not sure how this is useful, but I find it interesting. I would expect the escaped forms to generate a syntax error, but rather the "escaped" operator is fully functional :!:

Anyone have any ideas as to the mechanism :?:
This was interesting enough to make me spend time in analyzing the parser implementation with the help of debugger and CMD debugging symbols.

This effect is due to how the operator precedence is implemented in CMD.

Code: Select all

Operator precedence in CMD:
1. Command Group and silence operators: (), @
2. Redirection operators: > >> <
3. Pipe operator: |
4. Success operator: &&
5. Failure operator: ||
6. Command separation operator: &
To achieve the above precedence, CMD will not parse the whole statement at once, Rather it breaks the original statement in to several parts then scans and parses each of them individually.

Let's call the original statement Statement0
It goes like this:

Code: Select all

1. Statement0 ---> Statement1 & Statement0_new
2. Statement1 ---> Statement2 || Satatement1_new
3. Statement2 ---> Statement3 && Statement2_new
4. Statement3 ---> Statement4 | Statement3_new
5. Statement4 ---> Detect leading redirection: >redir Statement5
6. Statement5 ---> ( Statement0_new ) , @Statement0_new , Statement5 (Detects IF, FOR, REM, command token, arguments, trailing and middle redirection)
The left hand side of the operands will be parsed first, so the actual parsing will begin at level 6.

Now lets apply the above logic to following simple case

Code: Select all

(echo Hello) ^& echo World
1. The parser reads the first token and scan for leading redirection. It fails to find it.

2. The parser checks the token to see if it begins with ( or @. If either found it sets the current parse node type to Command Group or Silent type and allocates a new parse node and will begin at level 1 otherwise it will proceed to find the command token and will stop at first occurrence of (&, |, >, <)
2-1 Since the token began with a left paren ( it increases the parenthesis count bye 1. A non zero parenthesis count makes the right paren ) special token.

3. It begins with the newly allocated parse node at level 1 which again ends up in level 6. It finds echo Hello as the command token and its arguments.

4. The parser recursively backs to level 1 failing to find any of the (&, |, >, <) operators. Because the right paren is special it will NOT proceed past the right paren.

5. The parser backs to level 6 with previous parse node. It reads the next token to see if it is the right paren. If it is not it will complain and fails the parsing. Because the right paren is present it will close the parse node goes back to levels 5 through 1, At each level it reads the next token to find the corresponding operator.

6. The parser is at level 1. It reads the next token(GeToken) and compares it against the string "&". The escaping will be done in the GeToken function so it will return "&" instead of "^&" and sets the token type to Literal Text(0x4000) but the operator parsing function (which is named BinaryOperator) will not pay attention to token type, it just compares the token against "&" which will succeed, so it will be interpreted as Command Separator

7. Since the operator parsing function detected the Command Separator it will parse the the rest of the input as a new Statement0 for the right side of the & operator

EDIT:
I forgot to include the pipe operator in the statement splitting levels. Corrected.
Last edited by sst on 05 Sep 2019 18:08, edited 1 time in total.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Closing parenthesis prevents escape of subsequent special character operator

#22 Post by dbenham » 03 Sep 2019 05:30

sst wrote:
02 Sep 2019 08:10
6. The parser is at level 1. It reads the next token(GeToken) and compares it against the string "&". The escaping will be done in the GeToken function so it will return "&" instead of "^&" and sets the token type to Literal Text(0x4000) but the operator parsing function (which is named BinaryOperator) will not pay attention to token type, it just compares the token against "&" which will succeed, so it will be interpreted as Command Separator
So there is the key fault. It exactly describes the original behavior described in this thread. But why does the parser only ignore the token type immediately after a closing parenthesis :?:

And then there is the odd line continuation behavior whereby REM and ) continue to strip lone tokens until there are 2 or more tokens or no line continuation. I'm not seeing the cause of this either.


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Closing parenthesis prevents escape of subsequent special character operator

#23 Post by penpen » 05 Sep 2019 00:48

sst wrote:
02 Sep 2019 08:10
This was interesting enough to make me spend time in analyzing the parser implementation with the help of debugger and CMD debugging symbols.

This effect is due to how the operator precedence is implemented in CMD.

Code: Select all

Operator precedence in CMD:
1. Command Group and silence operators: (), @
2. Redirection operators: > >> <
3. Pipe operator: |
4. Success operator: &&
5. Failure operator: ||
6. Command separation operator: &
It's interesting, that this list doesn't contain the escape operator (the circumflex accent character: '^').
That raises the question, if you used something like IDA, where you see all code, or if you just traced the program flow.
The first option would be nice, beacause then it could be possible that there is a cmd-parser phase with no escape character,
but the second option would also be an explanation.


penpen

sst
Posts: 93
Joined: 12 Apr 2018 23:45

Re: Closing parenthesis prevents escape of subsequent special character operator

#24 Post by sst » 06 Sep 2019 00:02

penpen wrote:
05 Sep 2019 00:48
It's interesting, that this list doesn't contain the escape operator (the circumflex accent character: '^').
That raises the question, if you used something like IDA, where you see all code, or if you just traced the program flow.
The first option would be nice, beacause then it could be possible that there is a cmd-parser phase with no escape character,
but the second option would also be an explanation.
Well '^' is not an operator in the same sense that (&& || | & < > >> () @) are.
The parser has no knowledge and doesn't care about the caret. The caret is exclusively handled in the tokenization stage by the lexer. When the parser requests the next token from lexer(GeToken), the necessary escaping will be done by the lexer.

About the analysis, I used both IDA and tracing the execution flow with debuggers ollydbg(mostly) and x64dbg.
What I've posted is mostly the result of the extensive analysis in debugger. Although it doesn't take the place of a powerful tool like IDA but without tracing and seeing what is really going on with each input, it is hard to deduce the logic of the code.

In the mean time, I was and still am, in the process of translating parts of the parser code to a high level language like C. For this task I'm using IDA. It is very difficult and time consuming. Debugging symbols are of great help but unfortunately the public symbols don't contain information about the data types and structures. One the most difficult tasks is to find out the layout of the data structures and their meaning and purpose, this is where tracing helps a lot. So I'm using the combination of all tools to construct a clear C code.
I'm also studying the Debug/Checked builds of CMD (WinXP , Win7) which are full of debug out strings with lots of valuable information.

I had good progress so far. The most relevant parts of the parser is constructed, except the command parser parts(FOR, IF, REM, generic command) which currently I don't have a plan for them. The next step is to construct the lexer, at least partially, to have a more complete picture. I will share more details about this in the future.
dbenham wrote:
03 Sep 2019 05:30
So there is the key fault. It exactly describes the original behavior described in this thread. But why does the parser only ignore the token type immediately after a closing parenthesis :?:
BinaryOperator function is responsible for parsing the binary operators (& | && ||). It is just implemented this way, it doesn't check the token type at all.
A parenthesis pair is the only case where an escaped operator reaches to the hands of BinaryOperator function. This is because when the parser is at level 6(command parser) and it sees the matching closing parenthesis, it doesn't process the input any further so the escaped operator will not be consumed by the command parser, later the escaped operator will be seen by the BinaryOperator function which doesn't care about the token type.

Let me describe it another way:
When the parser is looking for command token, it looks for text tokens. Any none text token (& && | || < > >>) marks the end of the input for command parser. When inside a parenthetical statement the closing parenthesis is also special so it marks end of input for command parser. Now any token, escaped or otherwise, past the command parser input will be seen bye the binary operator parser. And as described, the binary operator parser does't care about the token type. It just performs the string comparison: wcscmp(L"&", TokBuf)

Code: Select all

//...
return BinaryOperator(L"&", ....);
//...

BinaryOperator(wchar_t _operator, ....) {
    // call the left hand parser.
    // the left hand parser will not consume the _operator
    leftHandParser();
    If( no token is left to process ) {
        // no more tokens so there is no operator
        // return the parsed left hand
        return ...;
     }
     if( !wcscmp(_operator, TokBuf) ) {
        // current token is the _operator
        // get the next token and call the right hand parser
        GeToken(..);
        rightHandParser();
        //...
     } else {
       // the _operator is not present 
       // return the parsed left hand
       return ...
    }
}
This also describes why (echo Hello) ^& echo World works but (echo Hello) ^&echo World doesn't. In the former case the binary operator parser sees & as the current token, in the latter case it sees &echo as the current token. If the the operator is not escaped, the lexer would return & as the current token in both cases.
dbenham wrote:
03 Sep 2019 05:30
And then there is the odd line continuation behavior whereby REM and ) continue to strip lone tokens until there are 2 or more tokens or no line continuation. I'm not seeing the cause of this either.
I don't follow. Would you give an example?

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Closing parenthesis prevents escape of subsequent special character operator

#25 Post by dbenham » 06 Sep 2019 03:58

sst wrote:
06 Sep 2019 00:02
I don't follow. Would you give an example?
From a few posts ago - viewtopic.php?f=3&t=8198&start=15#p60206

Dave Benham

sst
Posts: 93
Joined: 12 Apr 2018 23:45

Re: Closing parenthesis prevents escape of subsequent special character operator

#26 Post by sst » 06 Sep 2019 08:15

dbenham wrote:
30 Aug 2019 21:29
I stumbled upon this old post of mine, and I just realized the last behavior is identical to REM :shock: :!:
If there is only one following token ending with escape and end of line, then the token is thrown away. This repeats until there is more than one token on the line, or the line doesn't end with escape. Note that token delimiters can be escaped.

Note this script is run with ECHO ON

Code: Select all

rem ignore^
 ignore^ ignore^
 This is the comment

(break) ignore^
 ignore^ ignore^
 &echo This is echoed
--OUTPUT--

Code: Select all


C:\test>rem This is the comment 

C:\test>(break)  & echo This is echoed 
This is echoed
I can't believe this is a coincidence. It seems the closing parenthesis and REM implementations must share some common code. But why that behavior is there is still a mystery :?
Perhaps penpen's theory can explain this? But I don't understand it, so I can't tell.


Dave Benham
This is related to the lexer(tokenizer) and how it manages its buffer. The CMD parser routines request the next token from lexer by GeToken function. Also there is a function call which undoes the last call to GeToken. Let's call it UnGetToken(Although the function name is not UnGeToken, rather it is direct call to lexer but that is not important here).

The lexer holds a pointer to the current token address(LexBufPtr) in the lexer's buffer and holds a pointer the the previous token address(PrevLexPtr). UnGeToken restores the previous token by a simple assigment: LexBufPtr = PrevLexPtr.

When there is a line continuation, there is no more data in the lexer's buffer, so it reads the next line from StdIn or from batch file and fills the lexer buffer which overwrites previous data and resets both pointers(LexBufPtr, PrevLexPtr) to the start of the lexer buffer. It is obvious that ungetting the last token is not possible anymore.

The REM parser will read the first token(GeToken) and compares it against "/?". If the token is "/?" it sets the help flag to true else it ungets the token(UnGeToken) and do a second GeToken but this time requests the whole remaining data from lexer which also turns off caret functionality. ( GeToken(REM_FLAG) ).

The case for () is similar. it is because of a GeToken/UnGeToken pair.

jeb
Expert
Posts: 1041
Joined: 30 Aug 2007 08:05
Location: Germany, Bochum

Re: Closing parenthesis prevents escape of subsequent special character operator

#27 Post by jeb » 06 Sep 2019 11:05

Hi sst,

I'm impressed by your analysis/debugging. :o :!:
It explains so much.

Now, I understand why the REM trick to gather the arguments works so good, and why it fails with multiline input.
sst wrote:
06 Sep 2019 08:15
The REM parser will read the first token(GeToken) and compares it against "/?". If the token is "/?" it sets the help flag to true else it ungets the token(UnGeToken) and do a second GeToken but this time requests the whole remaining data from lexer which also turns off caret functionality. ( GeToken(REM_FLAG) ).
But even with your explanation, I don't understand why in this sample the token in front of the caret is dropped.

Code: Select all

@echo off

echo one <nul two^
three four
output wrote:one three four
I suppose it has something to do with
sst wrote:
06 Sep 2019 08:15
The lexer holds a pointer to the current token address(LexBufPtr) in the lexer's buffer and holds a pointer the the previous token address(PrevLexPtr). UnGeToken restores the previous token by a simple assigment: LexBufPtr = PrevLexPtr.

When there is a line continuation, there is no more data in the lexer's buffer, so it reads the next line from StdIn or from batch file and fills the lexer buffer which overwrites previous data and resets both pointers(LexBufPtr, PrevLexPtr) to the start of the lexer buffer. It is obvious that ungetting the last token is not possible anymore.
But why it only drops the token, when a redirection is present :?:

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Closing parenthesis prevents escape of subsequent special character operator

#28 Post by dbenham » 07 Sep 2019 05:41

jeb wrote:
06 Sep 2019 11:05
Hi sst,

I'm impressed by your analysis/debugging. :o :!:
It explains so much.
Ditto that
jeb wrote:
06 Sep 2019 11:05
But why it only drops the token, when a redirection is present :?:
I'll take a stab at that.

The redirection parser requires the file path, which may be in the next token.

Code: Select all

echo hello 2> one^
 two^
 three.txt world
--OUTPUT--

Code: Select all

hello  world
Creates empty file "one two three.txt"

But the redirection parser is lazy and blindly scans the next token before checking if the redirection token already contains the file path. Then when it discovers the file within the redirection, it says "never mind, I don't need that next token", and gives it back to the command. But now the damage has been done.

Code: Select all

echo hello 2>output.txt ignore^
 ignore^
 world
--OUTPUT--

Code: Select all

hello  world
Creates empty file "output.txt"


Dave Benham

sst
Posts: 93
Joined: 12 Apr 2018 23:45

Re: Closing parenthesis prevents escape of subsequent special character operator

#29 Post by sst » 08 Sep 2019 05:28

jeb wrote:
06 Sep 2019 11:05
Now, I understand why the REM trick to gather the arguments works so good, and why it fails with multiline input.
sst wrote:
06 Sep 2019 08:15
The REM parser will read the first token(GeToken) and compares it against "/?". If the token is "/?" it sets the help flag to true else it ungets the token(UnGeToken) and do a second GeToken but this time requests the whole remaining data from lexer which also turns off caret functionality. ( GeToken(REM_FLAG) ).
Hi jeb,
I was not precise when I said the REM parser compares the first token against "/?", actually it checks if the first token contains "/?". As you you already know something like "abc/?efg" also triggers the help.

But the interesting part is that when it finds the help switch, it wont process the rest of the args if there are any. The help token marks the end of the command. So if there are more tokens after the help token they will be parsed by redirection and binary operator parsers.

So besides the parenthesis pair there is also another case which ^& and the like works as operator:

Code: Select all

REM /? ^& echo Works
This is also the case for IF and FOR since they use the same logic as REM uses for processing the help switch.

jeb wrote:
06 Sep 2019 11:05
But why it only drops the token, when a redirection is present :?:
Dave explained it well

-------------------------------------------------------------------------------

The following should help to have a more clear understanding of how GeToken/UnGeToken works and see how state of lexer changes after each call

Code: Select all

// Parser does not have access to LexBuf, it reads the token from TokBuf
// The TokBuf will be filled by calling GeToken


// The initial state of LexBuf
// The LifeFeed(0xA) character is Token5
// 0xD is CaridgeReturn
// 0x0 is null terminator
1)  
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |         |
---------------------------------------        -----------
 ^
 |
 PrevLexPtr, LexBufPtr


2)
GeToken();  //GeToken copies the token(char by char) which is pointed by LexBufPtr to TokBuf

3)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |Token1   |
---------------------------------------        -----------
 ^      ^
 |      |
 |      LexBufPtr
 |
 PrevLexPtr

4)
GeToken();  //GeToken copies the token(char by char) which is pointed by LexBufPtr to TokBuf


5)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |Token2   |
---------------------------------------        -----------
        ^      ^
        |      |
        |      LexBufPtr
        |
        PrevLexPtr

6)
UnGeToken();  //UnGeToken sets LexBufPtr=PrevLexPtr. It has no effect on the contents of TokBuf

7)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |Token2   |
---------------------------------------        -----------
        ^
        |
        PrevLexPtr, LexBufPtr

8)
GeToken();  //GeToken copies the token(char by char) which is pointed by LexBufPtr to TokBuf


9)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |Token2   |
---------------------------------------        -----------
        ^      ^
        |      |
        |      LexBufPtr
        |
        PrevLexPtr


10)
GeToken();  //GeToken copies the token(char by char) which is pointed by LexBufPtr to TokBuf


11)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |Token3   |
---------------------------------------        -----------
               ^      ^
               |      |
               |      LexBufPtr
               |
               PrevLexPtr


12)
GeToken();  //GeToken copies the token(char by char) which is pointed by LexBufPtr to TokBuf
            //Skips over CR(0xD)


13)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |Token4   |
---------------------------------------        -----------
                      ^        ^
                      |        |
                      |        LexBufPtr
                      |
                      PrevLexPtr


14)
GeToken();  //GeToken copies the token(char by char) which is pointed by LexBufPtr to TokBuf


15)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token1 Token2 Token3 Token40xD0xA0x0 |        |0xA      |
---------------------------------------        -----------
                               ^  ^
                               |  |
                               |  LexBufPtr
                               |
                               PrevLexPtr



16)
//LexBufPtr now points to null char. GetToken will cause the LexBuf to be filled with new data
GeToken();




17)  //FillBuf()
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token6 Token7 Token8 Token90xD0xA0x0 |        |0xA      |
---------------------------------------        -----------
 ^
 |
 PrevLexPtr, LexBufPtr



18)
LexBuf:                                        TokBuf:
---------------------------------------        -----------
|Token6 Token7 Token8 Token90xD0xA0x0 |        |Token6   |
---------------------------------------        -----------
 ^      ^
 |      |
 |      LexBufPtr
 |
 PrevLexPtr
 
After studying the above the analysis for the REM case can be better understood:

Code: Select all

//Batch Sample running with Echo on
//-------------
REM One^
 Two^ Three^
 This is the comment
//-------------

//Expected output:
//-------------
REM One Two Three This is the comment
//-------------

//Real output:
//-------------
REM This is the comment
//-------------


//Analysis:

// When flags is zero it means normal processing mode.
// DWORD GeToken(DWORD flags);

// Inside ParseRem function
// The state of LexBuf and TokBuf before ParseRem function begins parsing the REM arguments
// 0xA is LF
// 0xD is CR
// 0x0 is NULL char 
// 0x20 is the space char
1)  
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|REM One^0xD0xA0x0                    |        |REM                      |
---------------------------------------        ---------------------------
 ^   ^
 |   |
 |   LexBufPtr
 |
 PrevLexPtr


2)
//Begins parsing REM args
GeToken(0);   //Gets the next token to check for "/?" switch


3)
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|REM One^0xD0xA0x0                    |        |One                      |
---------------------------------------        ---------------------------
     ^         ^
     |         |
     |         LexBufPtr
     |
     PrevLexPtr


4)
//LexBufPtr points to null char, because the of caret it fills the LexBuf to read and escape the next character
//GeToken ----> FillBuf()
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|0x20Two^0x20Three^0xD0xA0x0          |        |One                      |
---------------------------------------        ---------------------------
 ^
 |
 PrevLexPtr, LexBufPtr


5)
//Continue reading
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|0x20Two^0x20Three^0xD0xA0x0          |        |One Two Three            |
---------------------------------------        ---------------------------
 ^                       ^
 |                       |
 |                       LexBufPtr
 |
 PrevLexPtr


6)
//Again LexBufPtr points to null char, because the of caret it fills the LexBuf to read and escape the next character
//GeToken ----> FillBuf() 
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|0x20This is the comment0xD0xA0x0     |        |One Two Three            |
---------------------------------------        ---------------------------
 ^
 |
 PrevLexPtr, LexBufPtr


7)
//Continue reading
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|0x20This is the comment0xD0xA0x0     |        |One Two Three This       |
---------------------------------------        ---------------------------
 ^        ^
 |        |
 |        LexBufPtr
 |
 PrevLexPtr



8)
// GeToken returned
// "One Two Three This" doen not contain "/?" so set the help flag to 0 and
// unget the last token to re-read the whole command line at once.
UnGeToken();


9)
// The previous lines are lost, UnGetoken can't take them back
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|0x20This is the comment0xD0xA0x0     |        |One Two Three This       |
---------------------------------------        ---------------------------
 ^
 |
 PrevLexPtr, LexBufPtr


10)
// Read the whole data without tokenization
GeToken(REM_FLAG);


11)
LexBuf:                                        TokBuf:
---------------------------------------        ---------------------------
|0x20This is the comment0xD0xA0x0     |        |0x20This is the comment  |
---------------------------------------        ---------------------------
 ^                         ^
 |                         |
 |                         LexBufPtr
 |
 PrevLexPtr


12)
// copy TokBuf to REM args buffer.
// RemArgs: " This is the comment"

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Closing parenthesis prevents escape of subsequent special character operator

#30 Post by penpen » 09 Sep 2019 04:10

I don't know the cmd.exe source, but are you sure, that the program doesn't copy parts to somewhere else (and work on that result)?
I mean... what is unclear to me is, that i don't get why MS seems to have built a parser, that behaves differently depending on the command token, when splitting the command and argument string:
If the expected output of "REM One^\r\n Two^ Three^\r\n This is the comment\r\n" is the (command, argument) pair ("REM", " This is the comment\r\n"),
then i would expect any other command to behave the same, but "for One^\r\n Two^ Three^\r\n %%a in (1 2 3) do echo(%%~a" seems to work differently.
(But on the other hand i don't get why MS did what they did on many other occasions... so that might just be another one... .)

Sidenote:
If i had to guess i would have said that they used setjmp and longjmp and lost track of what they are doing exactly..., but
it now seems more they did that the classic way (function calls only, which i would always prefer, except for try/catch) -
with some unexpected decisions (at least from my viewpoint).
Because you know (parts of) the code i would like to know if they used setjmp and longjmp outside any try/catch block?


penpen

Post Reply