REPLVAR.BAT - regex search and replace for variables

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

REPLVAR.BAT - regex search and replace for variables

#1 Post by dbenham » 05 Apr 2014 00:12

I was thinking about the problem of replacing = characters in variable content, as well as other thorny string search and replace issues in batch. My REPL.BAT utility was primarily built to work with files (via pipes or redirection), but it also supports input via an environment variable. The difficulty is how to reliably capture the stdout output in a variable. FOR /F can be used, but carriage returns and line feeds can be troublesome. And then there is the problem of corruption of ! (and ^) when expanding FOR variables while delayed expansion is enabled.

I initially came up with a powerful solution combining the safe return technique with a series of piped REPL.BAT operations. It worked well, but it was very slow. So I decided to build a dedicated REPLVAR.BAT hybrid JScript/batch utility that manages to do all the needed replacements with a single JScript call. It borrows heavily from the REPL.BAT script.

It has most of the same options as REPL.BAT, except input is always from a variable, so no S option, and the search always uses multi-line mode, so no M option.

It has an impressive list of features:
  • Both the input and output values are passed by reference via variable names.
  • Searches can be interpreted as regular expressions, or as string literals.
  • Searches can be case sensitive or insensitive.
  • Replacement strings can reference matched content from the search.
  • Many escape sequences are supported in both the search and target strings: All possible byte codes are supported (except NULL 0x00).
  • The utility can be safely called with delayed expansion enabled or disabled, and all input and output characters will be preserved.

Note - REPLVAR.BAT effectively treats all strings as extended ASCII. The source variable value should map properly to the active code page. If the source value is unicode that does not map to the active code page, then the value will be silently transformed into a different value that does map to the active code page. Also, the final output must be compatible with the active code page, otherwise an error is raised.

Usage is simple:

Code: Select all

@echo off
setlocal enableDelayedExpansion
set "input=1 + 1 = 3!"
call replVar input output "=" "<>" L
echo(!output!
--OUTPUT--

Code: Select all

1 + 1 <> 3!

A single call takes about 110 milliseconds on my machine. Certainly not fast, but not too bad for batch, considering the power.

Full documentation is embedded within the script.

Let me know if you find any bugs. I've done moderate testing, but I wouldn't be shocked if there are some bugs lurking somewhere.

REPLVAR.BAT
EDIT 2014-04-06, version 1.1: Detect and raise an error if the result is incompatible with the active code page. Also explicitly set ERRORLEVEL to 0 or 1 as appropriate upon return.
EDIT 2014-04-07, version 1.2: Fixed a bug with output when the input included extended ASCII values. Also dropped the V option, so the search and replace strings must now be passed as strings, never by reference using variable names.
EDIT 2014-04-08, version 1.3: Modified the documentation to better explain the limits on the source content.
EDIT 2014-04-24, version 1.4: Fixed the A option that was broken with V1.2.

Code: Select all

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::************ Documentation ***********
::REPLVAR.BAT version 1.4
:::
:::REPLVAR  InVar  OutVar  Search  Replace  [Options]
:::REPLVAR  /?[REGEX|REPLACE]
:::REPLVAR  /V
:::
:::  Performs a global regular expression search and replace on the contents of
:::  variable InVar and writes the result to variable OutVar.
:::
:::  REPLVAR.BAT works properly with delayed expansion enabled or disabled.
:::
:::  REPLVAR.BAT treats the source variable value as extended ASCII. The value
:::  should map properly to the active code page. Unicode source values that
:::  do not map to the active code page will be silently transformed to a new
:::  value that does map to the active code page. The result of the search and
:::  replace must be compatible with the active code page, otherwise an error
:::  is raised.
:::
:::  The maximum supported output string length usually approaches the 8191
:::  maximum for most strings. But it could be significantly less if the output
:::  string contains many % " \r or \n characters, as they must be temporarily
:::  expanded into 2 or 3 bytes. Also, ^ and ! characters are temporarily
:::  expanded into 2 bytes if delayed expansion is enabled.
:::
:::  REPLVAR.BAT returns with ERRORLEVEL 0 upon success, and ERRORLEVEL 1
:::  upon error. If the A option is used and the input was not altered then
:::  OutVar is undefined and ERRORLEVEL set to 2.
:::
:::  Each parameter may be optionally enclosed by double quotes. The double
:::  quotes are not considered part of the argument. The quotes are required
:::  if the parameter contains a batch token delimiter like space, tab, comma,
:::  semicolon. The quotes should also be used if the argument contains a
:::  batch special character like &, |, etc. so that the special character
:::  does not need to be escaped with ^.
:::
:::  If called with a single argument of /?, then prints help documentation
:::  to stdout. If a single argument of /?REGEX, then opens up Microsoft's
:::  JScript regular expression documentation within your browser. If a single
:::  argument of /?REPLACE, then opens up Microsoft's JScript REPLACE
:::  documentation within your browser.
:::
:::  If called with a single argument of /V, case insensitive, then prints
:::  the version of REPLVAR.BAT.
:::
:::  InVar   - The name of a variable containing the source string.
:::
:::  OutVar  - The name of a variable where the result should be stored.
:::
:::  Search  - By default, this is a case sensitive JScript (ECMA) regular
:::            expression expressed as a string.
:::
:::            The search is conducted using the regular expression g (global)
:::            and m (multilline) flags.
:::
:::            JScript regex syntax documentation is available at
:::            http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx
:::
:::  Replace - By default, this is the string to be used as a replacement for
:::            each found search expression. Full support is provided for
:::            substituion patterns available to the JScript replace method.
:::
:::            For example, $& represents the portion of the source that matched
:::            the entire search pattern, $1 represents the first captured
:::            submatch, $2 the second captured submatch, etc. A $ literal
:::            can be escaped as $$.
:::
:::            An empty replacement string must be represented as "".
:::
:::            Replace substitution pattern syntax is fully documented at
:::            http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx
:::
:::  Options - An optional string of characters used to alter the behavior
:::            of REPLVAR. The option characters are case insensitive, and may
:::            appear in any order.
:::
:::            I - Makes the search case-insensitive.
:::
:::            L - The Search is treated as a string literal instead of a
:::                regular expression. Also, all $ found in Replace are
:::                treated as $ literals.
:::
:::            B - The Search must match the beginning of a line.
:::                Mostly used with literal searches.
:::
:::            E - The Search must match the end of a line.
:::                Mostly used with literal searches.
:::
:::            A - Only return a value if the input was altered. If not altered,
:::                then ERRORLEVEL is set to 2.
:::
:::            X - Enables extended substitution pattern syntax with support
:::                for the following escape sequences within the Replace string:
:::
:::                \\     -  Backslash
:::                \b     -  Backspace
:::                \f     -  Formfeed
:::                \n     -  Newline
:::                \q     -  Quote
:::                \r     -  Carriage Return
:::                \t     -  Horizontal Tab
:::                \v     -  Vertical Tab
:::                \xnn   -  Extended ASCII byte code expressed as 2 hex digits
:::                \unnnn -  Unicode character expressed as 4 hex digits
:::
:::                Also enables the \q escape sequence for the Search string.
:::                The other escape sequences are already standard for a regular
:::                expression Search string.
:::
:::                Also modifies the behavior of \xnn in the Search string to work
:::                properly with extended ASCII byte codes.
:::
:::                Extended escape sequences are supported even when the L option
:::                is used. Both Search and Replace support all of the extended
:::                escape sequences if both the X and L opions are combined.
:::
::: REPLVAR.BAT was written by Dave Benham, with assistance from DosTips users
::: Aacini and Liviu regarding complications due to JScript's use of unicode vs.
::: cmd.exe's use of extended ASCII. REPLVAR.BAT also uses a modifed form of the
::: safe return technique developed by DosTips user jeb. Updates to REPLVAR.BAT
::: will be posted to the original posting site:
::: http://www.dostips.com/forum/viewtopic.php?f=3&t=5492
:::

::************ Batch portion ***********

@echo off
if .%4 equ . (
  if "%~1" equ "/?" (
    for /f "delims=: tokens=1*" %%A in ('findstr /n "^:::" "%~f0"') do echo(%%B
    exit /b 0
  ) else if /i "%~1" equ "/?REGEX" (
    start "" "http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx"
    exit /b 0
  ) else if /i "%~1" equ "/?REPLACE" (
    start "" "http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx"
    exit /b 0
  ) else if /i "%~1" equ "/V" (
    for /f "delims=: tokens=1*" %%A in ('findstr /nblc:"::REPLVAR.BAT version" "%~f0"') do echo(%%B
    exit /b 0
  ) else (
    call :err "Insufficient arguments"
    exit /b 1
  )
)
echo(%~5|findstr /i "[^ILEBXA]" >nul && (
  call :err "Invalid option(s)"
  exit /b 1
)

setlocal
set "$replVar.notDelayed=!!"
setlocal enableDelayedExpansion
for /f "delims==" %%V in ('set ~ 2^>nul') do set "%%V="
set "~=!%~1!"
setlocal disableDelayedExpansion
set "rtn="
for /f delims^=^ eol^= %%A in (
  'set ~ 2^>nul^|cscript //E:JScript //nologo "%~f0" "%$replVar.notDelayed%" %3 %4 %5'
) do set "rtn=%%A"
if defined rtn (
  set "err=%rtn:~0,1%"
  set "rtn=%rtn:~1%"
) else set "err=2"
if %err% equ 1 (echo ERROR: Result not compatible with active code page) >&2
if %err% equ 2 (echo Input not altered) >&2
setlocal enableDelayedExpansion
set ^"LF=^

^"
for /f %%A in ('copy /z "%~dpf0" nul') do set "CR=%%A"
set "replace=%% """ !CR!!CR!"
for /f "tokens=1,2,3" %%J in ("!replace!") do for %%M in ("!LF!") do (
  endlocal
  endlocal
  endlocal
  endlocal
  set "%~2=%rtn%" !
  exit /b %err%
)

:err
>&2 echo ERROR: %~1. Use replVar /? to get help.
exit /b


************* JScript portion **********/
var env=WScript.CreateObject("WScript.Shell").Environment("Process");
var args=WScript.Arguments;
var search=args.Item(1);
var replace=args.Item(2);
var options="gm";
if (args.length>3) options+=args.Item(3).toLowerCase();
var alterations=(options.indexOf("a")>=0);
if (alterations) options=options.replace(/a/g,"");
if (options.indexOf("x")>=0) {
  options=options.replace(/x/g,"");
  replace=replace.replace(/\\\\/g,"\\B");
  replace=replace.replace(/\\q/g,"\"");
  replace=replace.replace(/\\x80/g,"\\u20AC");
  replace=replace.replace(/\\x82/g,"\\u201A");
  replace=replace.replace(/\\x83/g,"\\u0192");
  replace=replace.replace(/\\x84/g,"\\u201E");
  replace=replace.replace(/\\x85/g,"\\u2026");
  replace=replace.replace(/\\x86/g,"\\u2020");
  replace=replace.replace(/\\x87/g,"\\u2021");
  replace=replace.replace(/\\x88/g,"\\u02C6");
  replace=replace.replace(/\\x89/g,"\\u2030");
  replace=replace.replace(/\\x8[aA]/g,"\\u0160");
  replace=replace.replace(/\\x8[bB]/g,"\\u2039");
  replace=replace.replace(/\\x8[cC]/g,"\\u0152");
  replace=replace.replace(/\\x8[eE]/g,"\\u017D");
  replace=replace.replace(/\\x91/g,"\\u2018");
  replace=replace.replace(/\\x92/g,"\\u2019");
  replace=replace.replace(/\\x93/g,"\\u201C");
  replace=replace.replace(/\\x94/g,"\\u201D");
  replace=replace.replace(/\\x95/g,"\\u2022");
  replace=replace.replace(/\\x96/g,"\\u2013");
  replace=replace.replace(/\\x97/g,"\\u2014");
  replace=replace.replace(/\\x98/g,"\\u02DC");
  replace=replace.replace(/\\x99/g,"\\u2122");
  replace=replace.replace(/\\x9[aA]/g,"\\u0161");
  replace=replace.replace(/\\x9[bB]/g,"\\u203A");
  replace=replace.replace(/\\x9[cC]/g,"\\u0153");
  replace=replace.replace(/\\x9[dD]/g,"\\u009D");
  replace=replace.replace(/\\x9[eE]/g,"\\u017E");
  replace=replace.replace(/\\x9[fF]/g,"\\u0178");
  replace=replace.replace(/\\b/g,"\b");
  replace=replace.replace(/\\f/g,"\f");
  replace=replace.replace(/\\n/g,"\n");
  replace=replace.replace(/\\r/g,"\r");
  replace=replace.replace(/\\t/g,"\t");
  replace=replace.replace(/\\v/g,"\v");
  replace=replace.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
    function($0,$1,$2){
      return String.fromCharCode(parseInt("0x"+$0.substring(2)));
    }
  );
  replace=replace.replace(/\\B/g,"\\");
  search=search.replace(/\\\\/g,"\\B");
  search=search.replace(/\\q/g,"\"");
  search=search.replace(/\\x80/g,"\\u20AC");
  search=search.replace(/\\x82/g,"\\u201A");
  search=search.replace(/\\x83/g,"\\u0192");
  search=search.replace(/\\x84/g,"\\u201E");
  search=search.replace(/\\x85/g,"\\u2026");
  search=search.replace(/\\x86/g,"\\u2020");
  search=search.replace(/\\x87/g,"\\u2021");
  search=search.replace(/\\x88/g,"\\u02C6");
  search=search.replace(/\\x89/g,"\\u2030");
  search=search.replace(/\\x8[aA]/g,"\\u0160");
  search=search.replace(/\\x8[bB]/g,"\\u2039");
  search=search.replace(/\\x8[cC]/g,"\\u0152");
  search=search.replace(/\\x8[eE]/g,"\\u017D");
  search=search.replace(/\\x91/g,"\\u2018");
  search=search.replace(/\\x92/g,"\\u2019");
  search=search.replace(/\\x93/g,"\\u201C");
  search=search.replace(/\\x94/g,"\\u201D");
  search=search.replace(/\\x95/g,"\\u2022");
  search=search.replace(/\\x96/g,"\\u2013");
  search=search.replace(/\\x97/g,"\\u2014");
  search=search.replace(/\\x98/g,"\\u02DC");
  search=search.replace(/\\x99/g,"\\u2122");
  search=search.replace(/\\x9[aA]/g,"\\u0161");
  search=search.replace(/\\x9[bB]/g,"\\u203A");
  search=search.replace(/\\x9[cC]/g,"\\u0153");
  search=search.replace(/\\x9[dD]/g,"\\u009D");
  search=search.replace(/\\x9[eE]/g,"\\u017E");
  search=search.replace(/\\x9[fF]/g,"\\u0178");
  if (options.indexOf("l")>=0) {
    search=search.replace(/\\b/g,"\b");
    search=search.replace(/\\f/g,"\f");
    search=search.replace(/\\n/g,"\n");
    search=search.replace(/\\r/g,"\r");
    search=search.replace(/\\t/g,"\t");
    search=search.replace(/\\v/g,"\v");
    search=search.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
      function($0,$1,$2){
        return String.fromCharCode(parseInt("0x"+$0.substring(2)));
      }
    );
    search=search.replace(/\\B/g,"\\");
  } else search=search.replace(/\\B/g,"\\\\");
}
if (options.indexOf("l")>=0) {
  options=options.replace(/l/g,"");
  search=search.replace(/([.^$*+?()[{\\|])/g,"\\$1");
  replace=replace.replace(/\$/g,"$$$$");
}
if (options.indexOf("b")>=0) {
  options=options.replace(/b/g,"");
  search="^"+search
}
if (options.indexOf("e")>=0) {
  options=options.replace(/e/g,"");
  search=search+"$"
}
var search=new RegExp(search,options);

var str1, str2, delay;
delay=args.Item(0);

if (!WScript.StdIn.AtEndOfStream) str1=WScript.StdIn.ReadAll(); else str1="";
str1=str1.substr(2,str1.length-4);
str2=str1.replace(search,replace);
if (!alterations || str1!=str2) {
  str2=str2.replace(/%/g,"%J");
  str2=str2.replace(/\"/g,"%~K");
  str2=str2.replace(/\r/g,"%L");
  str2=str2.replace(/\n/g,"%~M");
  if (delay=="") {
    str2=str2.replace(/\^/g,"^^");
    str2=str2.replace(/!/g,"^!");
  }
  try {
    WScript.Stdout.Write("0"+str2);
  } catch (e) {
    WScript.Stdout.Write("1");
  }
}


Dave Benham
Last edited by dbenham on 24 Apr 2014 13:02, edited 7 times in total.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#2 Post by Liviu » 05 Apr 2014 22:51

dbenham wrote:My REPL.BAT utility was primarily built to work with files (via pipes or redirection), but it also supports input via an environment variable. [...] I decided to build a dedicated REPLVAR.BAT hybrid JScript/batch utility [...] It has most of the same options as REPL.BAT, except input is always from a variable

Taking the input from a variable vs. file or piped input raises the bar considerably. Critical difference is that environment variables are natively Unicode, while file or piped input is always "narrowed down" to the current codepage, first.

The second run of the following doesn't seem to work as expected.

Code: Select all

C:\tmp>set "input=<abcde>" & set input
input=<abcde>

C:\tmp>replVar input output "c" " ? " LX & set output
output=<ab ? de>

C:\tmp>set "input=‹αß©∂€›" & set input
input=‹αß©∂€›

C:\tmp>replVar input output "©" " ? " LX & set output
C:\tmp\replVar.bat(298, 3) Microsoft JScript runtime error: Invalid procedure call or argument

Environment variable output not defined

Liviu

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#3 Post by dbenham » 05 Apr 2014 23:45

Thanks Liviu for finding that.

I've done some debugging and verified that the JScript works perfectly until the very last statement that writes the final result to stdout.

I tested by writing the value of the intermediate strings to stderr. The really weird thing is, I can write the final result to stderr, but when I try to write the same final string to stdout it throws the exception.

Could WScript.Stdout.WriteLine() be going through some unicode to code page gyrations that StdErr does not :?:

Perhaps the output string contains a unicode value that does not translate to the active code page, and attempting to write to stderr throws the exception. But it seems really odd that stderr does not do the same :? :(

My guess is that everything works as long as all output characters map properly to the active code page.

Any ideas?


Dave Benham

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#4 Post by Liviu » 06 Apr 2014 00:41

dbenham wrote:The really weird thing is, I can write the final result to stderr, but when I try to write the same final string to stdout it throws the exception.
That's odd, indeed. However, something still breaks underneath, since if I just replace "stdout" with "stderr" on the last line then I can see the output, but it's still followed by "Environment variable output not defined" right after. Don't have any constructive idea offhand, and don't know that it's actually solvable, since the for/f loop on the receiving end in the batch file flattens everything down to the active codepage.

dbenham wrote:My guess is that everything works as long as all output characters map properly to the active code page.
Possibly so, but the problem remains to detect what works in advance , or at least catch what doesn't afterwards. Because, in the failure cases, there can be wrong and active-codepage-dependent outcomes, other than hard errors.

Code: Select all

C:\tmp>set "input=<ab©de>" & set input
input=<ab©de>

C:\tmp>chcp 437
Active code page: 437

C:\tmp>replVar input output "c" " ? " LX & set output
output=<ab⌐de>

C:\tmp>chcp 850
Active code page: 850

C:\tmp>replVar input output "c" " ? " LX & set output
output=<ab®de>

Liviu

carlsomo
Posts: 91
Joined: 02 Oct 2012 17:21

Re: REPLVAR.BAT - regex search and replace for variables

#5 Post by carlsomo » 06 Apr 2014 01:01

problems with ...'s

C:\>set input=111111111111??

C:\>replvar input output "..." "ttt"

C:\>set output
output=tttttttttttt??

C:\>replvar input output "222" "ttt"

C:\>set output
output=111111111111??

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#6 Post by dbenham » 06 Apr 2014 07:48

@carlsomo - both your examples are giving the expected result: "..." is a regular expression that matches any 3 characters other than <CR> or <LF>

@Liviu - I've done some more experiments, and the problem is with FOR /F - it cannot handle unicode values that do not map to the active code page. If it receives them, then it must be reporting an error back to JScript. I had forgotten that this is a known issue with FOR /F. I had seen reports how FOR /F cannot parse the output of DIR properly if a file name contains unicode that does not map to the active page.

I believe the command shell (and/or FOR /F) will automatically map some characters that are not in the active code page to different, but related characters that are within the active code page. I don't see how this can be detected, other than to maintain a list of all possible mappings (or somehow access the mappings that CMD is using). This is not something I want to attempt.

The way the code stands now, it is up to the user to make sure the output is compatible with the active code page.

I've modified the script to detect if there was an error writing the output, in which case the output variable is not set, and the ERRORLEVEL is set to 1. I also made a few changes to ensure that all exit points properly set the ERRORLEVEL to 0 or 1 as needed. I've posted v1.1 to the first post on this thread.

There is still one aspect of REPLVAR.BAT that I am nervous about. I am using the mappings that Aacini provided for REPL.BAT to translate \xNN codes into the appropriate \uNNNN unicode such that it yields the desired extended ASCII character. But I've never been comfortable with how it works. I was worried that it might be impacted by the active code page, but my tests on REPL.BAT did not show any problems with different code pages. Even if REPL.BAT works, I'm wondering if REPLVAR.BAT might have some problems with different code pages.


Dave Benham

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#7 Post by Liviu » 06 Apr 2014 17:15

Liviu wrote:
dbenham wrote:The really weird thing is, I can write the final result to stderr, but when I try to write the same final string to stdout it throws the exception.
That's odd, indeed.
Actually, it's not all that odd. What happens is that stderr is not captured by for/f and goes directly to the console as full Unicode, without any codepage translation.

dbenham wrote:I believe the command shell (and/or FOR /F) will automatically map some characters that are not in the active code page to different, but related characters that are within the active code page. I don't see how this can be detected
This can be detected with something like the following...

Code: Select all

@echo off
for /f "delims=" %%x in ('echo "%~1"') do (
  if "%%~x"=="%~1" (exit /b 0) else exit /b 1
)
...saved as, say, 'isInCP.cmd', then...

Code: Select all

C:\tmp>chcp 437
Active code page: 437

C:\tmp>(call isincp ".αß....") && (echo + OK) || (echo - err)
+ OK

C:\tmp>chcp 1252
Active code page: 1252

C:\tmp>(call isincp ".αß....") && (echo + OK) || (echo - err)
- err

C:\tmp>chcp 437
Active code page: 437

C:\tmp>(call isincp "‹.ß©.€›") && (echo + OK) || (echo - err)
- err

C:\tmp>chcp 1252
Active code page: 1252

C:\tmp>(call isincp "‹.ß©.€›") && (echo + OK) || (echo - err)
+ OK
Of course, the code is oversimplified and will fail on some of the usual poison characters, but it's just to make the point that it's technically possible.

dbenham wrote:The way the code stands now, it is up to the user to make sure the output is compatible with the active code page.
I don't think even that restriction is strong enough. Consider the following, where "αß" are valid in CP 437, but a dummy replacement still fails to return the original string. (The fact that "ß" is returned correctly under CP 1252 is also significant, will get back to that below.)

Code: Select all

C:\tmp>chcp 437
Active code page: 437

C:\tmp>set "input=.αß...." & set input
input=.αß....

C:\tmp>(call replVar input output "x" "y" LX) && (set output) || (echo *** error)
output=.a▀....

C:\tmp>chcp 1252
Active code page: 1252

C:\tmp>(call replVar input output "x" "y" LX) && (set output) || (echo *** error)
output=.aß....

dbenham wrote:There is still one aspect of REPLVAR.BAT that I am nervous about. I am using the mappings that Aacini provided for REPL.BAT to translate \xNN codes into the appropriate \uNNNN unicode such that it yields the desired extended ASCII character. But I've never been comfortable with how it works. I was worried that it might be impacted by the active code page
That hardcoded mapping does indeed make the replacement codepage dependent. In addition to the "ß" example above, here is another one.

Code: Select all

C:\tmp>set "input=‹.ß©.€›" & set input
input=‹.ß©.€›

C:\tmp>chcp 437
Active code page: 437

C:\tmp>(call replVar input output "x" "y" LX) && (set output) || (echo *** error)
output=ï.▀⌐.Ç¢

C:\tmp>chcp 1252
Active code page: 1252

C:\tmp>(call replVar input output "x" "y" LX) && (set output) || (echo *** error)
output=‹.ß©.€›
Without having followed it all the way through, but it looks to me as if this only works reliably when (a) the string is only using characters that are present in codepage 1252, and (b) the active codepage is 1252 - which is not too common in practice since that's an ANSI codepage, while the cmd prompt is set by default to an OEM codepage such as 437 or 850.

dbenham wrote:but my tests on REPL.BAT did not show any problems with different code pages. Even if REPL.BAT works, I'm wondering if REPLVAR.BAT might have some problems with different code pages.
Repl.bat does its own output in JScript, so it's not subject to the "flattening back to codepage" issues of a for/f loop - but any caller code that tries to capture that output in a variable would have the same problems that replVar does.

Liviu

P.S. There is something odd about repl.bat, too, but I am not closely enough familiar with JScript to put my finger on it. For one thing, it fails with a hard error when output is redirected, and it includes out-of-codepage characters.

Code: Select all

C:\tmp>chcp 437
Active code page: 437

C:\tmp>repl "©" "-!-" LXS input
‹αß-!-∂€›

C:\tmp>repl "©" "-!-" LXS input >nul
C:\tmp\repl.bat(283, 5) Microsoft JScript runtime error: Invalid procedure call or argument


C:\tmp>cmd /u/c repl "©" "-!-" LXS input ^>repltest.tmp
C:\tmp\repl.bat(283, 5) Microsoft JScript runtime error: Invalid procedure call or argument

I tried a variation on the output method in repl.bat by forcing the output stream to 'Unicode mode' explicitly.

Code: Select all

var fso = new ActiveXObject("Scripting.FileSystemObject");
var stdoutW = fso.GetStandardStream(1, true)

if (srcVar) {
  str1=env(args.Item(3));
  str2=str1.replace(search,replace);
  if (!alterations || str1!=str2) if (multi) {
    stdoutW.Write(str2);
  } else {
    stdoutW.WriteLine(str2);
  }
} else { /*... */
This allowed redirection to work, but (surprisingly) the output is _always_ Unicode, even at a 'cmd /a' prompt.

Still, combined with the new UTF-8 support in Windows 7 (http://www.dostips.com/forum/viewtopic.php?f=3&t=5358) this allowed the following to use repl.bat and capture its output to a variable.

Code: Select all

C:\tmp>chcp 437  &rem start in default OEM codepage
Active code page: 437

C:\tmp>set "input=‹αß©∂€›" & set input
input=‹αß©∂€›

C:\tmp>chcp 1252  &rem necessary in order to generate the UTF16-LE BOM
Active code page: 1252

C:\tmp>(set /p =ÿþ) <nul >repl-u16.tmp 2>nul  &rem write BOM to file

C:\tmp>chcp 437  &rem return to default OEM codepage
Active code page: 437

C:\tmp>repl "©" "-!-" LXS input >>repl-u16.tmp  &rem save UTF16-LE string

C:\tmp>type repl-u16.tmp  &rem verify file contents OK
‹αß-!-∂€›

C:\tmp>chcp 65001  &rem change to UTF-8 codepage
Active code page: 65001

C:\tmp>type repl-u16.tmp >repl-u8.tmp  &rem convert file to UTF-8

C:\tmp>for /f "delims=" %s in (repl-u8.tmp) do @(set "output=%s" & set output)
output=‹αß-!-∂€›

C:\tmp>chcp 437  &rem return to default OEM codepage
Active code page: 437

C:\tmp>set "output"  &rem verify 'output' variable OK
output=‹αß-!-∂€›
Again, the above only works in Windows 7+ (not XP), and the whole 'stdoutW' trick smells hack'ish, so I would not recommend it for production code, at least not without a whole lot more investigation.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#8 Post by dbenham » 06 Apr 2014 19:08

Thanks Liviu, your experiments and explanation help.

but based on your results... Major bummer :cry:

I did some experiments, and REPLVAR is able to generate all 255 byte codes properly using \xNN in the replacement string.

But if I pass in a variable containing extended ASCII, then the value gets corrupted.

I'm thinking I have to abandon reading string values directly from the environment. Instead I think I have to always deal with string literals, either piped in, or else on the command line.


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: REPLVAR.BAT - regex search and replace for variables

#9 Post by penpen » 07 Apr 2014 17:09

Won't it work if you avoid "WScript.Stdout.Write("0"+str2);"
Maybe this works (actually using Win95 can't check that myself):

Code: Select all

 try {
    WScript.Echo("0"+str2);
  } catch (e) {
    WScript.Stdout.Write("1");
  }

penpen

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#10 Post by dbenham » 07 Apr 2014 19:38

@penpen - I don't think so, but I haven't tested, as I have a new version that fixes the problems.

Rightly or wrongly, I've come to the conclusion that letting JScript read environment variables directly is always going to create problems when dealing with extended ASCII codes >127 (0x7F). This is true for my REPL.BAT utility as well. I have confirmed that use of the S and V options can lead to incorrect results with REPL.BAT.

I now use SET to write the value of the source variable and pipe the result into JScript. I also removed the V option, so the search and replace strings must be passed as strings on the command line, they cannot be passed by reference using variable names.

I've updated the code at the beginning of this thread with version 1.2.


Dave Benham

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#11 Post by Liviu » 07 Apr 2014 21:12

dbenham wrote:Rightly or wrongly, I've come to the conclusion that letting JScript read environment variables directly is always going to create problems when dealing with extended ASCII codes >127 (0x7F).
That's odd. You said that the previous version was writing the correct strings to stderr, which means that JScript was receiving the input string correctly, pointing the issue at the for/f on the return path.

dbenham wrote:This is true for my REPL.BAT utility as well. I have confirmed that use of the S and V options can lead to incorrect results with REPL.BAT.
Just curious, what's a failure case for S? I haven't tested it in any depth, but the example in the P.S. of my previous post worked correctly.

dbenham wrote:I now use SET to write the value of the source variable and pipe the result into JScript.
Piping 'set' reduces the input to the active codepage upfront, before JScript ever sees it. I assume it works from there on, but it's not possible to catch codepage-violation errors without some checking done on the batch side of the code prior to calling JScript. For example, the following passes a string that's not valid in either codepage...

Code: Select all

C:\tmp>chcp 437 >nul

C:\tmp>set "input=‹αß©∂€›" & set input
input=‹αß©∂€›

C:\tmp>(call replVar input output "a" "-" LX) && (set output) || (echo *** error)
output=<αßc??>

C:\tmp>chcp 1252 >nul

C:\tmp>(call replVar input output "a" "-" LX) && (set output) || (echo *** error)
output=‹-ß©?€›
...but replVar does not return an error, the results are different in the two cases, and both wrong - without the caller having any indication of failure.

Liviu

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#12 Post by dbenham » 07 Apr 2014 22:51

Liviu wrote:
dbenham wrote:Rightly or wrongly, I've come to the conclusion that letting JScript read environment variables directly is always going to create problems when dealing with extended ASCII codes >127 (0x7F).
That's odd. You said that the previous version was writing the correct strings to stderr, which means that JScript was receiving the input string correctly, pointing the issue at the for/f on the return path.
Yes, but in my next post I said that further experiments showed the issue was with FOR /F, not any difference between stderr and stdout. The stderr was not going through FOR /F, so it appeared to be working.

I don't see any choice other than FOR /F. I could write the output to a file, but then I need to somehow read the content into a variable. SET /P has major length limititations, so I am back to using FOR /F.

Liviu wrote:
dbenham wrote:This is true for my REPL.BAT utility as well. I have confirmed that use of the S and V options can lead to incorrect results with REPL.BAT.
Just curious, what's a failure case for S? I haven't tested it in any depth, but the example in the P.S. of my previous post worked correctly.
I simply put some extended ASCII in a variable, and then did a REPL with S, redirecting output to a file. I then read the content back into a variable using SET /P, and got a different value than my original.

Liviu wrote:
dbenham wrote:I now use SET to write the value of the source variable and pipe the result into JScript.
Piping 'set' reduces the input to the active codepage upfront, before JScript ever sees it. I assume it works from there on, but it's not possible to catch codepage-violation errors without some checking done on the batch side of the code prior to calling JScript. For example, the following passes a string that's not valid in either codepage...
Yes, that is exactly my intent. REPLVAR only works properly if the before and after is within the active code page. I am at a loss how to improve it any more, given that batch inherently does not work well with environment variables containing content that does not map to the active code page.


Dave Benham

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#13 Post by carlos » 07 Apr 2014 23:50

Dave what about not using for /f and create a temp file like this:

temp.cmd

Code: Select all

set "var=content"


and call it from replvar.bat

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#14 Post by Liviu » 08 Apr 2014 00:08

dbenham wrote:
Liviu wrote:
dbenham wrote:Rightly or wrongly, I've come to the conclusion that letting JScript read environment variables directly is always going to create problems when dealing with extended ASCII codes >127 (0x7F).
That's odd. You said that the previous version was writing the correct strings to stderr, which means that JScript was receiving the input string correctly, pointing the issue at the for/f on the return path.
Yes, but in my next post I said that further experiments showed the issue was with FOR /F, not any difference between stderr and stdout. The stderr was not going through FOR /F, so it appeared to be working.
OK, we agree on that point. But then, piping 'set' instead of reading the variable in JScript - like v1.2 does - does not really help with anything. On the contrary, it "truncates" the string to the active codepage before JScript even sees it, thus denying it any chance to work correctly with out-of-codepage characters, or at least attempt to catch and return an error in those cases.

dbenham wrote:I don't see any choice other than FOR /F. I could write the output to a file, but then I need to somehow read the content into a variable. SET /P has major length limititations, so I am back to using FOR /F.
IMHO this is the crux of the issue. Leaving aside JScript and regex, the basic problem here is that a child process (cscript) has a string value that the parent (batch) needs to capture in a variable. For all I know, there is no portable and unicode-safe solution to this problem. It can be partially solved with temp files in the active codepage (all Windows versions - but corrupts out-of-codepage characters), or it can be partially solved with a UTF16-LE temp file (Unicode safe, using the technique in my p.s. above plus patches to deal with the ASCII poison characters - but that only works in Win7+, not XP). I don't know that it can be done both Unicode-safe and XP-compatible, though I'd be more than happy to be proven wrong ;-)

dbenham wrote:
Liviu wrote:Just curious, what's a failure case for S? I haven't tested it in any depth, but the example in the P.S. of my previous post worked correctly.
I simply put some extended ASCII in a variable, and then did a REPL with S, redirecting output to a file. I then read the content back into a variable using SET /P, and got a different value than my original.
Redirecting to a file saves 8-bit text in the active codepage, unless you run 'cmd /u' or 'cscript //u', which could well explain the discrepancy. My test run in the previous post seemed to be working correctly. I can retest if you give me a specific example.

dbenham wrote:
Liviu wrote:For example, the following passes a string that's not valid in either codepage...
Yes, that is exactly my intent. REPLVAR only works properly if the before and after is within the active code page.
To be clear, I don't have any issue at all with code relying on known, documented assumptions. And I wouldn't raise a question if replVar was restricted to plain ASCII - after all, having regex work even on just the ASCII subset of characters is a nice useful feature in itself. Still, the "within the active code page" part is too vague from a user's standpoint. Some restrictions, such as "no control characters, or embedded quotes" are easy to satisfy when, for example, the string is a filename and the user knows it can't contain either. But the casual user won't know what "within the active code page" even means, and if they do, they have no easy way to check whether an arbitrary string (or filename) is compliant. That leaves a user calling replVar get back a result with no indication whether it's correct, or just a silent failure.

Liviu

P.S.
carlos wrote:Dave what about not using for /f and create a temp file like this:

Code: Select all

set "var=content"
and call it from replvar.bat
For the temp file to be call'able as a batch, it needs to be an 8-bit text file. Down-converting Unicode text to 8-bit encoding is lossy, and such a temp batch file would be subject to all the same limitations as Dave's code itself.

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#15 Post by carlos » 08 Apr 2014 01:00

Liviu: but cmd support hold unicode text in the variables? in both modes (script/ interactive) ?
I did a investigation here:
http://www.dostips.com/forum/viewtopic.php?f=3&t=5500

Post Reply