REPLVAR.BAT - regex search and replace for variables

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#16 Post by dbenham » 08 Apr 2014 06:12

I posted version 1.3 at the top of the thread. The only difference is in the built in documentation. I attempted to better explain the limits on the source value - it is treated as extended ASCII. Unicode values that do not map to the active code page will be silently transformed into a different value that does map to the active code page.


Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#17 Post by dbenham » 08 Apr 2014 08:19

Liviu wrote:
dbenham wrote:
Liviu wrote:Just curious, what's a failure case for S? I haven't tested it in any depth, but the example in the P.S. of my previous post worked correctly.
I simply put some extended ASCII in a variable, and then did a REPL with S, redirecting output to a file. I then read the content back into a variable using SET /P, and got a different value than my original.
Redirecting to a file saves 8-bit text in the active codepage, unless you run 'cmd /u' or 'cscript //u', which could well explain the discrepancy. My test run in the previous post seemed to be working correctly. I can retest if you give me a specific example.

Here is a simple test case. I use my CHARLIB.BAT script to set a variable to extended ASCII 0xC8:

Code: Select all

D:\test>chcp 437
Active code page: 437

D:\test>charlib chr 0xC8 input

D:\test>echo %input%
È

D:\test>repl a a s input
È

D:\test>echo %input%|repl a a
E
It looks like the direct read witht he S option is working, and the piped method is failing. But all is not as it seems. The glyphs are being written within JScript, and it is using a different code page to write to the screen. Your eyes cannot be trusted. :wink:

My main goal for REPL.BAT was to enable editing of files using batch. If I redirect the output of REPL.BAT to a file, and then examine the contents, we see that the reverse is actually true. The direct read of the variable is corrupting the content :!: I am using my HEXDUMP.BAT to examine the file contents.

Code: Select all

D:\test>echo %input%>f1.txt

D:\test>hexdump f1.txt
C8 0D 0A

D:\test>repl a a s input>f2.txt

D:\test>hexdump f2.txt
2B 0D 0A

D:\test>echo %input%|repl a a>f3.txt

D:\test>hexdump f3.txt
C8 0D 0A


Dave Benham

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#18 Post by Liviu » 08 Apr 2014 12:12

dbenham wrote:Here is a simple test case. [...] set a variable to extended ASCII 0xC8: [...] It looks like the direct read witht he S option is working, and the piped method is failing. But all is not as it seems.
Why, it is exactly as it seems. If, instead of piping to repl, you do just an "echo %input% | more" it will still return a plain "E" under codepage 437. The "corruption" is caused by the pipe, not by whatever is on the right hand side of it.

dbenham wrote:The glyphs are being written within JScript, and it is using a different code page to write to the screen.
No, the console output of cscript is fully Unicode and codepage-independent (also see the old, but I believe still applicable note, at http://blogs.msdn.com/b/ericlippert/archive/2004/02/11/71472.aspx). Conversion to the active codepage only occurs when the output is piped or redirected.

dbenham wrote:My main goal for REPL.BAT was to enable editing of files using batch. If I redirect the output of REPL.BAT to a file, and then examine the contents, we see that the reverse is actually true. The direct read of the variable is corrupting the content :!:
I don't dispute the symptom, but I disagree with the diagnostic ;-) The corruption occurs due to the file redirection, and has nothing to do with reading the variable directly.

Just for prove that point beyond doubt, modify the "cscript" call in repl as follows.

Code: Select all

::: cscript //E:JScript //nologo "%~f0" %*
cscript //U //E:JScript //nologo "%~f0" %*
Then copy the code below to, say, replVar7.cmd (with "7" as a reminder that it doesn't work under XP).

Code: Select all

@echo off & setlocal disableDelayedExpansion

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do @set /a "cp=%%~a"

:: run repl, save output to UTF16-LE file with BOM
chcp 1252 >nul
(set /p =ÿþ) <nul >repl-u16.tmp 2>nul
call repl %3 %4 LXS %1 >>repl-u16.tmp

:: convert file to UTF-8 and read variable from it *** does NOT work in XP ***
chcp 65001 >nul
type repl-u16.tmp >repl-u8.tmp
for /f "delims=" %%s in (repl-u8.tmp) do set "output=%%s"

:: restore original codepage
chcp %cp% >nul

endlocal & set "%2=%output%" & goto :eof
Now run the following at a Win7 cmd prompt.

Code: Select all

C:\tmp>chcp 437 >nul

C:\tmp>set "input=‹αß©∂€›" & set input
input=‹αß©∂€›

C:\tmp>(set output=) && (call replVar7 input output "x" "y") && (set output) || (echo *** error)
output=‹αß©∂€›

C:\tmp>(set output=) && (call replVar7 input output "©" "-!-") && (set output) || (echo *** error)
output=‹αß-!-∂€›

C:\tmp>chcp 1252 >nul

C:\tmp>(set output=) && (call replVar7 input output "x" "y") && (set output) || (echo *** error)
output=‹αß©∂€›

C:\tmp>(set output=) && (call replVar7 input output "©" "-!-") && (set output) || (echo *** error)
output=‹αß-!-∂€›

C:\tmp>(set output=) && (call replVar7 input output "α" "-!-") && (set output) || (echo *** error)
output=‹-!-ß©∂€›

C:\tmp>(set output=) && (call replVar7 input output "a" "-!-") && (set output) || (echo *** error)
output=‹αß©∂€›

C:\tmp>set "input=È" & set input
input=È

C:\tmp>(set output=) && (call replVar7 input output "x" "y") && (set output) || (echo *** error)
output=È

C:\tmp>(set output=) && (call replVar7 input output "È" "y") && (set output) || (echo *** error)
output=y

C:\tmp>(set output=) && (call replVar7 input output "È" "ÈÈÈ") && (set output) || (echo *** error)
output=ÈÈÈ
All output is correct, and there is no corruption. However, the only change made to repl was in its output method - so the corruption that happened before was not about getting the variable right, but rather outputting the result correctly. (Note that I am not saying that there may not be other codepage issues with repl/S - but just that your test case is not an example of such.)

The way I see it, the issue here is not about repl in particular. You could replace "repl a a s input" with "cmd /v/c echo !input!" and have pretty much the same problem - it's a child process that outputs a string to the console correctly, but the parent has no (portable, Unicode-safe) way to capture that string into a variable of its own.

Liviu

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#19 Post by carlos » 08 Apr 2014 13:49

Dave I think what is the solution.
Inside the jscript forget about do conversion for codepage.

Save the content of the variable using unicode (cmd /u) to a file.
And read the unicode file inside the jscript (this is unicode not care about a codepage).
Then save your unicode text to a file, but add the bom (0xff 0xfe)at the begin of file.

For /f not support unicode but yes multibyte.
Then use type command for convert that unicode to multibyte (that will use the current codepage).
You have two options:
use for /f "options" %%a in ('type unicodewithbomfile.tmp') do set var=%%a
or type unicodewithbomfile.tmp > multibyte.cmd && call multibyte.cmd

I test that works.
Saving a japanase text in a variable using codepage 937, and using cmd /u to output the content to a file, is the same binary file that change the codepage to 65001 and using cmd /u to output the content of the file.

(Edited)
The internal unicode representation not change, because internally is utf-16 and if you save as it, any translation is done.
But for reinput it to cmd, you need translate the utf-16 to a multibyte using a codepage (type can do this, it check if the file have the bom and do the conversion using the current codepage).

I hope this be useful.

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#20 Post by carlos » 09 Apr 2014 10:03

I copy this code from my other thread. Is a similar code the Liviu code.
I think that this would be a way for catch a unicode variable and reinput to the cmd from batch.

save unicode variable as utf-8 multibyte

Code: Select all


@echo off

REM Convert unicode variable to multibyte batch script
REM using utf-8.
REM Parameter: the name of variable
REM Output: multibyte.bat

setlocal enableextensions enabledelayedexpansion

call :genchr 255
call :genchr 254

cmd /u /d /c "echo(set "%~1=!%~1!">utf16-bom.txt"
copy 255.chr /b + 254.chr /b + utf16-bom.txt /b utf16bom.txt /b /y >nul

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do @set /a "cp=%%~a"

(
chcp 65001 >nul
cmd /a /d /c "type utf16bom.txt >multibyte.bat"
chcp %cp% >nul
)

del 255.chr 254.chr
del utf16bom.txt
del utf16-bom.txt

goto :eof


:genchr
REM This code creates one single byte. Parameter: <int>0-255
REM Teamwork of carlos, penpen, aGerman, dbenham
REM Tested under Win2000, XP, Win7, Win8
set "options=/d compress=off /d reserveperdatablocksize=26"
if %~1 neq 26  (type nul >t.tmp
makecab %options% /d reserveperfoldersize=%~1 t.tmp %~1.chr >nul
type %~1.chr | (
(for /l %%N in (1 1 38) do pause)>nul&findstr "^">t.tmp)
>nul copy /y t.tmp /a %~1.chr /b
del t.tmp
) else (copy /y nul + nul /a 26.chr /a >nul)
goto :eof


and this would be for convert it to unicode (utf16)
(not tested)

Code: Select all


:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do @set /a "cp=%%~a"

(
chcp 65001 >nul
multibyte.bat
chcp %cp% >nul
)


Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#21 Post by Liviu » 09 Apr 2014 10:44

carlos wrote:I copy this code from my other thread. Is a similar code the Liviu code.
I think that this would be a way for catch a unicode variable and reinput to the cmd from batch.
Your code saves the variable to a UTF-8 file, but never reads it back from the UTF-8 file - which is the real difficulty here.

In the code I posted, the relevant part is this.

Code: Select all

chcp 65001 >nul
for /f "delims=" %%s in (repl-u8.tmp) do set "output=%%s"
The above works, but only in Win7 - not XP. If you put parenthesis around it, like you suggested in the other thread, then it no longer works - neither in XP nor in Win7. What you posted does not provide any solution to that which works in XP.

Liviu

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#22 Post by carlos » 09 Apr 2014 19:46

Liviu you talk about the method on your code using for /f for input a variable encoded as utf-8 mulitbyte.
But this will works, not use the for /f

Code: Select all

:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do @set /a "cp=%%~a"

(
chcp 65001 >nul
multibyte.bat
chcp %cp% >nul
)


the multibyte.cmd is generated with the code in above post that begin with this:
REM Convert unicode variable to multibyte batch script
REM using utf-8.


This method works (I will test) ?

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#23 Post by Liviu » 09 Apr 2014 20:59

carlos wrote:Liviu you talk about the method on your code using for /f for input a variable encoded as utf-8 mulitbyte.
But this will works, not use the for /f

Please re-read my previous post: "Your code saves the variable to a UTF-8 file, but never reads it back from the UTF-8 file - which is the real difficulty here".

If you meant your code just to show how to save a variable to a UTF-8 encoded file then, yes, that part works. But that's been known to work for a long time, and has been used on several occasions in other code posted here and elsewhere.

However, your code is completely missing the read-variable-back-from-UTF-8-file part. That's the essential step in my reply to Dave about the child process returning a Unicode string, and the parent process reading it into a variable of its own. And that's the part that I showed how to work out under Win7, but don't know of a way to do it in XP. You'd be more than welcome to contribute a solution to that.

Liviu

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#24 Post by carlos » 10 Apr 2014 00:06

Liviu thanks for so many comments.
I checked that you are right.
On xp, using this cause a corruption of saved the variable.
I suspect that this also happen on windows 8

load.cmd utf8 batch

Code: Select all

(
chcp 65001 >nul
"%~f1"
chcp %cp% >nul
)


I not understand why the utf8 file without bom, when is launched with the above code that will use the utf8 codepage, and run without error, not translate correctly from utf8 to utf16.

Image

This is a new challenge for me. I will try solve it.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#25 Post by dbenham » 13 Apr 2014 19:01

Liviu wrote:
dbenham wrote:Here is a simple test case. [...] set a variable to extended ASCII 0xC8: [...] It looks like the direct read witht he S option is working, and the piped method is failing. But all is not as it seems.
Why, it is exactly as it seems. If, instead of piping to repl, you do just an "echo %input% | more" it will still return a plain "E" under codepage 437. The "corruption" is caused by the pipe, not by whatever is on the right hand side of it.

dbenham wrote:The glyphs are being written within JScript, and it is using a different code page to write to the screen.
No, the console output of cscript is fully Unicode and codepage-independent (also see the old, but I believe still applicable note, at http://blogs.msdn.com/b/ericlippert/archive/2004/02/11/71472.aspx). Conversion to the active codepage only occurs when the output is piped or redirected.

dbenham wrote:My main goal for REPL.BAT was to enable editing of files using batch. If I redirect the output of REPL.BAT to a file, and then examine the contents, we see that the reverse is actually true. The direct read of the variable is corrupting the content :!:
I don't dispute the symptom, but I disagree with the diagnostic ;-) The corruption occurs due to the file redirection, and has nothing to do with reading the variable directly.

...

The way I see it, the issue here is not about repl in particular. You could replace "repl a a s input" with "cmd /v/c echo !input!" and have pretty much the same problem - it's a child process that outputs a string to the console correctly, but the parent has no (portable, Unicode-safe) way to capture that string into a variable of its own.
Very lucid, well thought out explanation Liviu, thanks.

I was pretty much on the same page, but in my mind I was thinking from the perspective that the REPL or REPLVAR process is not complete until the final output is written to an (extended) ASCII file, or to a variable that logically contains extended ASCII compatible with the active code page (even though the internal representation is actually unicode). I like your explanation much better.

But I still stick with my assertion that REPL.BAT does not function the way I want if I allow it to read variables directly. But that is because I want the output to be effectively extended ASCII that is compatible with the active code page. I also want the \xNN escape sequences to represent the extended ASCII characters.

If you want unicode output, then I agree, there is no problem with reading the variables directly. But then the JScript should probably not be translating the \xNN escape sequences like my code does.

P.S. - I really found the following really useful (from viewtopic.php?p=33516#p33516)
Liviu wrote:A few other random notes:
- all Windows (since 9x/ME) store strings as UTF16-LE internally, and that includes the console and cmd interpreter;
- UTF16 is a Unicode encoding using 1 or 2 16-bit code units (16b integer) per code point (character, loosely speaking);
- multi-byte encodings are not limited to 2 bytes per character, for example UTF-8 is treated as an MBCS in Windows and can use up to 4 bytes per character;
- Asian CJK languages use DBCS codepages, including the Japanese 932 default OEM codepage;
- interactive cmd input is fully Unicode, for example you can paste arbitrary Unicode strings to the console regardless of the active codepage;
- piped-in or redirected-from-file input is read in the active codepage, and internally converted to Unicode (MultiByteToWideChar) - and this why you can't execute batch files saved as UTF16-LE Unicode;
- console output is fully Unicode, for example you can echo arbitrary Unicode strings to the console regardless of the active codepage;
- piped-out or redirected-to-file cmd/a output is done in the active codepage, and the internal Unicode is converted to the respective codepage (WideCharToMultiByte).


Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: REPLVAR.BAT - regex search and replace for variables

#26 Post by dbenham » 24 Apr 2014 13:05

Updated code in first post to version 1.4. Fixed the A option that only outputs altered values - it had been broken starting with version 1.2.

Also, the ERRORLEVEL is now set to 2 if A option used and the input is not altered.


Dave Benham

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#27 Post by carlos » 10 May 2014 01:34

carlos wrote:Liviu thanks for so many comments.
I checked that you are right.
On xp, using this cause a corruption of saved the variable.
I suspect that this also happen on windows 8

load.cmd utf8 batch

Code: Select all

(
chcp 65001 >nul
"%~f1"
chcp %cp% >nul
)


I not understand why the utf8 file without bom, when is launched with the above code that will use the utf8 codepage, and run without error, not translate correctly from utf8 to utf16.

Image

This is a new challenge for me. I will try solve it.


I tested it in a windows 7 japanase, and the same corruption behavior happen.
Maybe is impossible run a utf8 encoded batch, using codepage 65001.
I tried on windows xp and windows 7, both japanase, and none works ok.
And, also windows 7 japanase using raster font cannot display a variable that as introduced using codepage 932, when codepage 65001 is used.

Image

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#28 Post by Liviu » 10 May 2014 17:33

carlos wrote:I tested it in a windows 7 japanase, and the same corruption behavior happen.
Maybe is impossible run a utf8 encoded batch, using codepage 65001.
I tried on windows xp and windows 7, both japanase, and none works ok.
And, also windows 7 japanase using raster font cannot display a variable that as introduced using codepage 932, when codepage 65001 is used.

Not sure what the point of your exercise is, but just to recycle a couple of notes from older posts:
- xp can't run a batch file under codepage 65001 (though win7 can);
- codepage sensitive work needs to be done in a prompt configured to use a Unicode/TT font (not raster).

Also, at the risk of repeating myself, but you are overcomplicating things unnecessarily with the Japanese twist (btw, you don't mention what you mean by "Japanese windows" - native install vs. western windows with Japanese support added later - they are not entirely the same). Asian languages introduce their own complications (DBCS codepages etc), but in your case the choice of Japanese only obfuscates basic Unicode issues that can be demonstrated in western installations, too, and are easier to explore and figure out there. For example, setting a variable "varW=‹αß©∂€›" and then attempting to display it with "set varW" at a cmd prompt using a raster font results in the same "cannot write to the specified device" error.

Liviu

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: REPLVAR.BAT - regex search and replace for variables

#29 Post by carlos » 10 May 2014 18:38

I used japanase native install on xp and seven.

Look these file:

Code: Select all

http://consolesoft.com/shared/soko_env_batch.zip

It have two batch files:
that do this:

Code: Select all

set "sokoban=倉庫番"

both are multibyte: one using the codepage 932, and the other, using the codepage 65001 (utf-8 without bom)

If I run the 932.cmd using the codepage 932, it set the variable correctly.
If I run the 65001.cmd using the codepage 65001 (for comodity windows 7 that run batch files using codepage 65001 and on xp using other codepage and then change temporary to codepage 65001 with the trick of parenthesis) it not save the variable correctly, it corrupt it, as previous screenshots.

Conclusion: cmd is inconsistent about using the codepage 65001, sure it have a programming bug on the parameter flags passed to MultiByteToWideChar function. I will try found a patch for it.


Also, on my experiments I found that cmd on string manipulation is codepage no dependant.

For example:
Using codepage 932:
I can do this in the cmd interactive mode:

Code: Select all

set sokoban=倉庫番


Then, change to codepage 437 and do this using a batch script:

Code: Select all

set abc=%sokoban:~0,1%


If I print it using the raster font, and codepage 437 it print:
abc=?

But If i change to codepage 932 I can print it correctly.

Code: Select all

echo %abc%




The goal of encode a batch script as utf8 without bom, and run it using the codepage 65001 (not works ok, but my idea is patch this problem), is for the technique that use dbenham of edit a variable using the jscript, it can manipulate it without loose but for reintroduce to cmd without loose, we need a batch script encoded as utf-8, for save a variable with the output.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: REPLVAR.BAT - regex search and replace for variables

#30 Post by Liviu » 10 May 2014 19:28

carlos wrote:If I run the 65001.cmd using the codepage 65001 (for comodity windows 7 that run batch files using codepage 65001 ...) it not save the variable correctly, it corrupt it
65001.cmd works correctly for me in win7x64.sp1 when run under chcp 65001 at a cmd prompt using a Unicode font (note: it does not matter that the font actually includes japanese characters or not, but it must be a Unicode font - not a raster font - since using a raster font disables the automatic codepage conversions in the console).

carlos wrote:If I print it using the raster font, and codepage 437 it print: abc=?
It won't work and can't work using a raster font.

carlos wrote:The goal of encode a batch script as utf8 without bom, and run it using the codepage 65001 (not works ok ...)
It does work OK for me in win7, and it will never fully work for you if you keep trying xp, or running your console with raster fonts.

Liviu

P.S. None of your examples so far is specifically related to Japanese Windows. There is nothing about "倉庫番" and codepage 932 that I've seen, which can't be demonstrated using "‹αß©∂€›" and an SBCS codepage (437, 850, 1252 etc). You would help your cause if you posted simple examples that others could duplicate more easily in a western Windows installation.

Post Reply