In ny directory tree there are some (many) *.txt files which contain text which is UTF-8 encoded (without BOM maker).
The text inside such a file may contain UTF-8 encoded german Umlaute which consist of TWO bytes.
Example:
xC3xB6 for o Umlaut = "ö"
Now I want to convert all these occurencies in all the text files from two-byte-UTF-8 encoding to one-byte ISO-8859-1 (or ISO 8859-15) encoding.
So refering to the example above Hex xC3xB8 should be replaced by Hex xF6
How can I achieve this from a DOS batch script?
As far as I found out there are some advanced scripts for this task like:
viewtopic.php?f=3&t=3855
However this is too comprehensive and big for this small task.
Isn't there a smaller (one-liner) script command for that?
Thank you
Peter
Detecting two bytes and replacing them by one
Moderator: DosItHelp
Re: Detecting two bytes and replacing them by one
pstein wrote:In ny directory tree there are some (many) *.txt files which contain text which is UTF-8 encoded (without BOM maker).
The text inside such a file may contain UTF-8 encoded german Umlaute which consist of TWO bytes.
Example:
xC3xB6 for o Umlaut = "ö"
Now I want to convert all these occurencies in all the text files from two-byte-UTF-8 encoding to one-byte ISO-8859-1 (or ISO 8859-15) encoding.
So refering to the example above Hex xC3xB8 should be replaced by Hex xF6
How can I achieve this from a DOS batch script?
As far as I found out there are some advanced scripts for this task like:
viewtopic.php?f=3&t=3855
However this is too comprehensive and big for this small task.
Isn't there a smaller (one-liner) script command for that?
Thank you
Peter
Perhaps like this one?
Code: Select all
@set @Batch=1 /*
@echo off
for %%a in (*.txt) do (
CScript //nologo //E:JScript "%~F0" < "%%a" > "%%a.new"
REM move /Y "%%a.new" "%%a"
)
goto :EOF */
// JScript section
WScript.Stdout.Write(WScript.StdIn.ReadAll().replace(/\xC3\xB8/g,"\xF6"));
Antonio
Re: Detecting two bytes and replacing them by one
pstein wrote:As far as I found out there are some advanced scripts for this task like:
viewtopic.php?f=3&t=3855
However this is too comprehensive and big for this small task.
I don't understand that sentiment. Software like that is an already developed tool that you simply put on your machine once and then it can be used to quickly and efficiently solve a great many text processing problems.
The modern replacement for REPL.BAT is JREPL.BAT. One of the nice features of JREPL.BAT is it has a translate option that allows you to supply a list of search strings, and a like size list of replacement strings. So the following one liner will handle all the German UTF 8 characters (upper and lower A O U with umlaut, as well as lower case sharp S) in a single pass.
Code: Select all
call jrepl "\xC3\x84 \xC3\xA4 \xC3\x96 \xC3\xB6 \xC3\x9C \xC3\xBC \xC3\x9F" "\xC4 \xE4 \xD6 \xF6 \xDC \xFC \xDF" /x /t " " /f test.txt /o -
You might also have to add a search replace pair to replace the BOM with nothing.
The command line might become ungainly with a lot of translations, in which case the /V option can be used to pass the search and replace strings as environment variables. Here is a full script that takes the input file as the first argument, and the output file as the second argument. If the second argument is missing, then it simply overwrites the original input file. I've also added the search/replace pair to remove any UTF 8 BOM.
Code: Select all
@echo off
setlocal
set in=%1
set out=%2
if not defined out set out=-
set "find=\xEF\xBB\XBF \xC3\x84 \xC3\xA4 \xC3\x96 \xC3\xB6 \xC3\x9C \xC3\xBC \xC3\X9F"
set "repl= \xC4 \xE4 \xD6 \xF6 \xDC \xFC \xDF"
call jrepl find repl /v /x /t " " /f %in% /o %out%
Dave Benham
Re: Detecting two bytes and replacing them by one
The batch methods all cheat to an extent and use javascript.. there's no other way in "batch".
if you're willing to go towards 3rd party but commonly used linux tools, but without leaving all the functionality to some linux tool
for example, get a linux tool called xxd which puts the file into hex and back
and sed which is a classic ancient tool for search and replace
and sed or cut to remove chars from the start.
you can then do a lot yourself manually
and you can check how you did, with the linux 'file' command which tells you whether it's utf-8 or utf-16 BOM or without BOM. BE or LE.
I'm not sure off the top of my head re what you want to do but here is some method for how to do it
See xxd can dump the hex of a file or take as input the hex of a file and dump the file. And you can change the hex of a file and send the hex to xxd to produce the file.
And you can use regexes also to place strings at the beginning or end.. Placing them at the beginning or not is useful for BOM
C:\crp>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/$/3d/"|xxd -r -p
AA=
C:\crp>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/^/3d/"|xxd -r -p
=AA
C:\crp>
I see utf-8 starts differing from utf-16 at \xc1\x80 or \u0080
And this site may help for any UTF-8 <--> unicode code point conversion, I guess utf-16 roughly(At least) matches the latter at least up to FFFF. http://www.utf8-chartable.de/
if you're willing to go towards 3rd party but commonly used linux tools, but without leaving all the functionality to some linux tool
for example, get a linux tool called xxd which puts the file into hex and back
and sed which is a classic ancient tool for search and replace
and sed or cut to remove chars from the start.
you can then do a lot yourself manually
and you can check how you did, with the linux 'file' command which tells you whether it's utf-8 or utf-16 BOM or without BOM. BE or LE.
I'm not sure off the top of my head re what you want to do but here is some method for how to do it
See xxd can dump the hex of a file or take as input the hex of a file and dump the file. And you can change the hex of a file and send the hex to xxd to produce the file.
Code: Select all
C:\>echo 4141|sed -r "s/(..)/\1 /g"
41 41
C:\>echo 4141|sed -r "s/(..)/\1 /g"| xxd -r -p
AA
C:\>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/41 /42 /g"
42 42
C:\>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/41 /42 /g"|xxd -r -p
BB
C:\>
And you can use regexes also to place strings at the beginning or end.. Placing them at the beginning or not is useful for BOM
C:\crp>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/$/3d/"|xxd -r -p
AA=
C:\crp>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/^/3d/"|xxd -r -p
=AA
C:\crp>
Code: Select all
C:\crp>echo abcdefg|cut -c 2- <ENTER>
bcdefg
C:\crp>echo FEFFdsfsdds|cut -c 5-
dsfsdds
or
I see utf-8 starts differing from utf-16 at \xc1\x80 or \u0080
And this site may help for any UTF-8 <--> unicode code point conversion, I guess utf-16 roughly(At least) matches the latter at least up to FFFF. http://www.utf8-chartable.de/