Detecting two bytes and replacing them by one

Message

pstein · #1 Post by **pstein** » 22 Jul 2015 02:38

In ny directory tree there are some (many) *.txt files which contain text which is UTF-8 encoded (without BOM maker).

The text inside such a file may contain UTF-8 encoded german Umlaute which consist of TWO bytes.
Example:

xC3xB6 for o Umlaut = "ö"

Now I want to convert all these occurencies in all the text files from two-byte-UTF-8 encoding to one-byte ISO-8859-1 (or ISO 8859-15) encoding.

So refering to the example above Hex xC3xB8 should be replaced by Hex xF6

How can I achieve this from a DOS batch script?

As far as I found out there are some advanced scripts for this task like:
viewtopic.php?f=3&t=3855

However this is too comprehensive and big for this small task.

Isn't there a smaller (one-liner) script command for that?

Thank you
Peter

#2 Post by **Aacini** » 22 Jul 2015 08:46

pstein wrote:In ny directory tree there are some (many) *.txt files which contain text which is UTF-8 encoded (without BOM maker).

The text inside such a file may contain UTF-8 encoded german Umlaute which consist of TWO bytes.
Example:

xC3xB6 for o Umlaut = "ö"

Now I want to convert all these occurencies in all the text files from two-byte-UTF-8 encoding to one-byte ISO-8859-1 (or ISO 8859-15) encoding.

So refering to the example above Hex xC3xB8 should be replaced by Hex xF6

How can I achieve this from a DOS batch script?

As far as I found out there are some advanced scripts for this task like:
viewtopic.php?f=3&t=3855

However this is too comprehensive and big for this small task.

Isn't there a smaller (one-liner) script command for that?

Thank you
Peter

Perhaps like this one?

Code: Select all

@set @Batch=1 /*

@echo off
for %%a in (*.txt) do (
   CScript //nologo //E:JScript "%~F0" < "%%a" > "%%a.new"
   REM move /Y "%%a.new" "%%a"
)

goto :EOF    */

// JScript section

WScript.Stdout.Write(WScript.StdIn.ReadAll().replace(/\xC3\xB8/g,"\xF6"));

Antonio

#3 Post by **dbenham** » 23 Jul 2015 06:01

pstein wrote:As far as I found out there are some advanced scripts for this task like:
viewtopic.php?f=3&t=3855

However this is too comprehensive and big for this small task.

I don't understand that sentiment. Software like that is an already developed tool that you simply put on your machine once and then it can be used to quickly and efficiently solve a great many text processing problems.

The modern replacement for REPL.BAT is JREPL.BAT. One of the nice features of JREPL.BAT is it has a translate option that allows you to supply a list of search strings, and a like size list of replacement strings. So the following one liner will handle all the German UTF 8 characters (upper and lower A O U with umlaut, as well as lower case sharp S) in a single pass.

Code: Select all

call jrepl "\xC3\x84 \xC3\xA4 \xC3\x96 \xC3\xB6 \xC3\x9C \xC3\xBC \xC3\x9F"  "\xC4 \xE4 \xD6 \xF6 \xDC \xFC \xDF" /x /t " " /f test.txt /o -

You might also have to add a search replace pair to replace the BOM with nothing.

The command line might become ungainly with a lot of translations, in which case the /V option can be used to pass the search and replace strings as environment variables. Here is a full script that takes the input file as the first argument, and the output file as the second argument. If the second argument is missing, then it simply overwrites the original input file. I've also added the search/replace pair to remove any UTF 8 BOM.

Code: Select all

@echo off
setlocal
set in=%1
set out=%2
if not defined out set out=-
set "find=\xEF\xBB\XBF \xC3\x84 \xC3\xA4 \xC3\x96 \xC3\xB6 \xC3\x9C \xC3\xBC \xC3\X9F"
set "repl= \xC4 \xE4 \xD6 \xF6 \xDC \xFC \xDF"
call jrepl find repl /v /x /t " " /f %in% /o %out%

Dave Benham

taripo · #4 Post by **taripo** » 03 Aug 2015 02:10

The batch methods all cheat to an extent and use javascript.. there's no other way in "batch".

if you're willing to go towards 3rd party but commonly used linux tools, but without leaving all the functionality to some linux tool

for example, get a linux tool called xxd which puts the file into hex and back

and sed which is a classic ancient tool for search and replace

and sed or cut to remove chars from the start.

you can then do a lot yourself manually

and you can check how you did, with the linux 'file' command which tells you whether it's utf-8 or utf-16 BOM or without BOM. BE or LE.

I'm not sure off the top of my head re what you want to do but here is some method for how to do it

See xxd can dump the hex of a file or take as input the hex of a file and dump the file. And you can change the hex of a file and send the hex to xxd to produce the file.

Code: Select all

C:\>echo 4141|sed -r "s/(..)/\1 /g"
41 41

C:\>echo 4141|sed -r "s/(..)/\1 /g"| xxd -r -p
AA
C:\>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/41 /42 /g"
42 42

C:\>echo 4141|sed -r "s/(..)/\1 /g"| sed "s/41 /42 /g"|xxd -r -p
BB
C:\>

Code: Select all

C:\crp>echo abcdefg|cut -c 2- <ENTER>
bcdefg


C:\crp>echo FEFFdsfsdds|cut -c 5-
dsfsdds

or

I see utf-8 starts differing from utf-16 at \xc1\x80 or \u0080

And this site may help for any UTF-8 <--> unicode code point conversion, I guess utf-16 roughly(At least) matches the latter at least up to FFFF. http://www.utf8-chartable.de/

DosTips.com

Detecting two bytes and replacing them by one

Detecting two bytes and replacing them by one

Re: Detecting two bytes and replacing them by one

Re: Detecting two bytes and replacing them by one

Re: Detecting two bytes and replacing them by one