UTF-8 To Unicode

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
mauro012345
Posts: 2
Joined: 18 May 2012 01:47

UTF-8 To Unicode

#1 Post by mauro012345 » 18 May 2012 01:53

Hi! :)
Is there a similar snippet, like that one:

http://www.dostips.com/?t=Snippets.AnsiToUnicode

to translate a file from UTF-8 to Unicode txt format?
I have to do that with thousands of files, so I need a command to call from a script 8)
thanks!

aGerman
Expert
Posts: 3895
Joined: 22 Jan 2010 18:01
Location: Germany

Re: UTF-8 To Unicode

#2 Post by aGerman » 18 May 2012 09:06

Batch has a horrible Unicode support and it doesn't support UTF-8. If you need to convert those character encodings you could try VBScript.
See http://www.robvanderwoude.com/vbstech_files_utf8.php
PM me if you need help with it.

Regards
aGerman

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: UTF-8 To Unicode

#3 Post by Liviu » 18 May 2012 14:27

Save the following to a batch file.

Code: Select all

@echo off
setlocal disabledelayedexpansion

:: save original codepage
for /f "tokens=2 delims=:" %%a in ('chcp') do @set /a "cp=%%~a"

:: write utf-16le BOM
chcp 1252 >nul
rem replace with 'cmd /a /c (set ..' if called at 'cmd /u' prompt
(set /p =ÿþ) <nul >%2 2>nul
chcp %cp% >nul

:: convert utf-8 to utf-16le
rem all on one line since batch parsing fails while active codepage is utf-8
chcp 65001 >nul & cmd /u /c type %1 >>%2 & chcp %cp% >nul

Call it with the source UTF-8 encoded file as the 1st argument, and the destination filename as the 2nd argument to be saved as UTF-16LE (including the leading BOM). It was tested to work under my xp.sp3, note however that this is just a minimal snippet with no error checking.

Liviu

aGerman
Expert
Posts: 3895
Joined: 22 Jan 2010 18:01
Location: Germany

Re: UTF-8 To Unicode

#4 Post by aGerman » 18 May 2012 16:39

I wasn't aware that TYPE would return a usable output. Great, Liviu!

Regards
aGerman

Squashman
Expert
Posts: 4179
Joined: 23 Dec 2011 13:59

Re: UTF-8 To Unicode

#5 Post by Squashman » 18 May 2012 19:04

I tried just the original code and it seemed to do ok when I told my file viewing software to display the output file using UTF-16LE but there were a few unreadable characters at the beginning.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: UTF-8 To Unicode

#6 Post by Liviu » 18 May 2012 20:00

aGerman wrote:I wasn't aware that TYPE would return a usable output.
TYPE is indeed surprisingly well behaved for a builtin command ;-) Using combinations of chcp, the half-baked 65001 codepage support, and 'cmd /u' one can use TYPE to convert text files between codepages, or 8-bit and Unicode encodings.

Squashman wrote:...but there were a few unreadable characters at the beginning.
Maybe your input file had a UTF-8 BOM (neither required nor recommended), which TYPE doesn't like. Or maybe your viewer did not skip over the UTF-16LE BOM (both required and recommended). Or maybe you just had some characters in the test file that the viewer font does not cover.

Liviu

Squashman
Expert
Posts: 4179
Joined: 23 Dec 2011 13:59

Re: UTF-8 To Unicode

#7 Post by Squashman » 18 May 2012 21:19

I was just testing with plain old american english. Just 3 sentences. I used notepad to save it as UTF-8 and then ran the code. I was going to post a screen shot but didn't have time earlier.

Liviu
Expert
Posts: 470
Joined: 13 Jan 2012 21:24

Re: UTF-8 To Unicode

#8 Post by Liviu » 19 May 2012 09:25

Squashman wrote:I used notepad to save it as UTF-8
Notepad does indeed write a BOM to the UTF-8 file. When redirecting the output to a file, "type" converts the UTF-8 BOM to a UTF-16LE BOM. Since the original code forces a UTF-16LE BOM itself, the end result would be a UTF-16LE file mistakenly starting with two BOM sequences (0xFF 0xFE 0xFF 0xFE).

If you remove the ":: write utf-16le BOM" section from the original code, the conversion will work for UTF-8 files with an embedded BOM.

Liviu

mauro012345
Posts: 2
Joined: 18 May 2012 01:47

Re: UTF-8 To Unicode

#9 Post by mauro012345 » 21 May 2012 03:39

Let's try this!

Post Reply