CONVERTCP.exe - Convert text from one code page to another

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

CONVERTCP.exe - Convert text from one code page to another

#1 Post by aGerman » 24 Nov 2016 17:44

This command line utility is a codepage converter. It supports charsets such as single-byte code pages, UTF-8, UTF-16 LE/BE, and EBCDIC. Its designed to process big files also. It shall work on Windows XP onwards (tested on XP, Windows 7, Windows 8.1, and Windows 10). It's a free and open source tool.

A few days ago miskox asked me to rewrite an old 16 bit tool that he uses in order to make it run on 64 bit Windows also. The tool converts text from one single-byte code page to another. I bet the native English speakers of you are wondering what such a tool is even good for. The answer is that the CMD console and Windows applications use different code pages where non-ASCII characters have different code points. Thus, characters like Ü, É, Š, and the like show up as different/wrong characters.

Steffen

convertcp_v1.4.4.zip
(84.33 KiB) Downloaded 310 times



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage of convertcp.exe

Code: Select all

Converts a stream of characters to another code page.

Usage:
CONVERTCP CP_In CP_Out [/i "infile.txt"] [/o "outfile.txt"] [/b|/a]
CONVERTCP /?|/l

CP_In     Code Page Identifier of the input stream
CP_Out    Code Page Identifier of the output stream
 To get a list of supported Code Page Identifiers use option /l
 Alternatively you can use 0 for the ANSI Code Page
  and 1 for the OEM Code Page of your system default settings.

/i        Introduces the source file
/o        Introduces the destination file
           (the content of an existing file will be truncated
           unless option /a was passed)
 Redirections to or from CONVERTCP can be used instead of /i and /o

/b        Add the Byte Order Mark to the output stream
           (will be ignored if CP_Out was not one of
           65001, 1200, or 1201)
/a        Append the output stream to the destination file
           (always use the same CP_Out)
 Do not combine options /b and /a

/?        Display this help message
/l        Display a list of supported Code Page Identifiers
           installed on this computer

infile    Path of a text file whose content shall be converted
outfile   Path of a text file where the converted stream
           shall be written


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Additional information:

The support of code pages is restricted ...
a) by the shared characters of both used code pages. If a read character has no equivalent the implementations of the used API functions decide if they
- either convert to the approximated ASCII character (e.g. Š to S)
- or replace it with a default character (usually a question mark)
b) by the maximum number of bytes used to represent a character. The table outputted using option /l indicates in the second column whether or not a code page can be used by CONVERTCP for input streams greater than 1MB (while all listed code pages can be used for output streams independing of their size).

The utility was written in C/WinAPI. Besides of the exe files (which are 32 bit and 64 bit MinGW/GCC release builds) the source code is included in the attached ZIP file. The program flow chart is for those who try to understand how the program works (even though it's simplified and incomplete). All files under MIT license.

Critique is always much appreciated.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Examples

Convert the output of a command and save it in a text file.
(The output of FINDSTR /? will be converted from the default OEM code page to UTF-16 LE with BOM prepended. The converted stream will be saved in "commands.txt".)

Code: Select all

findstr /? | convertcp 1 1200 /b /o "commands.txt"


Convert the content of a text file and save it to another text file.
(The content of "commands.txt" will be converted from UTF-16 LE to the default ANSI code page and saved in "commands2.txt")

Code: Select all

convertcp 1200 0 /i "commands.txt" /o "commands2.txt"


Convert the content of a text file and output it to the console window.
(The content of "commands2.txt" will be converted from the default ANSI code page to the default OEM code page and displayed.)

Code: Select all

convertcp 0 1 /i "commands2.txt"


Append to an existing file.
(The output of FIND /? will be converted from the default OEM code page to UTF-16 LE. The converted stream will be appended to "commands.txt".)

Code: Select all

find /? | convertcp 1 1200 /a /o "commands.txt"


Create a file with a Byte Order Mark only.
(NUL is redirected to CONVERTCP. Thus, the input stream is empty. The input code page ID is meaningless. Because the output code page ID is for UTF-8 and option /b was passed only the UTF-8 BOM will be written to the file. This might be useful if you want to append text to the file in multiple steps afterwards.)

Code: Select all

<nul convertcp 0 65001 /b /o "bom.txt"


List the installed code pages.
(Process the outputted list of CONVERTCP /L in a FOR /F loop in order to write the values comma-separated)

Code: Select all

for /f "skip=3 tokens=1,3,4*" %%i in ('convertcp /l') do echo "%%i","%%j","%%l"


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Release notes:
2017/05/27 - v1.4.4.0/1 added option /l to print a list of installed code pages
2017/02/02 - v1.4.3.0/1 added option /a for appending to an existing file
2017/01/29 - v1.4.2.0/1 reduced the size of the binary files by half (kudos to carlos)
2017/01/23 - v1.4.1.0/1 minor performance improvement
2016/12/28 - v1.4.0.0/1 UTF-16 BE support added, options /i and /o added
2016/12/09 - v1.3.2.0/1 fixed bug in conversion from UTF-8
2016/12/08 - v1.3.1.0/1 ambiguous code fixed, minor optimizations, source code tidied
2016/12/05 - v1.3.0.0/1 UTF-16 LE support added
2016/12/03 - v1.2.0.0/1 UTF-8 support added, fixed misleading error message if the input stream has a size of exact multiples of 4 MB
2016/11/28 - v1.1.4.0/1 minor optimizations, source code tidied, 64bit utility added
2016/11/25 - v1.1.3.0 fixed possible deadlock caused by unsignaled threads
2016/11/24 - v1.1.2.0 fixed possible memory leak if reallocations fail
2016/11/24 - v1.1.1.0 moved to C, multithreaded conversion added
unpublished - first versions using C++ vector containers, without multithreading

dbenham
Expert
Posts: 1961
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: CONVERTCP.exe - Convert text from one code page to another

#2 Post by dbenham » 25 Nov 2016 01:23

I'm a bit confused as to how this works, and/or how useful it could be. :?

So the low order ASCII code values remain the same, but the high order values vary from code page to code page. I can see how some code pages may share some characters in common, but their high order code values might be different. So your utility can do the necessary translation for characters in common. But what happens to the other characters that are not shared?

And are there frequently enough high order characters in common to make the utility worth while?

I should think there would be a number of code pages with no non-ASCII overlap at all, so I can't see how the utility could be useful in those cases.

At first I wondered how the utility works - how could it know all the correct mappings? But I looked at the source and see that it converts the text to UTF-16, and then converts back to a different single byte character set. I suppose it is the same underlying routines that cmd.exe uses to convert extended ASCII text to and from UTF-16.


Dave Benham

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#3 Post by aGerman » 25 Nov 2016 03:03

I absolutely understand your concerns Dave and I know it's pretty difficult to see the benefit as long as you don't have to deal with languages that permanently uses characters other than the default ASCII. E.g. see the output of PAUSE /? on my pc:
Hält die Ausführung einer Batchdatei an und zeigt folgende Meldung an:
Drücken Sie eine beliebige Taste . . .

I agree that you can't convert between codepages like 1251 and 1252 because there is no overlap in the extended ASCII range. The default OEM code page and the default ANSI code page on the same system will certainly share most of the characters. That's the reason why you can pass 1 and 0 instead of the code page IDs.
If a character has no equivalent the implementation of the used API functions decide if it
- either converts to the base character (e.g. Š to S)
- or replaces it with a question mark
Of course one can use a combination of TYPE, CMD /U, and CHCP to convert text to UTF-16 and back to another code page. As mentioned above I wrote the utility on behalf of miskox who already converted files with hundreds of MB of text. It seems to be useful for at least some people :lol:

Steffen

miskox
Posts: 290
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#4 Post by miskox » 25 Nov 2016 13:20

Again I must say Thank you! to aGerman for providing this program.

As he mentioned I had very old MS-DOS 16-bit exe which does not work on x64. I received a source code from the author (written in Turbo Pascal). aGerman said that it is easier to write a program from scratch then to try and relink it.

Back in the old days we in former Yugoslavia had 3 (yes, three!) different ways of displaying our characters that are special to our alphabet: ČŠŽ and also ĆĐ in Croatia, Serbia...

See this translation table:

Image

First I had to use character [ to display letter Š - fonts were patched to support this. After that 852 (OEM) and 1250 (ANSI) were introduced.

If I have a a.txt file with this letter Š (first letter is DEC 230, second character is DEC 138)

Code: Select all

1250 852
Š       Ő


And I do

Code: Select all

type a.txt


I see letter Š on the right as it should be, but letter Š on the left is not displayed correcty. If you edit this file with NOTEPAD letter on the left is correct but not letter on the right.

If I have a .txt file with CP1250 character (for example Š) in it and try to find a letter (also Š) in command prompt window I will not succeed because these characters have different values in a code page table.

Saso

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#5 Post by aGerman » 28 Nov 2016 01:11

New release with additional 64bit utility.

Steffen

jfl
Posts: 52
Joined: 26 Oct 2012 06:40

Re: CONVERTCP.exe - Convert text from one code page to another

#6 Post by jfl » 01 Dec 2016 10:43

dbenham wrote:I'm a bit confused as to how this works, and/or how useful it could be. :?

+1 on aGerman answer:
As soon as you start working with non-English documents, you'll quickly encounter some with illegible characters. This is due to them being in the wrong encoding for your version of Windows.
And regularly facing that same problem, I've also developed long ago my own encoding converting tool: It's called conv.exe, and available in my system tools library at https://github.com/JFLarvoire/SysToolsLib/releases.

Steffen, Saso,
Mine also has options for converting to and from UTF8, which is the most common encoding error I encounter nowadays.
You might also be interested by the 1clip.exe and 2clip.exe and 12.bat tools, allowing to use command-line tools (yours or mine) to convert data directly inside GUI apps.

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#7 Post by aGerman » 01 Dec 2016 12:57

Thanks jfl

I already thought about adding UTF-8 support. The conversion to UTF-8 is quite simple. Actually it does already work except that the BOM is not prepended. Although that can be fixed easily.
However converting vice versa is much more complicated. The input stream will be read in chunks of 1 MB in order to be able to process big files * . The conversion will fail if the chunk ends in between a multibyte sequence of a UTF-8 stream. Currently I don't have any good idea how to solve that issue.

Steffen

* That's where your conv.exe utility doesn't seem to work anymore. I tested with a file of only 256 MB where it ends up with a deadlock.

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#8 Post by aGerman » 03 Dec 2016 19:23

I found a way to handle UTF-8. Pass 65001 as code page ID.
The UTF-8 Byte Order Mark will be prepended to the output stream if you pass /b as third argument.

Steffen

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#9 Post by aGerman » 05 Dec 2016 06:11

I changed the I/O from C to WinAPI in order to have UTF-16 little endian supported also. Pass 1200 as code page ID.

Steffen

miskox
Posts: 290
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#10 Post by miskox » 06 Dec 2016 03:19

Thank you, Steffen! New release almost daily. Great!

Saso

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#11 Post by aGerman » 06 Dec 2016 06:05

I try to work on it as long as it's fresh. I don't expect to get bug reports because the utility will not be found and used that often. Thus, finding uncertain code and optimizations keep being my own task. It would take me an hour to understand my own code after half a year not looking at it if I don't do it now.

I think in a few days I will upload one last minor release for the moment. After adding UTF-16 support there is no need to change the code that much. I'll try to find some ambiguous or uncertain code, do some minor optimizations, remove redundant code etc. Then I'll leave it as it is unless somebody finds a bug or has a request to add another feature ...

Steffen

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#12 Post by aGerman » 08 Dec 2016 05:27

As already announced ...
Corrected ambiguous code for BOM removement
Outsourced BOM removement into a function in order to remove redundant code
Removed unnecessary memory reallocations
Replaced multiplications/divisions by two with faster bitwise shifting

Steffen

miskox
Posts: 290
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#13 Post by miskox » 08 Dec 2016 08:18

aGerman wrote:...Then I'll leave it as it is unless somebody finds a bug or has a request to add another feature ...


Maybe just an idea (probably not neeeded at the moment):

Add a support for custom code page(s).

Code: Select all

convertcp.exe my_private_CP1 my_private_CP2 <file_in.txt >file_out.txt


and there you have a translation table between these two private tables:

Code: Select all

0x00 from CP1 translates into 0x12 in CP2
0x01 ---> 0x11
.
.
.


Thanks for everything.
Saso

aGerman
Expert
Posts: 3066
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#14 Post by aGerman » 08 Dec 2016 14:11

Saso

What you suggest is rather something like low-level cryptography and actually not the purpose of this utility. It doesn't make much sense to convert 0x00 to whatever byte in a plain text file. All single-byte code pages have the same code points in the ASCII range (until 0x7F).
If you want to have your own translation, then it should begin with 0x80 and end with 0xFF for the bytes read. Each of them having an associated other byte. Thus, you would need only one table (instead of two) with 128 pairs of values. I'm not sure if that was what you meant.

Steffen

miskox
Posts: 290
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#15 Post by miskox » 09 Dec 2016 01:44

@aGerman:

A translation from EBCDIC to ASCII was my initial thought that I had to use in the past. I did not check if current WinAPI can do this. So if this is not supperted by API then we can call it a 'custom' translation table.

As I said: this was just an idea - the question is if it is really needed.

Thanks.

Saso

Post Reply