DosTips.com

A Forum all about DOS Batch
It is currently 27 Apr 2017 05:02

All times are UTC-06:00




Post new topic  Reply to topic  [ 27 posts ]  Go to page 1 2 Next
Author Message
PostPosted: 24 Nov 2016 17:44 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
This command line utility is a codepage converter. It supports charsets such as single-byte code pages, UTF-8, UTF-16 LE/BE, and EBCDIC. Its designed to process big files also. It shall work on Windows XP onwards (tested on XP, Windows 7, Windows 8.1, and Windows 10). It's a free and open source tool.

A few days ago miskox asked me to rewrite an old 16 bit tool that he uses in order to make it run on 64 bit Windows also. The tool converts text from one single-byte code page to another. I bet the native English speakers of you are wondering what such a tool is even good for. The answer is that the CMD console and Windows applications use different code pages where non-ASCII characters have different code points. Thus, characters like Ü, É, Š, and the like show up as different/wrong characters.

Usage of convertcp.exe
Code: Select all
Converts a stream of characters to another code page.

Usage:
CONVERTCP CP_In CP_Out [/i "infile.txt"] [/o "outfile.txt"] [/b|/a]
CONVERTCP  /?

CP_In     Code Page Identifier of the input stream
CP_Out    Code Page Identifier of the output stream
 For a list of Code Page Identifiers see
  https://msdn.microsoft.com/en-us/library/dd317756.aspx
 Use with single-byte code pages, UTF-8, or UTF-16.
 Alternatively you can use 0 for the ANSI Code Page
  and 1 for the OEM Code Page of your system default settings.

/i        Introduces the source file
/o        Introduces the destination file
           (the content of an existing file will be truncated
           unless option /a was passed)
 Redirections to or from CONVERTCP can be used instead of /i and /o

/b        Add the Byte Order Mark to the output stream
           (will be ignored if CP_Out was not one of
           65001, 1200, or 1201)
/a        Append the output stream to the destination file
           (always use the same CP_Out)
 Do not combine options /b and /a

/?        Display this help message

infile    Any text file whose content shall be converted
outfile   Name of a text file where the converted stream
           shall be written

The tool is written in C/WinAPI. Besides of the exe files (which are 32 bit and 64 bit MinGW/GCC release builds) the source code is included in the attached ZIP file. All files under MIT license.
Attachment:
convertcp_v1.4.3.zip [14.3 KiB]
Downloaded 65 times

Steffen

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Examples

Convert the output of a command and save it in a text file.
(The output of FINDSTR /? will be converted from the default OEM code page to UTF-16 LE with BOM prepended. The converted stream will be saved in "commands.txt".)
Code: Select all
findstr /? | convertcp 1 1200 /b /o "commands.txt"


Convert the content of a text file and save it to another text file.
(The content of "commands.txt" will be converted from UTF-16 LE to the default ANSI code page and saved in "commands2.txt")
Code: Select all
convertcp 1200 0 /i "commands.txt" /o "commands2.txt"


Convert the content of a text file and output it to the console window.
(The content of "commands2.txt" will be converted from the default ANSI code page to the default OEM code page and displayed.)
Code: Select all
convertcp 0 1 /i "commands2.txt"


Append to an existing file.
(The output of FIND /? will be converted from the default OEM code page to UTF-16 LE. The converted stream will be appended to "commands.txt".)
Code: Select all
find /? | convertcp 1 1200 /a /o "commands.txt"


Create a file with a Byte Order Mark only.
(NUL is redirected to CONVERTCP. Thus, the input stream is empty. The input code page ID is meaningless. Because the output code page ID is for UTF-8 and option /b was passed only the UTF-8 BOM will be written to the file. This might be useful if you want to append text to the file in multiple steps afterwards.)
Code: Select all
<nul convertcp 0 65001 /b /o "bom.txt"


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Release notes:
2017/02/02 - v1.4.3.0/1 added option /a for appending to an existing file
2017/01/29 - v1.4.2.0/1 reduced the size of the binary files by half (kudos to carlos)
2017/01/23 - v1.4.1.0/1 minor performance improvement
2016/12/28 - v1.4.0.0/1 UTF-16 BE support added, options /i and /o added
2016/12/09 - v1.3.2.0/1 fixed bug in conversion from UTF-8
2016/12/08 - v1.3.1.0/1 ambiguous code fixed, minor optimizations, source code tidied
2016/12/05 - v1.3.0.0/1 UTF-16 LE support added
2016/12/03 - v1.2.0.0/1 UTF-8 support added, fixed misleading error message if the input stream has a size of exact multiples of 4 MB
2016/11/28 - v1.1.4.0/1 minor optimizations, source code tidied, 64bit utility added
2016/11/25 - v1.1.3.0 fixed possible deadlock caused by unsignaled threads
2016/11/24 - v1.1.2.0 fixed possible memory leak if reallocations fail
2016/11/24 - v1.1.1.0 moved to C, multithreaded conversion added
unpublished - first versions using C++ vector containers, without multithreading


Top
   
PostPosted: 25 Nov 2016 01:23 
Offline
Expert

Joined: 12 Feb 2011 21:02
Posts: 1858
Location: United States (east coast)
I'm a bit confused as to how this works, and/or how useful it could be. :?

So the low order ASCII code values remain the same, but the high order values vary from code page to code page. I can see how some code pages may share some characters in common, but their high order code values might be different. So your utility can do the necessary translation for characters in common. But what happens to the other characters that are not shared?

And are there frequently enough high order characters in common to make the utility worth while?

I should think there would be a number of code pages with no non-ASCII overlap at all, so I can't see how the utility could be useful in those cases.

At first I wondered how the utility works - how could it know all the correct mappings? But I looked at the source and see that it converts the text to UTF-16, and then converts back to a different single byte character set. I suppose it is the same underlying routines that cmd.exe uses to convert extended ASCII text to and from UTF-16.


Dave Benham


Top
   
PostPosted: 25 Nov 2016 03:03 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
I absolutely understand your concerns Dave and I know it's pretty difficult to see the benefit as long as you don't have to deal with languages that permanently uses characters other than the default ASCII. E.g. see the output of PAUSE /? on my pc:
Quote:
Hält die Ausführung einer Batchdatei an und zeigt folgende Meldung an:
Drücken Sie eine beliebige Taste . . .

I agree that you can't convert between codepages like 1251 and 1252 because there is no overlap in the extended ASCII range. The default OEM code page and the default ANSI code page on the same system will certainly share most of the characters. That's the reason why you can pass 1 and 0 instead of the code page IDs.
If a character has no equivalent the implementation of the used API functions decide if it
- either converts to the base character (e.g. Š to S)
- or replaces it with a question mark
Of course one can use a combination of TYPE, CMD /U, and CHCP to convert text to UTF-16 and back to another code page. As mentioned above I wrote the utility on behalf of miskox who already converted files with hundreds of MB of text. It seems to be useful for at least some people :lol:

Steffen


Top
   
PostPosted: 25 Nov 2016 13:20 
Offline

Joined: 28 Jun 2010 03:46
Posts: 271
Again I must say Thank you! to aGerman for providing this program.

As he mentioned I had very old MS-DOS 16-bit exe which does not work on x64. I received a source code from the author (written in Turbo Pascal). aGerman said that it is easier to write a program from scratch then to try and relink it.

Back in the old days we in former Yugoslavia had 3 (yes, three!) different ways of displaying our characters that are special to our alphabet: ČŠŽ and also ĆĐ in Croatia, Serbia...

See this translation table:

Image

First I had to use character [ to display letter Š - fonts were patched to support this. After that 852 (OEM) and 1250 (ANSI) were introduced.

If I have a a.txt file with this letter Š (first letter is DEC 230, second character is DEC 138)

Code: Select all
1250 852
Š       Ő


And I do

Code: Select all
type a.txt


I see letter Š on the right as it should be, but letter Š on the left is not displayed correcty. If you edit this file with NOTEPAD letter on the left is correct but not letter on the right.

If I have a .txt file with CP1250 character (for example Š) in it and try to find a letter (also Š) in command prompt window I will not succeed because these characters have different values in a code page table.

Saso


Top
   
PostPosted: 28 Nov 2016 01:11 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
New release with additional 64bit utility.

Steffen


Top
   
PostPosted: 01 Dec 2016 10:43 
Offline

Joined: 26 Oct 2012 06:40
Posts: 45
dbenham wrote:
I'm a bit confused as to how this works, and/or how useful it could be. :?

+1 on aGerman answer:
As soon as you start working with non-English documents, you'll quickly encounter some with illegible characters. This is due to them being in the wrong encoding for your version of Windows.
And regularly facing that same problem, I've also developed long ago my own encoding converting tool: It's called conv.exe, and available in my system tools library at https://github.com/JFLarvoire/SysToolsLib/releases.

Steffen, Saso,
Mine also has options for converting to and from UTF8, which is the most common encoding error I encounter nowadays.
You might also be interested by the 1clip.exe and 2clip.exe and 12.bat tools, allowing to use command-line tools (yours or mine) to convert data directly inside GUI apps.


Top
   
PostPosted: 01 Dec 2016 12:57 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
Thanks jfl

I already thought about adding UTF-8 support. The conversion to UTF-8 is quite simple. Actually it does already work except that the BOM is not prepended. Although that can be fixed easily.
However converting vice versa is much more complicated. The input stream will be read in chunks of 1 MB in order to be able to process big files * . The conversion will fail if the chunk ends in between a multibyte sequence of a UTF-8 stream. Currently I don't have any good idea how to solve that issue.

Steffen

* That's where your conv.exe utility doesn't seem to work anymore. I tested with a file of only 256 MB where it ends up with a deadlock.


Top
   
PostPosted: 03 Dec 2016 19:23 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
I found a way to handle UTF-8. Pass 65001 as code page ID.
The UTF-8 Byte Order Mark will be prepended to the output stream if you pass /b as third argument.

Steffen


Top
   
PostPosted: 05 Dec 2016 06:11 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
I changed the I/O from C to WinAPI in order to have UTF-16 little endian supported also. Pass 1200 as code page ID.

Steffen


Top
   
PostPosted: 06 Dec 2016 03:19 
Offline

Joined: 28 Jun 2010 03:46
Posts: 271
Thank you, Steffen! New release almost daily. Great!

Saso


Top
   
PostPosted: 06 Dec 2016 06:05 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
I try to work on it as long as it's fresh. I don't expect to get bug reports because the utility will not be found and used that often. Thus, finding uncertain code and optimizations keep being my own task. It would take me an hour to understand my own code after half a year not looking at it if I don't do it now.

I think in a few days I will upload one last minor release for the moment. After adding UTF-16 support there is no need to change the code that much. I'll try to find some ambiguous or uncertain code, do some minor optimizations, remove redundant code etc. Then I'll leave it as it is unless somebody finds a bug or has a request to add another feature ...

Steffen


Top
   
PostPosted: 08 Dec 2016 05:27 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
As already announced ...
Corrected ambiguous code for BOM removement
Outsourced BOM removement into a function in order to remove redundant code
Removed unnecessary memory reallocations
Replaced multiplications/divisions by two with faster bitwise shifting

Steffen


Top
   
PostPosted: 08 Dec 2016 08:18 
Offline

Joined: 28 Jun 2010 03:46
Posts: 271
aGerman wrote:
...Then I'll leave it as it is unless somebody finds a bug or has a request to add another feature ...


Maybe just an idea (probably not neeeded at the moment):

Add a support for custom code page(s).

Code: Select all
convertcp.exe my_private_CP1 my_private_CP2 <file_in.txt >file_out.txt


and there you have a translation table between these two private tables:

Code: Select all
0x00 from CP1 translates into 0x12 in CP2
0x01 ---> 0x11
.
.
.


Thanks for everything.
Saso


Top
   
PostPosted: 08 Dec 2016 14:11 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2597
Location: Germany
Saso

What you suggest is rather something like low-level cryptography and actually not the purpose of this utility. It doesn't make much sense to convert 0x00 to whatever byte in a plain text file. All single-byte code pages have the same code points in the ASCII range (until 0x7F).
If you want to have your own translation, then it should begin with 0x80 and end with 0xFF for the bytes read. Each of them having an associated other byte. Thus, you would need only one table (instead of two) with 128 pairs of values. I'm not sure if that was what you meant.

Steffen


Top
   
PostPosted: 09 Dec 2016 01:44 
Offline

Joined: 28 Jun 2010 03:46
Posts: 271
@aGerman:

A translation from EBCDIC to ASCII was my initial thought that I had to use in the past. I did not check if current WinAPI can do this. So if this is not supperted by API then we can call it a 'custom' translation table.

As I said: this was just an idea - the question is if it is really needed.

Thanks.

Saso


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 27 posts ]  Go to page 1 2 Next

All times are UTC-06:00


Who is online

Users browsing this forum: Baidu [Spider] and 18 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Limited