CONVERTCP.exe - Convert text from one code page to another

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#16 Post by aGerman » 09 Dec 2016 04:50

Please have a look at the list of code pages:
https://msdn.microsoft.com/en-us/library/dd317756.aspx
There are already code pages like 037, 500, 1026, 1047, 1140-1149. If you still have some EBCDIC data you may do some tests.

Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#17 Post by aGerman » 09 Dec 2016 17:44

Rather by a fluke I found a serious bug that could have happened while reading UTF-8. Fixed with v1.3.2.

Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#18 Post by aGerman » 28 Dec 2016 09:04

UTF-16 big endian is supported with version 1.4.0 (something that batch can't handle natively). Use code page ID 1201.
Also you can specify the source and destination files directly using options /i and /o. Of course redirections do still work.

Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#19 Post by aGerman » 23 Jan 2017 09:05

I did a little code profiling on the weekend. Outcome is that threading of the conversion isn't as important as I expected. It makes more sense to separate reading and writing on the file system because these are slow processes. I changed the behavior in a way that writing is done in a parallel thread while the next chunk of data can be read. Surprisingly I got the best performance results if both converting and writing run together in one thread.
To cut it short: The performance increasement is insignificant but existing. Thus, I'd like to share it by version 1.4.1.

Steffen

miskox
Posts: 308
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#20 Post by miskox » 26 Jan 2017 11:33

Thanks for the update.

Saso

carlos
Posts: 431
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#21 Post by carlos » 29 Jan 2017 02:37

Great tool. I reduced the executable size to 8Kb. Pm sent.

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#22 Post by aGerman » 29 Jan 2017 03:30

Thank you Carlos!

I will definitely try some of the compiler options in order to reduce the size of the tool. Unfortunately the tool you sent me was immediately removed by Avira (free antivirus) :( There are some good reasons why my tool has a few extra KBs. I'll explain it via PM.

Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#23 Post by aGerman » 29 Jan 2017 08:26

I managed to add carlos' size improvements. See comments of the DECREASE_SIZE_GCC macro in the source code. That way the size of the utility was reduced by half (without noticeable performance increasement though).
In order to preserve cross-compiler support I added a few pre-processor directives for retrieving arguments UTF-16-encoded.

Since I don't have any experiences with this kind of size optimizations yet I would like you to report if the new version causes false positives of your antivirus software.

Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#24 Post by aGerman » 30 Jan 2017 11:52

After testing at virustotal the executables uploaded with version 1.4.2. do not cause any findings. At least I hope this can be proved in real world, too.

Steffen

https://www.virustotal.com/en/file/a8d6 ... 485797283/
https://www.virustotal.com/en/file/7562 ... 485797365/

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#25 Post by aGerman » 02 Feb 2017 12:21

With version 1.4.3. comes the feature to add to an existing file using option /a. See the initial post.
Again I checked the executables on virustotal. No false positives detected.
https://www.virustotal.com/en/file/53c0 ... 486055380/
https://www.virustotal.com/en/file/f552 ... 486055433/

As always - the updated file can be found in the initial post of this thread.

Now I'm out of ideas (and am tired reading the source code repeatedly). I'll archive it and leave it alone :wink:

Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#26 Post by aGerman » 23 Feb 2017 11:58

Quoted from there:
viewtopic.php?f=3&t=7703&p=51312#p51310
penpen wrote:I have tested your CONVERTCP utility, and read the source code:
I saw no error, but i noticed that your tool does more, than just converting between codepages - it also approximates characters that are not within the target codepage (which is not that bad, because cmd.exe is doing the same, but i would mention it somewhere).
For example i created a file "string.txt" with this content (i hope it is not corrupted) encoded using UTF-8:

Code: Select all

ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩ

If you convert it to codepage 850 you get:

Code: Select all

AaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIi

The recommended behaviour for such cases i know were to use the REPLACEMENT CHARACTER, a question mark, a square, or a question mark in a square for such cases.

This is by design and actually wanted behavior.

1) https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx

Code: Select all

int WideCharToMultiByte(
  _In_      UINT    CodePage,
  _In_      DWORD   dwFlags,
  _In_      LPCWSTR lpWideCharStr,
  _In_      int     cchWideChar,
  _Out_opt_ LPSTR   lpMultiByteStr,
  _In_      int     cbMultiByte,
  _In_opt_  LPCSTR  lpDefaultChar,
  _Out_opt_ LPBOOL  lpUsedDefaultChar
);

...
lpDefaultChar [in, optional]
...
For the CP_UTF7 and CP_UTF8 settings for CodePage, this parameter must be set to NULL. Otherwise, the function fails with ERROR_INVALID_PARAMETER.

lpUsedDefaultChar [out, optional]
...
For the CP_UTF7 and CP_UTF8 settings for CodePage, this parameter must be set to NULL. Otherwise, the function fails with ERROR_INVALID_PARAMETER.
...

That means at least for UTF-7 and UTF-8 I'm not even able to define a default character.
I noted this behavior in my first reply to Dave:
viewtopic.php?f=3&t=7570#p50285

2) The reason why I don't even want to work around it is that the utility was requested by miskox. He told me via email
I 'patched' original .exe to make another .exe version with NOCSZ (that is NOČŠŽ) which replaces ČŠŽĐĆ characters with ordinary CZSDC - depending on the input code page.

That's why I called it "wanted behavior".

Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#27 Post by aGerman » 18 Mar 2017 08:15

I was asked to add another option in order to automatically replace the original file content with the converted content. I won't do so.

The utility was designed to convert big files. That means it doesn't read the whole content into memory before it begins with the conversion in order to avoid running out of RAM space and to be able to read and convert data in parallel threads. Concurrent access to the same file could cause data losses, especially if the converted data is bigger than the data read.
Of course I could let the tool automatically write to a temporary file and replace the original file after the conversion was finished. But as soon as the temporary file and the original file are saved on different volumes this would cause a physical copying of data which wastes time and resources.

Thus, I would rather keep it in your hands. Moving a file to another file at the same logical drive will only lead to changing the file addressing. Example:

Code: Select all

convertcp 1 65001 /b /i "test.txt" /o "test.txt.temp~"
if not errorlevel 1 move /y "test.txt.temp~" "test.txt"


Steffen

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#28 Post by aGerman » 27 May 2017 09:44

I didn't like to have only a link to the list of Code Page Identifiers in the help message. That's why I decided to add /l to the supported options that displays a list of installed code pages on your computer together with the information of how they can be used as input code page (see section "additional information" of the initial post), and their description.

Virustotal didn't find any false positives for version 1.4.4.
x86: https://www.virustotal.com/en/file/33108943bf6f8575a49873c44d0eef7ce30ffdd4af7f8564f6c2f8339171581c/analysis/
x64: https://www.virustotal.com/en/file/961bf49a7e624709742cde83ae5739f8e1f949a6e08e0e1a9f29e1f075afa9a4/analysis/

Steffen

dbenham
Expert
Posts: 2016
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: CONVERTCP.exe - Convert text from one code page to another

#29 Post by dbenham » 25 Sep 2017 16:41

Great tool Steffen.

I have another option for you - the new JREPL.BAT version 7 features (currently v7.4), can also be used to transform a text file from one encoding to another. I believe it is more restrictive on which character sets can be used because it only supports your machines native code page, plus UTF-16LE, plus code pages that have valid internet character set names. EDIT - Actually it is not that bad. Here is a page that lists code pages along with there internet (.NET) names. Most of the code pages have a valid name

Here is an example that transforms 1252 to UTF-8:

Code: Select all

jrepl "^" "" /f "source.txt|Windows-1252" /o "destination.txt|UTF-8"

But JREPL has a significant advantage in that you can provide custom transformations for source characters that do not exist in the target character set. This could satisfy Sasso's "custom character set" request. This is probably easiest to accomplish by using the JREPL /T option.

One thing that is pretty cool is that with the /X option, you can specify a character using the \xnn escape sequence, where nn is the hexadecimal byte code for the relevant character set. Within a search string it uses the input character set. Within a replacement string it uses the output character set. The \xnn sequence only works properly if the character set is a single byte character set.

With the /T "FILE" option, you can place all your search terms in one file, one per line, and all your replacement (transform) terms in a 2nd file. This helps prevent out of control command line lengths. Another cool feature is you can specify that the search file matches the input character set, and the replacement file matches the output character set.

There is no need for the transformations to involve just single characters. One input character can be transformed into multiple output characters, and vice versa.

Here is an example of what a custom transformation could look like (without specifying the actual custom transformations)

Code: Select all

jrepl "1252to1250find.txt|Windows-1252" "1252to1250repl.txt|Windows-1250" /x /t file /f "source.txt|Windows-1252" /o "destination.txt|Windows-1250"


Dave Benham

aGerman
Expert
Posts: 3173
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#30 Post by aGerman » 26 Sep 2017 10:18

Dave

I'm quite interested in JREPL.BAT as you know. Using ADO streams was a huge improvement. Also for my understanding it's a good alternative for CONVERTCP.

Of course everything has pros and cons. What I really like is that JREPL doesn't need any 3rd party tools. It's something that I can't compete with CONVERTCP. To compensate this deficiency a little I used C and WinAPI (that runs natively and isn't dependent on .NET or Java), I provided the source code (to enable people to read or edit the source and compile the tool by themself) and added a program flow chart (because an executable is like a black box where you can't see the way it works).
On the other hand the main scopes of JREPL and CONVERTCP are quite different. This makes that JREPL is able (and designed) to do customized replacements while CONVERTCP can't do that. But this also makes that CONVERTCP is so much faster for big files. 307s JREPL vs. 9s CONVERTCP for 360MB of text Windows-1252 to UTF-8 in my tests because it efficiently converts and writes in parallel threads. Converting big files was one of Saso's original requirements.

I don't want to make a fuss about CONVERTCP. Initially it was a gift to Saso who suggested to publish it. So why not :wink: It seems to be helpful for some people. Version 1.4.4 was downloaded ~120 times now. That's approx. once a day and thus, maybe 10 times more than I would have ever expected but isn't comparable to what JREPL catches on :wink: At the end there is no "nec plus ultra". The users have to decide which meets their needs best. For that reason I really appreciate that you left some notes and a link to JREPL.BAT in this thread. This will give the users more opportunities to find the right tool for their tasks :D

Steffen

This tangential topic about JREPL continues at JREPL.BAT, ADO Streams and big files

Post Reply