CONVERTCP.exe - Convert text from one code page to another

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#121 Post by jfl » 18 Aug 2020 09:32

how many bytes do you actually stop trying to figure out what encoding you got?
Currently I scan the whole file after loading it in memory, counting even NULs, odd NULs, non-ASCII bytes, invalid UTF-8 sequences. Then in the end, based on these counts, I select the most likely encoding among UTF-16, UTF-8, or ANSI.
A simple improvement could be to abort the scan when any counter passes a given limit. I'm pretty sure the impact on performance would be negligible.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#122 Post by aGerman » 18 Aug 2020 10:12

Since I implemented an own UTF-8 conversion algorithm I also have to validate char by char in this case because I have to shift in a replacement character if an invalid sequence is found. But everything else would conflict with the design of CONVERTCP. Usually I read only up to 1 MB at once to be able to read, convert, and write in threads. This has been a huge performance improvement for big files. Validating the whole chunk of data before actually converting it is what currently keeps me from supporting multithreading for DBCS code pages. Other than in UTF-8 and UTF-16 encoded text you would have to iterate over the entire chunk to figure out if it ends at a lead byte or if you got the whole character at the buffer boundary.

Steffen

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#123 Post by aGerman » 18 Aug 2020 15:46

As lately discussed with Jean-François I find it a pretty good idea to print the output as UTF-16 whenever the app detects a text device (such as a console or a terminal). Thank you for this suggestion!
In these cases the CP_out argument will simply not be used even if you still have to pass it for syntax reasons. That way the possible characters are not restricted by the character set anymore. Limits only exist due to the characters the font supports or due to the ability of the terminal to output glyphs outside of the Basic Multilingual Plane.
I tested in the usual conhost window as well as in the Windows Terminal and in ConEmu. The Windows Terminal supports even the output of emojis.

Virustotal scans of version 7.5:
x86: https://www.virustotal.com/gui/file/1c1 ... /detection
x64: https://www.virustotal.com/gui/file/ca4 ... /detection

Steffen

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#124 Post by aGerman » 23 Jun 2021 15:13

I can't believe to find myself doing this ... 😵

As of version 8 CONVERTCP supports the codepage detection of the incoming stream. It's still rather guessing. So, you should not rely on it.

Jean-François made some tests with IMultiLanguage2::DetectInputCodepage and the results have been exciting enough to investigate how to improve it.
viewtopic.php?f=3&t=10088

If you want to try out, just use a question mark as first argument.

Code: Select all

convertcp ? 65001 /i "a.txt" /o "b.txt"
CONVERTCP will try to determine the encoding of a.txt for the conversion.


You can also hint an encoding. Use the question mark followed by a preferred codepage ID. If reasonable, CONVERTCP will use it rather than the guessed encoding. However, if the guessed encoding is more likely than the hint, CONVERTCP won't use your preference.

Code: Select all

convertcp ?437 65001 /i "a.txt" /o "b.txt"

In both cases CONVERTCP will fail (no conversion performed) if it is unable to guess an encoding.


Virustotal scans of version 8.0:
x86: https://www.virustotal.com/gui/file/cac ... 96d1c78c7a
x64: https://www.virustotal.com/gui/file/8a5 ... f7ee9a5385

Steffen

findstr
Posts: 17
Joined: 09 Jun 2021 12:36

Re: CONVERTCP.exe - Convert text from one code page to another

#125 Post by findstr » 23 Jun 2021 19:56

aGerman wrote:
23 Jun 2021 15:13
...
Virustotal scans of version 8.0:
x86: https://www.virustotal.com/gui/file/cac ... 96d1c78c7a
x64: https://www.virustotal.com/gui/file/8a5 ... f7ee9a5385

Steffen
Are those false positives, or the program may possibly be malware?

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#126 Post by aGerman » 24 Jun 2021 00:21

:lol: What do you think?
Have a look at post #72 in this thread where I hopefully explained it all. viewtopic.php?p=57914#p57914

Steffen

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#127 Post by miskox » 24 Jun 2021 01:06

Steffen, thank you. Yes.. you thought there will be no more updates...

I will give it a try but might take a week or two.

Thank you.

Saso

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#128 Post by aGerman » 24 Jun 2021 03:47

Hi Saso,

It's years ago when I thought that :lol: But actually this update is something where I was quite sure that I will never implement it. Meanwhile I'm quite positive that codepage detection is something useful as long as it is better than just a gambler's game. Even though it will never be foolproof ¯\_(ツ)_/¯

However, even if this is something that you may never use, it would be nice to know how downward-compatible the tool still is. So, yeah, if you still have the opportunity to try on XP I'd be happy to get some feedback on that :)

Steffen

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#129 Post by aGerman » 02 Jul 2021 12:58

Saso verified that the tool is still running on XP. Of course he also gave some hints that codepage detection is everything but perfect. It turns out that files with a reasonable amount of text work way better than files with only a few words. That's been expected. However, also files which don't consist of natural language text tend to fail. E.g. ANSI-encoded CSV data which contains an unusual amount of commas (or semicolons) and numbers, compared to natural language only. They influence the statistical byte distribution which the API functions expect to see. That's nothing I'm able to improve though.

He also reminded me that the tool still throws error messages that are rather misleading. Should be better now. Also UTF-7 recognition was somewhat broken and is fixed in v. 8.1.

EDIT: I made a mess with the error messages :lol: I'm updating the tool but leaving the version.

Virustotal scans of version 8.1:
x86: https://www.virustotal.com/gui/file/3fe ... b51ffab672
x64: https://www.virustotal.com/gui/file/94a ... a55f94e8f5

Steffen
Last edited by aGerman on 03 Jul 2021 12:14, edited 1 time in total.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#130 Post by aGerman » 25 Jul 2021 10:25

Version 8.2 is a minor update. It adds the opportunity to overwrite the original file with the converted content. Pass a single minus sign along with option /o to specify it.

The minus sign is only taken as wildcard for the same file name under the following assumptions:
- option /i is specified
- option /a is not specified

The functionality is the same as previously recommended, but is automated now:
- a temporary file is created in the same directory (original file name with an appended GUID string)
- if the conversion is successful, the original file will be replaced by the temporary file

Virustotal scans of version 8.2:
x86: https://www.virustotal.com/gui/file/09d ... 4dfd848878
x64: https://www.virustotal.com/gui/file/d4d ... 4e740fc1a9

Steffen

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#131 Post by miskox » 25 Jul 2021 11:56

Steffen, you can't stop?

I mean - thanks.
Saso

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#132 Post by aGerman » 18 Oct 2021 10:55

Version 8.3 improves the performance of UTF-8 identification. Further I found a way to shrink the binary size by ~20 KB.

Virustotal scans of version 8.3:
x86: https://www.virustotal.com/gui/file/4d1 ... d8fb1a3172
x64: https://www.virustotal.com/gui/file/e5a ... 8a46ec4c4a

Steffen

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#133 Post by aGerman » 29 May 2022 05:46

Only one parenthesis at a wrong position in the source code broke the recognition of MIME names :lol:

Virustotal scans of version 8.4:
x86: https://www.virustotal.com/gui/file/470 ... 28561757c5
x64: https://www.virustotal.com/gui/file/b3c ... d7396c87db

Steffen

shantanu97
Posts: 3
Joined: 24 Apr 2022 00:45

Re: CONVERTCP.exe - Convert text from one code page to another

#134 Post by shantanu97 » 28 Jun 2022 22:40

The below code is showing an error:Unable to create output file. The input folder contains a bunch of CSV files and the output folder name is (UTF8Files) where we are saving a bunch of CSV files in UTF-8 encoding. Remember my input folder locations and output folder locations are different.

Code: Select all

md "C:\Users\ShantanuGupta\Documents\Test\UTF8Files"
for %%i in (C:\Users\ShantanuGupta\Desktop\UTF8\Input\*.csv) do convertcp.exe ? 65001 /i "%%~i" /o "C:\Users\ShantanuGupta\Documents\Test\UTF8Files\%%~i"
pause
Can anyone help me why my code is failing.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#135 Post by aGerman » 29 Jun 2022 09:11

%%~i contains the whole path of the file. If you read the help message of FOR (run FOR /? in a cmd prompt) you'll find the modifiers of FOR variables. E.g. %%~nxi expands to the file name and file extension which is likely what you're looking for.

Steffen

Post Reply