CONVERTCP.exe - Convert text from one code page to another

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
jfl
Posts: 146
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#121 Post by jfl » 18 Aug 2020 09:32

how many bytes do you actually stop trying to figure out what encoding you got?
Currently I scan the whole file after loading it in memory, counting even NULs, odd NULs, non-ASCII bytes, invalid UTF-8 sequences. Then in the end, based on these counts, I select the most likely encoding among UTF-16, UTF-8, or ANSI.
A simple improvement could be to abort the scan when any counter passes a given limit. I'm pretty sure the impact on performance would be negligible.

aGerman
Expert
Posts: 4003
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#122 Post by aGerman » 18 Aug 2020 10:12

Since I implemented an own UTF-8 conversion algorithm I also have to validate char by char in this case because I have to shift in a replacement character if an invalid sequence is found. But everything else would conflict with the design of CONVERTCP. Usually I read only up to 1 MB at once to be able to read, convert, and write in threads. This has been a huge performance improvement for big files. Validating the whole chunk of data before actually converting it is what currently keeps me from supporting multithreading for DBCS code pages. Other than in UTF-8 and UTF-16 encoded text you would have to iterate over the entire chunk to figure out if it ends at a lead byte or if you got the whole character at the buffer boundary.

Steffen

aGerman
Expert
Posts: 4003
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#123 Post by aGerman » 18 Aug 2020 15:46

As lately discussed with Jean-François I find it a pretty good idea to print the output as UTF-16 whenever the app detects a text device (such as a console or a terminal). Thank you for this suggestion!
In these cases the CP_out argument will simply not be used even if you still have to pass it for syntax reasons. That way the possible characters are not restricted by the character set anymore. Limits only exist due to the characters the font supports or due to the ability of the terminal to output glyphs outside of the Basic Multilingual Plane.
I tested in the usual conhost window as well as in the Windows Terminal and in ConEmu. The Windows Terminal supports even the output of emojis.

Virustotal scans of version 7.5:
x86: https://www.virustotal.com/gui/file/1c1 ... /detection
x64: https://www.virustotal.com/gui/file/ca4 ... /detection

Steffen

Post Reply