CONVERTCP.exe - Convert text from one code page to another

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#91 Post by carlos » 24 Dec 2019 09:54

Convertcp support for xp should be affected for this next (appeared in the documentation of the MultiByteToWideChar function) ?
Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. The function used on earlier operating systems encodes or decodes lone surrogate halves or mismatched surrogate pairs. Code written in earlier versions of Windows that rely on this behavior to encode random non-text binary data might run into problems. However, code that uses this function on valid UTF-8 strings will behave the same way as on earlier Windows operating systems.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#92 Post by aGerman » 24 Dec 2019 10:41

Yes I know this bug :lol: In other words, WideCharToMultiByte converts to CESU-8 instead of UTF-8 on XP. Fortunately this isn't quite relevant most of the time. It's rather seldom that you'll find surrogate pairs in UTF-16 natural language text. Some CJK characters require it.

Actually the conversion from UTF-16 to UTF-8 and vice versa is simple math. Though it requires the intermediate conversion to UTF-32 to avoid CESU-8. I already developed the code for that but still I'm struggling to implement it into CONVERTCP since the hand-rolled conversion would certainly not be as performant as MultiByteToWideChar and WideCharToMultiByte. And it would only affect the conversion from/to UTF-8. I still need the API functions for other charsets.

Steffen

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#93 Post by carlos » 24 Dec 2019 12:05

Thanks Steffen for the explanation. I never be hear about CESU-8. It would be nice convertcp have code for prevent this bug ocurrs, else I think should be appear in the documentation that utf-8 conversion should not 100% reliable in xp., thus the supported for xp is not real. Really support for xp would produce the same utf-8 conversion if you run convertcp on xp or win 10.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#94 Post by aGerman » 24 Dec 2019 12:20

OK, I'll implement it in one of the next versions. This requires extensive code profiling beforehand. I'm afraid it would destroy the performance of the tool otherwise :wink:

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#95 Post by carlos » 24 Dec 2019 13:31

Maybe specific code should run only on xp thus the performance to the others platforms will be not affected

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#96 Post by miskox » 25 Dec 2019 12:25

Steffen: thanks again for all the updates. If required you can remove the XP support (I will use the latest version supported by XP (as mentioned I only need CP852 <-> CP1250 conversion)).

As it looked like for a simple project...

Saso

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#97 Post by miskox » 25 Dec 2019 12:38

aGerman wrote:
24 Dec 2019 06:21
I received the information that option /l is broken on XP. The update to v7.1 is supposed to fix that. Although I have to wait for feedback since I can't test on XP anymore.

Virustotal scans of version 7.1:
x86: https://www.virustotal.com/gui/file/860 ... /detection
x64: https://www.virustotal.com/gui/file/3b4 ... /detection

Steffen
/L (version 7.1 on XP 32-bit PRO) returns this:

Code: Select all

Code   |     Supported As     | Description
Page ID| Input Stream >511 MB |
-------+----------------------+--------------------------------------------------
    37 |         Yes          | 37    (IBM EBCDIC - U.S./Canada)
   437 |         Yes          | 437   (OEM - United States)
   500 |         Yes          | 500   (IBM EBCDIC - International)
   737 |         Yes          | 737   (OEM - Greek 437G)
   775 |         Yes          | 775   (OEM - Baltic)
   850 |         Yes          | 850   (OEM - Multilingual Latin I)
   852 |         Yes          | 852   (OEM - Latin II)
   855 |         Yes          | 855   (OEM - Cyrillic)
   857 |         Yes          | 857   (OEM - Turkish)
   860 |         Yes          | 860   (OEM - Portuguese)
   861 |         Yes          | 861   (OEM - Icelandic)
   863 |         Yes          | 863   (OEM - Canadian French)
   865 |         Yes          | 865   (OEM - Nordic)
   866 |         Yes          | 866   (OEM - Russian)
   869 |         Yes          | 869   (OEM - Modern Greek)
   874 |         Yes          | 874   (ANSI/OEM - Thai)
   875 |         Yes          | 875   (IBM EBCDIC - Modern Greek)
   932 |         No           | 932   (ANSI/OEM - Japanese Shift-JIS)
   936 |         No           | 936   (ANSI/OEM - Simplified Chinese GBK)
   949 |         No           | 949   (ANSI/OEM - Korean)
   950 |         No           | 950   (ANSI/OEM - Traditional Chinese Big5)
  1026 |         Yes          | 1026  (IBM EBCDIC - Turkish (Latin-5))
  1200 |         Yes          | 1200  (UTF-16 Little Endian Byte Order)
  1201 |         Yes          | 1201  (UTF-16 Big Endian Byte Order)
  1250 |         Yes          | 1250  (ANSI - Central Europe)
  1251 |         Yes          | 1251  (ANSI - Cyrillic)
  1252 |         Yes          | 1252  (ANSI - Latin I)
  1253 |         Yes          | 1253  (ANSI - Greek)
  1254 |         Yes          | 1254  (ANSI - Turkish)
  1255 |         Yes          | 1255  (ANSI - Hebrew)
  1256 |         Yes          | 1256  (ANSI - Arabic)
  1257 |         Yes          | 1257  (ANSI - Baltic)
  1258 |         Yes          | 1258  (ANSI/OEM - Viet Nam)
 10000 |         Yes          | 10000 (MAC - Roman)
 10006 |         Yes          | 10006 (MAC - Greek I)
 10007 |         Yes          | 10007 (MAC - Cyrillic)
 10010 |         Yes          | 10010 (MAC - Romania)
 10017 |         Yes          | 10017 (MAC - Ukraine)
 10029 |         Yes          | 10029 (MAC - Latin II)
 10079 |         Yes          | 10079 (MAC - Icelandic)
 10081 |         Yes          | 10081 (MAC - Turkish)
 10082 |         Yes          | 10082 (MAC - Croatia)
 12000 |         Yes          | 12000 (UTF-32 Little Endian Byte Order)
 12001 |         Yes          | 12001 (UTF-32 Big Endian Byte Order)
 20127 |         Yes          | 20127 (US-ASCII)
 20261 |         No           | 20261 (T.61)
 20866 |         Yes          | 20866 (Russian - KOI8)
 21866 |         Yes          | 21866 (Ukrainian - KOI8-U)
 28591 |         Yes          | 28591 (ISO 8859-1 Latin I)
 28592 |         Yes          | 28592 (ISO 8859-2 Central Europe)
 28594 |         Yes          | 28594 (ISO 8859-4 Baltic)
 28595 |         Yes          | 28595 (ISO 8859-5 Cyrillic)
 28597 |         Yes          | 28597 (ISO 8859-7 Greek)
 28599 |         Yes          | 28599 (ISO 8859-9 Latin 5)
 28605 |         Yes          | 28605 (ISO 8859-15 Latin 9)
 65000 |         No           | 65000 (UTF-7)
 65001 |         Yes          | 65001 (UTF-8)
Saso

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#98 Post by aGerman » 25 Dec 2019 16:20

Thanks Saso! The French user who reported this bug also confirmed that it has been fixed now.
As long as the performance doesn't suffer, I'll try to support XP.

Currently I'm working on Carlos' suggestion to override the UTF-8 bug of the XP API functions. I got my own U8ToU16 and U16ToU8 functions taking the same time as Microsoft's MultiByteToWideChar. But WideCharToMultiByte is still ~30% faster than my U16ToU8. I guess Microsoft used some ASM magic that I'm not able to beat using C :( Probably I will end up branching the code depending on the OS version.

Steffen

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#99 Post by aGerman » 26 Dec 2019 06:04

I incorporated custom conversions from UTF-8 to UTF-16 and vice versa because of buggy API functions on XP. (See the discussion above.) Finally it was not necessary to determine the Windows version and branch the code. The speed of my own functions is now comparable. And if used along with option /v they perform better than the API functions.
Furthermore passing 0 for the default ANSI code page was broken on v7.0 and v7.1. That's fixed now.

Virustotal scans of version 7.2:
x86: https://www.virustotal.com/gui/file/d3c ... /detection
x64: https://www.virustotal.com/gui/file/f30 ... /detection

Steffen

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#100 Post by carlos » 26 Dec 2019 15:08

Wow, really nice work Steffen.
Thanks for the update, now it a very strong software.

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#101 Post by miskox » 27 Dec 2019 02:25

Wow. That was fast. Thanks!

Saso

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#102 Post by aGerman » 28 Dec 2019 05:55

I revised the validation of incoming UTF-8. Not sure if I already caught every invalid byte sequence in the previous version.

Virustotal scans of version 7.3:
x86: https://www.virustotal.com/gui/file/4f0 ... /detection
x64: https://www.virustotal.com/gui/file/f1a ... /detection

Steffen

miskox
Posts: 553
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#103 Post by miskox » 28 Dec 2019 14:13

FYI: first version (.cpp) has 131 lines (including some 50 lines of comments and help). Version 7.2 (.c) has (7.3 source is not available yet) has 1573 lines.

Saso

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#104 Post by aGerman » 28 Dec 2019 15:36

Oh, thanks for the reminder! I must have done something wrong when I uploaded the source file.
FWIW 1596 lines :)

Steffen

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#105 Post by aGerman » 18 Jan 2020 14:11

Conversions between UTF-8 and UTF-16 usually require to convert to an intermediate UTF-32 code unit (same as the Unicode code point). But ASCII characters already have the same value in both UTF-8 and UTF-16. Thus, it can be converted directly which improves the performance for text with latin characters that includes a lot of ASCII. Especially for English which is ASCII only.

Virustotal scans of version 7.4:
x86: https://www.virustotal.com/gui/file/4bd ... /detection
x64: https://www.virustotal.com/gui/file/ed2 ... /detection

Steffen

Post Reply