UTF-8 bug

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: UTF-8 bug

#16 Post by carlos » 06 Mar 2019 15:28

I will try to fix it in a new coming soon new utility for batch that will improve it.
AGerman please can you help me? How can I determine if the input buffer used in MultibyteToWidechar have incomplete codepoints?

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: UTF-8 bug

#17 Post by aGerman » 06 Mar 2019 16:51

Carlos

I'll explain the math behind here in the thread but the C implementation rather via PM or e-mail because it's quite off topic in the forum.

There are 5 rules that help you:
- UTF-8 characters are limited to a length of 4 code units (bytes).
- ASCII characters (7 low bits used, Most Significant Bit always 0) consist of only one code unit.
- Multi-byte UTF-8 code units always have the MSB set to 1.
- The first code unit of a multi-byte character has both the MSB and the second highest bit set to 1.
- The next code units of a multi-byte character have the MSB set to 1 but the second highest bit set to 0.

To check a byte you have two possibilities:
1) Use Logical Right Shift (that is, you have to cast the char type to an unsigned value). Shift 7 bits to get either 0 (ASCII) or 1 (multi-byte). Shift 6 bits to get either 3 (for the first code unit) or 2 (for the next code units) in a multi-byte character.
2) Use bit masks and bitwise AND for the same tests.

The zip file I uploaded in post #7 contains a C source code where I already implemented the test, along with the removal of the BOM. If you have any questions then just get back to me via PM.

Steffen

jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: UTF-8 bug

#18 Post by jfl » 12 Mar 2019 03:33

Arriving a bit late in this thread, but hopefully with a few interesting links:
  • The console always works in UCS-2 mode, whatever the code page you're using.
    That is is records 16-bits Unicode version 1 characters in each cell. Ascii or non-ascii is irrelevant: Ascii is just the first 128 Unicode characters.
    But with just 16 bits, it cannot record/display characters with 17 to 20 bits defined in subsequent Unicode versions.
    Microsoft is well aware of that, and is currently redesigning the console to resolve this issue. They've published a very interesting document about their current work on this subject there:
    https://blogs.msdn.microsoft.com/comman ... xt-buffer/
  • They're also aware of the BOM issue, and so hopefully this should be fixed when they deliver the redesigned console (as explained in their blog post above) later this year (?).
  • The problem with the spurious characters appearing in the middle of the text is different. I think it's completely independent of the above console limitations, but is a bug in the UTF-8 to Unicode conversion code in the console output handler.
    The good news, is that the console team, contrary to its cmd.exe colleagues, has a GitHub site for managing issues in their code. I've opened a new bug there, referencing this thread:
    https://github.com/Microsoft/console/issues/386

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: UTF-8 bug

#19 Post by carlos » 12 Mar 2019 04:56

Thanks aGerman. I'm currently working with Jason Hood on a new utility that include a feature for solve this bug.
The feature currently is done.
We are working on the other features and revisions before I publish this.
I hope publish the utility in two weeks.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: UTF-8 bug

#20 Post by aGerman » 12 Mar 2019 05:37

Very interesting :!:
I wasn't aware that Microsoft tracks the Console issues on GitHub. Thanks for reporting this bug Jean-François. I'll contribute to the discussion when I'm back home.

Steffen

Post Reply