Page 2 of 2

Re: UTF-8 bug

Posted: 06 Mar 2019 15:28
by carlos
I will try to fix it in a new coming soon new utility for batch that will improve it.
AGerman please can you help me? How can I determine if the input buffer used in MultibyteToWidechar have incomplete codepoints?

Re: UTF-8 bug

Posted: 06 Mar 2019 16:51
by aGerman
Carlos

I'll explain the math behind here in the thread but the C implementation rather via PM or e-mail because it's quite off topic in the forum.

There are 5 rules that help you:
- UTF-8 characters are limited to a length of 4 code units (bytes).
- ASCII characters (7 low bits used, Most Significant Bit always 0) consist of only one code unit.
- Multi-byte UTF-8 code units always have the MSB set to 1.
- The first code unit of a multi-byte character has both the MSB and the second highest bit set to 1.
- The next code units of a multi-byte character have the MSB set to 1 but the second highest bit set to 0.

To check a byte you have two possibilities:
1) Use Logical Right Shift (that is, you have to cast the char type to an unsigned value). Shift 7 bits to get either 0 (ASCII) or 1 (multi-byte). Shift 6 bits to get either 3 (for the first code unit) or 2 (for the next code units) in a multi-byte character.
2) Use bit masks and bitwise AND for the same tests.

The zip file I uploaded in post #7 contains a C source code where I already implemented the test, along with the removal of the BOM. If you have any questions then just get back to me via PM.

Steffen

Re: UTF-8 bug

Posted: 12 Mar 2019 03:33
by jfl
Arriving a bit late in this thread, but hopefully with a few interesting links:
  • The console always works in UCS-2 mode, whatever the code page you're using.
    That is is records 16-bits Unicode version 1 characters in each cell. Ascii or non-ascii is irrelevant: Ascii is just the first 128 Unicode characters.
    But with just 16 bits, it cannot record/display characters with 17 to 20 bits defined in subsequent Unicode versions.
    Microsoft is well aware of that, and is currently redesigning the console to resolve this issue. They've published a very interesting document about their current work on this subject there:
    https://blogs.msdn.microsoft.com/comman ... xt-buffer/
  • They're also aware of the BOM issue, and so hopefully this should be fixed when they deliver the redesigned console (as explained in their blog post above) later this year (?).
  • The problem with the spurious characters appearing in the middle of the text is different. I think it's completely independent of the above console limitations, but is a bug in the UTF-8 to Unicode conversion code in the console output handler.
    The good news, is that the console team, contrary to its cmd.exe colleagues, has a GitHub site for managing issues in their code. I've opened a new bug there, referencing this thread:
    https://github.com/Microsoft/console/issues/386

Re: UTF-8 bug

Posted: 12 Mar 2019 04:56
by carlos
Thanks aGerman. I'm currently working with Jason Hood on a new utility that include a feature for solve this bug.
The feature currently is done.
We are working on the other features and revisions before I publish this.
I hope publish the utility in two weeks.

Re: UTF-8 bug

Posted: 12 Mar 2019 05:37
by aGerman
Very interesting :!:
I wasn't aware that Microsoft tracks the Console issues on GitHub. Thanks for reporting this bug Jean-François. I'll contribute to the discussion when I'm back home.

Steffen