Hi Steffen,
You still have to condition the console window for UTF-16 output and restore the old behavior if you're done.
It's the stdout file that I switch to 16-bits mode, not the console. This is done using the C library function _setmode(fileno(stdout), _O_WTEXT).
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
I'd really favor an encoding that fully supports Unicode, preferably UTF-8.
Agreed, I need to change that in my code, at least for Windows 10.
One possible refinement would be to obey Notepad's own default encoding, stored in the registry in HKCU\Software\Microsoft\Notepad\iDefaultEncoding.
If the input has a BOM then it's easy indeed. Otherwise I have my doubts.
Indeed the BOM is a strong hint... But unfortunately it's being abandoned.
Fortunately, a byte stream with bytes from \x80 to \xFF can be reliably validated as UTF-8 or not. The probability of a false positive is not 0, but it's very very low. The more non-ASCII bytes, the lower the probability.
it may take thousands of characters before the first non-ASCII character appears.
Correct. It's a weakness of my conv program, which reads the whole file in memory. So it can't convert huge files that don't fit.
Again we're talking about heuristics for choosing defaults. We know we have to accept a (hopefully small) proportion of errors.
Using a large buffer, and testing the first part of the file, would give a good default in most cases.
In case of UTF-16 you can check for the typical alternately appearing zero bytes. But in CJK languages you'll be out of luck.
Correct. I have very little experience with CJK languages, but I hope that there are occasional spaces or digits that may help.
Note that the only UTF-16 files I usually have to deal with are Windows own log files, and registry exports. These contain mostly ASCII anyway, even on CJK versions of Windows.
how do you handle this in your source code
Well, actually only the BOM detection is implemented in conv.c. The UTF-8 / UTF-16 validation has been on my to-do list for years, but it's never made it to the top.
Finally note that I did experiment with the COM API IMultiLanguage2::DetectInputCodepage()... But the results were very poor: It's slower, and wrong more often than my current heuristics.
OK, but at least for the pipe you're not doing it. I tested a few minutes ago and it seems you're using the ACP along with conv
Yes I do! Notice how the two non-ASCII characters are changed when conv pipes that ANSI file to dump.exe, versus dumping the file directly:
Code: Select all
C:\JFL\Temp>chcp
Active code page: 437
C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>dump test.txt
Offset 00 04 08 0C 0 4 8 C
-------- ----------- ----------- ----------- ----------- -------- --------
00000000 4A 65 61 6E 2D 46 72 61 6E E7 6F 69 73 20 68 61 Jean-Fra n�ois ha
00000010 62 69 74 65 20 E0 20 47 72 65 6E 6F 62 6C 65 0D bite � G renoble
00000020 0A
C:\JFL\Temp>conv test.txt | dump
Offset 00 04 08 0C 0 4 8 C
-------- ----------- ----------- ----------- ----------- -------- --------
00000000 4A 65 61 6E 2D 46 72 61 6E 87 6F 69 73 20 68 61 Jean-Fra n�ois ha
00000010 62 69 74 65 20 85 20 47 72 65 6E 6F 62 6C 65 0D bite � G renoble
00000020 0A
C:\JFL\Temp>codepage 437
Code page 437: OEM - United States (SBCS) ASCII-compatible
80 Ç 90 É A0 á B0 ░ C0 └ D0 ╨ E0 α F0 ≡
81 ü 91 æ A1 í B1 ▒ C1 ┴ D1 ╤ E1 ß F1 ±
82 é 92 Æ A2 ó B2 ▓ C2 ┬ D2 ╥ E2 Γ F2 ≥
83 â 93 ô A3 ú B3 │ C3 ├ D3 ╙ E3 π F3 ≤
84 ä 94 ö A4 ñ B4 ┤ C4 ─ D4 ╘ E4 Σ F4 ⌠
85 à 95 ò A5 Ñ B5 ╡ C5 ┼ D5 ╒ E5 σ F5 ⌡
86 å 96 û A6 ª B6 ╢ C6 ╞ D6 ╓ E6 µ F6 ÷
87 ç 97 ù A7 º B7 ╖ C7 ╟ D7 ╫ E7 τ F7 ≈
88 ê 98 ÿ A8 ¿ B8 ╕ C8 ╚ D8 ╪ E8 Φ F8 °
89 ë 99 Ö A9 ⌐ B9 ╣ C9 ╔ D9 ┘ E9 Θ F9 ∙
8A è 9A Ü AA ¬ BA ║ CA ╩ DA ┌ EA Ω FA ·
8B ï 9B ¢ AB ½ BB ╗ CB ╦ DB █ EB δ FB √
8C î 9C £ AC ¼ BC ╝ CC ╠ DC ▄ EC ∞ FC ⁿ
8D ì 9D ¥ AD ¡ BD ╜ CD ═ DD ▌ ED φ FD ²
8E Ä 9E ₧ AE « BE ╛ CE ╬ DE ▐ EE ε FE ■
8F Å 9F ƒ AF » BF ┐ CF ╧ DF ▀ EF ∩ FF
C:\JFL\Temp>codepage 1252
Code page 1252: ANSI - Latin I (SBCS) ASCII-compatible
80 € 90 A0 B0 ° C0 À D0 Ð E0 à F0 ð
81 91 ‘ A1 ¡ B1 ± C1 Á D1 Ñ E1 á F1 ñ
82 ‚ 92 ’ A2 ¢ B2 ² C2 Â D2 Ò E2 â F2 ò
83 ƒ 93 “ A3 £ B3 ³ C3 Ã D3 Ó E3 ã F3 ó
84 „ 94 ” A4 ¤ B4 ´ C4 Ä D4 Ô E4 ä F4 ô
85 … 95 • A5 ¥ B5 µ C5 Å D5 Õ E5 å F5 õ
86 † 96 – A6 ¦ B6 ¶ C6 Æ D6 Ö E6 æ F6 ö
87 ‡ 97 — A7 § B7 · C7 Ç D7 × E7 ç F7 ÷
88 ˆ 98 ˜ A8 ¨ B8 ¸ C8 È D8 Ø E8 è F8 ø
89 ‰ 99 ™ A9 © B9 ¹ C9 É D9 Ù E9 é F9 ù
8A Š 9A š AA ª BA º CA Ê DA Ú EA ê FA ú
8B ‹ 9B › AB « BB » CB Ë DB Û EB ë FB û
8C Œ 9C œ AC ¬ BC ¼ CC Ì DC Ü EC ì FC ü
8D 9D AD BD ½ CD Í DD Ý ED í FD ý
8E Ž 9E ž AE ® BE ¾ CE Î DE Þ EE î FE þ
8F 9F Ÿ AF ¯ BF ¿ CF Ï DF ß EF ï FF ÿ
C:\JFL\Temp>
I'm really not sure if I would even consider to allow a default here.
All I can tell is that it works well for me in most cases.