Carlos
I'll explain the math behind here in the thread but the C implementation rather via PM or e-mail because it's quite off topic in the forum.
There are 5 rules that help you:
- UTF-8 characters are limited to a length of 4 code units (bytes).
- ASCII characters (7 low bits used, Most Significant Bit always 0) consist of only one code unit.
- Multi-byte UTF-8 code units always have the MSB set to 1.
- The first code unit of a multi-byte character has both the MSB and the second highest bit set to 1.
- The next code units of a multi-byte character have the MSB set to 1 but the second highest bit set to 0.
To check a byte you have two possibilities:
1) Use Logical Right Shift (that is, you have to cast the char type to an unsigned value). Shift 7 bits to get either 0 (ASCII) or 1 (multi-byte). Shift 6 bits to get either 3 (for the first code unit) or 2 (for the next code units) in a multi-byte character.
2) Use bit masks and bitwise AND for the same tests.
The zip file I uploaded in post
#7 contains a C source code where I already implemented the test, along with the removal of the BOM. If you have any questions then just get back to me via PM.
Steffen