Only :ref:`ASCII`, :ref:`UTF-8` and encodings using a :ref:`BOM <bom>` (:ref:`UTF-7 <utf7>` with BOM, UTF-8 with BOM, :ref:`UTF-16 <utf16>`, and :ref:`UTF-32 <utf32>`) have reliable algorithms to get the encoding of a document. For all other encodings, you have to trust heuristics based on statistics.
Check if a document is encoded to :ref:`ASCII` is simple: test if the bit 7 of
all bytes is unset (0b0xxxxxxx
).
Example in :ref:`C <c>`:
int isASCII(const char *data, size_t size)
{
const unsigned char *str = (const unsigned char*)data;
const unsigned char *end = str + size;
for (; str != end; str++) {
if (*str & 0x80)
return 0;
}
return 1;
}
In :ref:`Python`, the ASCII decoder can be used:
def isASCII(data):
try:
data.decode('ASCII')
except UnicodeDecodeError:
return False
else:
return True
Note
Only use the Python function on short strings because it decodes the whole string into memory. For long strings, it is better to use the algorithm of the C function because it doesn't allocate any memory.
If the string begins with a :ref:`BOM <bom>`, the encoding can be extracted from the BOM. But there is a problem with :ref:`UTF-16-BE <utf16>` and :ref:`UTF-32-LE <utf32>`: UTF-32-LE BOM starts with the UTF-16-LE BOM.
Example of a function written in :ref:`C <c>` to check if a BOM is present:
#include <string.h> /* memcmp() */
const char *UTF_16_BE_BOM = "\xFE\xFF";
const char *UTF_16_LE_BOM = "\xFF\xFE";
const char *UTF_8_BOM = "\xEF\xBB\xBF";
const char *UTF_32_BE_BOM = "\x00\x00\xFE\xFF";
const char *UTF_32_LE_BOM = "\xFF\xFE\x00\x00";
char* check_bom(const char *data, size_t size)
{
if (size >= 3) {
if (memcmp(data, UTF_8_BOM, 3) == 0)
return "UTF-8";
}
if (size >= 4) {
if (memcmp(data, UTF_32_LE_BOM, 4) == 0)
return "UTF-32-LE";
if (memcmp(data, UTF_32_BE_BOM, 4) == 0)
return "UTF-32-BE";
}
if (size >= 2) {
if (memcmp(data, UTF_16_LE_BOM, 2) == 0)
return "UTF-16-LE";
if (memcmp(data, UTF_16_BE_BOM, 2) == 0)
return "UTF-16-BE";
}
return NULL;
}
For the UTF-16-LE/UTF-32-LE BOM conflict: this function returns "UTF-32-LE"
if the string begins with "\xFF\xFE\x00\x00"
, even if this string can be
:ref:`decoded <decode>` from UTF-16-LE.
Example in :ref:`Python` getting the BOMs from the codecs library:
from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE
BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)
def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
This function is different from the C function: it returns a list. It returns
['UTF-32-LE', 'UTF-16-LE']
if the string begins with
b"\xFF\xFE\x00\x00"
.
:ref:`UTF-8` encoding adds markers to each bytes and so it's possible to write a reliable algorithm to check if a :ref:`byte string <bytes>` is encoded to UTF-8.
Example of a strict :ref:`C <c>` function to check if a string is encoded with
UTF-8. It rejects :ref:`overlong sequences <strict utf8 decoder>` (e.g. 0xC0
0x80
) and :ref:`surrogate characters <surrogates>` (e.g. 0xED 0xB2 0x80
,
U+DC80).
#include <stdint.h>
int isUTF8(const char *data, size_t size)
{
const unsigned char *str = (unsigned char*)data;
const unsigned char *end = str + size;
unsigned char byte;
unsigned int code_length, i;
uint32_t ch;
while (str != end) {
byte = *str;
if (byte <= 0x7F) {
/* 1 byte sequence: U+0000..U+007F */
str += 1;
continue;
}
if (0xC2 <= byte && byte <= 0xDF)
/* 0b110xxxxx: 2 bytes sequence */
code_length = 2;
else if (0xE0 <= byte && byte <= 0xEF)
/* 0b1110xxxx: 3 bytes sequence */
code_length = 3;
else if (0xF0 <= byte && byte <= 0xF4)
/* 0b11110xxx: 4 bytes sequence */
code_length = 4;
else {
/* invalid first byte of a multibyte character */
return 0;
}
if (str + (code_length - 1) >= end) {
/* truncated string or invalid byte sequence */
return 0;
}
/* Check continuation bytes: bit 7 should be set, bit 6 should be
* unset (b10xxxxxx). */
for (i=1; i < code_length; i++) {
if ((str[i] & 0xC0) != 0x80)
return 0;
}
if (code_length == 2) {
/* 2 bytes sequence: U+0080..U+07FF */
ch = ((str[0] & 0x1f) << 6) + (str[1] & 0x3f);
/* str[0] >= 0xC2, so ch >= 0x0080.
str[0] <= 0xDF, (str[1] & 0x3f) <= 0x3f, so ch <= 0x07ff */
} else if (code_length == 3) {
/* 3 bytes sequence: U+0800..U+FFFF */
ch = ((str[0] & 0x0f) << 12) + ((str[1] & 0x3f) << 6) +
(str[2] & 0x3f);
/* (0xff & 0x0f) << 12 | (0xff & 0x3f) << 6 | (0xff & 0x3f) = 0xffff,
so ch <= 0xffff */
if (ch < 0x0800)
return 0;
/* surrogates (U+D800-U+DFFF) are invalid in UTF-8:
test if (0xD800 <= ch && ch <= 0xDFFF) */
if ((ch >> 11) == 0x1b)
return 0;
} else if (code_length == 4) {
/* 4 bytes sequence: U+10000..U+10FFFF */
ch = ((str[0] & 0x07) << 18) + ((str[1] & 0x3f) << 12) +
((str[2] & 0x3f) << 6) + (str[3] & 0x3f);
if ((ch < 0x10000) || (0x10FFFF < ch))
return 0;
}
str += code_length;
}
return 1;
}
In :ref:`Python`, the UTF-8 decoder can be used:
def isUTF8(data):
try:
data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
return True
In :ref:`Python 2 <python2>`, this function is more tolerant than the C
function, because the UTF-8 decoder of Python 2 accepts surrogate characters
(U+D800—U+DFFF). For example, isUTF8(b'\xED\xB2\x80')
returns True
.
With :ref:`Python 3 <python3>`, the Python function is equivalent to the C
function. If you would like to reject surrogate characters in Python 2, use
the following strict function:
def isUTF8Strict(data):
try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
for ch in decoded:
if 0xD800 <= ord(ch) <= 0xDFFF:
return False
return True
:ref:`PHP <php>` has a builtin function to detect the encoding of a :ref:`byte
string <bytes>`: mb_detect_encoding()
.
- chardet: :ref:`Python` version of the "chardet" algorithm implemented in Mozilla
- UTRAC: command line program (written in :ref:`C <c>`) to recognize the encoding of an input file and its end-of-line type
- charguess: Ruby library to guess the charset of a document
.. todo:: update/complete this list