Skip to content
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 25 additions & 11 deletions spec.txt
Original file line number Diff line number Diff line change
Expand Up @@ -294,10 +294,20 @@ In the examples, the `→` character is used to represent tabs.
Any sequence of [characters] is a valid CommonMark
document.

A [character](@) is a Unicode code point. Although some
code points (for example, combining accents) do not correspond to
characters in an intuitive sense, all code points count as characters
for purposes of this spec.
A [character](@) is an
[Unicode encoded character](https://www.unicode.org/glossary/#encoded_character)
(or [assigned character](https://www.unicode.org/glossary/#assigned_character)).
Although some code points (for example, combining accents) do not correspond to
characters in an intuitive sense, all encoded characters count as characters
for purposes of this spec. However,
[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point),
[reserved code points](https://www.unicode.org/glossary/#reserved_code_point),
or [Unicode noncharacters](https://www.unicode.org/glossary/#noncharacter)
are not included. If an implementation meets a code unit that is not
a part of a character, for example, a part of
[ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence)
at the place where it expects a character, the behavior is
[undefined](https://eel.is/c++draft/defns.undefined).

This spec does not specify an encoding; it thinks of lines as composed
of [characters] rather than bytes. A conforming parser may be limited
Expand Down Expand Up @@ -661,9 +671,12 @@ references and their corresponding code points.
references](@)
consist of `&#` + a string of 1--7 arabic digits + `;`. A
numeric character reference is parsed as the corresponding
Unicode character. Invalid Unicode code points will be replaced by
the REPLACEMENT CHARACTER (`U+FFFD`). For security reasons,
the code point `U+0000` will also be replaced by `U+FFFD`.
number. The parsed number is replaced with
another Unicode scalar value according to
[the rules stipulated in HTML Living Standard](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state)
if applicable. For example,
[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
and the code point `U+0000` will be replaced by `U+FFFD`.

```````````````````````````````` example
# Ӓ Ϡ �
Expand All @@ -675,8 +688,9 @@ the code point `U+0000` will also be replaced by `U+FFFD`.
[Hexadecimal numeric character
references](@) consist of `&#` +
either `X` or `x` + a string of 1-6 hexadecimal digits + `;`.
They too are parsed as the corresponding Unicode character (this
time specified with a hexadecimal numeral instead of decimal).
They too are parsed and sanitized as the corresponding Unicode scalar value
according to the rules of HTML Living Standard
(this time specified with a hexadecimal numeral instead of decimal).

```````````````````````````````` example
" ആ ಫ
Expand All @@ -700,7 +714,7 @@ Here are some nonentities:
````````````````````````````````


Although HTML5 does accept some entity references
Although HTML Living Standard does accept some entity references
without a trailing semicolon (such as `&copy`), these are not
recognized here, because it makes the grammar too ambiguous:

Expand All @@ -711,7 +725,7 @@ recognized here, because it makes the grammar too ambiguous:
````````````````````````````````


Strings that are not on the list of HTML5 named entities are not
Strings that are not on the list of HTML Live Standard named entities are not
recognized as entity references either:

```````````````````````````````` example
Expand Down