commonmark · tats-u · Mar 18, 2025 · Apr 27, 2025 · Apr 27, 2025 · May 15, 2025
diff --git a/spec.txt b/spec.txt
@@ -294,10 +294,20 @@ In the examples, the `→` character is used to represent tabs.
 Any sequence of [characters] is a valid CommonMark
 document.
 
-A [character](@) is a Unicode code point.  Although some
-code points (for example, combining accents) do not correspond to
-characters in an intuitive sense, all code points count as characters
-for purposes of this spec.
+A [character](@) is an
+[Unicode encoded character](https://www.unicode.org/glossary/#encoded_character)
+(or [assigned character](https://www.unicode.org/glossary/#assigned_character)).
+Although some code points (for example, combining accents) do not correspond to
+characters in an intuitive sense, all encoded characters count as characters
+for purposes of this spec. However,
+[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point),
+[reserved code points](https://www.unicode.org/glossary/#reserved_code_point),
+or [Unicode noncharacters](https://www.unicode.org/glossary/#noncharacter)
+are not included. If an implementation meets a code unit that is not
+a part of a character, for example, a part of
+[ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence)
+at the place where it expects a character, the behavior is
+[undefined](https://eel.is/c++draft/defns.undefined).
 
 This spec does not specify an encoding; it thinks of lines as composed
 of [characters] rather than bytes.  A conforming parser may be limited
@@ -661,9 +671,12 @@ references and their corresponding code points.
 references](@)
 consist of `&#` + a string of 1--7 arabic digits + `;`. A
 numeric character reference is parsed as the corresponding
-Unicode character. Invalid Unicode code points will be replaced by
-the REPLACEMENT CHARACTER (`U+FFFD`).  For security reasons,
-the code point `U+0000` will also be replaced by `U+FFFD`.
+number.  The parsed number is replaced with
+another Unicode scalar value according to 
+[the rules stipulated in HTML Living Standard](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state)
+if applicable.  For example,
+[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
+and the code point `U+0000` will be replaced by `U+FFFD`.
 
 ```````````````````````````````` example
 &#35; &#1234; &#992; &#0;
@@ -675,8 +688,9 @@ the code point `U+0000` will also be replaced by `U+FFFD`.
 [Hexadecimal numeric character
 references](@) consist of `&#` +
 either `X` or `x` + a string of 1-6 hexadecimal digits + `;`.
-They too are parsed as the corresponding Unicode character (this
-time specified with a hexadecimal numeral instead of decimal).
+They too are parsed and sanitized as the corresponding Unicode scalar value
+according to the rules of HTML Living Standard
+(this time specified with a hexadecimal numeral instead of decimal).
 
 ```````````````````````````````` example
 &#X22; &#XD06; &#xcab;
@@ -700,7 +714,7 @@ Here are some nonentities:
 ````````````````````````````````
 
 
-Although HTML5 does accept some entity references
+Although HTML Living Standard does accept some entity references
 without a trailing semicolon (such as `&copy`), these are not
 recognized here, because it makes the grammar too ambiguous:
 
@@ -711,7 +725,7 @@ recognized here, because it makes the grammar too ambiguous:
 ````````````````````````````````
 
 
-Strings that are not on the list of HTML5 named entities are not
+Strings that are not on the list of HTML Live Standard named entities are not
 recognized as entity references either:
 
 ```````````````````````````````` example