Skip to content

Commit dd25392

Browse files
authored
Update README.md
First stab at an explainer
1 parent fa88886 commit dd25392

File tree

1 file changed

+55
-2
lines changed

1 file changed

+55
-2
lines changed

README.md

Lines changed: 55 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
A TC39 proposal for a new method to compare strings by Unicode code points.
44

5-
# Status
5+
## Status
66

77
[The TC39 Process](https://tc39.es/process-document/)
88

@@ -12,7 +12,60 @@ A TC39 proposal for a new method to compare strings by Unicode code points.
1212
- Mathieu Hofman (@mhofman)
1313
- Mark S. Miller (@erights)
1414

15-
# See also
15+
## Motivation
16+
17+
### Background on strings in JavaScript
18+
19+
JavaScript exposes strings as a sequence of 16-bits / UCS-2 characters. For well-formed strings, these characters are UTF-16 code units. UTF-16 uses Surrogate Pairs for any Unicode codepoints outside the Basic Multilingual Plane.
20+
21+
That means that a Unicode character with a code point in the [0x010000 - 0x10FFFF] range will be represented as 2 code units, the first leading surrogate in the [0xD800 - 0xDBFF] range, and the second trailing surrogate in the [0xDC00 - 0xDFFF] range.
22+
23+
For further reading, see @mathiasbynens's post on [JavaScript’s internal character encoding](https://mathiasbynens.be/notes/javascript-encoding).
24+
25+
### Effect of string encoding on JavaScript programs
26+
27+
This encoding choice is observed by JavaScript programs in the following main cases:
28+
- Indexed access to a strings. This extends to all String APIs involving offsets or length
29+
- Matching string using RegExp without the `u` or `v` flag
30+
- Comparing strings
31+
32+
Unlike indexed access, iterators on strings do operate on codepoints. Similarly the `u` and `v` RegExp flags enables matching whole Unicode codepoints instead of code units / surrogate halves. However no String API exists that allows comparing strings by code point.
33+
34+
Because JavaScript compares strings by their 16-bits code units, any codepoint in the range [0xE0000 - 0xFFFF] will sort after a leading surrogate used to encode the first half of a codepoint in the [0x010000 - 0x10FFFF] range.
35+
36+
### Interoperability with other systems
37+
38+
This comparison behavior puts JavaScript at odds with other languages or systems that end up comparing strings by their Unicode codepoint. That includes any language or system that uses UTF-8 as their string encoding and relies on bytes comparison (UTF-8 does preserve sort order).
39+
40+
Example of languages using UTF-8 encoding are Swift and Golang. SQLite by default encodes strings using UTF-8 as well. Because of this, the sort order of strings in these systems may not match the sort order of the same strings in JavaScript.
41+
42+
## Proposal
43+
44+
A `String.codePointCompare(a, b)` (actual name TBD) that can be used to compare 2 strings by their code point. The function could be used with `Array.prototype.sort`.
45+
46+
## Alternatives considered
47+
48+
### Manual iteration
49+
50+
It's possible to write a comparator using a manual iteration of the strings, and is the status-quo today. This can be implemented by either manually advancing String iterators in lock-step, or by using indexed access and retrieving the code units.
51+
52+
### A "locale-less" collation
53+
54+
There are already alternative comparators in the language, e.g. `String.prototype.localeCompare` or `Intl.Collator.prototype.compare`. While these function on codepoints, they take into consideration the locale and collapse characters in the same equivalence class.
55+
56+
We could imagine a collation that does not perform any locale specific logic, and simply enables comparing strings by their Unicode code point.
57+
58+
## Q&A
59+
60+
### What should the comparator return if it encounters malformed strings?
61+
62+
An unmatched surrogate in a string could fallback to comparing using its code unit. There may be alternative behaviors that could be considered.
63+
64+
### What about comparison operators like < or >
65+
66+
Because changing their behavior would be a breaking change, they would continue comparing strings lexicographically by code unit. However a program could instead compare the sign of the result of calling the proposed codepoint comparator function with the 2 strings.
67+
68+
## See also
1669

1770
https://github.com/endojs/endo/pull/2008
1871

0 commit comments

Comments
 (0)