You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -12,7 +12,60 @@ A TC39 proposal for a new method to compare strings by Unicode code points.
12
12
- Mathieu Hofman (@mhofman)
13
13
- Mark S. Miller (@erights)
14
14
15
-
# See also
15
+
## Motivation
16
+
17
+
### Background on strings in JavaScript
18
+
19
+
JavaScript exposes strings as a sequence of 16-bits / UCS-2 characters. For well-formed strings, these characters are UTF-16 code units. UTF-16 uses Surrogate Pairs for any Unicode codepoints outside the Basic Multilingual Plane.
20
+
21
+
That means that a Unicode character with a code point in the [0x010000 - 0x10FFFF] range will be represented as 2 code units, the first leading surrogate in the [0xD800 - 0xDBFF] range, and the second trailing surrogate in the [0xDC00 - 0xDFFF] range.
22
+
23
+
For further reading, see @mathiasbynens's post on [JavaScript’s internal character encoding](https://mathiasbynens.be/notes/javascript-encoding).
24
+
25
+
### Effect of string encoding on JavaScript programs
26
+
27
+
This encoding choice is observed by JavaScript programs in the following main cases:
28
+
- Indexed access to a strings. This extends to all String APIs involving offsets or length
29
+
- Matching string using RegExp without the `u` or `v` flag
30
+
- Comparing strings
31
+
32
+
Unlike indexed access, iterators on strings do operate on codepoints. Similarly the `u` and `v` RegExp flags enables matching whole Unicode codepoints instead of code units / surrogate halves. However no String API exists that allows comparing strings by code point.
33
+
34
+
Because JavaScript compares strings by their 16-bits code units, any codepoint in the range [0xE0000 - 0xFFFF] will sort after a leading surrogate used to encode the first half of a codepoint in the [0x010000 - 0x10FFFF] range.
35
+
36
+
### Interoperability with other systems
37
+
38
+
This comparison behavior puts JavaScript at odds with other languages or systems that end up comparing strings by their Unicode codepoint. That includes any language or system that uses UTF-8 as their string encoding and relies on bytes comparison (UTF-8 does preserve sort order).
39
+
40
+
Example of languages using UTF-8 encoding are Swift and Golang. SQLite by default encodes strings using UTF-8 as well. Because of this, the sort order of strings in these systems may not match the sort order of the same strings in JavaScript.
41
+
42
+
## Proposal
43
+
44
+
A `String.codePointCompare(a, b)` (actual name TBD) that can be used to compare 2 strings by their code point. The function could be used with `Array.prototype.sort`.
45
+
46
+
## Alternatives considered
47
+
48
+
### Manual iteration
49
+
50
+
It's possible to write a comparator using a manual iteration of the strings, and is the status-quo today. This can be implemented by either manually advancing String iterators in lock-step, or by using indexed access and retrieving the code units.
51
+
52
+
### A "locale-less" collation
53
+
54
+
There are already alternative comparators in the language, e.g. `String.prototype.localeCompare` or `Intl.Collator.prototype.compare`. While these function on codepoints, they take into consideration the locale and collapse characters in the same equivalence class.
55
+
56
+
We could imagine a collation that does not perform any locale specific logic, and simply enables comparing strings by their Unicode code point.
57
+
58
+
## Q&A
59
+
60
+
### What should the comparator return if it encounters malformed strings?
61
+
62
+
An unmatched surrogate in a string could fallback to comparing using its code unit. There may be alternative behaviors that could be considered.
63
+
64
+
### What about comparison operators like < or >
65
+
66
+
Because changing their behavior would be a breaking change, they would continue comparing strings lexicographically by code unit. However a program could instead compare the sign of the result of calling the proposed codepoint comparator function with the 2 strings.
0 commit comments