Skip to content

Conversation

@eggrobin
Copy link
Member

@eggrobin eggrobin commented Feb 11, 2025

[182-C7] Consensus: Provisionally assign 31 code points U+2E60..U+2E63, U+A7DD and U+1DF68..U+1DF81, in the Supplemental Punctuation, Latin Extended-D and Latin Extended-G blocks, to characters for EPA with names and code points as described in Section 2.2 of L2/24-277. [Ref. 1.6 in L2/25-010]

[185-C40] Consensus: UTC accepts for encoding in Unicode 18.0 the following 321 Arabic, Armenian, Bengali, Cuneiform, Devanagari, Hebrew, Kana, Khitan, Latin, Mongolian, Phonetic and other symbol characters for which code points have previously been assigned:

  1. Arabic (39 characters—ref. 180-C22, 180-C26): 10EC9..10ECF, 10ED9..10EEE, 10EF0..10EF9
  2. Armenian (3 characters—ref. 179-C46): 0558, 058B..058C
  3. Bengali (1 character—ref. 180-C30): 0984
  4. Cuneiform numerals (12 characters—ref. 182-C3): 1246F, 12475..1247F
  5. Devanagari (1 character—ref. 182-C5): 11B0A
  6. Hebrew (1 character—ref. 182-C4): 05C8
  7. Kana (7 characters—ref. 180-C6, 182-C31, 183-C54, 184-C38): 1B123..1B125, 1B126, 1B127..1B128, 1B168
  8. Khitan (5 characters—ref. 184-C5): 18CD6..18CDA
  9. Latin (54 characters—ref. 181-C8, 181-C10, 182-C6, 182-C7, 182-C8, 182-C9, 183-C8): 2E60..2E63, A7DD, A7E2, AB6C..AB6D, 1DF57..1DF59, 1DF5A..1DF66, 1DF67, 1DF68..1DF81, 1DFCD..1DFCF
  10. Mongolian (1 character—ref. 178-C30): 1879
  11. Phonetic (114 characters—ref. 179-C55, 179-C59, 179-C60, 180-C32, 180-C33, 180-C34, 180-C35, 180-C36, 180-C37, 181-C33, 181-C34, 181-C35, 181-C36, 181-C45, 183-C10): 1ADE..1ADF, 1AEC..1AF0, 208F, 209D..209F, 107BB..107BF, 1DF1F..1DF24, 1DF2B..1DF2C, 1DF2D..1DF3A, 1DF3B..1DF3D, 1DF3E..1DF3F, 1DF40..1DF56, 1DFD0, 1DFD1, 1DFD2..1DFD7, 1DFD8..1DFE8, 1DFE9..1DFF2, 1DFF3..1DFF4, 1DFF5..1DFF9, 1DFFA..1DFFF
  12. Symbols (81 characters—ref. 178-C31, 178-C36, 178-C37, 180-C38, 180-C39, 180-C40, 181-C38, 181-C39, 181-C40, 182-C10, 182-C11, 183-C12, 183-C13, 184-C18): 20C2, 1CEF1..1CEF5, 1D127..1D128, 1D1EB..1D1F6, 1D1F7..1D1FE, 1D1FF, 1D250..1D255, 1D256..1D25A, 1D25B..1D25F, 1D260, 1D261, 1D262..1D27F, 1D280..1D281, 1F1AE, 1F7DA
  13. Tangut (2 characters—ref. 183-C7, 184-C4: 18D1F..18D20

@kirkrmiller
Copy link

Do pull requests need to specify when characters should be added the PropList file for the soft_dotted property? Three of the characters here do: U+1DF6F, U+1DF70 and U+1DF71.

@eggrobin eggrobin added the ucd-δ-needs-revision have data, but UTC approved changes that need to be made label Nov 7, 2025
@eggrobin
Copy link
Member Author

eggrobin commented Nov 7, 2025

Good catch. I will fix that…

@eggrobin eggrobin marked this pull request as ready for review November 11, 2025 13:57
@eggrobin
Copy link
Member Author

@markusicu We had reported on the properties of these characters in L2/25-087, pp. 14 sq., but nobody had spotted the soft dotted issue; ideally I would like to include a note about the updated comparisons in the report to UTC-186, but I am not sure how to do that (I guess I could reopen the SAH issue, but that seems a bit over the top).

What do you think?

@eggrobin eggrobin requested a review from markusicu November 11, 2025 14:00
Comment on lines 36042 to 36045
1DF68;LATIN CAPITAL LETTER PHONOTYPIC A WITH SWASH;Lu;0;L;;;;;N;;;;1DF69;
1DF69;LATIN SMALL LETTER PHONOTYPIC A WITH SWASH;Ll;0;L;;;;;N;;;1DF68;;1DF68
1DF6A;LATIN CAPITAL LETTER PHONOTYPIC ROUNDTOP A;Lu;0;L;;;;;N;;;;1DF6B;
1DF6B;LATIN SMALL LETTER PHONOTYPIC ROUNDTOP A;Ll;0;L;;;;;N;;;1DF6A;;1DF6A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad that we missed early on that we have another range of characters with alternating small & capital letters :-(

@roozbehp @Ken-Whistler @PeterConstable FYI

Comment on lines 36066 to 36067
1DF80;LATIN CAPITAL LETTER A WITH TOPBAR;Lo;0;L;;;;;N;;;;;
1DF81;LATIN CAPITAL LETTER E WITH BENT TOPBAR;Lo;0;L;;;;;N;;;;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird that the uppercase-only letters are gc=Lo while the lowercase-only letters (1DF70, 1DF71) are gc=Ll, but I see in https://github.com/unicode-org/sah/issues/456 that we discussed this...

It feels like they should at least be Other_Uppercase (and thus also Cased).

@Ken-Whistler @macchiati @PeterConstable FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing no response, but I don't want to just forget about this:

2DD7 ; Cn # <reserved-2DD7>
2DDF ; Cn # <reserved-2DDF>
2E5E..2E7F ; Cn # [34] <reserved-2E5E>..<reserved-2E7F>
2E5E..2E5F ; Cn # [2] <reserved-2E5E>..<reserved-2E5F>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I think we should stop printing gc=Cn lines here.
PVA.txt does have

# @missing: 0000..10FFFF; General_Category; Unassigned

Unless you disagree, I can create a PAG issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1884 to +1886
2E60..2E61 ; Pattern_Syntax # Po [2] WIGGLY EXCLAMATION MARK..INVERTED WIGGLY EXCLAMATION MARK
2E62 ; Pattern_Syntax # Ps LEFT PARENTHESIS WITH MIDDLE RING
2E63 ; Pattern_Syntax # Pe RIGHT PARENTHESIS WITH MIDDLE RING
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nifty new Pattern_Syntax ;-)

@eggrobin eggrobin requested a review from markusicu November 29, 2025 01:46
@markusicu
Copy link
Member

CI check failures

Error:    TestTestUnicodeInvariants.testAdditionComparisons:64
 TestUnicodeInvariants.testInvariants(addition-comparisons) failed ==>
 expected: <0> but was: <1>

Error:    TestVersionedSymbolTable.testIdentityAndNullQueries:174
 Expected \p{Bidi_Paired_Bracket_Type=None} ⊇ \p{Bidi_Paired_Bracket=@none@} =
 [^()\[\]\{\}\u0F3A-\u0F3D\u169B\u169C\u2045\u2046\u207D\u207E\u208D\u208E\u2308-\u230B\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2E22-\u2E29\u2E55-\u2E5C\u3008-\u3011\u3014-\u301B\uFE59-\uFE5E\uFF08\uFF09\uFF3B\uFF3D\uFF5B\uFF5D\uFF5F\uFF60\uFF62\uFF63]
 but \p{Bidi_Paired_Bracket=@none@}
 contains unexpected [\u2E62\u2E63] ==>
 expected: <true> but was: <false>

@markusicu
Copy link
Member

@markusicu We had reported on the properties of these characters in L2/25-087, pp. 14 sq., but nobody had spotted the soft dotted issue; ideally I would like to include a note about the updated comparisons in the report to UTC-186, but I am not sure how to do that (I guess I could reopen the SAH issue, but that seems a bit over the top).

What do you think?

I created https://github.com/unicode-org/properties/issues/496

@markusicu
Copy link
Member

@markusicu We had reported on the properties of these characters in L2/25-087, pp. 14 sq., but nobody had spotted the soft dotted issue; ideally I would like to include a note about the updated comparisons in the report to UTC-186, but I am not sure how to do that (I guess I could reopen the SAH issue, but that seems a bit over the top).
What do you think?

I created unicode-org/properties#496

The proposal L2/24-277 did include:

The following characters have to get the “soft-dotted” property:

U+1DF6F LATIN SMALL LETTER PHONOTYPIC DIPHTHONG AI
U+1DF70 LATIN SMALL LETTER I WITH PIGTAIL AT BOTTOM
U+1DF71 LATIN SMALL LETTER STRETCHED I

... so maybe we should just give them Soft_Dotted and move on?

@eggrobin
Copy link
Member Author

... so maybe we should just give them Soft_Dotted and move on?

We definitely should, and I do not think we should block this PR on PAG rubberstamping this soft-dottedness. But since an earlier PAG report said something wrong about these characters, it might be useful to correct the record in the next PAG report. I will fill in https://github.com/unicode-org/properties/issues/496 accordingly.

@eggrobin
Copy link
Member Author

Error:    TestVersionedSymbolTable.testIdentityAndNullQueries:174
 Expected \p{Bidi_Paired_Bracket_Type=None} ⊇ \p{Bidi_Paired_Bracket=@none@} =
 [^()\[\]\{\}\u0F3A-\u0F3D\u169B\u169C\u2045\u2046\u207D\u207E\u208D\u208E\u2308-\u230B\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2E22-\u2E29\u2E55-\u2E5C\u3008-\u3011\u3014-\u301B\uFE59-\uFE5E\uFF08\uFF09\uFF3B\uFF3D\uFF5B\uFF5D\uFF5F\uFF60\uFF62\uFF63]
 but \p{Bidi_Paired_Bracket=@none@}
 contains unexpected [\u2E62\u2E63] ==>
 expected: <true> but was: <false>

… I am confused by this one.

@eggrobin
Copy link
Member Author

… I am confused by this one.

On my machine it works!?

@markusicu
Copy link
Member

Any chance that the server sees the Unicode 17 version of Bidi_Paired_Bracket?

@eggrobin
Copy link
Member Author

Maybe, but how?

@markusicu
Copy link
Member

maybe comment out this one test and move on for now?

@eggrobin
Copy link
Member Author

It was yet another static cache (this time of the set of unassigned code points)… 😩

From 2011, with this comment:

     * Reset the cache properties. Must be done if the version of Unicode is different than the ICU one, AND any UnicodeProperty has already been instantiated.
     * TODO make this a bit more robust.

Making this a bit more robust sure would have been nice…

Revert "CI is haunted"

This reverts commit 6e9b5f5.

Revert "moo"

This reverts commit 2eec56c.

Revert "meow?"

This reverts commit 6bff11e.

Revert "meow"

This reverts commit 35fe8e8.

Revert "more traces…"

This reverts commit 8a8d5be.

Revert "traces"

This reverts commit f88af9a.
@markusicu
Copy link
Member

Poor you... but thanks for chasing this down!

Maybe setDefaultXSymbolTable() should just call ResetCacheProperties()?

@eggrobin
Copy link
Member Author

Maybe setDefaultXSymbolTable() should just call ResetCacheProperties()?

The former lives in ICU, and the latter in the tools, so not really an option.

But I think UnicodeProperty is in dire need of refactoring, and one part of that could be making whatever caching there is correct (or even getting rid of it if it turns out not to be useful).

@eggrobin eggrobin merged commit 7d736b1 into unicode-org:main Nov 29, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants