ICU-23257 A somewhat more structured UnicodeSet parser. #3604

eggrobin · 2025-08-21T15:30:35Z

While this is a hand-rolled recursive-descent predictive parser, it has a separate lexer, so we know that the lexing is not context-dependent.
The lexer prefigures some of the distinctions introduced in UTS61 (escapes in property-query, bracketed-element distinct from string-literal, etc.), but they then get treated the old way in the parser so nothing changes.

Both the lexer and the parser mirror the UTS61 grammar (in its version before the yellow/cyan changes).

Technically it is possible to detect that this behaves differently, as there will be fewer calls to SymbolTable::lookup, see the changes to icu4c/source/test/intltest/usettest.cpp.

But we retain the essential property (relied on by rbbi) that with $meow=CP, the call to lookupMatcher(CP) always immediately follows a call to lookup(meow).

Checklist

Required: Issue filed: ICU-23257
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

eggrobin · 2025-09-11T12:34:30Z

@markusicu Do we need a separate ticket for this restructuring, or should I do it as part of ICU-23179?

macchiati

LGTM

BTW we should point out in the spec that unlike set operations, where
A∖B = A∩Bᶜ, UnicodeSet doesn't do that: for some A and Bs (those containing sequences):

A-B ≠ A&[^B]

Example

[ab{cd}{ef}]-[ax{cd}{ex}]
≠
[ab{cd}{ef}]&[^ax{cd}{ex}]

richgillam · 2025-09-13T00:46:20Z

Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people.

I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them?

I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to UnicodeSet earlier, but it seems like all of that needs to be covered in the spec, and I didn't really see it there. Is that planned? Am I wrong to be concerned about this?

macchiati · 2025-09-13T04:39:19Z

I wouldn't think that performance would be affected significantly, but we probably should check at least with a simple spot test.

…

On Fri, Sep 12, 2025, 17:46 Rich Gillam ***@***.***> wrote: *richgillam* left a comment (unicode-org/icu#3604) <#3604 (comment)> Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people. I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them? I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to UnicodeSet earlier, but it seems like all of that needs to be covered in the spec, and I didn't really see it there. Is that planned? Am I wrong to be concerned about this? — Reply to this email directly, view it on GitHub <#3604 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGAZYQKA7GBJMVWUIL3SNSPFAVCNFSM6AAAAACEPCMSG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBXGI4DCNRXGU> . You are receiving this because your review was requested.Message ID: ***@***.***>

eggrobin · 2025-11-06T12:25:27Z

@markusicu I filed a new ticket for this. Can you take a look?

eggrobin · 2025-11-06T12:29:39Z

At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types.

Good point (Mark’s earlier comment has examples of this non-obviousness). I will hammer at the draft some more…

markusicu · 2025-12-09T23:52:31Z

Looks like @poulsbo is not yet in the icu-team and so I can't add him as a reviewer, although in some ways he would be the obvious choice ;-)

markusicu

I finally started to look... I won't try to understand everything.

icu4c/source/common/uniset_props.cpp

markusicu

I got to the end. Looks plausible, modulo questions and suggestions.

icu4c/source/common/uniset_props.cpp

markusicu · 2025-12-10T03:28:39Z

Do we still need some of the existing functions in uniset_props.cpp like isPerlOpen() and resemblesPropertyPattern()?

markusicu · 2025-12-10T03:31:59Z

Do we still need applyPropertyPattern()? If so, can we rewrite it using this new parser?

eggrobin · 2025-12-10T08:03:43Z

Do we still need applyPropertyPattern()?

For now we still call it, with a comment explaining why we do this silly thing:

icu/icu4c/source/common/uniset_props.cpp

Lines 752 to 761 in 381616c

    
           // NOTE(egg): For now, we ignore the work that the lexer did to find out where the 
        
           // property-query or named-element ended in order to retain the existing buggy behaviour of 
        
           // variables containing property queries. 
        
           lexer.getCharacterIterator().skipIgnored(lexer.charsOptions()); 
        
           UnicodeSet propertyQuery; 
        
           propertyQuery.applyPropertyPattern(lexer.getCharacterIterator(), prettyPrintedPattern, ec); 
        
           U_UNICODESET_RETURN_IF_ERROR(ec); 
        
           // But now, we go back to our lexing and advance through the property-query or named-element as 
        
           // lexed.  If there was no error, the old and the new code should agree on the extent. 
        
           lexer.advance();

If so, can we rewrite it using this new parser?

Yes, but not if we want to retain the current insane behaviour of variables:

icu/icu4c/source/test/intltest/usettest.cpp

Lines 1842 to 1856 in 7d3c247

    
           // You should not do this, but it works. 
        
           {{{u"privateUseOrUnassigned", u"[[:Co:][:Cn:]"}, {u"close", u"]"}}, 
        
           u"$privateUseOrUnassigned$close", 
        
           U_ZERO_ERROR, 
        
           u"[[:Co:][:Cn:]]"}, 
        
           // This works and it is fine. 
        
           {{{u"privateUse", u"[[:Co:]]"}}, u"$privateUse", U_ZERO_ERROR, u"[[:Co:]]"}, 
        
           // This should work! But it does not. Note the doubled brackets on the one that works above. 
        
           // We are not yet inside the variable when we call lookahead(), so we try to parse 
        
           // $privateUse rather than [:Co:]. 
        
           {{{u"privateUse", u"[:Co:]"}}, u"[$privateUse]", U_ILLEGAL_ARGUMENT_ERROR, u"[]"}, 
        
           // This should not work, and it does not (we try to parse [$sad$surprised] as a 
        
           // property-query). 
        
           {{{u"sad", u":C"}, {u"surprised", u"o:"}}, 
        
           u"[$sad$surprised]",

So I will get rid of it in a subsequent PR (assuming the TC accepts changing the handling of variables so that they fit in the grammar).

eggrobin · 2025-12-10T08:07:23Z

Do we still need some of the existing functions in uniset_props.cpp like isPerlOpen() and resemblesPropertyPattern()?

resemblesPropertyPattern is used by resemblesPattern, which is @stable ICU 2.4.

isPerlOpen is used by both resemblesPropertyPattern and applyPropertyPattern.

icu4c/source/common/uniset_props.cpp

markusicu · 2025-12-10T17:30:19Z

aside from another typo and the optional refactoring suggestions, this lgtm

See unicode-org#3604

jira-pull-request-webhook · 2025-12-10T18:48:54Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin mentioned this pull request Sep 9, 2025

ICU-23179 Test more edge cases when mapping syntax characters to sets #3612

Merged

6 tasks

eggrobin marked this pull request as ready for review September 11, 2025 12:34

eggrobin changed the title ~~ICU-22851 A somewhat more structured UnicodeSet parser.~~ ICU-23179 A somewhat more structured UnicodeSet parser. Sep 11, 2025

markusicu self-assigned this Sep 11, 2025

markusicu requested review from macchiati, markusicu and richgillam September 11, 2025 16:25

macchiati previously approved these changes Sep 11, 2025

View reviewed changes

eggrobin changed the title ~~ICU-23179 A somewhat more structured UnicodeSet parser.~~ ICU-23257 A somewhat more structured UnicodeSet parser. Nov 6, 2025

markusicu reviewed Dec 10, 2025

View reviewed changes

eggrobin dismissed macchiati’s stale review via 4209151 December 10, 2025 07:46

eggrobin requested a review from markusicu December 10, 2025 13:02

markusicu reviewed Dec 10, 2025

View reviewed changes

icu4c/source/common/uniset_props.cpp Outdated Show resolved Hide resolved

markusicu approved these changes Dec 10, 2025

View reviewed changes

ICU-23257 A somewhat more structured UnicodeSet parser

7ca1c49

See unicode-org#3604

eggrobin force-pushed the recursive-descent-into-madness branch from a539944 to 7ca1c49 Compare December 10, 2025 18:48

eggrobin merged commit c7e6cfb into unicode-org:main Dec 11, 2025
97 checks passed

Uh oh!

ICU-23257 A somewhat more structured UnicodeSet parser. #3604

ICU-23257 A somewhat more structured UnicodeSet parser. #3604

Conversation

eggrobin commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

eggrobin commented Sep 11, 2025

Uh oh!

macchiati left a comment

Choose a reason for hiding this comment

Uh oh!

richgillam commented Sep 13, 2025

Uh oh!

macchiati commented Sep 13, 2025 via email

Uh oh!

eggrobin commented Nov 6, 2025

Uh oh!

eggrobin commented Nov 6, 2025

Uh oh!

markusicu commented Dec 9, 2025

Uh oh!

markusicu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markusicu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markusicu commented Dec 10, 2025

Uh oh!

markusicu commented Dec 10, 2025

Uh oh!

eggrobin commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eggrobin commented Dec 10, 2025

Uh oh!

Uh oh!

markusicu commented Dec 10, 2025

Uh oh!

jira-pull-request-webhook bot commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eggrobin commented Aug 21, 2025 •

edited

Loading

eggrobin commented Dec 10, 2025 •

edited

Loading