Skip to content

Conversation

@eggrobin
Copy link
Member

@eggrobin eggrobin commented Aug 21, 2025

While this is a hand-rolled recursive-descent predictive parser, it has a separate lexer, so we know that the lexing is not context-dependent.
The lexer prefigures some of the distinctions introduced in UTS61 (escapes in property-query, bracketed-element distinct from string-literal, etc.), but they then get treated the old way in the parser so nothing changes.

Both the lexer and the parser mirror the UTS61 grammar (in its version before the yellow/cyan changes).

Technically it is possible to detect that this behaves differently, as there will be fewer calls to SymbolTable::lookup, see the changes to icu4c/source/test/intltest/usettest.cpp.

But we retain the essential property (relied on by rbbi) that with $meow=CP, the call to lookupMatcher(CP) always immediately follows a call to lookup(meow).

Checklist

  • Required: Issue filed: ICU-23257
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@eggrobin eggrobin marked this pull request as ready for review September 11, 2025 12:34
@eggrobin
Copy link
Member Author

@markusicu Do we need a separate ticket for this restructuring, or should I do it as part of ICU-23179?

@eggrobin eggrobin changed the title ICU-22851 A somewhat more structured UnicodeSet parser. ICU-23179 A somewhat more structured UnicodeSet parser. Sep 11, 2025
@markusicu markusicu self-assigned this Sep 11, 2025
macchiati
macchiati previously approved these changes Sep 11, 2025
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

BTW we should point out in the spec that unlike set operations, where
A∖B = A∩Bᶜ, UnicodeSet doesn't do that: for some A and Bs (those containing sequences):

A-B ≠ A&[^B]

Example

[ab{cd}{ef}]-[ax{cd}{ex}]

[ab{cd}{ef}]&[^ax{cd}{ex}]

@richgillam
Copy link
Contributor

Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people.

I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them?

I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to UnicodeSet earlier, but it seems like all of that needs to be covered in the spec, and I didn't really see it there. Is that planned? Am I wrong to be concerned about this?

@macchiati
Copy link
Member

macchiati commented Sep 13, 2025 via email

@eggrobin eggrobin changed the title ICU-23179 A somewhat more structured UnicodeSet parser. ICU-23257 A somewhat more structured UnicodeSet parser. Nov 6, 2025
@eggrobin
Copy link
Member Author

eggrobin commented Nov 6, 2025

@markusicu I filed a new ticket for this. Can you take a look?

@eggrobin
Copy link
Member Author

eggrobin commented Nov 6, 2025

At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types.

Good point (Mark’s earlier comment has examples of this non-obviousness). I will hammer at the draft some more…

@markusicu
Copy link
Member

Looks like @poulsbo is not yet in the icu-team and so I can't add him as a reviewer, although in some ways he would be the obvious choice ;-)

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally started to look... I won't try to understand everything.

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got to the end. Looks plausible, modulo questions and suggestions.

@markusicu
Copy link
Member

Do we still need some of the existing functions in uniset_props.cpp like isPerlOpen() and resemblesPropertyPattern()?

@markusicu
Copy link
Member

Do we still need applyPropertyPattern()? If so, can we rewrite it using this new parser?

@eggrobin
Copy link
Member Author

eggrobin commented Dec 10, 2025

Do we still need applyPropertyPattern()?

For now we still call it, with a comment explaining why we do this silly thing:

// NOTE(egg): For now, we ignore the work that the lexer did to find out where the
// property-query or named-element ended in order to retain the existing buggy behaviour of
// variables containing property queries.
lexer.getCharacterIterator().skipIgnored(lexer.charsOptions());
UnicodeSet propertyQuery;
propertyQuery.applyPropertyPattern(lexer.getCharacterIterator(), prettyPrintedPattern, ec);
U_UNICODESET_RETURN_IF_ERROR(ec);
// But now, we go back to our lexing and advance through the property-query or named-element as
// lexed. If there was no error, the old and the new code should agree on the extent.
lexer.advance();

If so, can we rewrite it using this new parser?

Yes, but not if we want to retain the current insane behaviour of variables:

// You should not do this, but it works.
{{{u"privateUseOrUnassigned", u"[[:Co:][:Cn:]"}, {u"close", u"]"}},
u"$privateUseOrUnassigned$close",
U_ZERO_ERROR,
u"[[:Co:][:Cn:]]"},
// This works and it is fine.
{{{u"privateUse", u"[[:Co:]]"}}, u"$privateUse", U_ZERO_ERROR, u"[[:Co:]]"},
// This should work! But it does not. Note the doubled brackets on the one that works above.
// We are not yet inside the variable when we call lookahead(), so we try to parse
// $privateUse rather than [:Co:].
{{{u"privateUse", u"[:Co:]"}}, u"[$privateUse]", U_ILLEGAL_ARGUMENT_ERROR, u"[]"},
// This should not work, and it does not (we try to parse [$sad$surprised] as a
// property-query).
{{{u"sad", u":C"}, {u"surprised", u"o:"}},
u"[$sad$surprised]",

So I will get rid of it in a subsequent PR (assuming the TC accepts changing the handling of variables so that they fit in the grammar).

@eggrobin
Copy link
Member Author

Do we still need some of the existing functions in uniset_props.cpp like isPerlOpen() and resemblesPropertyPattern()?

resemblesPropertyPattern is used by resemblesPattern, which is @stable ICU 2.4.

isPerlOpen is used by both resemblesPropertyPattern and applyPropertyPattern.

@eggrobin eggrobin requested a review from markusicu December 10, 2025 13:02
@markusicu
Copy link
Member

aside from another typo and the optional refactoring suggestions, this lgtm

@eggrobin eggrobin force-pushed the recursive-descent-into-madness branch from a539944 to 7ca1c49 Compare December 10, 2025 18:48
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin eggrobin merged commit c7e6cfb into unicode-org:main Dec 11, 2025
97 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants