-
-
Notifications
You must be signed in to change notification settings - Fork 847
ICU-23257 A somewhat more structured UnicodeSet parser. #3604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-23257 A somewhat more structured UnicodeSet parser. #3604
Conversation
|
@markusicu Do we need a separate ticket for this restructuring, or should I do it as part of ICU-23179? |
macchiati
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BTW we should point out in the spec that unlike set operations, where
A∖B = A∩Bᶜ, UnicodeSet doesn't do that: for some A and Bs (those containing sequences):
A-B ≠ A&[^B]
Example
[ab{cd}{ef}]-[ax{cd}{ex}]
≠
[ab{cd}{ef}]&[^ax{cd}{ex}]
|
Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people. I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them? I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to |
|
I wouldn't think that performance would be affected significantly, but we
probably should check at least with a simple spot test.
…On Fri, Sep 12, 2025, 17:46 Rich Gillam ***@***.***> wrote:
*richgillam* left a comment (unicode-org/icu#3604)
<#3604 (comment)>
Wow, there's a lot here. I ran out of time to read through the whole thing
ad was starting to lose my mental acuity well before that, so I don't think
I have anything useful to say. I can try again next week if you don't get
useful review feedback from other people.
I read all the way through the spec, though, and I think I managed to keep
my brain going through that. I think it looks really good-- is the plan to
bring all the various implementations into complete conformance with that
spec, and is this one of the first steps toward doing so? That seems like a
good idea. I'm assuming you've thought through the backward-compatibility
issues in doing so and everybody's comfortable with them?
I hope you'll permit me one stupid question about the spec: I'm assuming
that the spec allows for sets that contain both individual code points and
strings, right? I did see a lot of verbiage in there about bracketed
elements and how they can represent either single code points or strings
and how you want to do special things with them when they're single code
points. I don't remember seeing much verbiage that talks about the behavior
of unicode sets that contain both strings and single code points. At least
to me, the semantics of this kind of thing are far from obvious, especially
if a set contains both types. I remember a lot of this being discussed when
various people were proposing changes to UnicodeSet earlier, but it seems
like all of that needs to be covered in the spec, and I didn't really see
it there. Is that planned? Am I wrong to be concerned about this?
—
Reply to this email directly, view it on GitHub
<#3604 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGAZYQKA7GBJMVWUIL3SNSPFAVCNFSM6AAAAACEPCMSG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBXGI4DCNRXGU>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
|
@markusicu I filed a new ticket for this. Can you take a look? |
Good point (Mark’s earlier comment has examples of this non-obviousness). I will hammer at the draft some more… |
|
Looks like @poulsbo is not yet in the icu-team and so I can't add him as a reviewer, although in some ways he would be the obvious choice ;-) |
markusicu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally started to look... I won't try to understand everything.
markusicu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got to the end. Looks plausible, modulo questions and suggestions.
|
Do we still need some of the existing functions in uniset_props.cpp like isPerlOpen() and resemblesPropertyPattern()? |
|
Do we still need |
For now we still call it, with a comment explaining why we do this silly thing: icu/icu4c/source/common/uniset_props.cpp Lines 752 to 761 in 381616c
Yes, but not if we want to retain the current insane behaviour of variables: icu/icu4c/source/test/intltest/usettest.cpp Lines 1842 to 1856 in 7d3c247
So I will get rid of it in a subsequent PR (assuming the TC accepts changing the handling of variables so that they fit in the grammar). |
|
|
aside from another typo and the optional refactoring suggestions, this lgtm |
a539944 to
7ca1c49
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
While this is a hand-rolled recursive-descent predictive parser, it has a separate lexer, so we know that the lexing is not context-dependent.
The lexer prefigures some of the distinctions introduced in UTS61 (escapes in property-query, bracketed-element distinct from string-literal, etc.), but they then get treated the old way in the parser so nothing changes.
Both the lexer and the parser mirror the UTS61 grammar (in its version before the yellow/cyan changes).
Technically it is possible to detect that this behaves differently, as there will be fewer calls to
SymbolTable::lookup, see the changes to icu4c/source/test/intltest/usettest.cpp.But we retain the essential property (relied on by rbbi) that with $meow=CP, the call to lookupMatcher(CP) always immediately follows a call to lookup(meow).
Checklist