-
-
Notifications
You must be signed in to change notification settings - Fork 3
Port quotation denormalization unicode tests #228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #228 +/- ##
==========================================
- Coverage 90.96% 90.88% -0.08%
==========================================
Files 334 335 +1
Lines 21431 21474 +43
==========================================
+ Hits 19494 19517 +23
- Misses 1937 1957 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit)
machine/punctuation_analysis/text_segment.py
line 87 at r4 (raw file):
class GlyphString:
I'm open to suggestions on the name of this class if we choose to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit)
machine/punctuation_analysis/text_segment.py
line 87 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I'm open to suggestions on the name of this class if we choose to keep it.
LogicalCharacter, TextElement, CombinedCharacter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe using StringInfo.GetTextElementEnumerator
isn't the appropriate strategy if it treats combining character sequences like surrogate pairs. It feels like we are adding an unnecessary complication, since we are only interested in surrogate pairs. We should consider implementing our own version that only deals with surrogate pairs. It should be easy to do with Char.IsSurrogatePair
. What do you think?
@ddaspit reviewed 8 of 11 files at r1, 1 of 1 files at r2, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
machine/punctuation_analysis/text_segment.py
line 87 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
LogicalCharacter, TextElement, CombinedCharacter
I can't think of a good name. Maybe VisualString
? At the very least, we should put a comment here describing the purpose of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I think either we should just allow them to operate differently since the indexing differences aren't in properties that are likely to be accessed by code outside of the module or we should do like you suggested - at least that way we'd only need a custom solution in one of machine.py & Machine. I'm happy with either one. Let me know which you'd prefer.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
machine/punctuation_analysis/text_segment.py
line 87 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I can't think of a good name. Maybe
VisualString
? At the very least, we should put a comment here describing the purpose of this class.
OK, I'll hold off on this since likely this class will be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to target the specific issue, i.e. surrogate pairs in C#, and do no more. So let's not use StringInfo.GetTextElementEnumerator
and make our own implementation that properly deals with surrogate pairs.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
Sounds good. I've removed the custom code here and will add commits to the parallel Machine PR to address this issue there. |
@ddaspit, I think this can be merged/re-reviewed whenever you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit reviewed 7 of 7 files at r5, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
7fcb23e
to
433c579
Compare
Addresses #222
Added the same tests.
Unfortunately, using the
TextElement
strategy in Machine means that combining characters are joined into preceding text elements - something which Python does not do. I updated the code to handle them the same as Machine. The alternative is just to allow there to be a difference. It's mostly internal, so it may not be an issue to do so.This change is