Skip to content

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Sep 11, 2025

Addresses #222

Added the same tests.

Unfortunately, using the TextElement strategy in Machine means that combining characters are joined into preceding text elements - something which Python does not do. I updated the code to handle them the same as Machine. The alternative is just to allow there to be a difference. It's mostly internal, so it may not be an issue to do so.


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 74.50980% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.88%. Comparing base (3a79c67) to head (433c579).

Files with missing lines Patch % Lines
tests/corpora/test_usfm_manual.py 25.00% 12 Missing ⚠️
...unctuation_analysis/quotation_mark_string_match.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
- Coverage   90.96%   90.88%   -0.08%     
==========================================
  Files         334      335       +1     
  Lines       21431    21474      +43     
==========================================
+ Hits        19494    19517      +23     
- Misses       1937     1957      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit)


machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

class GlyphString:

I'm open to suggestions on the name of this class if we choose to keep it.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit)


machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm open to suggestions on the name of this class if we choose to keep it.

LogicalCharacter, TextElement, CombinedCharacter

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using StringInfo.GetTextElementEnumerator isn't the appropriate strategy if it treats combining character sequences like surrogate pairs. It feels like we are adding an unnecessary complication, since we are only interested in surrogate pairs. We should consider implementing our own version that only deals with surrogate pairs. It should be easy to do with Char.IsSurrogatePair. What do you think?

@ddaspit reviewed 8 of 11 files at r1, 1 of 1 files at r2, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)


machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

LogicalCharacter, TextElement, CombinedCharacter

I can't think of a good name. Maybe VisualString? At the very least, we should put a comment here describing the purpose of this class.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. I think either we should just allow them to operate differently since the indexing differences aren't in properties that are likely to be accessed by code outside of the module or we should do like you suggested - at least that way we'd only need a custom solution in one of machine.py & Machine. I'm happy with either one. Let me know which you'd prefer.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)


machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I can't think of a good name. Maybe VisualString? At the very least, we should put a comment here describing the purpose of this class.

OK, I'll hold off on this since likely this class will be removed.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to target the specific issue, i.e. surrogate pairs in C#, and do no more. So let's not use StringInfo.GetTextElementEnumerator and make our own implementation that properly deals with surrogate pairs.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

I think I would prefer to target the specific issue, i.e. surrogate pairs in C#, and do no more. So let's not use StringInfo.GetTextElementEnumerator and make our own implementation that properly deals with surrogate pairs.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Sounds good. I've removed the custom code here and will add commits to the parallel Machine PR to address this issue there.

@Enkidu93
Copy link
Collaborator Author

@ddaspit, I think this can be merged/re-reviewed whenever you have time.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 7 of 7 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93 Enkidu93 force-pushed the port_qd_unicode_tests branch from 7fcb23e to 433c579 Compare September 16, 2025 20:49
@Enkidu93 Enkidu93 merged commit f1778e1 into main Sep 16, 2025
13 of 14 checks passed
@Enkidu93 Enkidu93 deleted the port_qd_unicode_tests branch September 16, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants