Port quotation denormalization unicode tests #228

Enkidu93 · 2025-09-11T22:03:30Z

Addresses #222

Added the same tests.

Unfortunately, using the TextElement strategy in Machine means that combining characters are joined into preceding text elements - something which Python does not do. I updated the code to handle them the same as Machine. The alternative is just to allow there to be a difference. It's mostly internal, so it may not be an issue to do so.

This change is

codecov-commenter · 2025-09-12T13:34:54Z

Codecov Report

❌ Patch coverage is 74.50980% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.88%. Comparing base (3a79c67) to head (433c579).

Files with missing lines	Patch %	Lines
tests/corpora/test_usfm_manual.py	25.00%	12 Missing ⚠️
...unctuation_analysis/quotation_mark_string_match.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
- Coverage   90.96%   90.88%   -0.08%     
==========================================
  Files         334      335       +1     
  Lines       21431    21474      +43     
==========================================
+ Hits        19494    19517      +23     
- Misses       1937     1957      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Enkidu93

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit)

machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

class GlyphString:

I'm open to suggestions on the name of this class if we choose to keep it.

Enkidu93

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit)

machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm open to suggestions on the name of this class if we choose to keep it.

LogicalCharacter, TextElement, CombinedCharacter

ddaspit

Maybe using StringInfo.GetTextElementEnumerator isn't the appropriate strategy if it treats combining character sequences like surrogate pairs. It feels like we are adding an unnecessary complication, since we are only interested in surrogate pairs. We should consider implementing our own version that only deals with surrogate pairs. It should be easy to do with Char.IsSurrogatePair. What do you think?

@ddaspit reviewed 8 of 11 files at r1, 1 of 1 files at r2, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

LogicalCharacter, TextElement, CombinedCharacter

I can't think of a good name. Maybe VisualString? At the very least, we should put a comment here describing the purpose of this class.

Enkidu93

Yeah, I agree. I think either we should just allow them to operate differently since the indexing differences aren't in properties that are likely to be accessed by code outside of the module or we should do like you suggested - at least that way we'd only need a custom solution in one of machine.py & Machine. I'm happy with either one. Let me know which you'd prefer.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

machine/punctuation_analysis/text_segment.py line 87 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I can't think of a good name. Maybe VisualString? At the very least, we should put a comment here describing the purpose of this class.

OK, I'll hold off on this since likely this class will be removed.

ddaspit

I think I would prefer to target the specific issue, i.e. surrogate pairs in C#, and do no more. So let's not use StringInfo.GetTextElementEnumerator and make our own implementation that properly deals with surrogate pairs.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Enkidu93 · 2025-09-15T22:17:57Z

I think I would prefer to target the specific issue, i.e. surrogate pairs in C#, and do no more. So let's not use StringInfo.GetTextElementEnumerator and make our own implementation that properly deals with surrogate pairs.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Sounds good. I've removed the custom code here and will add commits to the parallel Machine PR to address this issue there.

Enkidu93 · 2025-09-16T20:20:51Z

@ddaspit, I think this can be merged/re-reviewed whenever you have time.

ddaspit

@ddaspit reviewed 7 of 7 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

…rding combining characters in Python strings

Enkidu93 requested a review from ddaspit September 11, 2025 22:03

Enkidu93 mentioned this pull request Sep 11, 2025

Add test to specifically cover surrogate pairs, not just combining characters sillsdev/machine#335

Merged

Enkidu93 commented Sep 15, 2025

View reviewed changes

ddaspit reviewed Sep 15, 2025

View reviewed changes

Enkidu93 commented Sep 15, 2025

View reviewed changes

ddaspit reviewed Sep 15, 2025

View reviewed changes

ddaspit approved these changes Sep 16, 2025

View reviewed changes

Enkidu93 added 6 commits September 16, 2025 16:49

Port unicode string-related tests to Python; address discrepancy rega…

905a056

…rding combining characters in Python strings

Remove redundant string property

c7c8c3a

Fix formatting

f930050

Use Optional[]

a12b917

Change class name

38292fd

Remove custom combining character handling

433c579

Enkidu93 force-pushed the port_qd_unicode_tests branch from 7fcb23e to 433c579 Compare September 16, 2025 20:49

Enkidu93 merged commit f1778e1 into main Sep 16, 2025
13 of 14 checks passed

Enkidu93 deleted the port_qd_unicode_tests branch September 16, 2025 21:00

Enkidu93 mentioned this pull request Sep 18, 2025

Port test cases added in Machine to cover C# Unicode-handling #222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Port quotation denormalization unicode tests #228

Port quotation denormalization unicode tests #228

Uh oh!

Enkidu93 commented Sep 11, 2025 •

edited by ddaspit

Loading

Uh oh!

codecov-commenter commented Sep 12, 2025 •

edited

Loading

Uh oh!

Enkidu93 left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 commented Sep 15, 2025

Uh oh!

Enkidu93 commented Sep 16, 2025

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Port quotation denormalization unicode tests #228

Port quotation denormalization unicode tests #228

Uh oh!

Conversation

Enkidu93 commented Sep 11, 2025 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 commented Sep 15, 2025

Uh oh!

Enkidu93 commented Sep 16, 2025

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enkidu93 commented Sep 11, 2025 •

edited by ddaspit

Loading

codecov-commenter commented Sep 12, 2025 •

edited

Loading