-
Notifications
You must be signed in to change notification settings - Fork 49
Migrate to ICU4X #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Migrate to ICU4X #436
Conversation
- condense all byte indexes to char indexes in a single loop - track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed
- clean up tests - add tests for multi-character graphemes
- group all word boundary logic together
- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this - test naming/grouping
- compute `force_normalize`
- simplify ClusterInfo to just `is_emoji` - more clean-up
- remove unused conversion methods
# Conflicts: # Cargo.lock # parley/src/context.rs # parley/src/layout/data.rs # parley/src/shape/mod.rs # parley/src/swash_convert.rs
- remove LayoutContext.has_bidi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a full review (and doesn't really review the analysis logic at all (which I'm probably not qualified to do)), but I did a first pass on "rust-level things":
Some general notes:
- This currently increases the binary size of a release build of the
vello_editorexample by2-3mb. That seems very unfortunate, but it also looks like there are few easy wins to reduce this, and we should probably re-evaluate the impact once those are fixed. - I think depending on the top-level
icucrate is wrong. We should be depending on sub-crates likeicu_segmenter,icu_normalizer, etc directly. - We should also be disabling default features of those crates where appropriate. Currently I'm pretty sure we're compiling in the default data providers (pulled in by default features) AND our custom subset (generated by the build.rs)
- We should check how much we're saving by using the custom baked data, because any application which depends on icu4x for other purposes may well end up pulling in the default data sets anyway, which would also lead to duplicated data. May or may not be worth the saving, and we may also wish to offer both options.
- We should fully eliminate the Swash dependency (only a couple of trivial uses remaining).
- Benchmarks are currently broken due to
LayoutContextno longer beingSync. We should work out whether we want to go down the route of makingLayoutContexts always thread-local or whether we want find a way to make itSyncagain.
parley/Cargo.toml
Outdated
| accesskit = { workspace = true, optional = true } | ||
| hashbrown = { workspace = true } | ||
| harfrust = { workspace = true } | ||
| icu = { version = "2.0.0"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the icu crate with default features enabled doesn't seem right to me. That's going to pull in a lot of stuff which we surely aren't using?
| } | ||
|
|
||
| #[rustfmt::skip] | ||
| const FONTIQUE_SCRIPT_TAGS: [[u8; 4]; 193] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be replaced with short_name method on icu's Script type. But we should check that's it's completely equivalent. There may a couple special cases.
| loop { | ||
| if !parser.next(&mut cluster) { | ||
| // End of current item - process final segment | ||
| break; | ||
| } | ||
| cluster = match clusters_iter.next() { | ||
| Some(c) => c, | ||
| None => break, // End of current item - process final segment | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could become a while let loop
| if has_emoji && has_zwj { | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if was probably meant to be removed?
parley/src/lib.rs
Outdated
| extern crate alloc; | ||
|
|
||
| pub use fontique; | ||
| pub use swash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go if we're not using swash anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be one more use of swash for the swash::Setting type in parley/src/style/font.rs. I would suggest replacing this with a custom struct in Parley. This type does also exist in font-types, but I think we want to avoid putting font-types in the public API at least until it hits 1.0.
| }); | ||
| } | ||
|
|
||
| #[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat relieved to see how much of this file is tests! I was dreading reviewing 1000 lines. We may wish to consider moving these to a separate files in the tests directory, although that's subjective and others (including you) may have other opinions (I personally find it harder to navigate code files with a mix of lots of tests and lots of code).
|
|
||
| #[derive(Debug)] | ||
| pub(crate) struct CharCluster { | ||
| pub chars: Vec<Char>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should back this with benchmark/profile data, but we should consider using either an ArrayVec with capacity of MAX_CLUSTER_SIZE or a SmallVec with a capacity tuned to "big enough for most clusters" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Or the same array + len setup used for Form)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply using a SmallVec of capacity 1 saw these results - seems like a great suggestion. I didn't play around too much with different sizes cc @conor-93
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can be have a capacity of 4 without using any more memory (although there are other similar crates that could get us a size reduction for smaller capacities).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only managed to review cluster.rs today. I'll continue with more tomorrow.
As I review, I'm experimenting and learning with this branch (which has a lot of the changes suggested).
This is where that branch is compared to this branch currently. Huge props to @nicoburns for that suggestion re using SmallVec. I think using SmallVec for Form had a huge effect with everything else being mostly marginal gains.
| icu_locale = { workspace = true } | ||
| icu_normalizer = { workspace = true } | ||
| icu_properties = { workspace = true, features = ["unicode_bidi"] } | ||
| icu_provider = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| icu_provider = { workspace = true } |
Does this need to be a dependency (both here and by the workspace)?
It's only usages are use icu_provider::prelude::icu_locale_core::LanguageIdentifier; but they can be replaced with use icu_locale::LanguageIdentifier;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
icu_locale_core::LanguageIdentifier would be even better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub struct BakedProvider; | ||
| impl_data_provider!(BakedProvider); | ||
|
|
||
| pub(crate) static PROVIDER: BakedProvider = BakedProvider; No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this the provider that we would need pass from the consumer if we didn't want to bake the data?
|
|
||
| pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, text: &str) { | ||
| // See: https://github.com/unicode-org/icu4x/blob/ee5399a77a6b94efb5d4b60678bb458c5eedb25d/components/segmenter/src/line.rs#L338-L351 | ||
| fn is_mandatory_line_break(line_break: LineBreak) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you please define this closer to where it's used?
|
|
||
| #[derive(Debug)] | ||
| pub(crate) struct CharCluster { | ||
| pub chars: Vec<Char>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply using a SmallVec of capacity 1 saw these results - seems like a great suggestion. I didn't play around too much with different sizes cc @conor-93
|
|
||
| #[derive(Debug)] | ||
| pub(crate) struct CharCluster { | ||
| pub chars: Vec<Char>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| continue; | ||
| } | ||
| all_boundaries_byte_indexed[wb] = Boundary::Word; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not need to pay the cost of the all_boundaries_byte_indexed allocation by making this loop an iterator and instead push line boundary positions to a vector. There's probably a way to make line boundary positions an iterator too in order to avoid the vec we push to, but I think it's fine to leave that as a TODO
diff --git a/parley/src/analysis/mod.rs b/parley/src/analysis/mod.rs
index f903b20..6ddbee2 100644
--- a/parley/src/analysis/mod.rs
+++ b/parley/src/analysis/mod.rs
@@ -273,16 +273,15 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, text: &str) {
let mut line_segmenters = core::mem::take(&mut lcx.analysis_data_sources.line_segmenters);
- let mut all_boundaries_byte_indexed = vec![Boundary::None; text.len()];
-
- // Word boundaries:
- for wb in lcx.analysis_data_sources.word_segmenter().segment_str(text) {
+ // Collect boundary byte positions compactly
+ let mut wb_iter = lcx.analysis_data_sources.word_segmenter().segment_str(text).filter_map(|wb| {
// icu produces a word boundary trailing the string, which we don't use.
if wb == text.len() {
- continue;
+ None
+ } else {
+ Some(wb)
}
- all_boundaries_byte_indexed[wb] = Boundary::Word;
- }
+ }).peekable();
// Line boundaries (word break naming refers to the line boundary determination config).
//
@@ -298,27 +297,40 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, text: &str) {
&first_style
);
let mut global_offset = 0;
+ let mut line_boundary_positions: Vec<usize> = Vec::new();
+ // LINE BOUNDARIES COLLECTION
for (substring_index, (substring, word_break_strength, last)) in contiguous_word_break_substrings.enumerate() {
- let line_boundaries: Vec<usize> = lcx.analysis_data_sources
- .line_segmenter(word_break_strength)
- .segment_str(substring)
- .collect();
// Fast path for text with a single word-break option.
if substring_index == 0 && last {
- // icu adds leading and trailing line boundaries, which we don't use.
- let Some((_first, rest)) = line_boundaries.split_first() else {
+ let mut lb_iter = line_segmenters.get(word_break_strength).segment_str(substring);
+
+ let _first = lb_iter.next();
+ let second = lb_iter.next();
+
+ if second.is_none() {
continue;
- };
- let Some((_last, middle)) = rest.split_last() else {
+ }
+
+ let third = lb_iter.next();
+
+ if third.is_none() {
continue;
- };
- for &b in middle {
- all_boundaries_byte_indexed[b] = Boundary::Line;
+ }
+
+ let iter = [second.unwrap(), third.unwrap()].into_iter().chain(lb_iter);
+ line_boundary_positions.extend(iter);
+ line_boundary_positions.pop();
- }
break;
}
+ let line_boundaries_iter = line_segmenters.get(word_break_strength).segment_str(substring);
+
let mut substring_chars = substring.chars();
if substring_index != 0 {
global_offset -= substring_chars.next().unwrap().len_utf8();
@@ -328,9 +340,9 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, text: &str) {
let last_len = substring_chars.next_back().unwrap().len_utf8();
// Mark line boundaries (overriding word boundaries where present).
- for (index, &pos) in line_boundaries.iter().enumerate() {
+ for (index, pos) in line_boundaries_iter.enumerate() {
// icu adds leading and trailing line boundaries, which we don't use.
- if index == 0 || index == line_boundaries.len() - 1 {
+ if index == 0 || pos == substring.len() {
continue;
}
@@ -340,7 +352,7 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, text: &str) {
if !last && pos == substring.len() - last_len {
continue;
}
- all_boundaries_byte_indexed[pos + global_offset] = Boundary::Line;
+ line_boundary_positions.push(pos + global_offset);
}
if !last {
@@ -351,11 +363,35 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, text: &str) {
// BiDi embedding levels:
let bidi_embedding_levels = unicode_bidi::BidiInfo::new_with_data_source(&lcx.analysis_data_sources.bidi_class(), text, None).levels;
+ // Merge boundaries - line takes precedence over word
+ let mut lb_iter = line_boundary_positions.iter().peekable();
let boundaries_and_levels_iter = text.char_indices()
- .map(|(byte_pos, _)| (
- all_boundaries_byte_indexed.get(byte_pos).unwrap(),
- bidi_embedding_levels.get(byte_pos).unwrap()
- ));
+ .map(|(byte_pos, _)| {
+ // advance any stale word boundary positions
+ while let Some(&w) = wb_iter.peek() {
+ if w < byte_pos { _ = wb_iter.next(); } else { break; }
+ }
+ // advance any stale line boundary positions
+ while let Some(&l) = lb_iter.peek() {
+ if *l < byte_pos { _ = lb_iter.next(); } else { break; }
+ }
+
+ let mut boundary = Boundary::None;
+ if let Some(&w) = wb_iter.peek() {
+ if w == byte_pos {
+ boundary = Boundary::Word;
+ _ = wb_iter.next();
+ }
+ }
+ if let Some(&l) = lb_iter.peek() {
+ if *l == byte_pos {
+ boundary = Boundary::Line;
+ _ = lb_iter.next();
+ }
+ }
+
+ (boundary, bidi_embedding_levels.get(byte_pos).unwrap())
+ });
fn unicode_data_iterator<'a, T: TrieValue>(
text: &'a str,| let line_boundaries: Vec<usize> = lcx.analysis_data_sources | ||
| .line_segmenter(word_break_strength) | ||
| .segment_str(substring) | ||
| .collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably skip this allocation entirely using something like:
for (substring_index, (substring, word_break_strength, last)) in contiguous_word_break_substrings.enumerate() {
// Fast path for text with a single word-break option.
if substring_index == 0 && last {
let mut lb_iter = line_segmenters.get(word_break_strength).segment_str(substring);
// CAUTION: bad names, draft code ahead
let _first = lb_iter.next();
let second = lb_iter.next();
if second.is_none() {
continue;
}
let third = lb_iter.next();
if third.is_none() {
continue;
}
let iter = [second.unwrap(), third.unwrap()].into_iter().chain(lb_iter);
for b in iter {
if b == substring.len() {
continue;
}
line_boundary_positions.push(b);
}
break;
}
let line_boundaries_iter = line_segmenters.get(word_break_strength).segment_str(substring);
let mut substring_chars = substring.chars();
if substring_index != 0 {
global_offset -= substring_chars.next().unwrap().len_utf8();
}
// There will always be at least two characters if we are not taking the fast path for
// a single word break style substring.
let last_len = substring_chars.next_back().unwrap().len_utf8();
// Mark line boundaries (overriding word boundaries where present).
for (index, pos) in line_boundaries_iter.enumerate() {
// icu adds leading and trailing line boundaries, which we don't use.
if index == 0 || pos == substring.len() {
continue;
}
// For all but the last substring, we ignore line boundaries caused by the last
// character, as this character is carried back from the next substring, and will be
// accounted for there.
if !last && pos == substring.len() - last_len {
continue;
}
line_boundary_positions.push(pos + global_offset);
}
if !last {
global_offset += substring.len() - last_len;
}
}| let Some((first_style, rest)) = lcx.styles.split_first() else { | ||
| panic!("No style info"); | ||
| }; | ||
| let contiguous_word_break_substrings = WordBreakSegmentIter::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand WordBreakSegmentIter correctly, then we still iterate through the string even when there's no change in word break style. Is there a way to avoid that by doing an iteration through styles first before proceeding with a per character iteration?
if lcx.styles.iter().any(|s| s.style.word_break != LineBreakWordOption::Normal) {
// Evaluate subranges
} else {
// Fast path
}Or simply update the iterator to first check whether it needs to yield subranges (by performing this check internally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that may be worth looking at is the difference between internal and external iteration (it's possible that Iterator::fold (or Iterator::try_fold) might be faster than Iterator::next() / Iterator::for_each())
| /// Whether the character | ||
| pub is_control_character: bool, | ||
| /// True if the character should be considered when mapping glyphs. | ||
| pub contributes_to_shaping: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use bit flags for these - either using bitflags or manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time to create a custom provider in build.rs from existing ICU data sources that allows for 1 lookup on the character (instead of 1 lookup per property per character) and saw very good performance improvements.
As per our conversation yesterday, I'll finish this review and prepare that work to be incorporated here
| self.decomp.state = FormState::Invalid; | ||
|
|
||
| // Create a string from the original characters to normalize | ||
| let mut orig_str = String::with_capacity(self.len as usize * 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of allocating these strings in composed and decomposed, are we able to instead please pass a scratch_string or similar to this and composed so that we can reuse 1 allocation?
Something like 352225f?
| } | ||
|
|
||
| // BiDi embedding levels: | ||
| let bidi_embedding_levels = unicode_bidi::BidiInfo::new_with_data_source(&lcx.analysis_data_sources.bidi_class(), text, None).levels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that we might only need to run this iff one of the characters is a from a BidiClass that requires resolution? If so, when we have the composite property trie, we could add a fast path for this.
| } | ||
| } | ||
|
|
||
| fn is_emoji_grapheme(analysis_data_sources: &AnalysisDataSources, grapheme: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function shows up significantly in profiles, but I'm wondering whether we can address that with the composite property provider. I.e., we store whether some character is an emoji and fast path out of this with something like:
let mut is_emoji_or_pictograph = false;
let chars = segment_text.char_indices().zip(item_infos_iter.by_ref()).map(|((_, ch), (info, style_index))| {
// ...
is_emoji_or_pictograph |= info.is_emoji_or_pictograph;
// ...
let cluster = CharCluster::new(
// ...
is_emoji_or_pictograph || if (segment_text.len() > 1) {is_emoji_grapheme(analysis_data_sources, segment_text) } else { false },
// ...
);where is_emoji_or_pictograph is a new field on item infos
| let cluster = CharCluster::new( | ||
| chars, | ||
| is_emoji_grapheme(analysis_data_sources, segment_text), | ||
| len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need len if it can be obtained by chars.len?
| } | ||
|
|
||
| // BiDi embedding levels: | ||
| let bidi_embedding_levels = unicode_bidi::BidiInfo::new_with_data_source(&lcx.analysis_data_sources.bidi_class(), text, None).levels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bidi impl doesn't enable passing in allocations (it allocates within its impl). I created an issue in the upstream repo asking whether it's possible to pass allocations in: servo/unicode-bidi#146



Migration of text analysis from Swash → ICU4X
Overview
ICU4X enables text analysis and internationalisation. For Parley, this includes locale and language recognition,
bidirectional text evaluation, text segmentation, emoji recognition, NFC/NFD normalisation and other Unicode character information.
ICU4X is developed and maintained by a trusted authority in the space of text internationalisation: the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. It is targeted at resource-constrained environments. For Parley, this means:
Notable changes
select_fontemoji detection improvements (Flag emoji "🇺🇸", Keycap sequences (e.g. 0️⃣ through 9️⃣) now supported in cluster detection, Swash did not support these).Scripts).Performance/binary size
vello_editoris ~22kB larger (9642kB vs 9620kB).Other details
Languageparsing is more tolerant, e.g. it permits extra, invalid subtags (like in"en-Latn-US-a-b-c-d").vello_editorcompilation testing). In order to potentially support correct word breaking across all languages, without seeing a huge compilation size increase, we would need a way for users to attach only the locale data they need at runtime. This locale data could be generated (withicu4x-datagen) and attached (usingDataProviders) at runtime in the future.