Skip to content

Commit 7f1e626

Browse files
authored
Rollup merge of #70486 - Mark-Simulacrum:unicode-shrink, r=dtolnay
Shrink Unicode tables (even more) This shrinks the Unicode tables further, building upon the wins in #68232 (the previous counts differ due to an interim Unicode version update, see #69929. The new data structure is slower by around 3x, on the benchmark of looking up every Unicode scalar value in each data set sequentially in every data set included. Note that for ASCII, the exposed functions on `char` optimize with direct branches, so ASCII will retain the same performance regardless of internal optimizations (or the reverse). Also, note that the size reduction due to the skip list (from where the performance losses come) is around 40%, and, as a result, I believe the performance loss is acceptable, as the routines are still quite fast. Anywhere where this is hot, should probably be using a custom data structure anyway (e.g., a raw bitset) or something optimized for frequently seen values, etc. This PR updates both the bitset data structure, and introduces a new data structure similar to a skip list. For more details, see the [main.rs] of the table generator, which describes both. The commits mostly work individually and document size wins. As before, this is tested on all valid chars to have the same results as nightly (and the canonical Unicode data sets), happily, no bugs were found. [main.rs]: https://github.com/rust-lang/rust/blob/fb4a715e18b/src/tools/unicode-table-generator/src/main.rs Set | Previous | New | % of old | Codepoints | Ranges | ----------------|---------:|------:|-----------:|-----------:|-------:| Alphabetic | 3055 | 1599 | 52% | 132875 | 695 | Case Ignorable | 2136 | 949 | 44% | 2413 | 410 | Cased | 934 | 359 | 38% | 4286 | 141 | Cc | 43 | 9 | 20% | 65 | 2 | Grapheme Extend | 1774 | 813 | 46% | 1979 | 344 | Lowercase | 985 | 867 | 88% | 2344 | 652 | N | 1266 | 419 | 33% | 1781 | 133 | Uppercase | 934 | 777 | 83% | 1911 | 643 | White_Space | 140 | 37 | 26% | 25 | 10 | ----------------|----------|-------|------------|------------|--------| Total | 11267 | 5829 | 51% | - | - |
2 parents bbd3634 + ad679a7 commit 7f1e626

File tree

7 files changed

+1152
-652
lines changed

7 files changed

+1152
-652
lines changed

src/libcore/unicode/mod.rs

-25
Original file line numberDiff line numberDiff line change
@@ -32,28 +32,3 @@ pub use unicode_data::lowercase::lookup as Lowercase;
3232
pub use unicode_data::n::lookup as N;
3333
pub use unicode_data::uppercase::lookup as Uppercase;
3434
pub use unicode_data::white_space::lookup as White_Space;
35-
36-
#[inline(always)]
37-
fn range_search<const N: usize, const N1: usize, const N2: usize>(
38-
needle: u32,
39-
chunk_idx_map: &[u8; N],
40-
(last_chunk_idx, last_chunk_mapping): (u16, u8),
41-
bitset_chunk_idx: &[[u8; 16]; N1],
42-
bitset: &[u64; N2],
43-
) -> bool {
44-
let bucket_idx = (needle / 64) as usize;
45-
let chunk_map_idx = bucket_idx / 16;
46-
let chunk_piece = bucket_idx % 16;
47-
let chunk_idx = if chunk_map_idx >= N {
48-
if chunk_map_idx == last_chunk_idx as usize {
49-
last_chunk_mapping
50-
} else {
51-
return false;
52-
}
53-
} else {
54-
chunk_idx_map[chunk_map_idx]
55-
};
56-
let idx = bitset_chunk_idx[(chunk_idx as usize)][chunk_piece];
57-
let word = bitset[(idx as usize)];
58-
(word & (1 << (needle % 64) as u64)) != 0
59-
}

src/libcore/unicode/unicode_data.rs

+443-514
Large diffs are not rendered by default.

src/tools/unicode-table-generator/src/main.rs

+182-4
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,83 @@
1+
//! This implements the core logic of the compression scheme used to compactly
2+
//! encode Unicode properties.
3+
//!
4+
//! We have two primary goals with the encoding: we want to be compact, because
5+
//! these tables often end up in ~every Rust program (especially the
6+
//! grapheme_extend table, used for str debugging), including those for embedded
7+
//! targets (where space is important). We also want to be relatively fast,
8+
//! though this is more of a nice to have rather than a key design constraint.
9+
//! It is expected that libraries/applications which are performance-sensitive
10+
//! to Unicode property lookups are extremely rare, and those that care may find
11+
//! the tradeoff of the raw bitsets worth it. For most applications, a
12+
//! relatively fast but much smaller (and as such less cache-impacting, etc.)
13+
//! data set is likely preferable.
14+
//!
15+
//! We have two separate encoding schemes: a skiplist-like approach, and a
16+
//! compressed bitset. The datasets we consider mostly use the skiplist (it's
17+
//! smaller) but the lowercase and uppercase sets are sufficiently sparse for
18+
//! the bitset to be worthwhile -- for those sets the biset is a 2x size win.
19+
//! Since the bitset is also faster, this seems an obvious choice. (As a
20+
//! historical note, the bitset was also the prior implementation, so its
21+
//! relative complexity had already been paid).
22+
//!
23+
//! ## The bitset
24+
//!
25+
//! The primary idea is that we 'flatten' the Unicode ranges into an enormous
26+
//! bitset. To represent any arbitrary codepoint in a raw bitset, we would need
27+
//! over 17 kilobytes of data per character set -- way too much for our
28+
//! purposes.
29+
//!
30+
//! First, the raw bitset (one bit for every valid `char`, from 0 to 0x10FFFF,
31+
//! not skipping the small 'gap') is associated into words (u64) and
32+
//! deduplicated. On random data, this would be useless; on our data, this is
33+
//! incredibly beneficial -- our data sets have (far) less than 256 unique
34+
//! words.
35+
//!
36+
//! This gives us an array that maps `u8 -> word`; the current algorithm does
37+
//! not handle the case of more than 256 unique words, but we are relatively far
38+
//! from coming that close.
39+
//!
40+
//! With that scheme, we now have a single byte for every 64 codepoints.
41+
//!
42+
//! We further chunk these by some constant N (between 1 and 64 per group,
43+
//! dynamically chosen for smallest size), and again deduplicate and store in an
44+
//! array (u8 -> [u8; N]).
45+
//!
46+
//! The bytes of this array map into the words from the bitset above, but we
47+
//! apply another trick here: some of these words are similar enough that they
48+
//! can be represented by some function of another word. The particular
49+
//! functions chosen are rotation, inversion, and shifting (right).
50+
//!
51+
//! ## The skiplist
52+
//!
53+
//! The skip list arose out of the desire for an even smaller encoding than the
54+
//! bitset -- and was the answer to the question "what is the smallest
55+
//! representation we can imagine?". However, it is not necessarily the
56+
//! smallest, and if you have a better proposal, please do suggest it!
57+
//!
58+
//! This is a relatively straightforward encoding. First, we break up all the
59+
//! ranges in the input data into offsets from each other, essentially a gap
60+
//! encoding. In practice, most gaps are small -- less than u8::MAX -- so we
61+
//! store those directly. We make use of the larger gaps (which are nicely
62+
//! interspersed already) throughout the dataset to index this data set.
63+
//!
64+
//! In particular, each run of small gaps (terminating in a large gap) is
65+
//! indexed in a separate dataset. That data set stores an index into the
66+
//! primary offset list and a prefix sum of that offset list. These are packed
67+
//! into a single u32 (11 bits for the offset, 21 bits for the prefix sum).
68+
//!
69+
//! Lookup proceeds via a binary search in the index and then a straightforward
70+
//! linear scan (adding up the offsets) until we reach the needle, and then the
71+
//! index of that offset is utilized as the answer to whether we're in the set
72+
//! or not.
73+
174
use std::collections::{BTreeMap, HashMap};
275
use std::ops::Range;
376
use ucd_parse::Codepoints;
477

578
mod case_mapping;
679
mod raw_emitter;
80+
mod skiplist;
781
mod unicode_download;
882

983
use raw_emitter::{emit_codepoints, RawEmitter};
@@ -152,9 +226,17 @@ fn main() {
152226
std::process::exit(1);
153227
});
154228

229+
// Optional test path, which is a Rust source file testing that the unicode
230+
// property lookups are correct.
231+
let test_path = std::env::args().nth(2);
232+
155233
let unicode_data = load_data();
156234
let ranges_by_property = &unicode_data.ranges;
157235

236+
if let Some(path) = test_path {
237+
std::fs::write(&path, generate_tests(&write_location, &ranges_by_property)).unwrap();
238+
}
239+
158240
let mut total_bytes = 0;
159241
let mut modules = Vec::new();
160242
for (property, ranges) in ranges_by_property {
@@ -163,7 +245,16 @@ fn main() {
163245
emit_codepoints(&mut emitter, &ranges);
164246

165247
modules.push((property.to_lowercase().to_string(), emitter.file));
166-
println!("{:15}: {} bytes, {} codepoints", property, emitter.bytes_used, datapoints,);
248+
println!(
249+
"{:15}: {} bytes, {} codepoints in {} ranges ({} - {}) using {}",
250+
property,
251+
emitter.bytes_used,
252+
datapoints,
253+
ranges.len(),
254+
ranges.first().unwrap().start,
255+
ranges.last().unwrap().end,
256+
emitter.desc,
257+
);
167258
total_bytes += emitter.bytes_used;
168259
}
169260

@@ -173,7 +264,10 @@ fn main() {
173264
"///! This file is generated by src/tools/unicode-table-generator; do not edit manually!\n",
174265
);
175266

176-
table_file.push_str("use super::range_search;\n\n");
267+
// Include the range search function
268+
table_file.push('\n');
269+
table_file.push_str(include_str!("range_search.rs"));
270+
table_file.push('\n');
177271

178272
table_file.push_str(&version());
179273

@@ -236,26 +330,110 @@ fn fmt_list<V: std::fmt::Debug>(values: impl IntoIterator<Item = V>) -> String {
236330
out
237331
}
238332

333+
fn generate_tests(data_path: &str, ranges: &[(&str, Vec<Range<u32>>)]) -> String {
334+
let mut s = String::new();
335+
s.push_str("#![allow(incomplete_features, unused)]\n");
336+
s.push_str("#![feature(const_generics)]\n\n");
337+
s.push_str("\n#[allow(unused)]\nuse std::hint;\n");
338+
s.push_str(&format!("#[path = \"{}\"]\n", data_path));
339+
s.push_str("mod unicode_data;\n\n");
340+
341+
s.push_str("\nfn main() {\n");
342+
343+
for (property, ranges) in ranges {
344+
s.push_str(&format!(r#" println!("Testing {}");"#, property));
345+
s.push('\n');
346+
s.push_str(&format!(" {}_true();\n", property.to_lowercase()));
347+
s.push_str(&format!(" {}_false();\n", property.to_lowercase()));
348+
let mut is_true = Vec::new();
349+
let mut is_false = Vec::new();
350+
for ch_num in 0..(std::char::MAX as u32) {
351+
if std::char::from_u32(ch_num).is_none() {
352+
continue;
353+
}
354+
if ranges.iter().any(|r| r.contains(&ch_num)) {
355+
is_true.push(ch_num);
356+
} else {
357+
is_false.push(ch_num);
358+
}
359+
}
360+
361+
s.push_str(&format!(" fn {}_true() {{\n", property.to_lowercase()));
362+
generate_asserts(&mut s, property, &is_true, true);
363+
s.push_str(" }\n\n");
364+
s.push_str(&format!(" fn {}_false() {{\n", property.to_lowercase()));
365+
generate_asserts(&mut s, property, &is_false, false);
366+
s.push_str(" }\n\n");
367+
}
368+
369+
s.push_str("}");
370+
s
371+
}
372+
373+
fn generate_asserts(s: &mut String, property: &str, points: &[u32], truthy: bool) {
374+
for range in ranges_from_set(points) {
375+
if range.end == range.start + 1 {
376+
s.push_str(&format!(
377+
" assert!({}unicode_data::{}::lookup({:?}), \"{}\");\n",
378+
if truthy { "" } else { "!" },
379+
property.to_lowercase(),
380+
std::char::from_u32(range.start).unwrap(),
381+
range.start,
382+
));
383+
} else {
384+
s.push_str(&format!(" for chn in {:?}u32 {{\n", range));
385+
s.push_str(&format!(
386+
" assert!({}unicode_data::{}::lookup(std::char::from_u32(chn).unwrap()), \"{{:?}}\", chn);\n",
387+
if truthy { "" } else { "!" },
388+
property.to_lowercase(),
389+
));
390+
s.push_str(" }\n");
391+
}
392+
}
393+
}
394+
395+
fn ranges_from_set(set: &[u32]) -> Vec<Range<u32>> {
396+
let mut ranges = set.iter().map(|e| (*e)..(*e + 1)).collect::<Vec<Range<u32>>>();
397+
merge_ranges(&mut ranges);
398+
ranges
399+
}
400+
239401
fn merge_ranges(ranges: &mut Vec<Range<u32>>) {
240402
loop {
241403
let mut new_ranges = Vec::new();
242404
let mut idx_iter = 0..(ranges.len() - 1);
405+
let mut should_insert_last = true;
243406
while let Some(idx) = idx_iter.next() {
244407
let cur = ranges[idx].clone();
245408
let next = ranges[idx + 1].clone();
246409
if cur.end == next.start {
247-
let _ = idx_iter.next(); // skip next as we're merging it in
410+
if idx_iter.next().is_none() {
411+
// We're merging the last element
412+
should_insert_last = false;
413+
}
248414
new_ranges.push(cur.start..next.end);
249415
} else {
416+
// We're *not* merging the last element
417+
should_insert_last = true;
250418
new_ranges.push(cur);
251419
}
252420
}
253-
new_ranges.push(ranges.last().unwrap().clone());
421+
if should_insert_last {
422+
new_ranges.push(ranges.last().unwrap().clone());
423+
}
254424
if new_ranges.len() == ranges.len() {
255425
*ranges = new_ranges;
256426
break;
257427
} else {
258428
*ranges = new_ranges;
259429
}
260430
}
431+
432+
let mut last_end = None;
433+
for range in ranges {
434+
if let Some(last) = last_end {
435+
assert!(range.start > last, "{:?}", range);
436+
}
437+
last_end = Some(range.end);
438+
}
261439
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
#[inline(always)]
2+
fn bitset_search<
3+
const N: usize,
4+
const CHUNK_SIZE: usize,
5+
const N1: usize,
6+
const CANONICAL: usize,
7+
const CANONICALIZED: usize,
8+
>(
9+
needle: u32,
10+
chunk_idx_map: &[u8; N],
11+
bitset_chunk_idx: &[[u8; CHUNK_SIZE]; N1],
12+
bitset_canonical: &[u64; CANONICAL],
13+
bitset_canonicalized: &[(u8, u8); CANONICALIZED],
14+
) -> bool {
15+
let bucket_idx = (needle / 64) as usize;
16+
let chunk_map_idx = bucket_idx / CHUNK_SIZE;
17+
let chunk_piece = bucket_idx % CHUNK_SIZE;
18+
let chunk_idx = if let Some(&v) = chunk_idx_map.get(chunk_map_idx) {
19+
v
20+
} else {
21+
return false;
22+
};
23+
let idx = bitset_chunk_idx[chunk_idx as usize][chunk_piece] as usize;
24+
let word = if let Some(word) = bitset_canonical.get(idx) {
25+
*word
26+
} else {
27+
let (real_idx, mapping) = bitset_canonicalized[idx - bitset_canonical.len()];
28+
let mut word = bitset_canonical[real_idx as usize];
29+
let should_invert = mapping & (1 << 6) != 0;
30+
if should_invert {
31+
word = !word;
32+
}
33+
// Lower 6 bits
34+
let quantity = mapping & ((1 << 6) - 1);
35+
if mapping & (1 << 7) != 0 {
36+
// shift
37+
word >>= quantity as u64;
38+
} else {
39+
word = word.rotate_left(quantity as u32);
40+
}
41+
word
42+
};
43+
(word & (1 << (needle % 64) as u64)) != 0
44+
}
45+
46+
fn decode_prefix_sum(short_offset_run_header: u32) -> u32 {
47+
short_offset_run_header & ((1 << 21) - 1)
48+
}
49+
50+
fn decode_length(short_offset_run_header: u32) -> usize {
51+
(short_offset_run_header >> 21) as usize
52+
}
53+
54+
#[inline(always)]
55+
fn skip_search<const SOR: usize, const OFFSETS: usize>(
56+
needle: u32,
57+
short_offset_runs: &[u32; SOR],
58+
offsets: &[u8; OFFSETS],
59+
) -> bool {
60+
// Note that this *cannot* be past the end of the array, as the last
61+
// element is greater than std::char::MAX (the largest possible needle).
62+
//
63+
// So, we cannot have found it (i.e. Ok(idx) + 1 != length) and the correct
64+
// location cannot be past it, so Err(idx) != length either.
65+
//
66+
// This means that we can avoid bounds checking for the accesses below, too.
67+
let last_idx =
68+
match short_offset_runs.binary_search_by_key(&(needle << 11), |header| header << 11) {
69+
Ok(idx) => idx + 1,
70+
Err(idx) => idx,
71+
};
72+
73+
let mut offset_idx = decode_length(short_offset_runs[last_idx]);
74+
let length = if let Some(next) = short_offset_runs.get(last_idx + 1) {
75+
decode_length(*next) - offset_idx
76+
} else {
77+
offsets.len() - offset_idx
78+
};
79+
let prev =
80+
last_idx.checked_sub(1).map(|prev| decode_prefix_sum(short_offset_runs[prev])).unwrap_or(0);
81+
82+
let total = needle - prev;
83+
let mut prefix_sum = 0;
84+
for _ in 0..(length - 1) {
85+
let offset = offsets[offset_idx];
86+
prefix_sum += offset as u32;
87+
if prefix_sum > total {
88+
break;
89+
}
90+
offset_idx += 1;
91+
}
92+
offset_idx % 2 == 1
93+
}

0 commit comments

Comments
 (0)