Skip to content

Conversation

@Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Oct 13, 2025

Minor refactors to unicode_data that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC

@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2025

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 13, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2025

r? @joboet

rustbot has assigned @joboet.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch from 2c5244e to 90adbe2 Compare October 13, 2025 01:07
@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch from 90adbe2 to 1a646cf Compare October 13, 2025 20:31
@rustbot

This comment has been minimized.

}

fn rustfmt(path: &str) {
std::process::Command::new("rustfmt").arg(path).status().expect("rustfmt failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is rustfmt really always in PATH when this command is run? Otherwise, I think it'd be easier to slap a big #[rustfmt::skip] on the mod unicode_data.

Copy link
Contributor Author

@Kmeakin Kmeakin Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to keep the generated unicode_data.rs readable without having to carefully construct well-formatted code in the metaprogram. For that, we need to run rustfmt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would rather just add #![rustfmt::skip] to the top of the file than run it through rustfmt, personally. I don't really think that the file is readable, rustfmt or not, due to the sheer amount of data included.

@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch 3 times, most recently from 1b56c98 to dc0dcf5 Compare October 19, 2025 23:42
@rustbot rustbot added the T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) label Oct 19, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 19, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@rust-log-analyzer

This comment has been minimized.

@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch from dc0dcf5 to d2c9773 Compare October 20, 2025 01:02
@rust-log-analyzer

This comment has been minimized.

@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch 2 times, most recently from 41d988f to ba38f37 Compare October 21, 2025 23:32
@Kmeakin
Copy link
Contributor Author

Kmeakin commented Oct 27, 2025

@joboet ping?

Copy link
Member

@joboet joboet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. This looks great!
r=me after addressing the nit (or just ignore it, it's fine)

View changes since this review

@joboet
Copy link
Member

joboet commented Oct 31, 2025

@bors delegate+

@bors
Copy link
Collaborator

bors commented Oct 31, 2025

✌️ @Kmeakin, you can now approve this pull request!

If @joboet told you to "r=me" after making some further change, please make that change, then do @bors r=@joboet

Instead of `include_str!()`ing `range_search.rs`, just make it a normal
module under `core::unicode`. This means the same source code doesn't
have to be checked in twice, and it plays nicer with IDEs.

Also rename it to `rt` since it includes functions for searching the
bitsets and case conversion tables as well as the range
represesentation.
Remove `#[rustfmt::skip]` from all the generated modules in
`unicode_data.rs`. This means we won't have to worry so much about
getting indetation and formatting right when generating code.

Exempted for now some tables which would be too big when formatted by
`rustfmt`.
This check was made redundant (it will always be true) when we removed
all ASCII characters from the tables
(rust-lang@a8c6694).
To make the final output code easier to see:
* Get rid of the unnecessary line-noise of `.unwrap()`ing calls to
  `write!()` by moving the `.unwrap()` into a macro.
* Join consecutive `write!()` calls using a single multiline format
  string.
* Replace `.push()` and `.push_str(format!())` with `write!()`.
* If after doing all of the above, there is only a single `write!()`
  call in the function, just construct the string directly with
  `format!()`.
Instead of generating a standalone executable to test `unicode_data`,
generate normal tests in `coretests`. This ensures tests are always
generated, and will be run as part of the normal testsuite.

Also change the generated tests to loop over lookup tables, rather than
generating a separate `assert_eq!()` statement for every codepoint. The
old approach produced a massive (20,000 lines plus) file which took
minutes to compile!
@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch from ba38f37 to 0e6131c Compare October 31, 2025 14:20
@Kmeakin
Copy link
Contributor Author

Kmeakin commented Oct 31, 2025

@bors r+

@bors
Copy link
Collaborator

bors commented Oct 31, 2025

📌 Commit 0e6131c has been approved by Kmeakin

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 31, 2025
@joboet
Copy link
Member

joboet commented Oct 31, 2025

@bors r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Oct 31, 2025
@joboet
Copy link
Member

joboet commented Oct 31, 2025

@bors r=joboet

@bors
Copy link
Collaborator

bors commented Oct 31, 2025

📌 Commit 0e6131c has been approved by joboet

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 31, 2025
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Oct 31, 2025
… r=joboet

`unicode_data` refactors

Minor refactors to `unicode_data` that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC
bors added a commit that referenced this pull request Oct 31, 2025
Rollup of 9 pull requests

Successful merges:

 - #139310 (add first HelenOS compilation targets)
 - #147161 (implement VecDeque extend_from_within and prepend_from_within)
 - #147622 (`unicode_data` refactors)
 - #147780 (Implement VecDeque::extract_if)
 - #147942 (Enable regression labeling aliases)
 - #147986 (Use fstatat() in DirEntry::metadata on Apple platforms)
 - #148103 (cg_llvm: Pass `debuginfo_compression` through FFI as an enum)
 - #148319 (docs: Fix argument names for `carrying_mul_add`)
 - #148322 (Enable file locking support in illumos)

r? `@ghost`
`@rustbot` modify labels: rollup
Zalathar added a commit to Zalathar/rust that referenced this pull request Nov 1, 2025
… r=joboet

`unicode_data` refactors

Minor refactors to `unicode_data` that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC
Zalathar added a commit to Zalathar/rust that referenced this pull request Nov 1, 2025
… r=joboet

`unicode_data` refactors

Minor refactors to `unicode_data` that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC
Zalathar added a commit to Zalathar/rust that referenced this pull request Nov 1, 2025
… r=joboet

`unicode_data` refactors

Minor refactors to `unicode_data` that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC
bors added a commit that referenced this pull request Nov 1, 2025
Rollup of 9 pull requests

Successful merges:

 - #139310 (add first HelenOS compilation targets)
 - #147161 (implement VecDeque extend_from_within and prepend_from_within)
 - #147622 (`unicode_data` refactors)
 - #147780 (Implement VecDeque::extract_if)
 - #147942 (Enable regression labeling aliases)
 - #147986 (Use fstatat() in DirEntry::metadata on Apple platforms)
 - #148103 (cg_llvm: Pass `debuginfo_compression` through FFI as an enum)
 - #148319 (docs: Fix argument names for `carrying_mul_add`)
 - #148322 (Enable file locking support in illumos)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 75fbbd3 into rust-lang:master Nov 1, 2025
11 checks passed
@rustbot rustbot added this to the 1.93.0 milestone Nov 1, 2025
rust-timer added a commit that referenced this pull request Nov 1, 2025
Rollup merge of #147622 - Kmeakin:km/unicode-data/refactors, r=joboet

`unicode_data` refactors

Minor refactors to `unicode_data` that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC
@Kmeakin Kmeakin deleted the km/unicode-data/refactors branch November 1, 2025 11:26
@Zalathar
Copy link
Member

Zalathar commented Nov 3, 2025

I think this PR might have caused the x86_64-gnu-aux CI job to go from ~2 hours to ~3.5 hours, making that job the slowest CI job.

let mut out = String::new();
out.push_str("pub const UNICODE_VERSION: (u8, u8, u8) = ");
fn rustfmt(path: &str) {
std::process::Command::new("rustfmt").arg(path).status().expect("rustfmt failed");
Copy link
Member

@jieyouxu jieyouxu Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising to me, this will pick up whichever rustfmt happens to be in PATH? That we run rustfmt or not on the generated code is not the problem, the problem is that we need to run the same rustfmt for this generated code and also for the rest of standard library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants