Skip to content

Conversation

@fw-immunant
Copy link
Contributor

Currently this is just the C-side implementation; I'll add commits moving the impl of Rust splitting and merging from the crisp repository tomorrow.

@fw-immunant fw-immunant force-pushed the fw/c2rust-postprocess-split-merge branch from 16f6f85 to 4ffc28d Compare November 26, 2025 02:11
@ahomescu
Copy link
Contributor

Not against this approach per se, just wanted to bring up the alternative approach of using the existing c2rust::src_loc annotations for this. Do we not want to use those?

@thedataking thedataking force-pushed the feature/c2rust-postprocess-take-two branch 2 times, most recently from d2cf708 to 5b517cb Compare November 27, 2025 01:40
@fw-immunant
Copy link
Contributor Author

Not against this approach per se, just wanted to bring up the alternative approach of using the existing c2rust::src_loc annotations for this. Do we not want to use those?

I discussed this with @thedataking in a call.

One problem is that we only have the c2rust::src_loc annotations if the transpiler was invoked with --reorganize-definitions, which may not always be the case. We could just say that we require this flag (or another that would also cause these attributes to be emitted) for comment reinsertion, but then it may be the case that these attributes themselves are surprising to the LLM and damage performance. Emitting this information out of band keeps this functionality independent and simplifies using it in other tools; our other tooling on the CRISP side currently handles a similar format of a flat JSON string->string map.

The other problem is that even given said annotations, we only know the start of the C definition, not its entire bounds (nor bounds of any relevant preceding comments). Most of the logic here is in getting precise bounds from the information in the Clang AST, which is otherwise lost. So we would need to add that logic somewhere even if we were going to use the c2rust::src_loc attribute for this.

@fw-immunant fw-immunant force-pushed the fw/c2rust-postprocess-split-merge branch from e751b54 to e51268a Compare December 1, 2025 14:46
@fw-immunant
Copy link
Contributor Author

We can't practically integrate the Rust splitting/merging tools into our Cargo.toml because they use Rust 2024 and have dependencies that also do. But I don't think it's a problem to just have them live in their own subdirectories for now; they just have to be built before the c2rust-postprocess tool can invoke them.

Copy link
Contributor

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't practically integrate the Rust splitting/merging tools into our Cargo.toml because they use Rust 2024 and have dependencies that also do. But I don't think it's a problem to just have them live in their own subdirectories for now; they just have to be built before the c2rust-postprocess tool can invoke them.

That seems fine for now. When we get around to #1227, we can add it back.

@thedataking thedataking force-pushed the feature/c2rust-postprocess-take-two branch 7 times, most recently from 85b2bc4 to 4fe4d8d Compare December 4, 2025 02:43
@fw-immunant fw-immunant force-pushed the fw/c2rust-postprocess-split-merge branch from e51268a to edcb0bf Compare December 4, 2025 19:28
@fw-immunant fw-immunant changed the base branch from feature/c2rust-postprocess-take-two to master December 4, 2025 19:29
@fw-immunant fw-immunant force-pushed the fw/c2rust-postprocess-split-merge branch from 485b683 to 6f349c1 Compare December 6, 2025 15:47
@fw-immunant
Copy link
Contributor Author

We now seem to extract correct source ranges for all the definitions in libxml2 and lua in the testsuite.

this is being used for local definitions instead of the `static` keyword, so it should encompass the newer helpers we've defined recently
Clang's SourceRange ends at the beginning of the final token, but this is not useful if we want to look at every character of a declaration, as we may when considering the spelling of the corresponding C definition
otherwise, we treat them as one character, which means the next definition's source range will expand to include them as if they were leading comments for it
this can occur if a macro is defined inside a top-level function
otherwise, inner decls like local variables confuse our tracking of definition bounds
@fw-immunant fw-immunant force-pushed the fw/c2rust-postprocess-split-merge branch from c56e84d to 9d65be7 Compare December 7, 2025 21:04

## Building

These tools rely on rust-analyzer's libraries, so they need a relatively recent version of Rust. Run `cargo build --release` in their respective directories to build them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, that picks up the c2rust nightly toolchain and fails because it is too old. I had to run cargo +stable build --release.

Copy link
Contributor Author

@fw-immunant fw-immunant Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's because I was running actual cargo, not rustup aliased to cargo. In the latter case (which is probably more common) your command is the right one; I'll clarify. Or alternately, rustup might pick up a rust-toolchain.toml if we placed one somewhere.

Copy link
Contributor

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I test this? Also, do we need to merge the --emit-c-decl-map changes to c2rust-transpile/c2rust-ast-exporter at the same time as the new tools/? The former looks mostly good to me, but I haven't reviewed the latter. Also, is it worth it to review split_ffi_entry_points, as FWIU, it's already in crisp?

Comment on lines +313 to +317
decls_sorted.sort_by(|v1, v2| {
self.c_decls[v1]
.begin_loc()
.cmp(&self.c_decls[v2].begin_loc())
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
decls_sorted.sort_by(|v1, v2| {
self.c_decls[v1]
.begin_loc()
.cmp(&self.c_decls[v2].begin_loc())
});
decls_sorted.sort_by_key(|v| self.c_decls[v].begin_loc());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also move the sorting to before name_loc_map and prev_src_loc, which are only used in the loop?

),
};

match serde_json::ser::to_writer(file, &decl_map) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably best to just write to a Vec<u8>/String and then write the file all at once. You could also write to a buffered writer, but that's easy to forget, like here.

Comment on lines +620 to +622
"Unable to write C declaration map to file {}: {}",
output_path.display(),
e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Unable to write C declaration map to file {}: {}",
output_path.display(),
e
"Unable to write C declaration map to file {}: {e}",
output_path.display(),

}

let file_content =
std::fs::read(&t.ast_context.get_file_path(t.main_file).unwrap()).unwrap();
Copy link
Contributor

@kkysen kkysen Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a .expect() here or fs_err? A file missing is a fairly common error, so better context is important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea, will do.

let mut begin_offset = src_loc_to_byte_offset(&line_end_offsets, begin);
let mut end_offset = src_loc_to_byte_offset(&line_end_offsets, end);
assert!(begin_offset <= end_offset);
const VT: u8 = 11;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's VT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vertical tab?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, ASCII VT, which has gone out of fashion so Rust lacks an escape for it.

})
.collect::<HashMap<_, _>>();

// Generate a map from Rust items to the source code of their C declarations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this in its own function? I think it'd be easier to follow that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, an emit_c_decl_map function would probably be good to avoid bloating fn translate.

.collect::<HashMap<_, _>>();

// Generate a map from Rust items to the source code of their C declarations.
let decl_map = if let Some(decl_source_ranges) = decl_source_ranges {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decl_source_ranges.map(|decl_sources_ranges| {})?


let file_content =
std::fs::read(&t.ast_context.get_file_path(t.main_file).unwrap()).unwrap();
let line_end_offsets = //memchr::memchr_iter(file_content, '\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the memchr for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a more efficient equivalent, but I don't think we have the dependency and this hasn't been a bottleneck.

assert!(begin_offset <= end_offset);
const VT: u8 = 11;
/* Skip whitespace and any trailing semicolons after the previous decl. */
while let Some(b'\t' | b'\n' | &VT | b'\r' | b' ' | b';') =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use u8::is_ascii_whitespace + ;?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't include \v, but this could be expressed using that function if we wanted. I don't know if it's clearer, given the specific set of characters we care about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. is_ascii_whitespace also handles FF but not VT. Is VT worth handling at all? I guess it doesn't matter that much. I don't imagine anything will actually produce VTs or FFs.

@fw-immunant
Copy link
Contributor Author

How do I test this? Also, do we need to merge the --emit-c-decl-map changes to c2rust-transpile/c2rust-ast-exporter at the same time as the new tools/? The former looks mostly good to me, but I haven't reviewed the latter. Also, is it worth it to review split_ffi_entry_points, as FWIU, it's already in crisp?

The flag is independent from tools but intended to be used alongside them in the CRISP loop. This PR doesn't include any automated testing yet, but I've been manually testing just by invoking rustc with the flag and examining the produced .c_decls.json files.

In addition to the testsuite, I've been using the following test case which I'll probably add as a snapshot test:

#include <stddef.h>
#include <stdint.h>
int a;int bb;/*comment for cc*/int ccc;
/*
123456789012345678901234567890
*/
typedef uintptr_t ngx_uint_t;
void f(void){}void g(void){}/*comment for h*/void h(void){}int d;int e;;;;;;;;int another;
//comment for ngx_hash_init
void
ngx_hash_init(void)
{
    size_t len;
    ngx_uint_t align = len & ~sizeof(void *);
}
int
anotherfunc(void)
{
    int var = 0;
    return 4 + var;
}
int
funcholdingdefine(void)
{
#define FOO 4000+50
    return FOO;
}
void forwarddeclbeforeotherdef(int x);
void funcafterdecl(int n) {}
/*real defn*/
void forwarddeclbeforeotherdef(int x) { }

@fw-immunant fw-immunant mentioned this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants