Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(mercury): Optimize decoding time by skipping renaming #681

Closed
wants to merge 3 commits into from

Conversation

el-ev
Copy link
Contributor

@el-ev el-ev commented Nov 12, 2024

In the save_to_file function of cache_object.rs, we previously written data to a temporary file and then renamed it to the final path. This process introduced additional I/O overhead due to the creation and renaming of temporary files (see the flamegraph).

flamegraph

This change modifies the function to write directly to the target file. By doing so, we eliminate the need for temporary files and the renaming operation, reducing I/O operations and improving performance.

It can be observed that decoding time of the test test_pack_decode_with_large_file_with_delta_without_ref reduced from ~26s to ~21s (~20% improvement) on a Macbook Air M2 device.

Copy link

vercel bot commented Nov 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
mega ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 12, 2024 0:49am

@el-ev el-ev changed the title [mercury] refactor: Optimize decoding time by skipping renaming refactor(mercury): Optimize decoding time by skipping renaming Nov 12, 2024
@Hou-Xiaoxuan
Copy link
Contributor

Thank you for your contribution. In fact, the previous version was designed to use a technique known as “atomic writes” to prevent partially written files from being read. (On most Unix systems, mv/rename is an atomic operation. )By first writing to a temporary file and then renaming it to the target path, we ensure that readers only access fully written files, avoiding issues with incomplete data reads and potential race conditions.

Additionally, the execution time of test_pack_decode_with_large_file_with_delta_without_ref is actually quite variable due to the complex concurrent calls and scheduling at higher levels. In reality, the rename operation itself doesn’t significantly impact performance.

We appreciate your insights and analysis. If you have further suggestions or alternative ideas for optimizing performance while maintaining atomicity, we’d be glad to discuss them.

@el-ev
Copy link
Contributor Author

el-ev commented Nov 12, 2024

Thank you for the clarification on the atomic writes. Now I see that data races may occur if the file is read before it is completely written. I will close this PR and propose a new one when I have a better idea for optimization. I thought in-memory atomic data structures or file locks could be utilized.


Besides, I'm current on a student project which is related to the mega project, more specifically, the mercury crate. However, the instructions for contributing to this crate are not clear to me (for instance, the README file is empty and there are few to no related issues). @Hou-Xiaoxuan @genedna Could you please provide me with some guidance on what can be done to contribute to this crate?

Thanks in advance.

@Hou-Xiaoxuan
Copy link
Contributor

@el-ev Mercury serves as a foundational module within the project rather than an entry point. Broadly speaking, it contains basic data structures and functionalities related to Git, such as the encoding and decoding functions for git index file. Currently, there are some performance issues with the decoding function, as noted in Issue #600, for which there is a temporary but suboptimal solution. If you’re interested in optimizations, you could look into implementing a more efficient diff algorithm, though this is quite challenging.

Alternatively, if you’re open to contributing beyond the Mercury module, the Libra module might be a good option. Libra is a Rust-based implementation of Git, with some commands yet to be implemented. Additionally, it has more comprehensive documentation and contribution guidelines.

@el-ev el-ev closed this Nov 12, 2024
@el-ev el-ev deleted the decode_fileio branch December 18, 2024 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants