fix: prevent same-name file collision in parser output directories (#51)#203
fix: prevent same-name file collision in parser output directories (#51)#203teamauresta wants to merge 2 commits intoHKUDS:mainfrom
Conversation
…KUDS#51) When multiple files share the same name (e.g. dir1/paper.pdf and dir2/paper.pdf), their parser output was written to the same directory, causing data loss. Add _unique_output_dir() that creates a unique subdirectory per file by appending a short hash of the file's absolute path (e.g. paper_a1b2c3d4/). This ensures each file gets its own isolated output directory.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 267b6955ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """ | ||
| file_path = Path(file_path).resolve() | ||
| stem = file_path.stem | ||
| path_hash = hashlib.md5(str(file_path).encode()).hexdigest()[:8] |
There was a problem hiding this comment.
Use a non-truncated hash for unique output directories
_unique_output_dir() truncates the MD5 digest to 8 hex chars, which gives only 32 bits of space; with larger batch runs, different absolute paths will eventually collide and be mapped to the same base_output_dir, causing the same output overwrite/data-loss behavior this patch is trying to eliminate (all updated parse paths now rely on this helper). In practice, birthday collisions become plausible around tens of thousands of files, so this can silently corrupt experiment outputs in high-volume ingestion jobs.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82d8689a0b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """ | ||
| file_path = Path(file_path).resolve() | ||
| stem = file_path.stem | ||
| path_hash = hashlib.md5(str(file_path).encode()).hexdigest()[:8] |
There was a problem hiding this comment.
Avoid MD5 for output-dir hashing
Using hashlib.md5(...) here can raise at runtime on FIPS-enabled Python/OpenSSL builds, which means any parse call with output_dir set will fail before running MinerU/Docling. This regression is environment-specific but impactful in hardened production setups; prefer a FIPS-safe hash (for example SHA-256 truncated) or pass usedforsecurity=False where supported.
Useful? React with 👍 / 👎.
When multiple files share the same name (e.g. dir1/paper.pdf and dir2/paper.pdf), their parser output was written to the same directory, causing data loss.
Add _unique_output_dir() that creates a unique subdirectory per file by appending a short hash of the file's absolute path (e.g. paper_a1b2c3d4/). This ensures each file gets its own isolated output directory.
Description
[Briefly describe the changes made in this pull request.]
Related Issues
[Reference any related issues or tasks addressed by this pull request.]
Changes Made
[List the specific changes made in this pull request.]
Checklist
Additional Notes
[Add any additional notes or context for the reviewer(s).]