fix: prevent same-name file collision in parser output directories (#51) by teamauresta · Pull Request #203 · HKUDS/RAG-Anything

teamauresta · 2026-02-17T23:17:02Z

When multiple files share the same name (e.g. dir1/paper.pdf and dir2/paper.pdf), their parser output was written to the same directory, causing data loss.

Add _unique_output_dir() that creates a unique subdirectory per file by appending a short hash of the file's absolute path (e.g. paper_a1b2c3d4/). This ensures each file gets its own isolated output directory.

Description

[Briefly describe the changes made in this pull request.]

Related Issues

[Reference any related issues or tasks addressed by this pull request.]

Changes Made

[List the specific changes made in this pull request.]

Checklist

Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Additional Notes

[Add any additional notes or context for the reviewer(s).]

…KUDS#51) When multiple files share the same name (e.g. dir1/paper.pdf and dir2/paper.pdf), their parser output was written to the same directory, causing data loss. Add _unique_output_dir() that creates a unique subdirectory per file by appending a short hash of the file's absolute path (e.g. paper_a1b2c3d4/). This ensures each file gets its own isolated output directory.

LarFii · 2026-02-18T16:31:01Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 267b6955ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-18T16:33:55Z

raganything/parser.py

+        """
+        file_path = Path(file_path).resolve()
+        stem = file_path.stem
+        path_hash = hashlib.md5(str(file_path).encode()).hexdigest()[:8]


Use a non-truncated hash for unique output directories

_unique_output_dir() truncates the MD5 digest to 8 hex chars, which gives only 32 bits of space; with larger batch runs, different absolute paths will eventually collide and be mapped to the same base_output_dir, causing the same output overwrite/data-loss behavior this patch is trying to eliminate (all updated parse paths now rely on this helper). In practice, birthday collisions become plausible around tens of thousands of files, so this can silently corrupt experiment outputs in high-volume ingestion jobs.

Useful? React with 👍 / 👎.

LarFii · 2026-02-20T16:50:50Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82d8689a0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-20T16:54:23Z

raganything/parser.py

+        """
+        file_path = Path(file_path).resolve()
+        stem = file_path.stem
+        path_hash = hashlib.md5(str(file_path).encode()).hexdigest()[:8]


Avoid MD5 for output-dir hashing

Using hashlib.md5(...) here can raise at runtime on FIPS-enabled Python/OpenSSL builds, which means any parse call with output_dir set will fail before running MinerU/Docling. This regression is environment-specific but impactful in hardened production setups; prefer a FIPS-safe hash (for example SHA-256 truncated) or pass usedforsecurity=False where supported.

Useful? React with 👍 / 👎.

chatgpt-codex-connector bot reviewed Feb 18, 2026

View reviewed changes

chore: apply ruff formatting

82d8689

chatgpt-codex-connector bot reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: prevent same-name file collision in parser output directories (#51)#203

fix: prevent same-name file collision in parser output directories (#51)#203
teamauresta wants to merge 2 commits intoHKUDS:mainfrom
sotastack:fix/same-name-file-collision

teamauresta commented Feb 17, 2026

Uh oh!

LarFii commented Feb 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 18, 2026

Uh oh!

LarFii commented Feb 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

teamauresta commented Feb 17, 2026

Description

Related Issues

Changes Made

Checklist

Additional Notes

Uh oh!

LarFii commented Feb 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

LarFii commented Feb 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants