Skip to content

use custom gzip compression to tolerate FUSE-mounted OSS/S3#980

Open
ideal wants to merge 3 commits into
datajuicer:mainfrom
ideal:loguru
Open

use custom gzip compression to tolerate FUSE-mounted OSS/S3#980
ideal wants to merge 3 commits into
datajuicer:mainfrom
ideal:loguru

Conversation

@ideal
Copy link
Copy Markdown
Contributor

@ideal ideal commented May 15, 2026

Loguru's built-in "gz" compression fails on K8s + Fluid OSS/S3 mounts because the post-compression unlink may returns OSError errno 34 even though the source file is removed. We replace it with a custom callback that swallows that specific failure so log compression works under FUSE.

e.g.:
image

Loguru's built-in "gz" compression fails on K8s + Fluid OSS/S3 mounts
because the post-compression unlink returns OSError errno 34 even though
the source file is removed. Replace it with a custom callback that
swallows that specific failure so log compression works under FUSE.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom gzip_compression function to replace the default loguru 'gz' compression, specifically addressing an issue where file removal fails on FUSE-mounted filesystems with a 'Numerical result out of range' error. The feedback suggests refining the exception handling in this new function by importing the errno module and explicitly checking for errno.ERANGE instead of catching all OSError instances, which ensures that other legitimate filesystem errors are not masked.

Comment thread data_juicer/utils/logger_utils.py
Comment thread data_juicer/utils/logger_utils.py Outdated
Swallowing all OSError can mask real failures (permission denied, disk
full, etc.). Restrict the suppression to the specific errnos observed on
FUSE-mounted OSS/S3 (errno 34 / ERANGE, plus ENOSYS) and re-raise
anything else.
@cmgzn
Copy link
Copy Markdown
Collaborator

cmgzn commented May 22, 2026

Thanks for the detailed fix and context.

I agree this is a real issue, but the root cause appears to be the FUSE/Fluid mount returning ERANGE from unlink. I’m hesitant to replace Loguru’s built-in compression="gz" globally in Data-Juicer for a filesystem-specific behavior, since that would diverge from Loguru’s native compression semantics.

On the Data-Juicer side, I’m a bit concerned that swallowing filesystem errors here (even if limited to ERANGE) isn’t the cleanest/most reliable long-term approach, since it’s hard to be confident we’re not hiding a real deletion failure in some cases.

I’d prefer to address this in the FUSE/Fluid layer if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants