Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

Implements python-magic for MIME type detection from file contents, similar to the pattern used in https://github.com/vellum-ai/vellum/pull/17154.


Summary

When uploading files with generic MIME types (specifically application/octet-stream), the SDK now uses python-magic to detect the actual file type from the file contents. This enables proper file extension inference for files without explicit MIME type information.

Key changes:

  • Added python-magic==0.4.27 dependency
  • Modified ensure_filename_with_extension() to accept optional contents parameter (bytes or IO)
  • Added logic to detect MIME type using python-magic when MIME type is application/octet-stream
  • Updated call sites in upload.py to pass file contents for detection
  • Added comprehensive tests with mocking for the new functionality

Behavior:

  • Only activates when MIME type is exactly application/octet-stream AND contents are provided
  • Gracefully falls back to existing behavior if python-magic is not available (try/except import)
  • Strips charset parameters from detected MIME types (e.g., text/html; charset=utf-8text/html)
  • Preserves existing filename extensions (doesn't override if extension already exists)
  • Handles both bytes and IO objects, seeking back to original position for seekable streams

Review & Testing Checklist for Human

  • Verify libmagic is installed in CI/production environments - python-magic is just a Python wrapper around the system libmagic library. Without it, the detection will silently fail and fall back to .bin extension. Check that CI images and production environments have libmagic installed (e.g., libmagic1 on Debian/Ubuntu).

  • Test with real files - The tests use mocking, so they don't verify actual python-magic integration. Test uploading files with application/octet-stream MIME type (e.g., files without extensions) to verify detection works correctly in a real environment.

  • Review scope limitation - The implementation only uses python-magic for application/octet-stream MIME type. Consider whether this should be expanded to other generic MIME types or if this narrow scope is intentional.

Test Plan

  1. Upload a file with application/octet-stream MIME type (e.g., a PDF without extension)
  2. Verify the file gets the correct extension (e.g., .pdf)
  3. Test in an environment without libmagic to verify graceful fallback to .bin
  4. Test with files that have charset in detected MIME type (e.g., HTML, text files)

Notes

  • The implementation matches the pattern from the vellum repo PR #17154
  • Tests use mocking to avoid dependency on libmagic being installed during test runs
  • Backward compatible - existing calls without contents parameter continue to work unchanged

- Add python-magic==0.4.27 dependency to pyproject.toml
- Modify ensure_filename_with_extension to accept optional contents parameter
- Use python-magic to detect MIME type when provided MIME type is application/octet-stream
- Update call sites in upload.py to pass file contents
- Add comprehensive tests with mocking for python-magic functionality
- Handle cases where libmagic is not available with graceful fallback

Co-Authored-By: [email protected] <[email protected]>
@vellum-automation
Copy link
Contributor

@codex review

@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR that start with 'DevinAI' or '@devin'.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 1 to +5
"""Tests for file extension inference utilities."""

import pytest
from io import BytesIO
from unittest.mock import patch

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Place new extension tests under collected test paths

Pytest is configured via pyproject.toml with testpaths = ["tests"], and the default make test just runs pytest, so modules outside that directory are skipped. This new test module lives under src/vellum/utils/files/tests, meaning none of the python-magic coverage will execute in CI. Consider moving it under tests/ or broadening testpaths so these tests actually run.

Useful? React with 👍 / 👎.

Copy link
Contributor

@siddseethepalli siddseethepalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, we'll need to add the libmagic dep to the PWS Dockerfile for this to work right?

@siddseethepalli
Copy link
Contributor

@noanflaherty what do you plan on doing with this?

Copy link
Contributor

@siddseethepalli siddseethepalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably ok, but if we just do this on the django side we can accomplish the same thing right?

filename: Optional filename provided by the user
mime_type: The MIME type of the file (e.g., "application/pdf", "image/png"). This'll be used to infer the
extension if the filename lacks one.
contents: Optional file contents (bytes or file-like object) to use for MIME type detection via python-magic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q] echoing Sidd's question, especially this will require Dockerfile changes. Do we need to use the inferred mime type on the sdk side before we call the upload api?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants