Skip to content

Bug: malformed RAG PDFs escape validation as unhandled 500s #975

@eshaanag

Description

@eshaanag

Summary

POST /api/v1/rag/ingest validates filename, content type, file count, and byte limits, but a malformed or password-protected file can still pass those checks and raise from the PDF parser. The document-loader call is outside any exception mapping, so the parser error escapes as an unhandled HTTP 500 instead of a controlled client-facing response.

Evidence from current main

  • backend/app/api/v1/rag.py:83-113 accepts uploads based on .pdf, MIME type, and size limits.
  • backend/app/api/v1/rag.py:125 calls load_documents_from_paths(saved_paths) outside an exception handler.
  • backend/app/modules/rag/document_loader.py:23-28 constructs PyPDFLoader and calls loader.load() without handling parser failures.
  • The endpoint only catches failures from create_vector_store() at backend/app/api/v1/rag.py:135-141.
  • backend/tests/test_rag_ingest.py covers non-PDF files, an empty loader result, and FAISS build failures, but not a loader/parser exception.

A request using filename broken.pdf, content type application/pdf, and invalid PDF bytes passes the endpoint validation before reaching PyPDFLoader.load(). Parser exceptions then bubble through FastAPI as an unhandled 500.

Expected behavior

Malformed, encrypted, or otherwise unreadable PDFs should receive a controlled 4xx response with a generic message. Parser internals should not be exposed, create_vector_store() should not run, and temporary files should still be cleaned up. Unexpected infrastructure failures should remain distinguishable from invalid client files.

Suggested acceptance criteria

  • Map known PDF parsing/read failures to a generic HTTP 400 response.
  • Do not expose raw parser exception details in the response.
  • Keep unexpected loader infrastructure failures logged and mapped separately.
  • Add a regression test where load_documents_from_paths() raises a parser-style exception and assert that vector-store creation is not called.
  • Preserve the existing temporary-directory cleanup behavior.

Duplicate check

Searched existing issues for RAG ingest malformed PDF, RAG ingest invalid PDF, corrupt PDF RAG, password protected PDF RAG, and parser error ingest; no matching issue was found.

This is an intermediate backend/RAG reliability task raised for GSSoC 2026. Please add or adjust the program labels if appropriate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions