Skip to content

File upload support: PDF, docx, markdown for admin and participant inputs #5

@Par-t

Description

@Par-t

Problem

Currently only plain text input is supported for both admin (hackathon setup) and participant (idea submission). Real-world usage requires document uploads — hackathon rules as PDFs, ideas written in docx/markdown, etc.

Current state

  • Admin: text chat with DeepSeek via /init endpoint
  • Participant: idea_text: str field in submission JSON
  • No file upload endpoint exists
  • Frontend will have a file input prop ready (non-functional until backend support lands)

Proposed solution

Participant side (ingestion/normalization node)

  • Accept file uploads (PDF, docx, markdown) as base64 in submission payload
  • Ingestion node extracts text based on file type → conditionally summarizes if too long
  • Normalized text feeds into embeddings + scoring
  • Tools: extract_pdf, extract_docx, parse_markdown — agent picks the right one based on input

Admin side (init handler)

  • Accept file upload (hackathon rules/guidelines document) in /init payload
  • Quick text extraction → keyword scan for criteria/guidelines/theme sections
  • Send relevant sections to LLM for structured config extraction
  • If document is missing required info, ask admin for clarification in follow-up turn

Libraries

  • PDF: pdfplumber
  • Docx: python-docx
  • Markdown: regex strip or mistune

Scope

  • Add idea_file + idea_file_type optional fields to submission model
  • Add file + file_type optional fields to init request
  • Extraction tools as agent-callable functions inside TEE
  • Conditional summarization for long extracted text
  • Error handling for malformed/empty files

Context

Part of making the pipeline genuinely agentic — the ingestion node makes non-deterministic tool call decisions based on input format and content length. Also unblocks real-world usage where participants have their ideas in documents rather than raw text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions