Skip to content

Commit 2664aee

Browse files
committed
feat(api): add file_processor API skeleton
This change adds a file_processor API skeleton that provides a foundationfor converting files into structured content for vector store ingestionwith support for chunking strategies and optional embedding generation. Signed-off-by: Alina Ryan <[email protected]>
1 parent 6147321 commit 2664aee

File tree

21 files changed

+258
-0
lines changed

21 files changed

+258
-0
lines changed
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
sidebar_label: File Processor
3+
title: File_Processor
4+
---
5+
6+
# File_Processor
7+
8+
## Overview
9+
10+
This section contains documentation for all available providers for the **file_processor** API.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
description: "Reference file processor implementation (placeholder for development)"
3+
sidebar_label: Reference
4+
title: inline::reference
5+
---
6+
7+
# inline::reference
8+
9+
## Description
10+
11+
Reference file processor implementation (placeholder for development)
12+
13+
## Sample Configuration
14+
15+
```yaml
16+
{}
17+
```

src/llama_stack/apis/datatypes.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ class Api(Enum, metaclass=DynamicApiMeta):
127127
files = "files"
128128
prompts = "prompts"
129129
conversations = "conversations"
130+
file_processor = "file_processor"
130131

131132
# built-in API
132133
inspect = "inspect"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the terms described in the LICENSE file in
5+
# the root directory of this source tree.
6+
7+
from .file_processor import *
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the terms described in the LICENSE file in
5+
# the root directory of this source tree.
6+
7+
from typing import Any, Protocol, runtime_checkable
8+
9+
from pydantic import BaseModel
10+
11+
from llama_stack.apis.common.tracing import telemetry_traceable
12+
from llama_stack.apis.vector_io.vector_io import Chunk, VectorStoreChunkingStrategy
13+
from llama_stack.apis.version import LLAMA_STACK_API_V1ALPHA
14+
from llama_stack.schema_utils import json_schema_type, webmethod
15+
16+
17+
@json_schema_type
18+
class ProcessFileRequest(BaseModel):
19+
"""Request for processing a file into structured content."""
20+
21+
file_data: bytes
22+
"""Raw file data to process."""
23+
24+
filename: str
25+
"""Original filename for format detection and processing hints."""
26+
27+
options: dict[str, Any] | None = None
28+
"""Optional processing options. Provider-specific parameters."""
29+
30+
chunking_strategy: VectorStoreChunkingStrategy | None = None
31+
"""Optional chunking strategy for splitting content into chunks."""
32+
33+
include_embeddings: bool = False
34+
"""Whether to generate embeddings for chunks."""
35+
36+
37+
@json_schema_type
38+
class ProcessedContent(BaseModel):
39+
"""Result of file processing operation."""
40+
41+
content: str
42+
"""Extracted text content from the file."""
43+
44+
chunks: list[Chunk] | None = None
45+
"""Optional chunks if chunking strategy was provided."""
46+
47+
embeddings: list[list[float]] | None = None
48+
"""Optional embeddings for chunks if requested."""
49+
50+
metadata: dict[str, Any]
51+
"""Processing metadata including processor name, timing, and provider-specific data."""
52+
53+
54+
@telemetry_traceable
55+
@runtime_checkable
56+
class FileProcessor(Protocol):
57+
"""
58+
File Processor API for converting files into structured, processable content.
59+
60+
This API provides a flexible interface for processing various file formats
61+
(PDFs, documents, images, etc.) into text content that can be used for
62+
vector store ingestion, RAG applications, or standalone content extraction.
63+
64+
The API supports:
65+
- Multiple file formats through extensible provider architecture
66+
- Configurable processing options per provider
67+
- Integration with vector store chunking strategies
68+
- Optional embedding generation for chunks
69+
- Rich metadata about processing results
70+
71+
Future providers can extend this interface to support additional formats,
72+
processing capabilities, and optimization strategies.
73+
"""
74+
75+
@webmethod(route="/file-processor/process", method="POST", level=LLAMA_STACK_API_V1ALPHA)
76+
async def process_file(
77+
self,
78+
file_data: bytes,
79+
filename: str,
80+
options: dict[str, Any] | None = None,
81+
chunking_strategy: VectorStoreChunkingStrategy | None = None,
82+
include_embeddings: bool = False,
83+
) -> ProcessedContent:
84+
"""
85+
Process a file into structured content with optional chunking and embeddings.
86+
87+
This method processes raw file data and converts it into text content for applications such as vector store ingestion.
88+
89+
:param file_data: Raw bytes of the file to process.
90+
:param filename: Original filename for format detection.
91+
:param options: Provider-specific processing options (e.g., OCR settings, output format).
92+
:param chunking_strategy: Optional strategy for splitting content into chunks.
93+
:param include_embeddings: Whether to generate embeddings for chunks.
94+
:returns: ProcessedContent with extracted text, optional chunks, and metadata.
95+
"""
96+
...

src/llama_stack/core/resolver.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from llama_stack.apis.datasets import Datasets
1717
from llama_stack.apis.datatypes import ExternalApiSpec
1818
from llama_stack.apis.eval import Eval
19+
from llama_stack.apis.file_processor import FileProcessor
1920
from llama_stack.apis.files import Files
2021
from llama_stack.apis.inference import Inference, InferenceProvider
2122
from llama_stack.apis.inspect import Inspect
@@ -96,6 +97,7 @@ def api_protocol_map(external_apis: dict[Api, ExternalApiSpec] | None = None) ->
9697
Api.files: Files,
9798
Api.prompts: Prompts,
9899
Api.conversations: Conversations,
100+
Api.file_processor: FileProcessor,
99101
}
100102

101103
if external_apis:

src/llama_stack/distributions/ci-tests/build.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ distribution_spec:
2929
- provider_type: remote::weaviate
3030
files:
3131
- provider_type: inline::localfs
32+
file_processor:
33+
- provider_type: inline::reference
3234
safety:
3335
- provider_type: inline::llama-guard
3436
- provider_type: inline::code-scanner

src/llama_stack/distributions/ci-tests/run.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ apis:
55
- batches
66
- datasetio
77
- eval
8+
- file_processor
89
- files
910
- inference
1011
- post_training
@@ -154,6 +155,9 @@ providers:
154155
metadata_store:
155156
table_name: files_metadata
156157
backend: sql_default
158+
file_processor:
159+
- provider_id: reference
160+
provider_type: inline::reference
157161
safety:
158162
- provider_id: llama-guard
159163
provider_type: inline::llama-guard

src/llama_stack/distributions/starter-gpu/build.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ distribution_spec:
3030
- provider_type: remote::weaviate
3131
files:
3232
- provider_type: inline::localfs
33+
file_processor:
34+
- provider_type: inline::reference
3335
safety:
3436
- provider_type: inline::llama-guard
3537
- provider_type: inline::code-scanner

src/llama_stack/distributions/starter-gpu/run-with-postgres-store.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ apis:
55
- batches
66
- datasetio
77
- eval
8+
- file_processor
89
- files
910
- inference
1011
- post_training
@@ -154,6 +155,9 @@ providers:
154155
metadata_store:
155156
table_name: files_metadata
156157
backend: sql_default
158+
file_processor:
159+
- provider_id: reference
160+
provider_type: inline::reference
157161
safety:
158162
- provider_id: llama-guard
159163
provider_type: inline::llama-guard

0 commit comments

Comments
 (0)