Part of the GPT-RAG solution.
The GPT-RAG Data Ingestion service automates the processing of diverse document types—such as PDFs, images, spreadsheets, transcripts, and SharePoint files—preparing them for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experies for agent-based RAG applications.
The service performs the following steps:
- Scan sources: Detects new or updated content in configured sources
- Process content: Chunk and enrich data for retrieval
- Index documents: Writes processed chunks into Azure AI Search
- Schedule execution: Runs on a CRON-based scheduler defined by environment variables
- Blob Storage
- NL2SQL Metadata
- SharePoint
The ingestion service selects a chunker based on the file extension, ensuring each document is processed with the most suitable method.
-
.pdf
files — Processed by the DocAnalysisChunker using the Document Intelligence API. Structured elements such as tables and sections are extracted and converted into Markdown, then segmented with LangChain splitters. When Document Intelligence API 4.0 is enabled,.docx
and.pptx
files are handled the same way. -
Image files (
.bmp
,.png
,.jpeg
,.tiff
) — The DocAnalysisChunker applies OCR to extract text before chunking. -
Text-based files (
.txt
,.md
,.json
,.csv
) — Processed by the LangChainChunker, which splits content into paragraphs or sections. -
Specialized formats:
.vtt
(video transcripts) — Handled by the TranscriptionChunker, which splits content by time codes..xlsx
(spreadsheets) — Processed by the SpreadsheetChunker, chunked by rows or sheets.
Before deploying the application, you must provision the infrastructure as described in the GPT-RAG repo. This includes creating all necessary Azure resources required to support the application runtime.
Click to view software prerequisites
The machine used to customize and or deploy the service should have:
- Azure CLI: Install Azure CLI
- Azure Developer CLI (optional, if using azd): Install azd
- Git: Download Git
- Python 3.12: Download Python 3.12
- Docker CLI: Install Docker
- VS Code (recommended): Download VS Code
Click to view permissions requirements
To customize the service, your user should have the following roles:
Resource | Role | Description |
---|---|---|
App Configuration Store | App Configuration Data Owner | Full control over configuration settings |
Container Registry | AcrPush | Push and pull container images |
AI Search Service | Search Index Data Contributor | Read and write index data |
Storage Account | Storage Blob Data Contributor | Read and write blob data |
Cosmos DB | Cosmos DB Built-in Data Contributor | Read and write documents in Cosmos DB |
To deploy the service, assign these roles to your user or service principal:
Resource | Role | Description |
---|---|---|
App Configuration Store | App Configuration Data Reader | Read config |
Container Registry | AcrPush | Push images |
Azure Container App | Azure Container Apps Contributor | Manage Container Apps |
Ensure the deployment identity has these roles at the correct scope (subscription or resource group).
Make sure you're logged in to Azure before anything else:
az login
Initialize the template:
azd init -t azure/gpt-rag-ingestion
Important
Use the same environment name with azd init
as in the infrastructure deployment to keep components consistent.
Update env variables then deploy:
azd env refresh
azd deploy
Important
Run azd env refresh
with the same subscription and resource group used in the infrastructure deployment.
To deploy using a script, first clone the repository, set the App Configuration endpoint, and then run the deployment script.
git clone https://github.com/Azure/gpt-rag-ingestion.git
$env:APP_CONFIG_ENDPOINT = "https://<your-app-config-name>.azconfig.io"
cd gpt-rag-ingestion
.\scripts\deploy.ps1
git clone https://github.com/Azure/gpt-rag-ingestion.git
export APP_CONFIG_ENDPOINT="https://<your-app-config-name>.azconfig.io"
cd gpt-rag-ingestion
./scripts/deploy.sh
Note
For earlier versions, use the corresponding release in the GitHub repository (e.g., v1.0.0 for the initial version).
We appreciate contributions! See CONTRIBUTING for guidelines on submitting pull requests.
This project may contain trademarks or logos. Authorized use of Microsoft trademarks or logos must follow Microsoft’s Trademark & Brand Guidelines. Modified versions must not imply sponsorship or cause confusion. Third-party trademarks are subject to their own policies.