GPT-RAG Data Ingestion

Part of the GPT-RAG solution.

The GPT-RAG Data Ingestion service automates the processing of diverse document types—such as PDFs, images, spreadsheets, transcripts, and SharePoint files—preparing them for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experies for agent-based RAG applications.

How data ingestion works

The service performs the following steps:

Scan sources: Detects new or updated content in configured sources
Process content: Chunk and enrich data for retrieval
Index documents: Writes processed chunks into Azure AI Search
Schedule execution: Runs on a CRON-based scheduler defined by environment variables

Supported data sources

Supported formats and chunkers

The ingestion service selects a chunker based on the file extension, ensuring each document is processed with the most suitable method.

.pdf files — Processed by the DocAnalysisChunker using the Document Intelligence API. Structured elements such as tables and sections are extracted and converted into Markdown, then segmented with LangChain splitters. When Document Intelligence API 4.0 is enabled, .docx and .pptx files are handled the same way.
Image files (.bmp, .png, .jpeg, .tiff) — The DocAnalysisChunker applies OCR to extract text before chunking.
Text-based files (.txt, .md, .json, .csv) — Processed by the LangChainChunker, which splits content into paragraphs or sections.
Specialized formats:
- .vtt (video transcripts) — Handled by the TranscriptionChunker, which splits content by time codes.
- .xlsx (spreadsheets) — Processed by the SpreadsheetChunker, chunked by rows or sheets.

How to deploy the data ingestion service

Prerequisites

Before deploying the application, you must provision the infrastructure as described in the GPT-RAG repo. This includes creating all necessary Azure resources required to support the application runtime.

Click to view software prerequisites

The machine used to customize and or deploy the service should have:

Azure CLI: Install Azure CLI
Azure Developer CLI (optional, if using azd): Install azd
Git: Download Git
Python 3.12: Download Python 3.12
Docker CLI: Install Docker
VS Code (recommended): Download VS Code

Click to view permissions requirements

To customize the service, your user should have the following roles:

Resource	Role	Description
App Configuration Store	App Configuration Data Owner	Full control over configuration settings
Container Registry	AcrPush	Push and pull container images
AI Search Service	Search Index Data Contributor	Read and write index data
Storage Account	Storage Blob Data Contributor	Read and write blob data
Cosmos DB	Cosmos DB Built-in Data Contributor	Read and write documents in Cosmos DB

To deploy the service, assign these roles to your user or service principal:

Resource	Role	Description
App Configuration Store	App Configuration Data Reader	Read config
Container Registry	AcrPush	Push images
Azure Container App	Azure Container Apps Contributor	Manage Container Apps

Ensure the deployment identity has these roles at the correct scope (subscription or resource group).

Deployment steps

Make sure you're logged in to Azure before anything else:

az login

Deploying the app with azd (recommended)

Initialize the template:

azd init -t azure/gpt-rag-ingestion

Important

Use the same environment name with azd init as in the infrastructure deployment to keep components consistent.

Update env variables then deploy:

azd env refresh
azd deploy

Important

Run azd env refresh with the same subscription and resource group used in the infrastructure deployment.

Deploying the app with a shell script

To deploy using a script, first clone the repository, set the App Configuration endpoint, and then run the deployment script.

PowerShell (Windows)

git clone https://github.com/Azure/gpt-rag-ingestion.git
$env:APP_CONFIG_ENDPOINT = "https://<your-app-config-name>.azconfig.io"
cd gpt-rag-ingestion
.\scripts\deploy.ps1

Bash (Linux/macOS)

git clone https://github.com/Azure/gpt-rag-ingestion.git
export APP_CONFIG_ENDPOINT="https://<your-app-config-name>.azconfig.io"
cd gpt-rag-ingestion
./scripts/deploy.sh

Previous Releases

Note

For earlier versions, use the corresponding release in the GitHub repository (e.g., v1.0.0 for the initial version).

🤝 Contributing

We appreciate contributions! See CONTRIBUTING for guidelines on submitting pull requests.

Trademarks

This project may contain trademarks or logos. Authorized use of Microsoft trademarks or logos must follow Microsoft’s Trademark & Brand Guidelines. Modified versions must not imply sponsorship or cause confusion. Third-party trademarks are subject to their own policies.

Name		Name	Last commit message	Last commit date
Latest commit History 415 Commits
.devcontainer		.devcontainer
chunking		chunking
docs		docs
infra		infra
jobs		jobs
media		media
samples		samples
scripts		scripts
telemetry		telemetry
tools		tools
utils		utils
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
VERSION		VERSION
azure.yaml		azure.yaml
constants.py		constants.py
dependencies.py		dependencies.py
launch.json		launch.json
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-RAG Data Ingestion

How data ingestion works

Supported data sources

Supported formats and chunkers

How to deploy the data ingestion service

Prerequisites

Deployment steps

Deploying the app with azd (recommended)

Deploying the app with a shell script

PowerShell (Windows)

Bash (Linux/macOS)

Previous Releases

🤝 Contributing

Trademarks

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors 15

Languages

License

Azure/gpt-rag-ingestion

Folders and files

Latest commit

History

Repository files navigation

GPT-RAG Data Ingestion

How data ingestion works

Supported data sources

Supported formats and chunkers

How to deploy the data ingestion service

Prerequisites

Deployment steps

Deploying the app with azd (recommended)

Deploying the app with a shell script

PowerShell (Windows)

Bash (Linux/macOS)

Previous Releases

🤝 Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors 15

Languages

Packages