The RAG (Retrieval Augmented Generation) System is designed to provide high-quality, contextually relevant responses to user queries by leveraging multiple information retrieval methods combined with large language models (LLMs). The system offers a production-grade, highly scalable solution for enterprise knowledge management and question-answering capabilities.
- Information Accuracy: Ensure responses are grounded in factual information from trusted sources
- Contextual Understanding: Maintain context across multiple query techniques
- Performance and Scalability: Support high throughput with low latency
- Security and Compliance: Protect sensitive information and adhere to enterprise standards
- Observability: Comprehensive monitoring and evaluation metrics
- Extensibility: Support for multiple data modalities (text, images, audio, video)
The Enhanced RAG System follows a microservices-based architecture with the following major components:
- API Layer: REST endpoints for user interaction and administration
- Core RAG Engine: Processing pipeline for retrieval and generation
- Document Processing: Multi-modal content processing and storage
- Storage Layer: Vector database and graph database for different retrieval methods
- Evaluation System: Continuous evaluation and feedback incorporation
- Monitoring and Observability: Metrics, logging, and tracing infrastructure
The API layer provides RESTful endpoints for interacting with the RAG system:
-
RagApiController: Main controller for query processing and document ingestion
/api/v1/query
: Process queries and generate responses/api/v1/query/stream
: Stream responses in real-time/api/v1/ingest
: Ingest documents into the system/api/v1/feedback/{queryId}
: Submit user feedback/api/v1/health
: Health check endpoint
-
AdminController: Administrative operations
/api/v1/admin/stats
: System statistics/api/v1/admin/reindex
: Trigger reindexing/api/v1/admin/evaluation/metrics
: Get evaluation metrics
-
Security Components:
- API key authentication
- Role-based access control
- Rate limiting
Orchestrates the overall RAG workflow:
- Processes user queries
- Coordinates retrieval strategies
- Generates responses using LLMs
- Handles streaming responses
Advanced retrieval system with multiple strategies:
- VectorStoreFactory: Manages different vector stores
- EnhancedWeaviateVectorStore: Optimized vector similarity search
- EnhancedNeo4jGraphRepository: Graph-based context retrieval
- RetrievalStrategy: Interface for different retrieval methods
- HybridStrategy: Combines vector and graph retrieval
- HydeStrategy: Hypothetical Document Embedding
- DecompositionStrategy: Query decomposition
- AdvancedStrategy: Comprehensive strategy using all techniques
Integrates with OpenAI models through Spring AI:
- Query processing
- Response generation
- Entity extraction
- Document analysis
Processes text-based documents:
- Document chunking with configurable strategies
- Semantic chunking
- Section-aware splitting
- Metadata extraction
Handles non-text content:
- Image processing and description
- Audio transcription
- Video processing
Extracts entities and relationships from documents:
- Named entity recognition
- Relationship extraction
- Topic identification
- Caching for performance
Stores document embeddings for similarity search:
- High-performance vector operations
- Efficient batch processing
- Configurable similarity algorithms
- Metadata filtering
Stores knowledge graph for semantic relationships:
- Document nodes
- Entity nodes
- Typed relationships
- Graph traversal capabilities
Stores system metadata:
- Evaluation results
- User feedback
- System statistics
Evaluates RAG performance:
- Multiple evaluation metrics (relevance, completeness, faithfulness)
- Asynchronous evaluation
- Human feedback incorporation
Persists evaluation results:
- Multiple storage backends (database, file, in-memory)
- Historical data retention
- Statistical aggregation
Collects performance metrics:
- Query processing times
- Document processing statistics
- Retrieval effectiveness
- Resource utilization
- Prometheus for metrics collection
- Grafana for visualization
- Loki for log aggregation
- Tempo for distributed tracing
- Documents are uploaded via the API
- Multi-modal processing identifies document type and extracts content
- Content is processed and split into chunks
- Entity extraction identifies entities and relationships
- Document embeddings are generated and stored in vector database
- Knowledge graph is updated with document and entity nodes
- User submits query via API
- Query is analyzed and potentially transformed or decomposed
- Retrieval pipeline fetches relevant context:
- Vector database provides similarity matches
- Graph database provides semantic relationships
- Context is assembled and formatted
- LLM generates response based on context and query
- Response is returned to user (standard or streaming)
- (Optional) Response is evaluated for quality
- RAG response is generated
- Asynchronous evaluation assesses quality metrics
- Results are stored for monitoring and improvement
- User feedback is incorporated into evaluation data
- Programming Language: Java 17
- Framework: Spring Boot 3.2.x
- AI Integration: Spring AI
- Build System: Gradle
- Vector Database: Weaviate
- Graph Database: Neo4j
- Relational Database: PostgreSQL
- Caching: Redis + Caffeine
- Text Embedding: OpenAI text-embedding-ada-002
- Chat Completion: OpenAI GPT-4 or equivalent
- Vision Analysis: OpenAI Vision API
- Metrics: Prometheus + Micrometer
- Visualization: Grafana
- Logging: Logback + Loki
- Tracing: OpenTelemetry + Tempo
- Containerization: Docker
- Orchestration: Kubernetes
- CI/CD: GitHub Actions
- Load Balancing: NGINX or Kubernetes Ingress
- SSL Termination: Cert-Manager
The system is designed for deployment in Kubernetes with the following components:
- RAG Application Deployment (with auto-scaling)
- Neo4j StatefulSet
- Weaviate StatefulSet
- PostgreSQL StatefulSet
- Monitoring stack (Prometheus, Grafana, etc.)
- Ingress for external access
Minimum recommended resources:
- RAG Application: 1-2 CPU cores, 2-4GB RAM per instance
- Neo4j: 4 CPU cores, 8GB RAM
- Weaviate: 4 CPU cores, 8GB RAM
- PostgreSQL: 2 CPU cores, 4GB RAM
- Storage: 100GB for vector database, 100GB for graph database
- Horizontal scaling for the RAG application
- Vertical scaling for databases with potential for clustering
- Auto-scaling based on CPU and memory metrics
- API Key authentication for service access
- Role-based access control for administrative functions
- JWT token support for web application integration
- TLS encryption for all communications
- Internal service communications within Kubernetes network
- Ingress with SSL termination
- Encrypted storage for sensitive data
- Secrets management with Kubernetes secrets
- Controlled access to document repositories
- Token bucket algorithm for rate limiting
- Configurable limits per client/API key
- Circuit breakers for dependent services
- Efficient vector indexing with HNSW algorithm
- Graph database indexing for common query patterns
- Caching for frequent queries and embeddings
- Asynchronous document processing
- Batch processing for vector operations
- Parallel entity extraction
- Streaming responses for long-running generations
- Context truncation to manage token limits
- Adjustable retrieval parameters based on query characteristics
- Stateless application tier for easy scaling
- Connection pooling for database connections
- Kubernetes autoscaling based on metrics
- Configurable JVM memory settings
- Database resource allocation based on workload
- Optimized container resource requests/limits
- Efficient chunking strategies for large documents
- Pagination for large result sets
- Incremental indexing for large document repositories
- Query processing times
- Retrieval effectiveness
- Document processing statistics
- Evaluation scores
- JVM metrics (memory, GC, threads)
- Database performance
- API endpoint latency
- Resource utilization
- Query success rates
- User satisfaction (via feedback)
- Content coverage
- Response quality trends
- Error rate thresholds
- Latency thresholds
- Resource utilization warnings
- Integration with incident management systems
- Integration with document management systems
- Web crawling capabilities
- Structured database connectors
- Support for additional vector databases
- Alternative embedding models
- Custom retrieval algorithms
- Enhanced support for tabular data
- Structured form extraction
- Interactive visualization
- Database backup strategies
- Configuration version control
- Disaster recovery plan
- Reindexing procedures
- Model update strategy
- Database optimization tasks
- Centralized logging
- Tracing for request flows
- Diagnostic endpoints
- Conversation memory/history
- Active learning from feedback
- Domain-specific fine-tuning
- API Gateway integration
- SSO/SAML authentication
- Enterprise data connectors
- Interactive UI components
- Feedback collection improvements
- Response explanation features
- Application Nodes: 3x (4 CPU cores, 8GB RAM)
- Vector Database: 8 CPU cores, 16GB RAM
- Graph Database: 8 CPU cores, 16GB RAM
- Test Dataset: 100,000 documents, average size 5KB
The system exhibits the following scaling characteristics:
Component | Scaling Method | Bottleneck | Solution |
---|---|---|---|
RAG Application | Horizontal | CPU during LLM calls | Add more pods, optimize context size |
Vector Database | Vertical + Sharding | Memory for embeddings | Increase memory, shard by collection |
Graph Database | Vertical | I/O for large graphs | SSD storage, optimize queries |
PostgreSQL | Vertical | Concurrent writes | Connection pooling, batch operations |
- Start with 3 RAG application replicas for every 100 concurrent users
- Allocate 2GB RAM for every 100,000 document chunks in vector database
- Allocate 4GB RAM for every 1,000,000 nodes/relationships in graph database
Process a query and generate a response.
Request Body:
{
"query": "What is RAG?",
"strategyType": "hybrid",
"options": {
"vectorTopK": 5,
"graphTopK": 3,
"maxResults": 8,
"rerankerEnabled": true
},
"evaluationEnabled": true
}
Response:
{
"queryId": "550e8400-e29b-41d4-a716-446655440000",
"query": "What is RAG?",
"answer": "RAG (Retrieval Augmented Generation) is a technique that combines...",
"sources": [
{
"sourceId": "doc-123",
"sourceName": "introduction_to_rag.pdf",
"score": 0.92,
"snippet": "RAG stands for Retrieval Augmented Generation..."
}
],
"timestamp": "2023-09-15T14:32:21.539Z",
"processingTimeMs": 437
}
Stream a response to a query.
Request Body: Same as /api/v1/query
Response: Server-Sent Events stream with chunks:
event: chunk
data: {"queryId":"550e8400-e29b-41d4-a716-446655440000","content":"RAG ","timestamp":"2023-09-15T14:32:21.639Z"}
event: chunk
data: {"queryId":"550e8400-e29b-41d4-a716-446655440000","content":"(Retrieval ","timestamp":"2023-09-15T14:32:21.689Z"}
event: chunk
data: {"queryId":"550e8400-e29b-41d4-a716-446655440000","content":"Augmented ","timestamp":"2023-09-15T14:32:21.739Z"}
Ingest documents into the system.
Request Form Data:
files
: One or more files to ingestoptions
: JSON string with options
Options Example:
{
"preprocessingEnabled": true,
"semanticChunking": true,
"chunkSize": 512,
"chunkOverlap": 128,
"vectorStoreName": "default"
}
Response:
{
"ingestionId": "7b2ff780-f56c-45e2-a9b1-32cdf1b3d0cc",
"status": "processing",
"files": ["document1.pdf", "document2.txt"],
"timestamp": 1694789541000
}
Submit user feedback for a response.
Request Body:
{
"rating": 4,
"comment": "Very helpful response!"
}
Response:
{
"status": "success",
"message": "Feedback recorded",
"queryId": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": 1694789641000
}
Get system statistics.
Query Parameters:
timeRange
: Time range in minutes (0 for all time)
Response:
{
"evaluation": {
"count": 1250,
"averageScore": 0.87,
"metricScores": {
"context_relevance": 0.92,
"answer_faithfulness": 0.88,
"answer_completeness": 0.85,
"answer_conciseness": 0.83
},
"humanFeedbackCount": 320,
"averageHumanRating": 0.9
}
}
Trigger reindexing operation.
Response:
{
"jobId": "9b1deb4d-3b7d-4bad-9bdd-2b0d7b3dcb6d",
"status": "started",
"message": "Reindexing job started",
"timestamp": 1694789741000
}
Get evaluation metrics.
Response:
{
"context_relevance": {
"name": "Context Relevance",
"description": "Measures how relevant the retrieved context is to the query",
"prompt": "Evaluate how relevant the provided context documents are to the user query..."
},
"answer_faithfulness": {
"name": "Answer Faithfulness",
"description": "Measures if the answer is faithful to the retrieved context",
"prompt": "Evaluate how faithful the answer is to the provided context..."
}
}
Check system health.
Response:
{
"status": "UP",
"timestamp": 1694789841000,
"version": "2.0.0",
"env": "production"
}
Property | Description | Default | Environment Variable |
---|---|---|---|
ragapp.vector-store.default-store-name |
Default vector store | weaviate |
VECTOR_STORE_NAME |
ragapp.vector-store.class-name |
Vector class name | Document |
VECTOR_CLASS_NAME |
ragapp.document.chunk-size |
Document chunk size | 512 |
DOCUMENT_CHUNK_SIZE |
ragapp.document.chunk-overlap |
Chunk overlap size | 128 |
DOCUMENT_CHUNK_OVERLAP |
ragapp.evaluation.enabled |
Enable evaluation | true |
EVALUATION_ENABLED |
ragapp.multimodal.vision-enabled |
Enable vision | true |
VISION_ENABLED |
ragapp.rate-limit.enabled |
Enable rate limiting | true |
RATE_LIMIT_ENABLED |
ragapp.rate-limit.requests-per-second |
Rate limit | 10 |
RATE_LIMIT_RPS |
Property | Description | Default | Environment Variable |
---|---|---|---|
spring.ai.openai.api-key |
OpenAI API key | - | OPENAI_API_KEY |
spring.ai.openai.chat.options.model |
Chat model | gpt-4 |
OPENAI_CHAT_MODEL |
spring.ai.openai.embedding.options.model |
Embedding model | text-embedding-ada-002 |
OPENAI_EMBEDDING_MODEL |
Property | Description | Default | Environment Variable |
---|---|---|---|
spring.neo4j.uri |
Neo4j URI | bolt://localhost:7687 |
NEO4J_URI |
spring.neo4j.authentication.username |
Neo4j username | neo4j |
NEO4J_USERNAME |
spring.neo4j.authentication.password |
Neo4j password | password |
NEO4J_PASSWORD |
spring.datasource.url |
JDBC URL | jdbc:postgresql://localhost:5432/ragapp |
JDBC_URL |
spring.datasource.username |
Database username | postgres |
JDBC_USERNAME |
spring.datasource.password |
Database password | postgres |
JDBC_PASSWORD |
Issue | Possible Cause | Solution |
---|---|---|
Slow vector search | High dimensionality, large dataset | Reduce embedding dimensions, add more RAM, optimize HNSW parameters |
Connection timeouts to LLM | Network issues, rate limiting | Add retries, backoff strategy, use async processing |
Out of memory errors | Large documents, high concurrency | Adjust JVM heap size, improve chunking, add more application nodes |
Neo4j query timeouts | Complex graph traversals | Optimize Cypher queries, add indexes, limit hop count |
PDF parsing failures | Corrupt PDFs, unsupported formats | Add error handling, fallback to OCR, validate input files |
Metric | Warning Threshold | Critical Threshold | Action |
---|---|---|---|
CPU Usage | >70% for 5min | >85% for 5min | Scale out application |
Memory Usage | >80% for 5min | >90% for 2min | Increase memory or scale out |
5xx Error Rate | >1% for 5min | >5% for 2min | Check logs, restart services |
Response Time | >2s for 5min | >5s for 2min | Optimize queries, check external services |
Queue Depth | >100 for 5min | >500 for 2min | Add more workers, check bottlenecks |
Common error patterns to look for in logs:
ERROR [EnhancedRagService] Error generating response for query [<queryId>]: Connection refused
Indicates OpenAI API connectivity issues.
WARN [WeaviateVectorStore] Error searching vector store: timeout
Indicates Weaviate performance or connectivity issues.
ERROR [DocumentProcessor] Error processing document: Out of memory
Indicates memory pressure during document processing.