A comprehensive web application for exploring and comparing different chunking strategies for Retrieval-Augmented Generation (RAG) systems. This tool helps developers and researchers optimize RAG performance by visualizing how different chunking approaches affect document processing.
- PDF Upload & Text Extraction: Upload PDF documents and extract text content for analysis
- Multiple Chunking Strategies: Explore 6 different chunking approaches:
- Fixed-Size Chunking
- Recursive Character Splitting
- Sentence-Based Chunking
- Paragraph-Based Chunking
- Semantic Sliding Window
- Fixed Overlap Chunking
- Strategy Explanations: Detailed explanations of each strategy's principles, best use cases, and trade-offs
- Chunk Visualization: Interactive visualization of chunks with metadata including size, overlap, and positioning
- Performance Optimized: Efficient processing of large PDF documents
- Responsive Design: Works seamlessly across desktop and mobile devices
- Real-time Analysis: Instant chunking results with comprehensive metadata
- Error Handling: Robust error handling for invalid files and processing issues
- Principle: Splits text into chunks of exactly the same character count
- Best For: Uniform processing requirements, predictable chunk sizes
- Trade-offs: May break semantic meaning mid-sentence or mid-word
- Principle: Intelligently splits text while preserving sentence and paragraph boundaries
- Best For: Maintaining semantic coherence while controlling chunk size
- Trade-offs: More complex processing, variable chunk sizes
- Principle: Groups complete sentences together into chunks (3-8 sentences per chunk)
- Best For: Maintaining grammatical and semantic integrity
- Trade-offs: Highly variable chunk sizes, may create very large chunks
- Principle: Uses paragraph breaks as natural chunk boundaries
- Best For: Documents with clear paragraph structure and topics
- Trade-offs: Very variable sizes, some chunks may be extremely large
- Principle: Creates overlapping chunks to maintain context across boundaries
- Best For: Ensuring no information is lost at chunk boundaries
- Trade-offs: Increased storage requirements due to overlap
- Principle: Large chunks with significant overlap for maximum context retention
- Best For: Complex documents where context is crucial
- Trade-offs: High redundancy, increased processing and storage costs
- Node.js 18+
- npm or yarn
-
Clone the repository:
git clone https://github.com/Pranav452/RAG-Chunking-Strategy.gitgit cd rag-chunking-visualizer -
Install dependencies:
npm install
-
Run the development server:
npm run dev
-
Open http://localhost:3000 in your browser
- Upload PDF: Click the upload area and select a PDF document
- Select Strategy: Choose a chunking strategy from the dropdown menu
- Apply Chunking: Click "Apply Chunking Strategy" to process the document
- Analyze Results: Review the chunk visualization and metadata
- Next.js 14: React framework with App Router
- TypeScript: Type-safe development
- Tailwind CSS: Utility-first CSS framework
- Radix UI: Accessible component primitives
- Lucide React: Beautiful icons
app/page.tsx: Main application interfaceapp/api/extract-pdf/route.ts: PDF text extraction API endpointcomponents/ui/*: Reusable UI components
Each chunking strategy is implemented as a pure function that takes text input and returns structured chunk data with metadata. The implementations focus on:
- Semantic Preservation: Maintaining meaning across chunk boundaries
- Performance: Efficient processing of large documents
- Flexibility: Configurable parameters for different use cases
Fixed-Size Chunking
- Simple documents with uniform structure
- When predictable chunk sizes are required
- Resource-constrained environments
Recursive Character Splitting
- General-purpose chunking for most documents
- When balancing semantic coherence with size control
- Mixed content types (paragraphs, lists, etc.)
Sentence-Based Chunking
- Question-answering systems
- Documents where sentence integrity is crucial
- Legal or academic documents
Paragraph-Based Chunking
- Well-structured documents with clear topics
- When preserving document hierarchy is important
- Blog posts, articles, reports
Semantic Sliding Window
- Complex technical documents
- When context preservation is critical
- Multi-step reasoning tasks
Fixed Overlap Chunking
- High-stakes applications where information loss is costly
- Documents with interconnected concepts
- Research and analysis tasks
The application provides several metrics to evaluate chunking effectiveness:
- Total Chunks: Number of chunks generated
- Average Size: Mean chunk size in characters
- Average Overlap: Mean overlap between adjacent chunks
- Coverage: Percentage of original text covered
