FileFlux Architecture Guide

Architectural overview of the document processing SDK for RAG systems

Design Principles

1. Clean Architecture

FileFlux follows clean architecture principles with a two-package structure:

FileFlux.Core: Pure document extraction with zero AI dependencies
- Standard document readers (PDF, DOCX, XLSX, PPTX, MD, TXT, JSON, CSV, HTML)
- Core interfaces and domain models
- AI service interface definitions (no implementations)
FileFlux: Full RAG pipeline (interface-driven)
- MultiModal document readers (AI-enhanced)
- AI service interfaces for consumer implementation
- Chunking strategies (FluxCurator)
- Content enhancement (FluxImprover)
- Processing orchestration

2. Interface-Driven Design

Extensible plugin architecture
Loose coupling through dependency injection
Strategy and Factory patterns

3. Domain Model Optimization

Simplified property names: Properties → Props, ChunkIndex → Index
Props dictionary pattern: Extensible metadata storage
Guid-based traceability: Track entire pipeline stages
Simplified Quality: Changed from complex object to double type
Unified API: Integrated batch/streaming through IDocumentProcessor

System Architecture

graph TB
    A[Client Application] --> B[IDocumentProcessor]
    B --> C[DocumentProcessor]

    C --> D[IDocumentReaderFactory]
    C --> E[IChunkingStrategyFactory]
    C --> M[IMetadataEnricher]

    D --> F[PdfReader]
    D --> G[WordReader]
    D --> H[ExcelReader]
    D --> I[PowerPointReader]
    D --> J[MarkdownReader]
    D --> K[TextReader]
    D --> L[JsonReader]
    D --> N[CsvReader]
    D --> O[HtmlReader]

    E --> P[AutoChunkingStrategy]
    E --> Q[SmartChunkingStrategy]
    E --> R[IntelligentChunkingStrategy]
    E --> S[SemanticChunkingStrategy]
    E --> T[ParagraphChunkingStrategy]
    E --> U[FixedSizeChunkingStrategy]
    E --> V[MemoryOptimizedIntelligentStrategy]

    M --> X[AIMetadataEnricher]
    M --> Y[RuleBasedMetadataExtractor]

    C --> W[DocumentChunk[]]

    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#fff3e0
    style W fill:#e8f5e8
    style M fill:#e8eaf6

Project Structure

FileFlux uses a two-package architecture for flexibility:

FileFlux.Core/                    # Extraction-Only Package (Zero AI Dependencies)
├── Exceptions/                   # Exception types
│   ├── FileFluxException
│   ├── DocumentProcessingException
│   └── UnsupportedFileFormatException
├── Infrastructure/
│   └── Readers/                  # Standard Document Readers
│       ├── PdfDocumentReader
│       ├── WordDocumentReader
│       ├── ExcelDocumentReader
│       ├── PowerPointDocumentReader
│       ├── MarkdownDocumentReader
│       ├── HtmlDocumentReader
│       ├── TextDocumentReader
│       ├── JsonDocumentReader
│       └── CsvDocumentReader
├── Utils/                        # Utilities
│   └── FileNameHelper
├── IDocumentReader.cs            # Reader interface
├── IDocumentParser.cs            # Parser interface
├── IChunkingStrategy.cs          # Strategy interface
├── DocumentChunk.cs              # Chunk model
├── RawContent.cs                 # Extraction result model
├── ParsedContent.cs              # Parsed content model
└── ChunkingOptions.cs            # Options model

FileFlux/                         # Full RAG Pipeline Package
├── Core/                         # AI Service Interfaces
│   ├── IDocumentProcessor
│   ├── IDocumentAnalysisService    # AI text generation interface
│   ├── IImageToTextService       # Vision AI interface
│   ├── IImageRelevanceEvaluator  # Image relevance interface
│   ├── IEmbeddingService         # Embedding generation interface
│   ├── IMetadataEnricher
│   └── Factories/
├── Infrastructure/
│   ├── Readers/                  # MultiModal Readers (AI-enhanced)
│   │   ├── MultiModalPdfDocumentReader
│   │   ├── MultiModalWordDocumentReader
│   │   ├── MultiModalExcelDocumentReader
│   │   └── MultiModalPowerPointDocumentReader
│   ├── Strategies/               # Chunking Strategies
│   │   ├── AutoChunkingStrategy
│   │   ├── SmartChunkingStrategy
│   │   ├── IntelligentChunkingStrategy
│   │   └── SemanticChunkingStrategy
│   ├── Languages/                # Language Profiles
│   │   └── LanguageProfiles.cs
│   ├── Services/                 # Processing Services
│   │   ├── AIMetadataEnricher
│   │   ├── FluxCurator
│   │   └── FluxImprover
│   └── Factories/                # Factory implementations
└── DocumentProcessor.cs          # Main orchestrator

Layer Structure

┌─────────────────────────────────────────────────────────────┐
│                    Client Layer                             │
│  • Application Code                                         │
│  • RAG Systems Integration                                  │
│  • AI Service Implementation                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────┐  ┌─────────────────────────┐  │
│  │   FileFlux.Core         │  │     FileFlux            │  │
│  │  (Extraction Only)      │  │  (Full RAG Pipeline)    │  │
│  │                         │  │                         │  │
│  │  • Document Readers     │──│  • Chunking Strategies  │  │
│  │  • Core Interfaces      │  │  • FluxCurator          │  │
│  │  • Domain Models        │  │  • FluxImprover         │  │
│  │  • AI Service Contracts │  │  • DocumentProcessor    │  │
│  │  • Zero AI Dependencies │  │  • Orchestration        │  │
│  │                         │  │                         │  │
│  └─────────────────────────┘  └─────────────────────────┘  │
│                                                             │
│   Use Case:                    Use Case:                    │
│   - Extract documents only     - Full processing pipeline   │
│   - Implement own chunking     - Use built-in strategies    │
│   - Minimal dependencies       - AI-enhanced features       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Core Components

1. IDocumentProcessor (Main Interface)

Role: Single entry point for all document processing with explicit state management

Key Methods:

ExtractAsync(): Stage 1 - Extract raw content
RefineAsync(): Stage 2 - Refine and structure analysis
ChunkAsync(): Stage 3 - Apply chunking strategy
EnrichAsync(): Stage 4 - LLM-powered enrichment
ProcessAsync(): Run complete pipeline

Properties:

State: Current processor state (Created → Extracted → Refined → Chunked → Enriched)
Result: Accumulated results across all stages
FilePath: Source document path

Responsibilities: Pipeline orchestration, state management, error handling, result validation

2. StatefulDocumentProcessor (v0.9.0+)

5-Stage Processing Pipeline:

📂 Extract → 🔄 Refine → 🤖 LLM-Refine → 📦 Chunk → ✨ Enrich

Stage	Interface	Description	Output
Extract	`IDocumentReader`	Raw content extraction from files	`RawContent`
Refine	`IDocumentRefiner`	Text cleaning, normalization, structure analysis	`RefinedContent`
LLM-Refine	`ILlmRefiner`	LLM-powered noise removal, sentence restoration (optional, skipped if LLM unavailable)	`LlmRefinedContent`
Chunk	`IChunkerFactory`	Text segmentation into chunks	`IReadOnlyList<DocumentChunk>`
Enrich	`IDocumentEnricher`	LLM-powered summaries, keywords, contextual text	`DocumentGraph`

State Flow:

Created → Extracted → Refined → LlmRefined → Chunked → Enriched → Disposed

Auto-Dependency Resolution:

Calling RefineAsync() automatically runs ExtractAsync() if not already completed
Calling LlmRefineAsync() automatically runs ExtractAsync() and RefineAsync() if needed
Calling ChunkAsync() automatically runs ExtractAsync() and RefineAsync() if needed
Each stage is idempotent - calling twice returns cached results

Dependency Management:

Optional logger support via NullLogger pattern
Optional IDocumentRefiner for refinement stage
Optional IDocumentEnricher for enrichment stage
All optional dependencies use graceful fallback strategies

Stage Responsibilities and RawContent Definition

RawContent Definition:

RawContent is not the original file bytes. It is "normalized text with semantic structure preserved, extracted without AI dependencies".

This design decision optimizes for RAG pipeline quality:

HTML tags are noise for RAG → Extract converts to Markdown
PDF layout info is valuable → Extract preserves structure hints
Markdown is already structured → Extract preserves as-is

Extract vs Refine Role Separation:

Stage	Primary Responsibility	Examples
Extract	Format normalization	HTML→Markdown, metadata extraction, noise tag removal
Refine	Quality enhancement	OCR correction, image→text, whitespace cleanup, header/footer removal

Reader-specific Extract Output:

Reader	Extract Output	Rationale
`HtmlDocumentReader`	Markdown (headings, lists, links, code blocks)	HTML tags are RAG noise; preserve semantic structure only
`MarkdownDocumentReader`	Original Markdown (parsed and validated)	Already structured; no conversion needed
`PdfDocumentReader`	Text + structural hints (has_tables, page_count)	Layout info aids chunking decisions
`TextDocumentReader`	Original text	No transformation required
`JsonDocumentReader`	Flattened text with key paths	Preserve hierarchy context
`CsvDocumentReader`	Text with row/column structure	Tabular context for RAG

Example: HTML Extract Output:

# Document Title

## Section 1
This is paragraph content with [a link](https://example.com).

- List item 1
- List item 2
  - Nested item

--- TABLE: User Data ---
Name | Age | Role
John | 30 | Developer
--- END TABLE ---

![Image alt text](image.png)

```javascript
console.log("Code block preserved");


**Why HTML Extract outputs Markdown**:
1. **Single conversion point**: HTML→Markdown happens once in Extract, not twice
2. **RAG optimization**: Chunking stage receives structured input for better segmentation
3. **Consistency**: All Readers output text suitable for immediate chunking
4. **Performance**: No intermediate format transformation overhead

### 3. IDocumentRefiner (Stage 2)

**Role**: Content refinement and structure extraction

**Key Methods**:
- `RefineAsync(RawContent, RefineOptions)`: Refine raw content

**Output (RefinedContent)**:
- `Text`: Cleaned and normalized text
- `Sections`: Document structure (headings, paragraphs)
- `Structures`: Structured elements (code blocks, tables, images)
- `Metadata`: Document metadata (filename, type, created date)
- `Quality`: Refinement quality scores

**RefineOptions**:
- `CleanNoise`: Remove extra whitespace, normalize formatting
- `ConvertToMarkdown`: Convert to markdown for better structure
- `BuildSections`: Extract document sections from headings
- `ExtractStructures`: Extract code blocks, tables, images

### 4. IDocumentEnricher (Stage 4)

**Role**: LLM-powered chunk enrichment and graph building

**Key Methods**:
- `EnrichAsync(chunks, refinedContent, options)`: Enrich chunks with LLM
- `EnrichStreamAsync(chunks, refinedContent)`: Streaming enrichment
- `BuildGraphAsync(enrichedChunks, options)`: Build document graph

**Output (EnrichmentResult)**:
- `Chunks`: List of enriched chunks with metadata
- `Graph`: Document graph showing inter-chunk relationships
- `Stats`: Enrichment statistics

**EnrichedDocumentChunk**:
- `Chunk`: Original DocumentChunk
- `Summary`: LLM-generated chunk summary
- `Keywords`: Extracted keywords with relevance scores
- `ContextualText`: Context for RAG retrieval
- `SearchableText`: Combined text for search

**DocumentGraph**:
- `Nodes`: Chunk nodes with metadata
- `Edges`: Relationships (sequential, hierarchical, semantic)
- `NodeCount`, `EdgeCount`: Graph statistics

### 5. Legacy DocumentProcessor

**5-Stage Processing Pipeline** (maintained for backward compatibility):

📂 Extract → 📄 Parse → 🔄 Refine → 📦 Chunk → ✨ Enhance


| Stage | Description | Component |
|-------|-------------|-----------|
| **Extract** | Raw content extraction from files | IDocumentReader |
| **Parse** | Structure analysis and parsing | IDocumentParser |
| **Refine** | Content transformation (Markdown, Image-to-Text) | IMarkdownConverter, IImageToTextService |
| **Chunk** | Text segmentation into chunks | IChunkerFactory (FluxCurator) |
| **Enhance** | Metadata enrichment and quality scoring | FluxImprover |

**Refine Stage (Default Enabled)**:
- `ConvertToMarkdown = true`: Markdown conversion for structure preservation (enabled by default)
- `ProcessImagesToText = false`: Image text extraction (opt-in, cost consideration)

### 6. IDocumentReader (Content Extraction)

**Current Implementations**:
- **PdfDocumentReader**: PDF text and image extraction
- **WordDocumentReader**: DOCX with style preservation
- **ExcelDocumentReader**: XLSX multi-sheet support
- **PowerPointDocumentReader**: PPTX slide extraction
- **MarkdownDocumentReader**: Markdown structure preservation
- **HtmlDocumentReader**: HTML content extraction
- **TextDocumentReader**: Plain text processing
- **JsonDocumentReader**: JSON structured data
- **CsvDocumentReader**: CSV table data

### 7. IChunkingStrategy (Content Splitting)

**Strategy Types**:
- **AutoChunkingStrategy**: Automatic strategy selection (recommended)
- **SmartChunkingStrategy**: Sentence boundary-based with high completeness
- **IntelligentChunkingStrategy**: LLM-based semantic boundary detection
- **MemoryOptimizedIntelligentChunkingStrategy**: Memory-efficient intelligent chunking
- **SemanticChunkingStrategy**: Sentence-based semantic chunking
- **ParagraphChunkingStrategy**: Paragraph-level segmentation
- **FixedSizeChunkingStrategy**: Fixed-size token-based chunking

### 8. ILanguageProfile (Multilingual Text Segmentation)

**Purpose**: Language-specific rules for accurate sentence boundary detection and text segmentation.

**Key Properties**:
- `LanguageCode`: ISO 639-1 code (en, ko, zh, ja, etc.)
- `ScriptCode`: ISO 15924 script code (Latn, Hang, Hans, Arab, Deva, Cyrl, Jpan)
- `WritingDirection`: LTR, RTL, or TopToBottom
- `NumberFormat`: Decimal/thousands separator conventions
- `QuotationMarks`: Language-specific quote characters
- `SentenceEndPattern`: Regex for sentence boundaries
- `Abbreviations`: Non-breaking abbreviation list
- `CategorizedAbbreviations`: Typed abbreviations (Prepositive/Postpositive/General)

**Supported Languages** (11):
| Language | Script | Direction | Number Format |
|----------|--------|-----------|---------------|
| English | Latn | LTR | Standard (1,234.56) |
| Korean | Hang | LTR | Standard |
| Chinese | Hans | LTR | NoGrouping |
| Japanese | Jpan | LTR | Standard |
| Spanish | Latn | LTR | European (1.234,56) |
| French | Latn | LTR | SpaceSeparated (1 234,56) |
| German | Latn | LTR | European |
| Arabic | Arab | **RTL** | Standard |
| Hindi | Deva | LTR | Standard |
| Portuguese | Latn | LTR | European |
| Russian | Cyrl | LTR | SpaceSeparated |

**Provider Pattern**:
- `ILanguageProfileProvider`: Manages language profile lookup and auto-detection
- `DefaultLanguageProfileProvider`: Built-in provider with Unicode script analysis
- Auto-detection analyzes text Unicode ranges to determine language

## Processing Pipeline

```mermaid
graph TB
    A[Document Input] --> B[Type Detection]
    B --> C[Reader Selection]
    C --> D[Content Extraction]
    D --> E[Metadata Enrichment]
    E --> F[Structure Parsing]
    F --> G[Strategy Selection]
    G --> H[Chunking Process]
    H --> I[Post Processing]
    I --> J[DocumentChunk[]]

    style A fill:#e1f5fe
    style E fill:#e8eaf6
    style J fill:#e8f5e8

1. Input Processing

File path or stream input support
File existence and access permission validation
Supported format verification

2. Content Extraction

Dedicated reader for each document type
Text content and metadata extraction
Document structure preservation

3. Metadata Enrichment (Optional)

AI-powered metadata extraction with IDocumentAnalysisService
Three-tier fallback: AI → Hybrid → Rule-based
Automatic caching based on file content hash
Schema-based extraction (General, ProductManual, TechnicalDoc)
Enriched metadata stored in CustomProperties with "enriched_" prefix

4. Chunking Processing

Content splitting based on selected strategy
Overlap between chunks
Metadata propagation and indexing

Factory Patterns

DocumentReaderFactory

File extension-based reader selection
New reader registration and management
Unsupported format exception handling
Extension discovery API

ChunkingStrategyFactory

Strategy name-based selection system
Default and fallback strategy management
Dynamic strategy registration support

Configuration and Options

ChunkingOptions

Main Settings:

Strategy: Chunking strategy name ("Auto", "Smart", "Intelligent", etc.)
MaxChunkSize: Maximum chunk size (default: 1024 tokens)
OverlapSize: Overlap size between chunks (default: 128 tokens)
PreserveStructure: Whether to preserve document structure
StrategyOptions: Strategy-specific detailed options
CustomProperties: Extensible configuration dictionary for features like metadata enrichment

Metadata Enrichment Configuration:

var options = new ChunkingOptions
{
    Strategy = "Auto",
    CustomProperties = new Dictionary<string, object>
    {
        ["enableMetadataEnrichment"] = true,
        ["metadataSchema"] = MetadataSchema.General,
        ["metadataOptions"] = new MetadataEnrichmentOptions
        {
            ExtractionStrategy = MetadataExtractionStrategy.Smart,
            MinConfidence = 0.7
        }
    }
};

Dependency Injection Setup

Basic Registration (no AI):

services.AddFileFlux();  // Pure extraction + chunking

With AI Services (Consumer-provided implementations):

// Use your own AI service implementations
services.AddScoped<IDocumentAnalysisService, YourLLMService>();
services.AddScoped<IImageToTextService, YourVisionService>();
services.AddScoped<IEmbeddingService, YourEmbeddingService>();
services.AddFileFlux();

With LMSupply (CLI example - local AI processing):

// LMSupply is NOT a dependency of FileFlux
// Consumer applications reference LMSupply directly
// See cli/FileFlux.CLI/Services/LMSupply for implementation examples

var lmSupplyOptions = new LMSupplyOptions
{
    UseGpuAcceleration = true,
    EmbeddingModel = "default",
    GeneratorModel = "microsoft/Phi-4-mini-instruct-onnx"
};

// Create and register LMSupply services
var embedder = await LMSupplyEmbedderService.CreateAsync(lmSupplyOptions);
services.AddSingleton<IEmbeddingService>(embedder);
services.AddFileFlux();

Extension Registration: Add custom readers/strategies

Performance Considerations

Memory Management

Stream-based processing for large files
IDisposable pattern for resource cleanup
ConfigureAwait(false) to minimize context switching

Concurrency

All public interfaces are thread-safe
Factories use immutable collections
No shared mutable state

Scalability

Minimal memory allocation
Efficient string processing
Reusable component design

Extension Points

Adding Custom Reader

Implement IDocumentReader interface
Register in DI container
Implement SupportedExtensions and CanRead methods

Example:

public class CustomDocumentReader : IDocumentReader
{
    public string ReaderType => "CustomReader";
    public IEnumerable<string> SupportedExtensions => [".custom"];

    public bool CanRead(string fileName) =>
        Path.GetExtension(fileName).Equals(".custom", StringComparison.OrdinalIgnoreCase);

    public Task<ReadResult> ReadAsync(string filePath, CancellationToken cancellationToken)
    {
        // Stage 0: Return document structure (page count, metadata)
        // Implementation
    }

    public Task<RawContent> ExtractAsync(string filePath, ExtractOptions? options = null, CancellationToken cancellationToken = default)
    {
        // Stage 1: Return extracted text content
        // Implementation
    }
}

// Registration
services.AddTransient<IDocumentReader, CustomDocumentReader>();

Adding Custom Chunking Strategy

FileFlux uses IChunker from FluxCurator for chunking. Register a custom IChunker implementation to provide a custom chunking strategy:

Implement IChunker from FluxCurator.Core.Core
Define StrategyName
Implement chunking logic in ChunkAsync

Example:

using FluxCurator.Core.Core;
using FluxCurator.Core.Domain;

public class CustomChunker : IChunker
{
    public string StrategyName => "Custom";
    public bool RequiresEmbedder => false;

    public async Task<IReadOnlyList<DocumentChunk>> ChunkAsync(
        string text,
        ChunkOptions options,
        CancellationToken cancellationToken = default)
    {
        // Implementation
    }

    public int EstimateChunkCount(string text, ChunkOptions options) => 1;
}

// Registration
services.AddTransient<IChunker, CustomChunker>();

Error Handling

Exception Hierarchy

FileFluxException: Base class for all exceptions
UnsupportedFileFormatException: Unsupported file format
DocumentProcessingException: Error during document processing
ChunkingException: Error during chunking process

Error Handling Patterns

Early error detection through input validation
Meaningful error messages with context
Preserve inner exceptions for cause tracking
Include debugging information (filename, strategy name)

RAG System Integration

Processing Result

DocumentChunk:

Id (Guid): Unique chunk identifier
Content (string): Chunk text content
Index (int): Chunk order index
Location (SourceLocation): StartChar/EndChar position info
Quality (double): Quality score (0.0~1.0)
Props (Dictionary<string, object>): Extensible metadata

RawContent:

Text: Extracted raw text
File (SourceFileInfo): File information
ReaderType: Reader type used
ExtractedAt: Extraction timestamp
Warnings: Processing warnings
Hints: Processing hints

ParsedDocumentContent:

Content: Parsed text content
Sections: Structured sections (optional)
RawId: Reference to RawContent.Id
Props: Additional metadata

SourceFileInfo:

Name: Filename
Extension: File extension
Size: File size
Path: File path (optional)

Integration Patterns

Streaming Processing: Sequential processing per chunk with ProcessStreamAsync
Batch Processing: Collect all chunks then batch process
Pipeline Processing: Simultaneous chunk generation and embedding generation

Extensibility Pattern

// Extensible metadata with Props dictionary
chunk.Props["ContextualHeader"] = "Document: Technical";
chunk.Props["DocumentDomain"] = "Technical";
chunk.Props["HasImages"] = true;

// Maintain backward compatibility with extension methods
public static string? ContextualHeader(this DocumentChunk chunk)
    => chunk.Props.TryGetValue("ContextualHeader", out var v) ? v?.ToString() : null;

Metadata Enrichment Pattern

// Enriched metadata storage in CustomProperties
chunk.Metadata.CustomProperties["enriched_topics"] = new[] { "AI", "Machine Learning" };
chunk.Metadata.CustomProperties["enriched_keywords"] = new[] { "neural networks", "deep learning" };
chunk.Metadata.CustomProperties["enriched_description"] = "Introduction to AI concepts";
chunk.Metadata.CustomProperties["enriched_confidence"] = 0.92;
chunk.Metadata.CustomProperties["enriched_extractionMethod"] = "ai";

// Access enriched metadata
var topics = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_topics") as string[];
var confidence = Convert.ToDouble(chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_confidence", 0.0));

Pipeline Traceability

RawContent.Id (Guid)
    ↓
ParsedDocumentContent.RawId → RawContent.Id
    ↓
DocumentChunk.RawId → RawContent.Id
DocumentChunk.ParsedId → ParsedDocumentContent.Id

Design Philosophy

FileFlux focuses on transforming documents into structured chunks optimized for RAG systems.

Two-Package Strategy:

FileFlux.Core: Pure extraction, zero AI dependencies
- Standard document readers
- Core interfaces and models
- AI service interface definitions
- For users implementing custom pipelines
FileFlux: Full RAG pipeline (interface-driven)
- MultiModal readers (AI-enhanced)
- FluxCurator and FluxImprover
- No direct AI service implementations

Interface-Driven AI: FileFlux defines AI service interfaces without implementations:

IDocumentAnalysisService: Text generation for intelligent chunking
IImageToTextService: Image captioning and OCR
IEmbeddingService: Embedding generation for semantic search

Consumer Responsibility: Applications provide AI service implementations:

Use OpenAI, Anthropic, Azure OpenAI, or other cloud providers
Use LMSupply for local AI processing (see CLI for examples)
Implement custom providers as needed

Package Selection Guide:

// Extraction only - implement your own chunking
using FileFlux.Core;
var reader = new PdfDocumentReader();
var rawContent = await reader.ReadAsync("document.pdf");

// Full RAG pipeline with custom AI providers
using FileFlux;
services.AddScoped<IDocumentAnalysisService, OpenAIService>();
services.AddScoped<IEmbeddingService, OpenAIEmbeddingService>();
services.AddFileFlux();
var processor = serviceProvider.GetRequiredService<IDocumentProcessor>();
var chunks = await processor.ProcessAsync("document.pdf", options);

FAQ

Why does HTML Extract output Markdown instead of raw HTML?

Short answer: HTML tags are noise for RAG systems. Extract converts to Markdown to preserve semantic structure while removing presentation markup.

Detailed explanation:

RAG Optimization: Raw HTML contains <div>, <span>, CSS classes, and other tags that add no semantic value for retrieval. Markdown preserves meaning (headings, lists, links) without noise.
Chunking Quality: The Chunk stage needs structured text input. If Extract outputs raw HTML, the Refine stage would need to parse and convert it, adding complexity and potential errors.
Single Conversion Point: Converting HTML→Markdown once in Extract is more efficient than maintaining an intermediate format and converting again in Refine.
Consistency: All Readers output text that can be immediately chunked. HTML Reader's Markdown output follows this pattern.

What if I need the original HTML?

Read the file directly using File.ReadAllText() before calling FileFlux
FileFlux is designed for RAG preprocessing, not general-purpose HTML processing

What's the difference between Extract and Refine stages?

Stage	Focus	AI Required
Extract	Format normalization (HTML→Markdown, PDF→Text)	No
Refine	Quality enhancement (OCR fix, image→text, cleanup)	Optional

Extract uses deterministic libraries (HtmlAgilityPack, PdfPig). Refine can optionally use AI services for advanced processing.

Why does `RefiningOptions.ConvertToMarkdown` exist if Extract already outputs Markdown?

ConvertToMarkdown in Refine handles cases where:

Extract output is plain text (from TextReader, legacy sources)
Additional structure detection is needed (e.g., inferring headings from formatting)
If input is already Markdown, Refine performs cleanup only (no re-conversion)

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History