Skip to content

douinc/azure-ai-documentintelligence-html

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Azure AI Document Intelligence to HTML Parser

A lightweight Python library for converting Azure AI Document Intelligence AnalyzeResult objects into well-formatted HTML documents.

Features

  • πŸ“„ Clean HTML Generation: Converts document analysis results to semantic HTML
  • 🎨 Customizable Styling: Support for custom CSS styling
  • πŸ“Š Table Support: Handles complex tables with row/column spans
  • πŸ“ Section Preservation: Maintains document structure with sections and paragraphs
  • πŸ”§ Flexible API: Works with JSON files, stdin/stdout, or Python objects

Installation

pip install azure-ai-documentintelligence-html

Quick Start

From Python Code

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from azure_ai_doc_intel_html import parse_analyze_result

# Analyze document with Azure Document Intelligence
client = DocumentIntelligenceClient(
    endpoint="<your-endpoint>",
    credential=AzureKeyCredential("<your-key>")
)

poller = client.begin_analyze_document("prebuilt-layout", document_url)
result = poller.result()

# Convert to HTML
html = parse_analyze_result(result)

# Save to file
with open("output.html", "w") as f:
    f.write(html)

# Or save directly
parse_analyze_result(result, output_html="output.html")

From Command Line

# From JSON file to HTML file
python -m azure_ai_doc_intel_html input.json output.html

# From stdin to stdout
cat input.json | python -m azure_ai_doc_intel_html > output.html

From JSON File

from azure_ai_doc_intel_html import parse

# Parse from JSON file
html = parse("analyze_result.json")

# Save to file
parse("analyze_result.json", output_html="output.html")

# With custom title
html = parse("analyze_result.json", title="My Document Analysis")

# With custom CSS
custom_css = """
    body { font-family: 'Segoe UI', Tahoma, sans-serif; }
    h1 { color: #0078D4; }
    table { border-color: #0078D4; }
"""
html = parse("analyze_result.json", custom_css=custom_css)

API Reference

Core Functions

parse_analyze_result(analyze_result, output_html=None, title=None, custom_css=None)

Convert an AnalyzeResult object to HTML.

Parameters:

  • analyze_result: AnalyzeResult object from Azure AI Document Intelligence
  • output_html: Optional path to save HTML file
  • title: Optional custom title (defaults to auto-detection from document)
  • custom_css: Optional custom CSS styles

Returns: HTML string

parse(path_or_stdin, output_html=None, title=None, custom_css=None)

Parse JSON file or stdin to HTML.

Parameters:

  • path_or_stdin: Path to JSON file or None for stdin
  • output_html: Optional path to save HTML file
  • title: Optional custom title
  • custom_css: Optional custom CSS styles

Returns: HTML string

Helper Functions

  • load_json(path_or_stdin): Load JSON from file or stdin
  • write_html(path_or_stdout, html): Write HTML to file or stdout
  • render_table_html(table): Render table object to HTML
  • render_paragraph_html(paragraph): Render paragraph object to HTML
  • render_sections_html(data): Render all sections to HTML
  • build_html_doc(title, sections_html, custom_css): Build complete HTML document

Output Structure

The generated HTML includes:

  • Document Title: Extracted from page headers or custom-specified
  • Sections: Logical document sections with proper hierarchy
  • Paragraphs: Text content with role-based styling (headers, footers, body text)
  • Tables: Complex tables with proper row/column spans and headers
  • Semantic HTML: Uses appropriate HTML5 elements for better accessibility

Default CSS Classes

  • .doc-table: Document tables
  • .section: Document sections
  • .page-header: Page header content
  • .page-footer: Page footer content
  • .summary: Summary sections with definition lists

Custom Styling Example

custom_css = """
    body { 
        font-family: 'Segoe UI', Tahoma, Geneva, sans-serif;
        max-width: 1200px;
        margin: 0 auto;
        padding: 20px;
    }
    h1 { 
        color: #0078D4;
        border-bottom: 2px solid #0078D4;
        padding-bottom: 10px;
    }
    .doc-table { 
        box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        border-radius: 4px;
        overflow: hidden;
    }
    .doc-table th { 
        background: #0078D4;
        color: white;
    }
    .section { 
        background: #f9f9f9;
        padding: 15px;
        margin: 20px 0;
        border-radius: 4px;
    }
"""

html = parse_analyze_result(result, custom_css=custom_css)

Integration with Azure Document Intelligence

This library is designed to work seamlessly with Azure AI Document Intelligence. Here's a complete example:

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from azure_ai_doc_intel_html import parse_analyze_result

# Setup client
endpoint = "https://your-resource.cognitiveservices.azure.com/"
key = "your-api-key"
client = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))

# Analyze document
with open("document.pdf", "rb") as f:
    poller = client.begin_analyze_document(
        "prebuilt-layout",
        analyze_request=f,
        content_type="application/pdf"
    )
result = poller.result()

# Convert to HTML with custom title
html = parse_analyze_result(
    result,
    title="Financial Report Analysis",
    output_html="report.html"
)

print(f"HTML generated: {len(html)} characters")

Supported Document Elements

  • βœ… Paragraphs with roles (headers, footers, section headings)
  • βœ… Tables with complex spans
  • βœ… Sections with hierarchical structure
  • βœ… Text content with proper escaping
  • βœ… Page structure preservation

Requirements

  • Python 3.7+
  • azure-ai-documentintelligence>=1.0.0

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and questions, please use the GitHub issue tracker.

About

Parses Azure Document Intelligence results to HTML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors