HTML to Markdown Test Files for lightfeed-extract
This repository contains test files for validating HTML to LLM-extractor-ready Markdown conversion functionality. It specifically tests three conversion variants:
- Basic Conversion - Converting all HTML content to Markdown (without images)
- Main Content Extraction - Extracting and converting only the main content from HTML files (without images)
- Conversion with Images - Converting all HTML content to Markdown including images
βββ html/ # Source HTML files
β βββ forum/ # Forum HTML samples
β β βββ tech-0.html
β β βββ ...
β βββ ...
β
βββ groundtruth/ # Expected Markdown output files
βββ forum/ # Expected forum conversion results
β βββ tech-0.md # Basic conversion expected output
β βββ tech-0.main.md # Main-content-only expected output
β βββ tech-0.images.md # Conversion with images expected output
β βββ ...
βββ ...
Files follow a specific naming pattern to clearly indicate their purpose:
html/[category]/[file-name].html
- Original HTML source filesgroundtruth/[category]/[file-name].md
- Expected output for basic HTML conversiongroundtruth/[category]/[file-name].main.md
- Expected output for main content extractiongroundtruth/[category]/[file-name].images.md
- Expected output for conversion with images
For example:
html/forum/tech.html
- Original forum HTML filegroundtruth/forum/tech.md
- Expected Markdown after basic conversion (no images)groundtruth/forum/tech.main.md
- Expected Markdown when only extracting main content (no images)groundtruth/forum/tech.images.md
- Expected Markdown with images included
The HTML test files included in this repository are used solely for testing purposes. All files have been sanitized to replace personal information and sensitive content with generic placeholders. The structure and formatting of the HTML is preserved for testing purposes.