A FastMCP server for intelligent hyperlink validation with semantic analysis, visual verification, and automated reporting. Built with Python and UV for seamless integration with AI coding assistants.
- Features
- Quick Start
- MCP Integration
- Available Tools
- Validation Scoring
- Project Structure
- Testing
- Usage Examples
- Progress Tracking & Job Management
- Technical Details
- Troubleshooting
- FAQ
- License
- π Smart Validation: Semantic analysis using TF-IDF and cosine similarity scoring
- π― Context-Aware: Enhanced validation using surrounding text and purpose context
- πΈ Visual Verification: Full-page screenshot capture with Playwright
- π€ LLM-Ready: Structured outputs and analysis prompts for AI workflows
- β‘ Batch Processing: Efficient parallel validation of multiple URLs
- π Markdown Support: Extract and validate links with automatic context extraction
- π Progress Tracking: Job-based polling for long-running operations
- π Cancellation Support: Graceful cancellation with partial results preserved
- π Report Generation: Google Docs-compatible HTML reports with embedded screenshots
# Clone and setup
git clone <repository-url>
cd url-checker
# Install dependencies
uv sync
# Install Playwright browsers (for screenshots)
uv run playwright install chromium# Start the MCP server
uv run main.py
# Run example demonstrations
uv run examples/llm_screenshot_analysis_demo.py
uv run examples/google_docs_report_example.py
# Run tests
uv run tests/run_tests.pyAdd to your Windsurf MCP settings:
{
"mcpServers": {
"url-checker": {
"command": "uv",
"args": ["--directory", "/absolute/path/to/url-checker", "run", "main.py"]
}
}
}Add to your Cursor MCP settings:
{
"mcp.servers": {
"url-checker": {
"command": "uv",
"args": ["--directory", "/absolute/path/to/url-checker", "run", "main.py"]
}
}
}| Tool | Purpose |
|---|---|
extract_hyperlinks_from_markdown |
Parse markdown content and extract links with context |
validate_hyperlinks_batch |
Validate multiple URLs in parallel with semantic analysis |
capture_full_page_screenshot_with_llm_analysis |
Capture full-page screenshots with structured LLM prompts |
generate_google_docs_report |
Generate HTML reports with embedded screenshots |
check_job_status |
Monitor progress of long-running operations |
cancel_job |
Cancel jobs gracefully with partial results |
get_validation_recommendation |
Get recommended actions based on validation results |
graph TD
A[extract_hyperlinks_from_markdown] --> B[validate_hyperlinks_batch]
B --> C{Needs Visual?}
C -->|Yes| D[capture_full_page_screenshot_with_llm_analysis]
C -->|No| E[Accept/Reject]
D --> F[LLM Analysis]
G[generate_google_docs_report] --> H[Returns job_id]
H --> I[check_job_status polling]
I --> J{Status?}
J -->|running| I
J -->|running| K[cancel_job optional]
K --> L[Partial results]
J -->|done| M[Get report_path]
J -->|failed| N[Handle error]
J -->|cancelled| L
The system uses weighted semantic analysis with optional context enhancement:
Without Context:
- Keyword Overlap (30%), Semantic Similarity (25%), Meta Analysis (25%), Content Analysis (20%)
With Context (Enhanced):
- Title Analysis (45%), Content Analysis (22%), Meta Analysis (18%), Context Analysis (15%)
- β₯ 0.7: High confidence (coherent) - Link matches content
- 0.4-0.69: Medium confidence (inconclusive) - Visual verification recommended
- < 0.4: Low confidence (non-coherent) - Likely incorrect link
Context enhancement improves accuracy for generic link names, disambiguates similar URLs, and provides automatic context extraction from markdown.
url-checker/
βββ src/url_checker/ # Core package
β βββ __init__.py # Package exports
β βββ server.py # FastMCP server implementation
βββ tests/ # Test suite
β βββ test_url_checker.py # Comprehensive tests
β βββ run_tests.py # Test runner
βββ examples/ # Usage examples
β βββ llm_screenshot_analysis_demo.py # Comprehensive LLM demo
β βββ google_docs_report_example.py # Report generation demo
β βββ test_job_polling.py # Job progress tracking demo
β βββ test_cancellation.py # Cancellation behavior demo
β βββ test_cancel_tool.py # cancel_job MCP tool demo
βββ reports/ # Generated HTML reports (created at runtime)
β βββ [session_id]/ # Session-specific reports
βββ screenshots/ # Captured screenshots (created at runtime)
β βββ [session_id]/ # Session-specific screenshots
βββ main.py # Server entry point
βββ pyproject.toml # Project configuration
βββ .gitignore # Git ignore patterns
βββ README.md # This documentation
# Run basic tests (no network required)
uv run tests/run_tests.py basic# All tests including network and browser
uv run tests/run_tests.py full
# Or use pytest
uv run pytest
# Run with verbose output
uv run pytest -v- Basic Tests: Text processing, scoring logic (no network)
- Network Tests: Real URL validation (requires internet)
- Browser Tests: Screenshot capture (requires Playwright)
- Job Tests: Progress tracking and cancellation
tests/test_url_checker.py: Comprehensive standalone test suitetests/run_tests.py: Simple test runner with optionsexamples/test_job_polling.py: Job-based progress tracking demoexamples/test_cancellation.py: Cancellation behavior demo
# Install main dependencies
uv sync
# Install development dependencies (for pytest)
uv sync --group dev
# Install Playwright browsers (for screenshot tests)
uv run playwright install chromiumNote: Screenshot tests may fail in headless environments or if Playwright browsers are not installed. This is expected and handled gracefully.
π Starting URL Checker Test Suite
============================================================
π Testing URLAnalyzer Basic Functions
----------------------------------------
β
URLAnalyzer.clean_text
β
URLAnalyzer.extract_keywords
β
URLAnalyzer.calculate_coherence_score
π Testing validate_url_basic_core
----------------------------------------
β
validate_url_basic_core - valid URL
β
validate_url_basic_core - invalid URL
... (more tests) ...
============================================================
TEST SUMMARY
============================================================
Tests run: 25
Passed: 25
Failed: 0
Duration: 15.23s
Success rate: 100.0%
π All tests passed!
# Enable verbose output
uv run pytest -v -s
# Check dependencies
uv run python -c "from src.url_checker import URLAnalyzer; print('β
Imports OK')"For CI/CD pipelines, use basic tests to avoid network and browser dependencies:
uv run tests/run_tests.py basicfrom url_checker import validate_url_basic_core
# Basic validation
result = await validate_url_basic_core(
name="Python Documentation",
url="https://docs.python.org/"
)
# Enhanced validation with context
result_with_context = await validate_url_basic_core(
name="Python Documentation",
url="https://docs.python.org/",
context="Official Python programming language documentation with tutorials and API reference"
)
if result_with_context["validation_result"] == "coherent":
print("β
Valid link")
elif result_with_context["needs_screenshot"]:
print("πΈ Needs visual verification")# Manual context for precise validation
result = await validate_url_basic_core(
name="API Docs",
url="https://docs.example.com",
context="REST API documentation for authentication endpoints"
)
# Automatic context extraction from markdown
from url_checker import extract_hyperlinks_from_markdown_core
markdown_content = """
# Development Guide
For API integration, see [API Documentation](https://docs.example.com)
which provides comprehensive endpoint references.
"""
links = extract_hyperlinks_from_markdown_core(markdown_content)
# Each link now includes surrounding text as context
print(links['hyperlinks'][0]['context'])
# Output: "For API integration, see which provides comprehensive endpoint references."from url_checker import validate_hyperlinks_batch_core
# Batch validation with context
links = [
{
"name": "Python",
"url": "https://python.org",
"context": "Official Python programming language website"
},
{
"name": "GitHub",
"url": "https://github.com",
"context": "Version control platform for code hosting"
}
]
results = await validate_hyperlinks_batch_core(links)
print(f"Processed {len(results['results'])} links")from url_checker import extract_hyperlinks_from_markdown_core
markdown = "Check out [Python](https://python.org) and [GitHub](https://github.com)"
links = extract_hyperlinks_from_markdown_core(markdown)
print(f"Found {links['total_count']} links")from url_checker import capture_full_page_screenshot_with_llm_analysis_core
# Capture full-page screenshot with structured LLM analysis prompt
result = await capture_full_page_screenshot_with_llm_analysis_core(
url="https://fastapi.tiangolo.com/",
name="FastAPI Documentation",
context="Modern Python web framework documentation"
)
if result['success']:
screenshot_path = result['screenshot_path'] # Local file path
screenshot_b64 = result['screenshot_base64'] # Base64 for LLM
llm_prompt = result['llm_analysis_prompt'] # Structured promptFeatures: Full-page capture, local storage with timestamps, base64 encoding, structured LLM prompts, context-aware analysis.
from url_checker import generate_google_docs_report_core, check_job_status_core
# Start report generation (returns job_id immediately)
result = await generate_google_docs_report_core(
job_id="unique-job-id",
urls=[
{"url": "https://example.com", "name": "Example", "context": "Main site"},
{"url": "https://docs.example.com", "name": "Docs"}
],
ctx=None,
report_title="Website Validation Report"
)
# Poll for progress
while True:
status = await check_job_status_core(result["job_id"])
print(f"Progress: {status['progress']}% - {status['message']}")
if status["status"] == "done":
print(f"Report: {status['output']['report_path']}")
break
elif status["status"] == "failed":
print(f"Failed: {status['error']}")
break
await asyncio.sleep(1)Features: Google Docs compatible HTML, embedded screenshots, progress tracking, cancellation support, session-based organization.
Import to Google Docs: File β Import β Upload the generated HTML file.
Long-running operations use a job-based polling system:
- Start job β Get
job_idimmediately - Poll
check_job_status(job_id)repeatedly - Monitor progress (0-100%)
- Retrieve results when status is "done"
- 0%: Initialization
- 0-90%: URL processing (proportional:
(processed/total) * 90) - 90%: HTML generation
- 100%: Complete
"running": Job in progress"done": Completed successfully"failed": Error occurred"cancelled": User cancelled
Cancel long-running operations gracefully with cancel_job:
# Start and cancel a job
result = await generate_google_docs_report(urls, "My Report")
cancel_result = await cancel_job(result["job_id"])
# Check final status - returns partial report
status = await check_job_status(result["job_id"])- Graceful: Current URL completes before stopping
- Partial Results: Processed URLs saved to
_PARTIAL.html - Preserved Data: All screenshots and progress maintained
- Status Transition:
runningβcancellingβcancelled
uv run examples/test_cancel_tool.py
uv run examples/test_cancellation.py- FastMCP: MCP server framework
- httpx: Async HTTP client
- BeautifulSoup4: HTML parsing
- Playwright: Browser automation
- scikit-learn: Semantic similarity
- NLTK: Natural language processing
- Async/await for non-blocking operations
- Parallel batch validation with asyncio
- NLTK data caching
- Job-based background processing
- Session-based file organization
uv sync --group dev
uv run playwright install chromium- Edit code in
src/url_checker/ - Run tests:
uv run pytestoruv run tests/run_tests.py - Test examples to verify functionality
- Add tests for new features
- Update documentation
- Follow existing patterns
- Add type hints and docstrings
- Include positive and negative test cases
- Test error conditions and edge cases
- Keep tool descriptions concise
uv sync
uv run python -c "from src.url_checker import validate_url_basic_core; print('β
OK')"uv run playwright install chromium- Check internet connectivity
- Use basic tests:
uv run tests/run_tests.py basic
- Ensure write permissions for
./screenshotsand./reports - Check disk space availability
- Cancellation not working: Wait for current URL to complete
- Partial report missing: Check write permissions and disk space
- Status not updating: Verify job_id and polling interval (1-2s)
Q: Do I need Playwright for basic validation?
A: No. Playwright is only required for screenshot capture.
Q: Can I use this without an MCP client?
A: Yes. Import core functions directly from src.url_checker in Python code.
Q: How accurate is the validation?
A: 85-90% typical accuracy using semantic similarity, keyword overlap, and meta analysis. Visual verification recommended for inconclusive results.
Q: What do the validation scores mean?
A: β₯0.7 = coherent (high confidence), 0.4-0.69 = inconclusive (needs visual check), <0.4 = non-coherent (likely incorrect).
Q: How does progress tracking work?
A: Long-running operations return a job_id immediately. Poll check_job_status for progress updates (0-100%).
Q: Can I cancel report generation?
A: Yes. Use cancel_job to stop gracefully and get a partial report with processed URLs.
Q: How are screenshots stored?
A: Saved to ./screenshots/[session_id]/ with timestamps. Both PNG and optimized JPEG versions, plus base64 encoding.
Q: Can I import reports into Google Docs?
A: Yes. File β Import β Upload the generated HTML file. Screenshots are embedded.
Q: Which IDEs support MCP?
A: Windsurf IDE, Cursor IDE, and Claude Desktop have built-in MCP support.
Open source - modify and distribute freely.
Built with β€οΈ using UV and modern Python practices