URL Checker

A FastMCP server for intelligent hyperlink validation with semantic analysis, visual verification, and automated reporting. Built with Python and UV for seamless integration with AI coding assistants.

Features

🔍 Smart Validation: Semantic analysis using TF-IDF and cosine similarity scoring
🎯 Context-Aware: Enhanced validation using surrounding text and purpose context
📸 Visual Verification: Full-page screenshot capture with Playwright
🤖 LLM-Ready: Structured outputs and analysis prompts for AI workflows
⚡ Batch Processing: Efficient parallel validation of multiple URLs
📝 Markdown Support: Extract and validate links with automatic context extraction
📊 Progress Tracking: Job-based polling for long-running operations
🛑 Cancellation Support: Graceful cancellation with partial results preserved
📄 Report Generation: Google Docs-compatible HTML reports with embedded screenshots

Quick Start

Installation

# Clone and setup
git clone <repository-url>
cd url-checker

# Install dependencies
uv sync

# Install Playwright browsers (for screenshots)
uv run playwright install chromium

Basic Usage

# Start the MCP server
uv run main.py

# Run example demonstrations
uv run examples/llm_screenshot_analysis_demo.py
uv run examples/google_docs_report_example.py

# Run tests
uv run tests/run_tests.py

MCP Integration

Windsurf IDE

Add to your Windsurf MCP settings:

{
  "mcpServers": {
    "url-checker": {
      "command": "uv",
      "args": ["--directory", "/absolute/path/to/url-checker", "run", "main.py"]
    }
  }
}

Cursor IDE

Add to your Cursor MCP settings:

{
  "mcp.servers": {
    "url-checker": {
      "command": "uv",
      "args": ["--directory", "/absolute/path/to/url-checker", "run", "main.py"]
    }
  }
}

Available Tools

Primary Tools

Tool	Purpose
`extract_hyperlinks_from_markdown`	Parse markdown content and extract links with context
`validate_hyperlinks_batch`	Validate multiple URLs in parallel with semantic analysis
`capture_full_page_screenshot_with_llm_analysis`	Capture full-page screenshots with structured LLM prompts
`generate_google_docs_report`	Generate HTML reports with embedded screenshots
`check_job_status`	Monitor progress of long-running operations
`cancel_job`	Cancel jobs gracefully with partial results
`get_validation_recommendation`	Get recommended actions based on validation results

Tool Workflow

graph TD
    A[extract_hyperlinks_from_markdown] --> B[validate_hyperlinks_batch]
    B --> C{Needs Visual?}
    C -->|Yes| D[capture_full_page_screenshot_with_llm_analysis]
    C -->|No| E[Accept/Reject]
    D --> F[LLM Analysis]
    
    G[generate_google_docs_report] --> H[Returns job_id]
    H --> I[check_job_status polling]
    I --> J{Status?}
    J -->|running| I
    J -->|running| K[cancel_job optional]
    K --> L[Partial results]
    J -->|done| M[Get report_path]
    J -->|failed| N[Handle error]
    J -->|cancelled| L

Validation Scoring

The system uses weighted semantic analysis with optional context enhancement:

Scoring Components

Without Context:

Keyword Overlap (30%), Semantic Similarity (25%), Meta Analysis (25%), Content Analysis (20%)

With Context (Enhanced):

Title Analysis (45%), Content Analysis (22%), Meta Analysis (18%), Context Analysis (15%)

Score Interpretation

≥ 0.7: High confidence (coherent) - Link matches content
0.4-0.69: Medium confidence (inconclusive) - Visual verification recommended
< 0.4: Low confidence (non-coherent) - Likely incorrect link

Context Benefits

Context enhancement improves accuracy for generic link names, disambiguates similar URLs, and provides automatic context extraction from markdown.

Project Structure

url-checker/
├── src/url_checker/        # Core package
│   ├── __init__.py         # Package exports
│   └── server.py           # FastMCP server implementation
├── tests/                  # Test suite
│   ├── test_url_checker.py # Comprehensive tests
│   └── run_tests.py        # Test runner
├── examples/               # Usage examples
│   ├── llm_screenshot_analysis_demo.py # Comprehensive LLM demo
│   ├── google_docs_report_example.py   # Report generation demo
│   ├── test_job_polling.py             # Job progress tracking demo
│   ├── test_cancellation.py            # Cancellation behavior demo
│   └── test_cancel_tool.py             # cancel_job MCP tool demo
├── reports/                # Generated HTML reports (created at runtime)
│   └── [session_id]/       # Session-specific reports
├── screenshots/            # Captured screenshots (created at runtime)
│   └── [session_id]/       # Session-specific screenshots
├── main.py                 # Server entry point
├── pyproject.toml          # Project configuration
├── .gitignore              # Git ignore patterns
└── README.md               # This documentation

Testing

Quick Test

# Run basic tests (no network required)
uv run tests/run_tests.py basic

Full Test Suite

# All tests including network and browser
uv run tests/run_tests.py full

# Or use pytest
uv run pytest

# Run with verbose output
uv run pytest -v

Test Categories

Basic Tests: Text processing, scoring logic (no network)
Network Tests: Real URL validation (requires internet)
Browser Tests: Screenshot capture (requires Playwright)
Job Tests: Progress tracking and cancellation

Test Files

tests/test_url_checker.py: Comprehensive standalone test suite
tests/run_tests.py: Simple test runner with options
examples/test_job_polling.py: Job-based progress tracking demo
examples/test_cancellation.py: Cancellation behavior demo

Prerequisites

# Install main dependencies
uv sync

# Install development dependencies (for pytest)
uv sync --group dev

# Install Playwright browsers (for screenshot tests)
uv run playwright install chromium

Note: Screenshot tests may fail in headless environments or if Playwright browsers are not installed. This is expected and handled gracefully.

Expected Test Output

🚀 Starting URL Checker Test Suite
============================================================

📋 Testing URLAnalyzer Basic Functions
----------------------------------------
✅ URLAnalyzer.clean_text
✅ URLAnalyzer.extract_keywords
✅ URLAnalyzer.calculate_coherence_score

🔍 Testing validate_url_basic_core
----------------------------------------
✅ validate_url_basic_core - valid URL
✅ validate_url_basic_core - invalid URL

... (more tests) ...

============================================================
TEST SUMMARY
============================================================
Tests run: 25
Passed: 25
Failed: 0
Duration: 15.23s
Success rate: 100.0%

🎉 All tests passed!

Debugging Test Failures

# Enable verbose output
uv run pytest -v -s

# Check dependencies
uv run python -c "from src.url_checker import URLAnalyzer; print('✅ Imports OK')"

Continuous Integration

For CI/CD pipelines, use basic tests to avoid network and browser dependencies:

uv run tests/run_tests.py basic

Usage Examples

Single URL Validation

from url_checker import validate_url_basic_core

# Basic validation
result = await validate_url_basic_core(
    name="Python Documentation", 
    url="https://docs.python.org/"
)

# Enhanced validation with context
result_with_context = await validate_url_basic_core(
    name="Python Documentation", 
    url="https://docs.python.org/",
    context="Official Python programming language documentation with tutorials and API reference"
)

if result_with_context["validation_result"] == "coherent":
    print("✅ Valid link")
elif result_with_context["needs_screenshot"]:
    print("📸 Needs visual verification")

Context-Aware Validation

# Manual context for precise validation
result = await validate_url_basic_core(
    name="API Docs",
    url="https://docs.example.com",
    context="REST API documentation for authentication endpoints"
)

# Automatic context extraction from markdown
from url_checker import extract_hyperlinks_from_markdown_core

markdown_content = """
# Development Guide
For API integration, see [API Documentation](https://docs.example.com) 
which provides comprehensive endpoint references.
"""

links = extract_hyperlinks_from_markdown_core(markdown_content)
# Each link now includes surrounding text as context
print(links['hyperlinks'][0]['context'])
# Output: "For API integration, see which provides comprehensive endpoint references."

Batch Processing

from url_checker import validate_hyperlinks_batch_core

# Batch validation with context
links = [
    {
        "name": "Python", 
        "url": "https://python.org",
        "context": "Official Python programming language website"
    },
    {
        "name": "GitHub", 
        "url": "https://github.com",
        "context": "Version control platform for code hosting"
    }
]

results = await validate_hyperlinks_batch_core(links)
print(f"Processed {len(results['results'])} links")

Markdown Processing

from url_checker import extract_hyperlinks_from_markdown_core

markdown = "Check out [Python](https://python.org) and [GitHub](https://github.com)"
links = extract_hyperlinks_from_markdown_core(markdown)
print(f"Found {links['total_count']} links")

Screenshot Analysis

from url_checker import capture_full_page_screenshot_with_llm_analysis_core

# Capture full-page screenshot with structured LLM analysis prompt
result = await capture_full_page_screenshot_with_llm_analysis_core(
    url="https://fastapi.tiangolo.com/",
    name="FastAPI Documentation",
    context="Modern Python web framework documentation"
)

if result['success']:
    screenshot_path = result['screenshot_path']      # Local file path
    screenshot_b64 = result['screenshot_base64']     # Base64 for LLM
    llm_prompt = result['llm_analysis_prompt']       # Structured prompt

Features: Full-page capture, local storage with timestamps, base64 encoding, structured LLM prompts, context-aware analysis.

Report Generation

from url_checker import generate_google_docs_report_core, check_job_status_core

# Start report generation (returns job_id immediately)
result = await generate_google_docs_report_core(
    job_id="unique-job-id",
    urls=[
        {"url": "https://example.com", "name": "Example", "context": "Main site"},
        {"url": "https://docs.example.com", "name": "Docs"}
    ],
    ctx=None,
    report_title="Website Validation Report"
)

# Poll for progress
while True:
    status = await check_job_status_core(result["job_id"])
    print(f"Progress: {status['progress']}% - {status['message']}")
    
    if status["status"] == "done":
        print(f"Report: {status['output']['report_path']}")
        break
    elif status["status"] == "failed":
        print(f"Failed: {status['error']}")
        break
    
    await asyncio.sleep(1)

Features: Google Docs compatible HTML, embedded screenshots, progress tracking, cancellation support, session-based organization.

Import to Google Docs: File → Import → Upload the generated HTML file.

Progress Tracking & Job Management

Long-running operations use a job-based polling system:

Workflow

Start job → Get job_id immediately
Poll check_job_status(job_id) repeatedly
Monitor progress (0-100%)
Retrieve results when status is "done"

Progress Stages

0%: Initialization
0-90%: URL processing (proportional: (processed/total) * 90)
90%: HTML generation
100%: Complete

Status Values

"running": Job in progress
"done": Completed successfully
"failed": Error occurred
"cancelled": User cancelled

Cancellation Support

Cancel long-running operations gracefully with cancel_job:

# Start and cancel a job
result = await generate_google_docs_report(urls, "My Report")
cancel_result = await cancel_job(result["job_id"])

# Check final status - returns partial report
status = await check_job_status(result["job_id"])

Behavior

Graceful: Current URL completes before stopping
Partial Results: Processed URLs saved to _PARTIAL.html
Preserved Data: All screenshots and progress maintained
Status Transition: running → cancelling → cancelled

Testing

uv run examples/test_cancel_tool.py
uv run examples/test_cancellation.py

Technical Details

Dependencies

FastMCP: MCP server framework
httpx: Async HTTP client
BeautifulSoup4: HTML parsing
Playwright: Browser automation
scikit-learn: Semantic similarity
NLTK: Natural language processing

Performance

Async/await for non-blocking operations
Parallel batch validation with asyncio
NLTK data caching
Job-based background processing
Session-based file organization

Contributing

Setup

uv sync --group dev
uv run playwright install chromium

Workflow

Edit code in src/url_checker/
Run tests: uv run pytest or uv run tests/run_tests.py
Test examples to verify functionality
Add tests for new features
Update documentation

Guidelines

Follow existing patterns
Add type hints and docstrings
Include positive and negative test cases
Test error conditions and edge cases
Keep tool descriptions concise

Troubleshooting

Import Errors

uv sync
uv run python -c "from src.url_checker import validate_url_basic_core; print('✅ OK')"

Playwright Errors

uv run playwright install chromium

Network Timeouts

Check internet connectivity
Use basic tests: uv run tests/run_tests.py basic

Permission Errors

Ensure write permissions for ./screenshots and ./reports
Check disk space availability

Job Issues

Cancellation not working: Wait for current URL to complete
Partial report missing: Check write permissions and disk space
Status not updating: Verify job_id and polling interval (1-2s)

FAQ

Q: Do I need Playwright for basic validation?
A: No. Playwright is only required for screenshot capture.

Q: Can I use this without an MCP client?
A: Yes. Import core functions directly from src.url_checker in Python code.

Q: How accurate is the validation?
A: 85-90% typical accuracy using semantic similarity, keyword overlap, and meta analysis. Visual verification recommended for inconclusive results.

Q: What do the validation scores mean?
A: ≥0.7 = coherent (high confidence), 0.4-0.69 = inconclusive (needs visual check), <0.4 = non-coherent (likely incorrect).

Q: How does progress tracking work?
A: Long-running operations return a job_id immediately. Poll check_job_status for progress updates (0-100%).

Q: Can I cancel report generation?
A: Yes. Use cancel_job to stop gracefully and get a partial report with processed URLs.

Q: How are screenshots stored?
A: Saved to ./screenshots/[session_id]/ with timestamps. Both PNG and optimized JPEG versions, plus base64 encoding.

Q: Can I import reports into Google Docs?
A: Yes. File → Import → Upload the generated HTML file. Screenshots are embedded.

Q: Which IDEs support MCP?
A: Windsurf IDE, Cursor IDE, and Claude Desktop have built-in MCP support.

License

Open source - modify and distribute freely.

Built with ❤️ using UV and modern Python practices

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
src/url_checker		src/url_checker
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

arturdj/url-checker

Folders and files

Latest commit

History

Repository files navigation

URL Checker

Table of Contents

Features

Quick Start

Installation

Basic Usage

MCP Integration

Windsurf IDE

Cursor IDE

Available Tools

Primary Tools

Tool Workflow

Validation Scoring

Scoring Components

Score Interpretation

Context Benefits

Project Structure

Testing

Quick Test

Full Test Suite

Test Categories

Test Files

Prerequisites

Expected Test Output

Debugging Test Failures

Continuous Integration

Usage Examples

Single URL Validation

Context-Aware Validation

Batch Processing

Markdown Processing

Screenshot Analysis

Report Generation

Progress Tracking & Job Management

Workflow

Progress Stages

Status Values

Cancellation Support

Behavior

Testing

Technical Details

Dependencies

Performance

Contributing

Setup

Workflow

Guidelines

Troubleshooting

Import Errors

Playwright Errors

Network Timeouts

Permission Errors

Job Issues

FAQ

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages