Skip to content

arturdj/url-checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

URL Checker

A FastMCP server for intelligent hyperlink validation with semantic analysis, visual verification, and automated reporting. Built with Python and UV for seamless integration with AI coding assistants.

Table of Contents

Features

  • πŸ” Smart Validation: Semantic analysis using TF-IDF and cosine similarity scoring
  • 🎯 Context-Aware: Enhanced validation using surrounding text and purpose context
  • πŸ“Έ Visual Verification: Full-page screenshot capture with Playwright
  • πŸ€– LLM-Ready: Structured outputs and analysis prompts for AI workflows
  • ⚑ Batch Processing: Efficient parallel validation of multiple URLs
  • πŸ“ Markdown Support: Extract and validate links with automatic context extraction
  • πŸ“Š Progress Tracking: Job-based polling for long-running operations
  • πŸ›‘ Cancellation Support: Graceful cancellation with partial results preserved
  • πŸ“„ Report Generation: Google Docs-compatible HTML reports with embedded screenshots

Quick Start

Installation

# Clone and setup
git clone <repository-url>
cd url-checker

# Install dependencies
uv sync

# Install Playwright browsers (for screenshots)
uv run playwright install chromium

Basic Usage

# Start the MCP server
uv run main.py

# Run example demonstrations
uv run examples/llm_screenshot_analysis_demo.py
uv run examples/google_docs_report_example.py

# Run tests
uv run tests/run_tests.py

MCP Integration

Windsurf IDE

Add to your Windsurf MCP settings:

{
  "mcpServers": {
    "url-checker": {
      "command": "uv",
      "args": ["--directory", "/absolute/path/to/url-checker", "run", "main.py"]
    }
  }
}

Cursor IDE

Add to your Cursor MCP settings:

{
  "mcp.servers": {
    "url-checker": {
      "command": "uv",
      "args": ["--directory", "/absolute/path/to/url-checker", "run", "main.py"]
    }
  }
}

Available Tools

Primary Tools

Tool Purpose
extract_hyperlinks_from_markdown Parse markdown content and extract links with context
validate_hyperlinks_batch Validate multiple URLs in parallel with semantic analysis
capture_full_page_screenshot_with_llm_analysis Capture full-page screenshots with structured LLM prompts
generate_google_docs_report Generate HTML reports with embedded screenshots
check_job_status Monitor progress of long-running operations
cancel_job Cancel jobs gracefully with partial results
get_validation_recommendation Get recommended actions based on validation results

Tool Workflow

graph TD
    A[extract_hyperlinks_from_markdown] --> B[validate_hyperlinks_batch]
    B --> C{Needs Visual?}
    C -->|Yes| D[capture_full_page_screenshot_with_llm_analysis]
    C -->|No| E[Accept/Reject]
    D --> F[LLM Analysis]
    
    G[generate_google_docs_report] --> H[Returns job_id]
    H --> I[check_job_status polling]
    I --> J{Status?}
    J -->|running| I
    J -->|running| K[cancel_job optional]
    K --> L[Partial results]
    J -->|done| M[Get report_path]
    J -->|failed| N[Handle error]
    J -->|cancelled| L
Loading

Validation Scoring

The system uses weighted semantic analysis with optional context enhancement:

Scoring Components

Without Context:

  • Keyword Overlap (30%), Semantic Similarity (25%), Meta Analysis (25%), Content Analysis (20%)

With Context (Enhanced):

  • Title Analysis (45%), Content Analysis (22%), Meta Analysis (18%), Context Analysis (15%)

Score Interpretation

  • β‰₯ 0.7: High confidence (coherent) - Link matches content
  • 0.4-0.69: Medium confidence (inconclusive) - Visual verification recommended
  • < 0.4: Low confidence (non-coherent) - Likely incorrect link

Context Benefits

Context enhancement improves accuracy for generic link names, disambiguates similar URLs, and provides automatic context extraction from markdown.

Project Structure

url-checker/
β”œβ”€β”€ src/url_checker/        # Core package
β”‚   β”œβ”€β”€ __init__.py         # Package exports
β”‚   └── server.py           # FastMCP server implementation
β”œβ”€β”€ tests/                  # Test suite
β”‚   β”œβ”€β”€ test_url_checker.py # Comprehensive tests
β”‚   └── run_tests.py        # Test runner
β”œβ”€β”€ examples/               # Usage examples
β”‚   β”œβ”€β”€ llm_screenshot_analysis_demo.py # Comprehensive LLM demo
β”‚   β”œβ”€β”€ google_docs_report_example.py   # Report generation demo
β”‚   β”œβ”€β”€ test_job_polling.py             # Job progress tracking demo
β”‚   β”œβ”€β”€ test_cancellation.py            # Cancellation behavior demo
β”‚   └── test_cancel_tool.py             # cancel_job MCP tool demo
β”œβ”€β”€ reports/                # Generated HTML reports (created at runtime)
β”‚   └── [session_id]/       # Session-specific reports
β”œβ”€β”€ screenshots/            # Captured screenshots (created at runtime)
β”‚   └── [session_id]/       # Session-specific screenshots
β”œβ”€β”€ main.py                 # Server entry point
β”œβ”€β”€ pyproject.toml          # Project configuration
β”œβ”€β”€ .gitignore              # Git ignore patterns
└── README.md               # This documentation

Testing

Quick Test

# Run basic tests (no network required)
uv run tests/run_tests.py basic

Full Test Suite

# All tests including network and browser
uv run tests/run_tests.py full

# Or use pytest
uv run pytest

# Run with verbose output
uv run pytest -v

Test Categories

  • Basic Tests: Text processing, scoring logic (no network)
  • Network Tests: Real URL validation (requires internet)
  • Browser Tests: Screenshot capture (requires Playwright)
  • Job Tests: Progress tracking and cancellation

Test Files

  • tests/test_url_checker.py: Comprehensive standalone test suite
  • tests/run_tests.py: Simple test runner with options
  • examples/test_job_polling.py: Job-based progress tracking demo
  • examples/test_cancellation.py: Cancellation behavior demo

Prerequisites

# Install main dependencies
uv sync

# Install development dependencies (for pytest)
uv sync --group dev

# Install Playwright browsers (for screenshot tests)
uv run playwright install chromium

Note: Screenshot tests may fail in headless environments or if Playwright browsers are not installed. This is expected and handled gracefully.

Expected Test Output

πŸš€ Starting URL Checker Test Suite
============================================================

πŸ“‹ Testing URLAnalyzer Basic Functions
----------------------------------------
βœ… URLAnalyzer.clean_text
βœ… URLAnalyzer.extract_keywords
βœ… URLAnalyzer.calculate_coherence_score

πŸ” Testing validate_url_basic_core
----------------------------------------
βœ… validate_url_basic_core - valid URL
βœ… validate_url_basic_core - invalid URL

... (more tests) ...

============================================================
TEST SUMMARY
============================================================
Tests run: 25
Passed: 25
Failed: 0
Duration: 15.23s
Success rate: 100.0%

πŸŽ‰ All tests passed!

Debugging Test Failures

# Enable verbose output
uv run pytest -v -s

# Check dependencies
uv run python -c "from src.url_checker import URLAnalyzer; print('βœ… Imports OK')"

Continuous Integration

For CI/CD pipelines, use basic tests to avoid network and browser dependencies:

uv run tests/run_tests.py basic

Usage Examples

Single URL Validation

from url_checker import validate_url_basic_core

# Basic validation
result = await validate_url_basic_core(
    name="Python Documentation", 
    url="https://docs.python.org/"
)

# Enhanced validation with context
result_with_context = await validate_url_basic_core(
    name="Python Documentation", 
    url="https://docs.python.org/",
    context="Official Python programming language documentation with tutorials and API reference"
)

if result_with_context["validation_result"] == "coherent":
    print("βœ… Valid link")
elif result_with_context["needs_screenshot"]:
    print("πŸ“Έ Needs visual verification")

Context-Aware Validation

# Manual context for precise validation
result = await validate_url_basic_core(
    name="API Docs",
    url="https://docs.example.com",
    context="REST API documentation for authentication endpoints"
)

# Automatic context extraction from markdown
from url_checker import extract_hyperlinks_from_markdown_core

markdown_content = """
# Development Guide
For API integration, see [API Documentation](https://docs.example.com) 
which provides comprehensive endpoint references.
"""

links = extract_hyperlinks_from_markdown_core(markdown_content)
# Each link now includes surrounding text as context
print(links['hyperlinks'][0]['context'])
# Output: "For API integration, see which provides comprehensive endpoint references."

Batch Processing

from url_checker import validate_hyperlinks_batch_core

# Batch validation with context
links = [
    {
        "name": "Python", 
        "url": "https://python.org",
        "context": "Official Python programming language website"
    },
    {
        "name": "GitHub", 
        "url": "https://github.com",
        "context": "Version control platform for code hosting"
    }
]

results = await validate_hyperlinks_batch_core(links)
print(f"Processed {len(results['results'])} links")

Markdown Processing

from url_checker import extract_hyperlinks_from_markdown_core

markdown = "Check out [Python](https://python.org) and [GitHub](https://github.com)"
links = extract_hyperlinks_from_markdown_core(markdown)
print(f"Found {links['total_count']} links")

Screenshot Analysis

from url_checker import capture_full_page_screenshot_with_llm_analysis_core

# Capture full-page screenshot with structured LLM analysis prompt
result = await capture_full_page_screenshot_with_llm_analysis_core(
    url="https://fastapi.tiangolo.com/",
    name="FastAPI Documentation",
    context="Modern Python web framework documentation"
)

if result['success']:
    screenshot_path = result['screenshot_path']      # Local file path
    screenshot_b64 = result['screenshot_base64']     # Base64 for LLM
    llm_prompt = result['llm_analysis_prompt']       # Structured prompt

Features: Full-page capture, local storage with timestamps, base64 encoding, structured LLM prompts, context-aware analysis.

Report Generation

from url_checker import generate_google_docs_report_core, check_job_status_core

# Start report generation (returns job_id immediately)
result = await generate_google_docs_report_core(
    job_id="unique-job-id",
    urls=[
        {"url": "https://example.com", "name": "Example", "context": "Main site"},
        {"url": "https://docs.example.com", "name": "Docs"}
    ],
    ctx=None,
    report_title="Website Validation Report"
)

# Poll for progress
while True:
    status = await check_job_status_core(result["job_id"])
    print(f"Progress: {status['progress']}% - {status['message']}")
    
    if status["status"] == "done":
        print(f"Report: {status['output']['report_path']}")
        break
    elif status["status"] == "failed":
        print(f"Failed: {status['error']}")
        break
    
    await asyncio.sleep(1)

Features: Google Docs compatible HTML, embedded screenshots, progress tracking, cancellation support, session-based organization.

Import to Google Docs: File β†’ Import β†’ Upload the generated HTML file.

Progress Tracking & Job Management

Long-running operations use a job-based polling system:

Workflow

  1. Start job β†’ Get job_id immediately
  2. Poll check_job_status(job_id) repeatedly
  3. Monitor progress (0-100%)
  4. Retrieve results when status is "done"

Progress Stages

  • 0%: Initialization
  • 0-90%: URL processing (proportional: (processed/total) * 90)
  • 90%: HTML generation
  • 100%: Complete

Status Values

  • "running": Job in progress
  • "done": Completed successfully
  • "failed": Error occurred
  • "cancelled": User cancelled

Cancellation Support

Cancel long-running operations gracefully with cancel_job:

# Start and cancel a job
result = await generate_google_docs_report(urls, "My Report")
cancel_result = await cancel_job(result["job_id"])

# Check final status - returns partial report
status = await check_job_status(result["job_id"])

Behavior

  • Graceful: Current URL completes before stopping
  • Partial Results: Processed URLs saved to _PARTIAL.html
  • Preserved Data: All screenshots and progress maintained
  • Status Transition: running β†’ cancelling β†’ cancelled

Testing

uv run examples/test_cancel_tool.py
uv run examples/test_cancellation.py

Technical Details

Dependencies

  • FastMCP: MCP server framework
  • httpx: Async HTTP client
  • BeautifulSoup4: HTML parsing
  • Playwright: Browser automation
  • scikit-learn: Semantic similarity
  • NLTK: Natural language processing

Performance

  • Async/await for non-blocking operations
  • Parallel batch validation with asyncio
  • NLTK data caching
  • Job-based background processing
  • Session-based file organization

Contributing

Setup

uv sync --group dev
uv run playwright install chromium

Workflow

  1. Edit code in src/url_checker/
  2. Run tests: uv run pytest or uv run tests/run_tests.py
  3. Test examples to verify functionality
  4. Add tests for new features
  5. Update documentation

Guidelines

  • Follow existing patterns
  • Add type hints and docstrings
  • Include positive and negative test cases
  • Test error conditions and edge cases
  • Keep tool descriptions concise

Troubleshooting

Import Errors

uv sync
uv run python -c "from src.url_checker import validate_url_basic_core; print('βœ… OK')"

Playwright Errors

uv run playwright install chromium

Network Timeouts

  • Check internet connectivity
  • Use basic tests: uv run tests/run_tests.py basic

Permission Errors

  • Ensure write permissions for ./screenshots and ./reports
  • Check disk space availability

Job Issues

  • Cancellation not working: Wait for current URL to complete
  • Partial report missing: Check write permissions and disk space
  • Status not updating: Verify job_id and polling interval (1-2s)

FAQ

Q: Do I need Playwright for basic validation?
A: No. Playwright is only required for screenshot capture.

Q: Can I use this without an MCP client?
A: Yes. Import core functions directly from src.url_checker in Python code.

Q: How accurate is the validation?
A: 85-90% typical accuracy using semantic similarity, keyword overlap, and meta analysis. Visual verification recommended for inconclusive results.

Q: What do the validation scores mean?
A: β‰₯0.7 = coherent (high confidence), 0.4-0.69 = inconclusive (needs visual check), <0.4 = non-coherent (likely incorrect).

Q: How does progress tracking work?
A: Long-running operations return a job_id immediately. Poll check_job_status for progress updates (0-100%).

Q: Can I cancel report generation?
A: Yes. Use cancel_job to stop gracefully and get a partial report with processed URLs.

Q: How are screenshots stored?
A: Saved to ./screenshots/[session_id]/ with timestamps. Both PNG and optimized JPEG versions, plus base64 encoding.

Q: Can I import reports into Google Docs?
A: Yes. File β†’ Import β†’ Upload the generated HTML file. Screenshots are embedded.

Q: Which IDEs support MCP?
A: Windsurf IDE, Cursor IDE, and Claude Desktop have built-in MCP support.

License

Open source - modify and distribute freely.


Built with ❀️ using UV and modern Python practices

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages