Skip to content

Conversation

Mearman
Copy link
Member

@Mearman Mearman commented Jul 30, 2025

Summary

Implements content freshness detection for external links to identify potentially stale documentation even when links return 200 OK status codes. This addresses real-world problems like outdated Firebase docs, deprecated GitHub Actions syntax, and moved API documentation.

Key Features

  • Content Age Analysis: Detects stale content using HTTP Last-Modified headers with configurable age thresholds
  • Pattern-Based Staleness Detection: Identifies content with deprecation indicators like "deprecated", "no longer supported", "archived", etc.
  • Domain-Specific Thresholds: Different staleness thresholds for different domains (e.g., 6 months for GitHub/Firebase vs 2 years default)
  • Content Change Detection: SHA-256 hashing with normalization to detect significant content changes between validation runs
  • Performance Optimization: File-based caching with TTL to avoid redundant analysis
  • CLI Integration: New options --check-content-freshness and --freshness-threshold <days>

Implementation Details

  • ContentFreshnessDetector: Core freshness detection with configurable thresholds and pattern matching
  • Enhanced LinkValidator: Integrated freshness detection with GET request support for content analysis
  • Updated validate command: Added freshness statistics counting and enhanced output formatting with [STALE] indicators
  • Comprehensive testing: 47 test cases covering unit tests, integration tests, and end-to-end scenarios

Example Usage

# Enable freshness detection with default 2-year threshold
markmv validate "**/*.md" --check-external --check-content-freshness

# Use custom threshold of 1 year
markmv validate docs/ --check-external --check-content-freshness --freshness-threshold 365

# Show detailed freshness information
markmv validate README.md --check-external --check-content-freshness --verbose

Test Coverage

  • Unit Tests: ContentFreshnessDetector functionality with 22 test cases
  • Integration Tests: LinkValidator integration with 15 test cases
  • End-to-End Tests: Complete validation pipeline with 10 test cases
  • Cross-platform compatibility: Ensures consistent behavior across environments

Resolves #35

Test Plan

  • All existing tests pass
  • New comprehensive test suites for freshness detection (47 tests total)
  • Manual testing with real-world URLs
  • CLI integration testing with various options
  • Performance testing with caching scenarios
  • Cross-platform compatibility verification

Implements comprehensive content freshness detection system to identify
potentially stale external documentation even when links return 200 OK.

## Key Features
- **Last-Modified Detection**: Checks HTTP last-modified headers against
  configurable staleness thresholds (default: 2 years)
- **Content Pattern Detection**: Identifies deprecation notices, moved pages,
  legacy documentation, and other staleness indicators
- **Content Change Detection**: Tracks content changes between validations
  using SHA-256 content hashes with normalization
- **Domain-Specific Thresholds**: Customizable staleness periods for different
  domains (e.g., 6 months for GitHub/Firebase vs 2 years for general sites)
- **Smart Caching**: Caches validation results with TTL and content-based
  invalidation to improve performance

## CLI Integration
- `--check-content-freshness`: Enable staleness detection
- `--freshness-threshold <days>`: Configure staleness threshold (default: 730)
- Enhanced output showing fresh vs stale link counts with detailed warnings
- Stale links marked with [STALE] indicator and include suggestions

## Implementation
- ContentFreshnessDetector class with configurable thresholds and patterns
- Enhanced LinkValidator with GET requests for content analysis
- Extended BrokenLink type with freshness information
- Comprehensive test coverage with unit, integration, and CLI tests

## Output Example
```
📊 Validation Summary
Files processed: 3
Total links found: 15
Broken links: 2
Fresh external links: 8
Stale external links: 2

🔗 Broken Links Found:
  📄 docs/api.md (1 broken):
    ❌ [external] https://api.example.com/deprecated (line 42) [STALE]
       Warning: Content contains staleness indicators
       Suggestion: Review content for updates or alternatives
       Detected patterns: deprecated, no longer supported
```

Resolves #35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Content freshness detection for external links

1 participant