Skip to content

Feature Request: Add web page to markdown conversion (like Markdown Web Clipper) #38

@Mearman

Description

@Mearman

Feature Description

Add functionality to download and convert web pages to markdown, similar to browser extensions like Markdown Web Clipper. This would expand markmv's capabilities beyond file operations to include content acquisition, making it a more complete markdown workflow tool.

Proposed Solution

Add a new command clip or fetch that downloads web pages and converts them to clean markdown:

Example Usage

# Basic usage - download and save with auto-generated filename
npx markmv clip https://example.com/article

# Specify output filename
npx markmv clip https://example.com/article -o article.md

# Download multiple URLs
npx markmv clip urls.txt --batch

# Download with specific options
npx markmv clip https://example.com/article \
  --format clean \
  --images download \
  --output-dir docs/articles/

# Extract only article content (using readability)
npx markmv clip https://example.com/article --article-only

# Include metadata in frontmatter
npx markmv clip https://example.com/article --metadata

Core Features

  1. Clean markdown extraction

    • Remove unnecessary HTML elements
    • Preserve article structure
    • Convert common HTML patterns to markdown
  2. Image handling

    • Download images locally
    • Update image paths in markdown
    • Optional: skip image download for text-only
  3. Metadata preservation

    • Title, author, date
    • Optional frontmatter generation
    • Source URL tracking
  4. Content extraction modes

    • Full page conversion
    • Article extraction (Readability-style)
    • Custom selectors for specific content

Implementation Suggestions

Dependencies to Consider

  • @mozilla/readability - For article extraction
  • turndown - HTML to markdown conversion
  • node-fetch or axios - HTTP requests
  • cheerio - HTML parsing if needed

Configuration Options

# .markmvrc or markmv.config.js
clip:
  output_dir: "./clipped"
  image_dir: "./clipped/images"
  frontmatter: true
  format: "clean" # clean, raw, article
  timeout: 30000
  user_agent: "Mozilla/5.0..."
  ignore_patterns:
    - "*.pdf"
    - "mailto:*"

Benefits

  1. Complete workflow - From content discovery to organization
  2. Consistency - Same tool for acquiring and managing markdown
  3. Integration - Clipped content automatically benefits from markmv's link management
  4. Automation - Script documentation gathering from multiple sources

Additional Features to Consider

  • Authentication support - For paywalled content (cookies, headers)
  • Rate limiting - Respectful crawling with delays
  • CSS selector support - Extract specific page sections
  • Template system - Custom markdown output formats
  • Link preservation - Convert relative to absolute URLs
  • Code block detection - Properly format code snippets
  • Table support - Convert HTML tables to markdown tables

Use Cases

  1. Documentation aggregation - Collect API docs, guides, tutorials
  2. Research compilation - Save articles for offline reading
  3. Knowledge base building - Archive important web content
  4. Blog migration - Convert HTML posts to markdown
  5. Tutorial collection - Save programming tutorials locally

Integration with Existing Features

After clipping content, users could:

  • Use validate to check all links in clipped content
  • Use move to organize clipped files
  • Use index to generate navigation for clipped content
  • Use convert to standardize link formats

This feature would position markmv as a comprehensive markdown toolkit, handling the full lifecycle from content acquisition to maintenance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions