-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Feature Description
Add functionality to download and convert web pages to markdown, similar to browser extensions like Markdown Web Clipper. This would expand markmv's capabilities beyond file operations to include content acquisition, making it a more complete markdown workflow tool.
Proposed Solution
Add a new command clip
or fetch
that downloads web pages and converts them to clean markdown:
Example Usage
# Basic usage - download and save with auto-generated filename
npx markmv clip https://example.com/article
# Specify output filename
npx markmv clip https://example.com/article -o article.md
# Download multiple URLs
npx markmv clip urls.txt --batch
# Download with specific options
npx markmv clip https://example.com/article \
--format clean \
--images download \
--output-dir docs/articles/
# Extract only article content (using readability)
npx markmv clip https://example.com/article --article-only
# Include metadata in frontmatter
npx markmv clip https://example.com/article --metadata
Core Features
-
Clean markdown extraction
- Remove unnecessary HTML elements
- Preserve article structure
- Convert common HTML patterns to markdown
-
Image handling
- Download images locally
- Update image paths in markdown
- Optional: skip image download for text-only
-
Metadata preservation
- Title, author, date
- Optional frontmatter generation
- Source URL tracking
-
Content extraction modes
- Full page conversion
- Article extraction (Readability-style)
- Custom selectors for specific content
Implementation Suggestions
Dependencies to Consider
@mozilla/readability
- For article extractionturndown
- HTML to markdown conversionnode-fetch
oraxios
- HTTP requestscheerio
- HTML parsing if needed
Configuration Options
# .markmvrc or markmv.config.js
clip:
output_dir: "./clipped"
image_dir: "./clipped/images"
frontmatter: true
format: "clean" # clean, raw, article
timeout: 30000
user_agent: "Mozilla/5.0..."
ignore_patterns:
- "*.pdf"
- "mailto:*"
Benefits
- Complete workflow - From content discovery to organization
- Consistency - Same tool for acquiring and managing markdown
- Integration - Clipped content automatically benefits from markmv's link management
- Automation - Script documentation gathering from multiple sources
Additional Features to Consider
- Authentication support - For paywalled content (cookies, headers)
- Rate limiting - Respectful crawling with delays
- CSS selector support - Extract specific page sections
- Template system - Custom markdown output formats
- Link preservation - Convert relative to absolute URLs
- Code block detection - Properly format code snippets
- Table support - Convert HTML tables to markdown tables
Use Cases
- Documentation aggregation - Collect API docs, guides, tutorials
- Research compilation - Save articles for offline reading
- Knowledge base building - Archive important web content
- Blog migration - Convert HTML posts to markdown
- Tutorial collection - Save programming tutorials locally
Integration with Existing Features
After clipping content, users could:
- Use
validate
to check all links in clipped content - Use
move
to organize clipped files - Use
index
to generate navigation for clipped content - Use
convert
to standardize link formats
This feature would position markmv as a comprehensive markdown toolkit, handling the full lifecycle from content acquisition to maintenance.
Metadata
Metadata
Assignees
Labels
No labels