A Go-based tool for scraping and downloading media (images, videos, and other files) from Lemmy instances. Features intelligent deduplication using content hashing and comprehensive metadata storage.
Warning
This project is almost entriely vibe-coded as an experiment. It works and I've reviewed the code but be careful...
- Multi-instance support: Connect to any Lemmy instance
- Community-specific scraping: Target specific communities or scrape from the hot page
- Intelligent deduplication: Uses SHA-256 content hashing to avoid downloading duplicates
- Comprehensive metadata: Stores post details, community info, author data, and more in SQLite
- Multiple run modes: One-time execution or continuous monitoring
- Flexible media filtering: Choose which media types to download (images, videos, other)
- Organized storage: Files automatically organized by community
- Smart pagination: Configurable limits with optional stopping at previously seen posts
- Web UI: Browse and manage downloaded media with a modern HTMX-based interface
- Full-text search: Fast FTS5-powered search across titles, communities, creators, and URLs
- Tag system: Organize media with user-defined tags or AI-powered auto-tagging
- Thumbnail generation: Automatic thumbnails for images and videos (FFmpeg required for videos)
- Image recognition: Optional AI-powered classification using Ollama vision models
- Statistics dashboard: Timeline charts, top creators, storage breakdown, and more
- Real-time progress: WebSocket-based live updates during scraping
- Go 1.21 or later
- SQLite3
- A Lemmy account on the instance you want to scrape
Download the latest release for your platform from the Releases page.
Linux (x86_64):
wget https://github.com/ST2Projects/lemmy-media-scraper/releases/latest/download/lemmy-scraper_*_Linux_x86_64.tar.gz
tar -xzf lemmy-scraper_*_Linux_x86_64.tar.gz
./lemmy-scraper -config config.example.yamlmacOS (Apple Silicon):
wget https://github.com/ST2Projects/lemmy-media-scraper/releases/latest/download/lemmy-scraper_*_Darwin_arm64.tar.gz
tar -xzf lemmy-scraper_*_Darwin_arm64.tar.gz
./lemmy-scraper -config config.example.yamlEach release includes:
- The
lemmy-scraperbinary config.example.yaml- Example configuration fileREADME.md- Documentation
Run with Docker Compose (recommended):
git clone https://github.com/ST2Projects/lemmy-media-scraper.git
cd lemmy-media-scraper
mkdir -p config downloads
cp config.docker.yaml config/config.yaml
# Edit config/config.yaml with your credentials
docker-compose up -dOr use the pre-built image:
docker pull ghcr.io/ST2Projects/lemmy-media-scraper:latestSee README.Docker.md for detailed Docker deployment instructions.
git clone https://github.com/ST2Projects/lemmy-media-scraper.git
cd lemmy-media-scraper
# Build with full-text search support (recommended)
make build-fts5
# Or build manually
CGO_ENABLED=1 go build -tags fts5 -o lemmy-scraper ./cmd/scraperNote: The fts5 build tag is required for full-text search functionality. Without it, the application will work but search will be disabled.
Create a config.yaml file based on the provided example:
cp config.example.yaml config.yamlEdit config.yaml with your settings:
lemmy:
instance: "lemmy.ml" # Lemmy instance (without https://)
username: "your_username" # Your Lemmy username
password: "your_password" # Your Lemmy password
communities: [] # Leave empty for hot page, or list communities
storage:
base_directory: "./downloads" # Where to save media files
database:
path: "./lemmy-scraper.db" # SQLite database path
scraper:
max_posts_per_run: 100 # Maximum posts to scrape per run
stop_at_seen_posts: true # Stop when encountering seen posts
sort_type: "Hot" # Hot, New, TopDay, TopWeek, etc.
include_images: true # Download images
include_videos: true # Download videos
include_other_media: true # Download other media types
run_mode:
mode: "once" # "once" or "continuous"
interval: "30m" # Interval for continuous mode- instance: The Lemmy instance hostname (e.g.,
lemmy.ml,lemmy.world) - username: Your Lemmy account username (required for authentication)
- password: Your Lemmy account password
- communities: List of communities to scrape. Examples:
[]- Empty list scrapes from the instance hot page["technology", "linux"]- Scrapes specific communities["[email protected]", "[email protected]"]- Scrapes communities from specific instances
- base_directory: Root directory for downloaded media. Files are organized as:
downloads/ ├── technology/ │ ├── 12345_image.jpg │ └── 12346_video.mp4 └── linux/ └── 12347_photo.png
- path: Location of the SQLite database file for tracking scraped media
- max_posts_per_run: Maximum number of posts to process per community/run
- stop_at_seen_posts: Stop scraping when encountering a previously processed post
- sort_type: How to sort posts. Options:
Hot- Currently trending postsNew- Newest posts firstTopDay- Top posts from the last dayTopWeek- Top posts from the last weekTopMonth- Top posts from the last monthTopYear- Top posts from the last yearTopAll- Top posts of all time
- include_images: Download image files
- include_videos: Download video files
- include_other_media: Download other media types
- mode: Execution mode
once- Run once and exit (useful for cron jobs)continuous- Run continuously on an interval
- interval: Time between runs in continuous mode (e.g.,
5m,1h,30m)
Run with default config file:
./lemmy-scraperSpecify a custom config file:
./lemmy-scraper -config /path/to/config.yamlEnable verbose logging:
./lemmy-scraper -verboseDisplay statistics about downloaded media:
./lemmy-scraper -statsOutput example:
=== Lemmy Media Scraper Statistics ===
Total media files: 245
By media type:
image: 198
video: 42
other: 5
Top communities:
technology: 89
linux: 67
programming: 45
...
Create a systemd service file at /etc/systemd/system/lemmy-scraper.service:
[Unit]
Description=Lemmy Media Scraper
After=network.target
[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/lemmy-media-scraper
ExecStart=/path/to/lemmy-media-scraper/lemmy-scraper -config /path/to/config.yaml
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.targetEnable and start the service:
sudo systemctl enable lemmy-scraper
sudo systemctl start lemmy-scraper
sudo systemctl status lemmy-scraperAdd to crontab for hourly execution:
0 * * * * cd /path/to/lemmy-media-scraper && ./lemmy-scraper -config config.yaml- Authentication: Connects to the specified Lemmy instance and authenticates using your credentials
- Post Retrieval: Fetches posts from either:
- The instance's hot page (if no communities specified)
- Specific communities (if listed in config)
- Media Extraction: Identifies media URLs in posts:
- Direct post URLs (e.g., image/video links)
- Thumbnail URLs
- Embedded video URLs
- Deduplication: Before downloading:
- Downloads the file content
- Computes SHA-256 hash
- Checks if hash exists in database
- Skips if already downloaded
- Storage: If new:
- Saves file to
{base_directory}/{community_name}/{post_id}_{filename} - Records metadata in SQLite database
- Saves file to
- Metadata: Stores comprehensive information:
- Post details (ID, title, URL, score, creation date)
- Community info (name, ID)
- Author info (name, ID)
- File info (path, size, hash, type)
- Download timestamp
lemmy:
instance: "lemmy.ml"
username: "myuser"
password: "mypass"
communities: ["technology", "linux", "programming"]
run_mode:
mode: "once"lemmy:
instance: "lemmy.world"
username: "myuser"
password: "mypass"
communities: [] # Empty = hot page
run_mode:
mode: "continuous"
interval: "15m"lemmy:
instance: "lemmy.ml"
username: "myuser"
password: "mypass"
communities: ["pics", "photography"]
scraper:
include_images: true
include_videos: false
include_other_media: false- Verify your username and password are correct
- Check if the Lemmy instance is accessible
- Ensure the instance URL doesn't include
https://or trailing slashes
- Enable verbose logging with
-verboseflag - Check if posts actually contain media URLs
- Verify media type filters are enabled
- Try scraping from a community known to have media content
- Ensure only one instance of the scraper is running
- Check file permissions on the database file
- If using continuous mode, ensure the database path is accessible
lemmy-media-scraper/
├── cmd/
│ └── scraper/ # Main application entry point
│ └── main.go
├── internal/
│ ├── api/ # Lemmy API client
│ ├── config/ # Configuration management
│ ├── database/ # SQLite database operations
│ ├── downloader/ # Media download and deduplication
│ └── scraper/ # Core scraping logic
├── pkg/
│ └── models/ # Data models
├── config.example.yaml # Example configuration
├── go.mod
├── go.sum
└── README.md
Contributions are welcome! Please feel free to submit issues or pull requests.
This tool is for personal use only. Please respect the Lemmy instance's terms of service and be mindful of rate limiting. Always use reasonable scraping intervals to avoid overloading the server.