Lemmy Media Scraper

A Go-based tool for scraping and downloading media (images, videos, and other files) from Lemmy instances. Features intelligent deduplication using content hashing and comprehensive metadata storage.

Warning

This project is almost entriely vibe-coded as an experiment. It works and I've reviewed the code but be careful...

Features

Multi-instance support: Connect to any Lemmy instance
Community-specific scraping: Target specific communities or scrape from the hot page
Intelligent deduplication: Uses SHA-256 content hashing to avoid downloading duplicates
Comprehensive metadata: Stores post details, community info, author data, and more in SQLite
Multiple run modes: One-time execution or continuous monitoring
Flexible media filtering: Choose which media types to download (images, videos, other)
Organized storage: Files automatically organized by community
Smart pagination: Configurable limits with optional stopping at previously seen posts
Web UI: Browse and manage downloaded media with a modern HTMX-based interface
Full-text search: Fast FTS5-powered search across titles, communities, creators, and URLs
Tag system: Organize media with user-defined tags or AI-powered auto-tagging
Thumbnail generation: Automatic thumbnails for images and videos (FFmpeg required for videos)
Image recognition: Optional AI-powered classification using Ollama vision models
Statistics dashboard: Timeline charts, top creators, storage breakdown, and more
Real-time progress: WebSocket-based live updates during scraping

Requirements

Go 1.21 or later
SQLite3
A Lemmy account on the instance you want to scrape

Installation

Pre-built Binaries

Download the latest release for your platform from the Releases page.

Linux (x86_64):

wget https://github.com/ST2Projects/lemmy-media-scraper/releases/latest/download/lemmy-scraper_*_Linux_x86_64.tar.gz
tar -xzf lemmy-scraper_*_Linux_x86_64.tar.gz
./lemmy-scraper -config config.example.yaml

macOS (Apple Silicon):

wget https://github.com/ST2Projects/lemmy-media-scraper/releases/latest/download/lemmy-scraper_*_Darwin_arm64.tar.gz
tar -xzf lemmy-scraper_*_Darwin_arm64.tar.gz
./lemmy-scraper -config config.example.yaml

Each release includes:

The lemmy-scraper binary
config.example.yaml - Example configuration file
README.md - Documentation

Docker

Run with Docker Compose (recommended):

git clone https://github.com/ST2Projects/lemmy-media-scraper.git
cd lemmy-media-scraper
mkdir -p config downloads
cp config.docker.yaml config/config.yaml
# Edit config/config.yaml with your credentials
docker-compose up -d

Or use the pre-built image:

docker pull ghcr.io/ST2Projects/lemmy-media-scraper:latest

See README.Docker.md for detailed Docker deployment instructions.

From Source

git clone https://github.com/ST2Projects/lemmy-media-scraper.git
cd lemmy-media-scraper

# Build with full-text search support (recommended)
make build-fts5

# Or build manually
CGO_ENABLED=1 go build -tags fts5 -o lemmy-scraper ./cmd/scraper

Note: The fts5 build tag is required for full-text search functionality. Without it, the application will work but search will be disabled.

Configuration

Create a config.yaml file based on the provided example:

cp config.example.yaml config.yaml

Edit config.yaml with your settings:

lemmy:
  instance: "lemmy.ml"              # Lemmy instance (without https://)
  username: "your_username"          # Your Lemmy username
  password: "your_password"          # Your Lemmy password
  communities: []                    # Leave empty for hot page, or list communities

storage:
  base_directory: "./downloads"      # Where to save media files

database:
  path: "./lemmy-scraper.db"        # SQLite database path

scraper:
  max_posts_per_run: 100            # Maximum posts to scrape per run
  stop_at_seen_posts: true          # Stop when encountering seen posts
  sort_type: "Hot"                  # Hot, New, TopDay, TopWeek, etc.
  include_images: true              # Download images
  include_videos: true              # Download videos
  include_other_media: true         # Download other media types

run_mode:
  mode: "once"                      # "once" or "continuous"
  interval: "30m"                   # Interval for continuous mode

Configuration Options

Lemmy Settings

instance: The Lemmy instance hostname (e.g., lemmy.ml, lemmy.world)
username: Your Lemmy account username (required for authentication)
password: Your Lemmy account password
communities: List of communities to scrape. Examples:
- [] - Empty list scrapes from the instance hot page
- ["technology", "linux"] - Scrapes specific communities
- ["[email protected]", "[email protected]"] - Scrapes communities from specific instances

Storage Settings

base_directory: Root directory for downloaded media. Files are organized as:

downloads/
├── technology/
│   ├── 12345_image.jpg
│   └── 12346_video.mp4
└── linux/
    └── 12347_photo.png

Database Settings

path: Location of the SQLite database file for tracking scraped media

Scraper Settings

max_posts_per_run: Maximum number of posts to process per community/run
stop_at_seen_posts: Stop scraping when encountering a previously processed post
sort_type: How to sort posts. Options:
- Hot - Currently trending posts
- New - Newest posts first
- TopDay - Top posts from the last day
- TopWeek - Top posts from the last week
- TopMonth - Top posts from the last month
- TopYear - Top posts from the last year
- TopAll - Top posts of all time
include_images: Download image files
include_videos: Download video files
include_other_media: Download other media types

Run Mode Settings

mode: Execution mode
- once - Run once and exit (useful for cron jobs)
- continuous - Run continuously on an interval
interval: Time between runs in continuous mode (e.g., 5m, 1h, 30m)

Usage

Basic Usage

Run with default config file:

./lemmy-scraper

Specify a custom config file:

./lemmy-scraper -config /path/to/config.yaml

Enable verbose logging:

./lemmy-scraper -verbose

View Statistics

Display statistics about downloaded media:

./lemmy-scraper -stats

Output example:

=== Lemmy Media Scraper Statistics ===

Total media files: 245

By media type:
  image: 198
  video: 42
  other: 5

Top communities:
  technology: 89
  linux: 67
  programming: 45
  ...

Running as a Service

Using systemd (Linux)

Create a systemd service file at /etc/systemd/system/lemmy-scraper.service:

[Unit]
Description=Lemmy Media Scraper
After=network.target

[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/lemmy-media-scraper
ExecStart=/path/to/lemmy-media-scraper/lemmy-scraper -config /path/to/config.yaml
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl enable lemmy-scraper
sudo systemctl start lemmy-scraper
sudo systemctl status lemmy-scraper

Using cron (One-time mode)

Add to crontab for hourly execution:

0 * * * * cd /path/to/lemmy-media-scraper && ./lemmy-scraper -config config.yaml

How It Works

Authentication: Connects to the specified Lemmy instance and authenticates using your credentials
Post Retrieval: Fetches posts from either:
- The instance's hot page (if no communities specified)
- Specific communities (if listed in config)
Media Extraction: Identifies media URLs in posts:
- Direct post URLs (e.g., image/video links)
- Thumbnail URLs
- Embedded video URLs
Deduplication: Before downloading:
- Downloads the file content
- Computes SHA-256 hash
- Checks if hash exists in database
- Skips if already downloaded
Storage: If new:
- Saves file to {base_directory}/{community_name}/{post_id}_{filename}
- Records metadata in SQLite database
Metadata: Stores comprehensive information:
- Post details (ID, title, URL, score, creation date)
- Community info (name, ID)
- Author info (name, ID)
- File info (path, size, hash, type)
- Download timestamp

Examples

Scrape specific communities once

lemmy:
  instance: "lemmy.ml"
  username: "myuser"
  password: "mypass"
  communities: ["technology", "linux", "programming"]

run_mode:
  mode: "once"

Continuous monitoring of hot page

lemmy:
  instance: "lemmy.world"
  username: "myuser"
  password: "mypass"
  communities: []  # Empty = hot page

run_mode:
  mode: "continuous"
  interval: "15m"

Download only images from specific communities

lemmy:
  instance: "lemmy.ml"
  username: "myuser"
  password: "mypass"
  communities: ["pics", "photography"]

scraper:
  include_images: true
  include_videos: false
  include_other_media: false

Troubleshooting

Authentication fails

Verify your username and password are correct
Check if the Lemmy instance is accessible
Ensure the instance URL doesn't include https:// or trailing slashes

No media being downloaded

Enable verbose logging with -verbose flag
Check if posts actually contain media URLs
Verify media type filters are enabled
Try scraping from a community known to have media content

Database locked errors

Ensure only one instance of the scraper is running
Check file permissions on the database file
If using continuous mode, ensure the database path is accessible

Project Structure

lemmy-media-scraper/
├── cmd/
│   └── scraper/          # Main application entry point
│       └── main.go
├── internal/
│   ├── api/             # Lemmy API client
│   ├── config/          # Configuration management
│   ├── database/        # SQLite database operations
│   ├── downloader/      # Media download and deduplication
│   └── scraper/         # Core scraping logic
├── pkg/
│   └── models/          # Data models
├── config.example.yaml  # Example configuration
├── go.mod
├── go.sum
└── README.md

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Disclaimer

This tool is for personal use only. Please respect the Lemmy instance's terms of service and be mindful of rate limiting. Always use reasonable scraping intervals to avoid overloading the server.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github		.github
cmd/scraper		cmd/scraper
internal		internal
pkg/models		pkg/models
.dockerignore		.dockerignore
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
Makefile		Makefile
README.Docker.md		README.Docker.md
README.md		README.md
config.docker.yaml		config.docker.yaml
config.example.yaml		config.example.yaml
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
openapi.json		openapi.json

License

ST2Projects/lemmy-media-scraper

Folders and files

Latest commit

History

Repository files navigation

Lemmy Media Scraper

Features

Requirements

Installation

Pre-built Binaries

Docker

From Source

Configuration

Configuration Options

Lemmy Settings

Storage Settings

Database Settings

Scraper Settings

Run Mode Settings

Usage

Basic Usage

View Statistics

Running as a Service

Using systemd (Linux)

Using cron (One-time mode)

How It Works

Examples

Scrape specific communities once

Continuous monitoring of hot page

Download only images from specific communities

Troubleshooting

Authentication fails

No media being downloaded

Database locked errors

Project Structure

Contributing

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages