A robust asynchronous web scraper designed to extract manga information from AnimeClick.it. Built with Python, it handles pagination, tag-based browsing, and detailed manga information extraction while respecting the site's resources. Uses Datpulse for proxy management and anti-detection measures.
- 🚀 Asynchronous web scraping with AsyncWebCrawler
- 📚 Complete manga information extraction
- 🏷️ Tag-based manga categorization
- 💾 Structured JSON and CSV output
- 🤖 Anti-bot detection measures via Datpulse proxies
- ⏱️ Rate limiting and polite crawling
- 🔄 Automatic deduplication of manga entries
- Clone the repository:
git clone https://github.com/yourusername/animeclick-manga-scraper.git
cd animeclick-manga-scraper
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables:
cp .env.example .env
# Edit .env with your Datpulse API key and other settings
The scraper provides three main functions:
await process_manga_tags()
- Extracts all manga tags from the site
- Creates a master list in
data/manga_tags.json
- Generates individual tag files in
data/manga_by_tag/
await extract_manga_details()
- Processes manga from tag files
- Extracts comprehensive information
- Saves individual manga files in
data/manga_details/
python dataset_maker.py
- Combines all manga details into a single dataset
- Removes duplicate entries based on URL
- Generates two output files:
animeclick_manga_dataset_20250228.json
(JSON format)animeclick_manga_dataset_20250228.csv
(CSV format)
{
"tag": "robot",
"tag_url": "/manga/tags/robot",
"extraction_date": "2024-02-27T12:00:00",
"manga_count": 42,
"manga_list": [
{
"href": "/manga/12345/manga-title",
"title": "Manga Title"
}
]
}
{
"url": "https://www.animeclick.it/manga/12345/manga-title",
"extraction_date": "2024-02-27T12:00:00",
"details": {
"titolo_originale": "Original Japanese Title",
"titolo_inglese": "English Title",
"titolo_kanji": "漢字タイトル",
"nazionalita": "Giappone",
"casa_editrice": "Publisher Name",
"storia": "Story Author",
"disegni": "Artist Name",
"categorie": ["Shounen", "Seinen"],
"generi": ["Action", "Adventure"],
"anno": "2024",
"volumi": "10",
"capitoli": "42",
"stato_patria": "completato",
"stato_italia": "inedito",
"serializzato_su": "Magazine Name",
"tag_generici": ["action", "fantasy"],
"trama": "Plot summary of the manga..."
}
}
The final dataset files contain the following fields for each manga:
url
: Unique identifier and source URLtitolo_originale
: Original titletitolo_inglese
: English titletitolo_kanji
: Title in kanjinazionalita
: Nationality/origincasa_editrice
: Publisherstoria
: Story authordisegni
: Artistanno
: Year of publicationstato_patria
: Status in original countrystato_italia
: Status in Italyserializzato_su
: Serialization magazinetrama
: Plot summary
- Headless mode for efficient operation
- Datpulse proxy integration for IP rotation
- Anti-detection measures
- Configurable viewport settings
- Precise CSS selectors for data extraction
- Data cleaning and transformation
- Null value handling
- Error recovery mechanisms
- 2-second delay between manga requests
- 1-second delay between tag requests
- Configurable delay settings
- Proxy rotation via Datpulse
- Network error recovery
- JSON validation
- Empty result detection
- Continuous operation on individual failures
- Proxy fallback mechanisms
The following environment variables are required in the .env
file:
PROXY_USERNAME=YOUR USER
PROXY_PASSWORD=YOUR PASSWORD
PROXY_ADDRESS=gw.dataimpulse.com
PROXY_PORT=823
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This scraper is for educational purposes only. Please respect AnimeClick.it's terms of service and robots.txt when using this tool.