A Python module that scrapes web content from URLs defined in a CSV file and downloads the content into JSON format.
- Scrapes web pages using Scrapy spiders for crawling and parsing
- Reads URLs from a CSV file
- Crawls and extracts content from internal links within the same domain
- Configurable crawling depth and limits
- Extracts title, text, and metadata from web pages
- Handles JavaScript-rendered pages and animations using Selenium
- Waits for pages to fully load before extracting content
- Saves scraped data in JSON format
- Configurable using Pydantic models
- Comprehensive logging with Loguru
- Python 3.12 or higher
- uv - A Python package installer and resolver
This project uses pyproject.toml for dependency management with uv, which is a faster alternative to pip.
-
Clone the repository:
git clone https://github.com/yourusername/web-scraper.git cd web-scraper -
Install the package and its dependencies using uv:
uv pip install -e .
If you just want to run the script without installing the package, you can install the required dependencies using uv:
uv pip install .
This will install all the necessary packages (scrapy, loguru, pydantic, pandas, selenium, webdriver-manager) defined in pyproject.toml needed to run the web scraper.
The web scraper can be configured using environment variables. A sample .env.sample file is provided with all available configuration options. To use it:
-
Copy the sample file to create your own
.envfile:cp .env.sample .env -
Edit the
.envfile to customize your settings. -
Install the web scraper with the optional
envdependency to load environment variables from the.envfile:uv pip install ".[env]"Alternatively, you can install
python-dotenvseparately:uv pip install python-dotenv -
The environment variables will be automatically loaded from the
.envfile when you run the scraper.
-
Make sure you have installed the required dependencies:
uv pip install . -
Create a CSV file with a list of URLs to scrape. The CSV file should have a column named
url. For example:url https://www.example.com https://www.python.org https://www.wikipedia.org -
Run the scraper:
python main.py --csv-file your_urls.csvAlternatively, you can set the
SCRAPER_CSV_FILE_PATHenvironment variable in your.envfile and run:python main.py -
The scraped data will be saved in the
outputdirectory by default.
The web scraper can crawl and extract content from internal links within the same domain. This feature is enabled by default and can be configured using the following environment variables:
SCRAPER_FOLLOW_INTERNAL_LINKS: Whether to follow internal links within the same domain (default:True)SCRAPER_MAX_DEPTH: Maximum depth for crawling internal links (default:2)SCRAPER_MAX_PAGES_PER_DOMAIN: Maximum number of pages to crawl per domain (default:100)
For example, to crawl with a maximum depth of 3 and limit to 50 pages per domain, set the following in your .env file:
SCRAPER_FOLLOW_INTERNAL_LINKS=True
SCRAPER_MAX_DEPTH=3
SCRAPER_MAX_PAGES_PER_DOMAIN=50
The crawler only follows links that start with the same path as the initial URL. For example:
- If the initial URL is
https://example.com/blog-details, the crawler will only follow links that start withhttps://example.com/blog-detailsorhttps://example.com/blog-details/. - If the initial URL is
https://example.com/blog-details?id=123, the crawler will only follow links that start withhttps://example.com/blog-details. - If the initial URL is
https://example.com/(or justhttps://example.com), the crawler will follow all links on the domain.
This ensures that the crawler stays within the specific section of the website that you're interested in, rather than crawling the entire domain. It's particularly useful for scraping specific sections of large websites.
The crawler will only follow links within the same domains as the initial URLs provided in the CSV file. This prevents the crawler from wandering off to external websites.
You can also configure crawling behavior using command-line arguments:
# Crawl with a maximum depth of 3
python main.py --csv-file your_urls.csv --max-depth 3
# Disable following internal links
python main.py --csv-file your_urls.csv --no-follow-internal-links
# Limit to 50 pages per domain
python main.py --csv-file your_urls.csv --max-pages-per-domain 50
Command-line arguments take precedence over environment variables.
The web scraper uses Selenium to render pages with JavaScript and wait for them to fully load before extracting content. This is particularly useful for scraping modern websites that use animations and dynamically load content.
By default, all pages are rendered using Selenium, which ensures that even content loaded via JavaScript is properly extracted. The scraper waits for the page to be fully loaded and then waits an additional period to allow any animations to complete.
You can configure the maximum time to wait for a page to load using the SELENIUM_WAIT_TIME environment variable in your .env file:
SELENIUM_WAIT_TIME=15 # Wait up to 15 seconds for pages to load
Alternatively, you can set the wait time using the command-line argument:
python main.py --csv-file your_urls.csv --selenium-wait-time 15
The default wait time is 10 seconds, which is sufficient for most websites. If you're scraping websites with particularly heavy JavaScript or slow-loading animations, you may need to increase this value.
python main.py --help
This will display the available command line options:
usage: main.py [-h] --csv-file CSV_FILE [--output-dir OUTPUT_DIR] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-file LOG_FILE]
[--follow-internal-links] [--no-follow-internal-links] [--max-depth MAX_DEPTH] [--max-pages-per-domain MAX_PAGES_PER_DOMAIN]
[--selenium-wait-time SELENIUM_WAIT_TIME]
Web Scraper
options:
-h, --help show this help message and exit
--csv-file CSV_FILE, -c CSV_FILE
Path to the CSV file containing URLs to scrape
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
Directory to save scraped data
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Log level
--log-file LOG_FILE Path to the log file. If not provided, a default log file will be created.
--follow-internal-links
Follow internal links within the same domain
--no-follow-internal-links
Do not follow internal links
--max-depth MAX_DEPTH
Maximum depth for crawling internal links
--max-pages-per-domain MAX_PAGES_PER_DOMAIN
Maximum number of pages to crawl per domain
--selenium-wait-time SELENIUM_WAIT_TIME
Maximum time to wait for a page to load (in seconds)
The project follows Object-Oriented Programming principles with different files for different functionalities:
main.py: Entry point for the scrapersrc/scraper/config.py: Configuration using Pydanticsrc/scraper/csv_reader.py: Module to read URLs from CSV filessrc/scraper/items.py: Definition of scraped data structuresrc/scraper/pipelines.py: Processing and saving scraped datasrc/scraper/logger.py: Logging setup using Logurusrc/scraper/middlewares.py: Middleware for handling JavaScript rendering using Seleniumsrc/scraper/spiders/web_spider.py: Scrapy spider for crawling and parsing
The scraper saves each scraped page as a separate JSON file with the following structure:
{
"url": "https://www.example.com",
"title": "Example Domain",
"text": "This domain is for use in illustrative examples in documents...",
"metadata": {
"description": "Example website description",
"keywords": "example, domain",
"domain": "www.example.com",
"crawl_depth": 0
},
"timestamp": "2023-06-01T12:34:56.789012"
}Additionally, all scraped pages are combined into a single JSON file named all_items.json.
This project is licensed under the MIT License - see the LICENSE file for details.