This repository contains a web scraping solution designed to help BricoSimplon, an e-commerce platform specializing in DIY and home improvement, monitor competitors' pricing strategies. By collecting and analyzing real-time product and category data from competitor websites (in this case, Castorama), this project supports BricoSimplon in optimizing its pricing policies and maintaining competitiveness in the market.
The scraping system was developed using Scrapy, focusing on clean and efficient data extraction, processing, and export. It ensures compliance with ethical scraping practices, avoiding server overload and respecting website policies.
- Category Scraping:
- Extracts all categories and subcategories from target websites.
- Differentiates between categories leading to subcategories and those linking directly to product lists.
- Product Scraping:
- Collect product data (name, price, URL, availability, etc.) from category-specific pages.
- Handle pagination to scrape all products listed.
- Data Processing Pipeline:
- Cleans and validates scraped data to ensure quality.
- Removes duplicates using unique identifiers for categories and products.
- Data Export:
- Categories exported to categories.csv.
- Products exported to products.csv.
- Error Handling:
- Includes error-catching mechanisms to manage missing or inconsistent data.
- Logs errors for debugging.
Database Integration:
- Stores data in an SQLite relational database.
- Maintains tables for categories and products, linked by foreign keys.
- Python 3.8 or higher
- Scrapy
- SQLite (optional, for database functionality)
- ipython (to enhance the interactive debugging and development process)
Install dependencies with:
pip install -r requirements.txt
- Categories Spider: Extracts all categories and subcategories.
scrapy crawl castospider
- Product List Spider: Scrapes product data from category-specific pages.
scrapy crawl productspider
- Exported to
categories.csv. - Contains fields:
category,url,is_page_list(which indicate whether a category is the final product page list).
- Exported to
products.csv. - Contains fields:
unique_id,category,subcategory,subsubcategory,subsubsubcategory,title,price, andurl.
- Compliance:
- Implements
DOWNLOAD_DELAYandUSER_AGENTto avoid overloading servers. - Respects the
robots.txtof the target websites.
- Deduplication:
- Uses unique identifiers to ensure no duplicate categories or products.
- Handles cases where a product belongs to multiple categories.
- Modularity:
Code is modular, making it easy to extend and maintain.
- Add support for additional competitors.
- Implement a more advanced scheduler for real-time data scraping.
- Integrate machine learning to detect pricing trends and insights.
This project was collaboratively developed and managed by Michael Adebayo (@MichAdebayo) and David Scott(@Daviddavid-sudo).
