Skip to content

Latest commit

 

History

History

README.md

Scrapy (Python)

Web scraping examples using Scrapy framework.

Run on Intuned

Run on Intuned

APIs

API Description
scrapy-crawler Scrapes static websites using Scrapy's built-in HTTP request system and CSS/XPath selectors
scrapy-crawler-js Renders JavaScript-heavy pages with Playwright, then parses the HTML output using Scrapy

Getting started

Install dependencies

uv sync

If the intuned CLI is not installed, install it globally:

npm install -g @intuned/cli

After installing dependencies, intuned command should be available in your environment.

Run an API

intuned dev run api scrapy-crawler .parameters/api/scrapy-crawler/default.json
intuned dev run api scrapy-crawler-js .parameters/api/scrapy-crawler-js/default.json

Save project

intuned dev provision

Deploy

intuned dev deploy

Project structure

/
├── api/
│   ├── scrapy-crawler.py     # Scrapy crawler using Scrapy's HTTP requests
│   └── scrapy-crawler-js.py  # Scrapy crawler using Playwright + Scrapy parsing
├── collector/
│   └── item_collector.py     # Collects scraped items via Scrapy signals
├── utils/
│   └── types_and_schemas.py  # Pydantic models for parameters and data
├── intuned-resources/
│   └── jobs/
│       ├── scrapy-crawler.job.jsonc    # Job for static site crawling
│       └── scrapy-crawler-js.job.jsonc # Job for JS-rendered crawling
├── .parameters/api/          # Test parameters
├── Intuned.jsonc             # Project config
├── pyproject.toml            # Python dependencies
└── README.md

Key features

  • scrapy-crawler: Best for static websites — uses Scrapy's CrawlerRunner for HTTP requests, CSS selectors, and pagination
  • scrapy-crawler-js: Best for JavaScript-heavy sites — uses Playwright to render pages before Scrapy parses the HTML

Related