Now that the foundational architecture and Scrapy-Playwright pipeline for the Glassdoor spider are merged in #55, the next step is to ensure live reliability.
During local testing, the spider encounters consistent TimeoutError blocks and Cloudflare challenges at browser initialization. To make the scraper usable in production, we need to integrate robust anti-bot bypass mechanisms and proxy middleware.
Proposed Changes / Checklist:
[ ] Integrate a proxy rotation middleware (e.g., Scrapy rotating proxies, Tor, or a custom proxy service provider).
[ ] Implement anti-bot handling techniques (e.g., randomizing User-Agents, managing custom headers, or adjusting Playwright navigation timeouts).
[ ] Harden the extraction logic to gracefully catch network-level or challenge-page blocks without crashing the spider runtime.
[ ] Verify end-to-end data ingestion into the jobs:raw Redis stream once blocks are bypassed.
If this looks good, please assign this issue to me!
Now that the foundational architecture and Scrapy-Playwright pipeline for the Glassdoor spider are merged in #55, the next step is to ensure live reliability.
During local testing, the spider encounters consistent TimeoutError blocks and Cloudflare challenges at browser initialization. To make the scraper usable in production, we need to integrate robust anti-bot bypass mechanisms and proxy middleware.
Proposed Changes / Checklist:
[ ] Integrate a proxy rotation middleware (e.g., Scrapy rotating proxies, Tor, or a custom proxy service provider).
[ ] Implement anti-bot handling techniques (e.g., randomizing User-Agents, managing custom headers, or adjusting Playwright navigation timeouts).
[ ] Harden the extraction logic to gracefully catch network-level or challenge-page blocks without crashing the spider runtime.
[ ] Verify end-to-end data ingestion into the jobs:raw Redis stream once blocks are bypassed.
If this looks good, please assign this issue to me!