[Feature/Enhancement] Implement proxy rotation and anti-bot handling for Glassdoor spider

Now that the foundational architecture and Scrapy-Playwright pipeline for the Glassdoor spider are merged in [#55](https://github.com/sharmavaibhav31/arachnode/pull/55), the next step is to ensure live reliability.

During local testing, the spider encounters consistent TimeoutError blocks and Cloudflare challenges at browser initialization. To make the scraper usable in production, we need to integrate robust anti-bot bypass mechanisms and proxy middleware.
Proposed Changes / Checklist:
[ ] Integrate a proxy rotation middleware (e.g., Scrapy rotating proxies, Tor, or a custom proxy service provider).

[ ] Implement anti-bot handling techniques (e.g., randomizing User-Agents, managing custom headers, or adjusting Playwright navigation timeouts).

[ ] Harden the extraction logic to gracefully catch network-level or challenge-page blocks without crashing the spider runtime.

[ ] Verify end-to-end data ingestion into the jobs:raw Redis stream once blocks are bypassed.

If this looks good, please assign this issue to me!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature/Enhancement] Implement proxy rotation and anti-bot handling for Glassdoor spider #94

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature/Enhancement] Implement proxy rotation and anti-bot handling for Glassdoor spider #94

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions