Skip to content

[Feature/Enhancement] Implement proxy rotation and anti-bot handling for Glassdoor spider #94

@krupalibachudasama2537-lgtm

Description

Now that the foundational architecture and Scrapy-Playwright pipeline for the Glassdoor spider are merged in #55, the next step is to ensure live reliability.

During local testing, the spider encounters consistent TimeoutError blocks and Cloudflare challenges at browser initialization. To make the scraper usable in production, we need to integrate robust anti-bot bypass mechanisms and proxy middleware.
Proposed Changes / Checklist:
[ ] Integrate a proxy rotation middleware (e.g., Scrapy rotating proxies, Tor, or a custom proxy service provider).

[ ] Implement anti-bot handling techniques (e.g., randomizing User-Agents, managing custom headers, or adjusting Playwright navigation timeouts).

[ ] Harden the extraction logic to gracefully catch network-level or challenge-page blocks without crashing the spider runtime.

[ ] Verify end-to-end data ingestion into the jobs:raw Redis stream once blocks are bypassed.

If this looks good, please assign this issue to me!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions