Implement robots.txt checker instance for crawler requests

Add a dedicated robots.txt checker instance that validates whether a URL is allowed to be crawled before it is requested.

The crawler should fetch and parse the corresponding robots.txt file for a target domain and evaluate the rules for the configured user agent. Before visiting a page, the crawler should check whether the requested path is permitted.

This is needed to make crawling behavior more compliant and to avoid requesting URLs that are explicitly disallowed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement robots.txt checker instance for crawler requests #41

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement robots.txt checker instance for crawler requests #41

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions