Add a dedicated robots.txt checker instance that validates whether a URL is allowed to be crawled before it is requested.
The crawler should fetch and parse the corresponding robots.txt file for a target domain and evaluate the rules for the configured user agent. Before visiting a page, the crawler should check whether the requested path is permitted.
This is needed to make crawling behavior more compliant and to avoid requesting URLs that are explicitly disallowed.
Add a dedicated robots.txt checker instance that validates whether a URL is allowed to be crawled before it is requested.
The crawler should fetch and parse the corresponding robots.txt file for a target domain and evaluate the rules for the configured user agent. Before visiting a page, the crawler should check whether the requested path is permitted.
This is needed to make crawling behavior more compliant and to avoid requesting URLs that are explicitly disallowed.