We currently crawl only absolute URLs (http(s)://…). Pages often use relative links (/docs, ../page.html, ch01.html…). These are skipped, so we miss a big part of the site.
The goal is to extract every (a href) and turn it into a validated absolute http or https URL using the page URL. Skip anchors and non-web schemes.
We currently crawl only absolute URLs (http(s)://…). Pages often use relative links (/docs, ../page.html, ch01.html…). These are skipped, so we miss a big part of the site.
The goal is to extract every (a href) and turn it into a validated absolute http or https URL using the page URL. Skip anchors and non-web schemes.