Skip to content

Resolve and crawl relative links #3

@jakobx0

Description

@jakobx0

We currently crawl only absolute URLs (http(s)://…). Pages often use relative links (/docs, ../page.html, ch01.html…). These are skipped, so we miss a big part of the site.

The goal is to extract every (a href) and turn it into a validated absolute http or https URL using the page URL. Skip anchors and non-web schemes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions