Skip to content

[Bug] Relative URLs Not Resolved - Links Reference Non-Existent Paths #138

@anshul23102

Description

@anshul23102

Description

When scraping relative URLs (e.g., "/products/item"), they are stored as-is without resolving to absolute URLs. Links become broken when accessed from different contexts. Downstream systems cannot follow links.

Steps to Reproduce

  1. Scrape page: example.com/blog/post
  2. Page contains link:
  3. Scraper extracts: "/products/item"
  4. Link stored in database
  5. Requesting: "/products/item" without base URL = 404

Environment Information

  • Framework: Cheerio
  • URL handling: No resolution
  • Storage: Relative URLs saved
  • Application version: Current main branch

Expected Behavior

All relative URLs resolved to absolute URLs using base domain. Stored as: https://example.com/products/item

Actual Behavior

File: src/services/linkExtractor.js
Extracts href without resolving: const url = link.attr('href')

Code Reference

File: src/services/linkExtractor.js
Missing: URL resolution using base URL

Additional Context

Resolve URLs:

const absoluteUrl = new URL(relativeUrl, baseUrl).href;

GSSoC Points Estimate: Level 1 (Bug/URL Handling)

Suggested Labels

  • gssoc:approved
  • type:bug
  • severity:medium
  • area:data-processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions