Description
When scraping relative URLs (e.g., "/products/item"), they are stored as-is without resolving to absolute URLs. Links become broken when accessed from different contexts. Downstream systems cannot follow links.
Steps to Reproduce
- Scrape page: example.com/blog/post
- Page contains link:
- Scraper extracts: "/products/item"
- Link stored in database
- Requesting: "/products/item" without base URL = 404
Environment Information
- Framework: Cheerio
- URL handling: No resolution
- Storage: Relative URLs saved
- Application version: Current main branch
Expected Behavior
All relative URLs resolved to absolute URLs using base domain. Stored as: https://example.com/products/item
Actual Behavior
File: src/services/linkExtractor.js
Extracts href without resolving: const url = link.attr('href')
Code Reference
File: src/services/linkExtractor.js
Missing: URL resolution using base URL
Additional Context
Resolve URLs:
const absoluteUrl = new URL(relativeUrl, baseUrl).href;
GSSoC Points Estimate: Level 1 (Bug/URL Handling)
Suggested Labels
- gssoc:approved
- type:bug
- severity:medium
- area:data-processing
Description
When scraping relative URLs (e.g., "/products/item"), they are stored as-is without resolving to absolute URLs. Links become broken when accessed from different contexts. Downstream systems cannot follow links.
Steps to Reproduce
Environment Information
Expected Behavior
All relative URLs resolved to absolute URLs using base domain. Stored as: https://example.com/products/item
Actual Behavior
File: src/services/linkExtractor.js
Extracts href without resolving: const url = link.attr('href')
Code Reference
File: src/services/linkExtractor.js
Missing: URL resolution using base URL
Additional Context
Resolve URLs:
GSSoC Points Estimate: Level 1 (Bug/URL Handling)
Suggested Labels