Description
Scraper hangs indefinitely when parsing large HTML documents (>50MB). Memory usage grows unbounded, eventually causing out-of-memory crash. No timeout or resource limits implemented.
Steps to Reproduce
- Scrape website with large HTML (50MB+)
- Parser processes entire document into memory
- Memory grows to 2-4 GB
- Process crashes
Environment Information
- Framework: Node.js
- Parser: Cheerio/jsdom
- Memory: Unbounded
- Application version: Current main branch
Expected Behavior
Parser uses streaming for large documents. Max file size enforced (10MB). Request timeout (30s). Memory-efficient chunked parsing.
Actual Behavior
File: src/scrapers/parser.js
Loads entire HTML into memory: fs.readFileSync(large_file)
Code Reference
File: src/scrapers/parser.js
Missing: Stream processing, memory limits, timeout
Additional Context
Use streams:
fs.createReadStream('large.html')
.pipe(parseStream)
.on('error', () => timeout());
GSSoC Points Estimate: Level 2 (Performance/Memory)
Suggested Labels
- gssoc:approved
- type:bug
- severity:high
- area:performance
Description
Scraper hangs indefinitely when parsing large HTML documents (>50MB). Memory usage grows unbounded, eventually causing out-of-memory crash. No timeout or resource limits implemented.
Steps to Reproduce
Environment Information
Expected Behavior
Parser uses streaming for large documents. Max file size enforced (10MB). Request timeout (30s). Memory-efficient chunked parsing.
Actual Behavior
File: src/scrapers/parser.js
Loads entire HTML into memory: fs.readFileSync(large_file)
Code Reference
File: src/scrapers/parser.js
Missing: Stream processing, memory limits, timeout
Additional Context
Use streams:
GSSoC Points Estimate: Level 2 (Performance/Memory)
Suggested Labels