Skip to content

[Bug] Web Scraper Hangs on Large Documents - Memory Leak in HTML Parser #134

@anshul23102

Description

@anshul23102

Description

Scraper hangs indefinitely when parsing large HTML documents (>50MB). Memory usage grows unbounded, eventually causing out-of-memory crash. No timeout or resource limits implemented.

Steps to Reproduce

  1. Scrape website with large HTML (50MB+)
  2. Parser processes entire document into memory
  3. Memory grows to 2-4 GB
  4. Process crashes

Environment Information

  • Framework: Node.js
  • Parser: Cheerio/jsdom
  • Memory: Unbounded
  • Application version: Current main branch

Expected Behavior

Parser uses streaming for large documents. Max file size enforced (10MB). Request timeout (30s). Memory-efficient chunked parsing.

Actual Behavior

File: src/scrapers/parser.js
Loads entire HTML into memory: fs.readFileSync(large_file)

Code Reference

File: src/scrapers/parser.js
Missing: Stream processing, memory limits, timeout

Additional Context

Use streams:

fs.createReadStream('large.html')
  .pipe(parseStream)
  .on('error', () => timeout());

GSSoC Points Estimate: Level 2 (Performance/Memory)

Suggested Labels

  • gssoc:approved
  • type:bug
  • severity:high
  • area:performance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions