Skip to content

fix: wrap pdfParse in timeout to prevent event loop DoS#4138

Open
taniy8 wants to merge 1 commit into
anurag3407:mainfrom
taniy8:fix/pdf-parse-timeout
Open

fix: wrap pdfParse in timeout to prevent event loop DoS#4138
taniy8 wants to merge 1 commit into
anurag3407:mainfrom
taniy8:fix/pdf-parse-timeout

Conversation

@taniy8

@taniy8 taniy8 commented Jun 16, 2026

Copy link
Copy Markdown

User description

Description

Both POST /upload and POST /upload/extract-text endpoints were calling pdfParse(fileBuffer)
with no execution timeout. A crafted PDF with deeply nested object streams or thousands of pages
could block the Node.js event loop indefinitely, causing a DoS for all concurrent requests.
Added a parseWithTimeout() helper that wraps pdfParse in a Promise.race with an 8-second
timeout, rejecting with an error if parsing exceeds the limit.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Other (describe)

Related Issue

Fixes #4133

Testing

  • Unit tests pass
  • Tested Locally
  • New tests added

Screenshots (MANDATORY for UI/UX changes)

  • Attached Screenshots or Screen Recordings (showing before and after)
  • If your PR does not make any visual/UI changes, please write "N/A"

Checklist

  • Code follows project style guidelines
  • Self-reviewed my code
  • Added comments where necessary
  • Updated documentation
  • No new warnings generated

CodeAnt-AI Description

Stop PDF uploads from hanging the server

What Changed

  • PDF uploads now stop after 8 seconds if text extraction takes too long
  • Both upload endpoints use the same timeout, so a badly formed or overly complex PDF no longer blocks requests indefinitely
  • Users now get a clear failure instead of an upload that can freeze the app

Impact

✅ Fewer upload hangs
✅ Lower risk of server lockups
✅ Clearer PDF upload failures

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Added timeout protection for PDF parsing operations with a default 8-second limit, improving reliability of document uploads and text extraction.

@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

@taniy8 is attempting to deploy a commit to the Anurag Mishra's projects Team on Vercel.

A member of the Team first needs to authorize it.

@codeant-ai

codeant-ai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

A parseWithTimeout helper is added to backend/src/routes/upload.js, wrapping pdfParse in a Promise.race that rejects after 8 seconds. Both the POST / and POST /extract-text route handlers are updated to call parseWithTimeout(fileBuffer) instead of pdfParse(fileBuffer) directly.

Changes

PDF Parse Timeout Guard

Layer / File(s) Summary
parseWithTimeout helper and call-site updates
backend/src/routes/upload.js
Adds parseWithTimeout(buffer, ms = 8000) that races pdfParse(buffer) against a timer-based rejection. Replaces the direct pdfParse(fileBuffer) calls in both the POST / upload handler (line 26) and the POST /extract-text handler (line 64) with parseWithTimeout(fileBuffer).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 A rabbit once waited and waited in vain,
While a PDF parsed through an infinite chain.
So I wrapped it in race() with a clock on the wall —
Eight seconds, no more, then I drop the whole call.
No more blocked event loops, no event-loop pain! 🕐

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a timeout wrapper around pdfParse to prevent event loop DoS attacks.
Linked Issues check ✅ Passed The PR implements the exact fix specified in #4133: wrapping pdfParse in parseWithTimeout with Promise.race and 8-second timeout, addressing the DoS vulnerability.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing the timeout vulnerability in #4133; no unrelated modifications or scope creep detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:XS This PR changes 0-9 lines, ignoring generated files label Jun 16, 2026
Comment on lines +11 to +17
const parseWithTimeout = (buffer, ms = 8000) =>
Promise.race([
pdfParse(buffer),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('PDF parsing timed out')), ms)
)
]);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The timeout wrapper does not actually stop pdfParse; Promise.race only rejects early, but the parser continues running in the background. If parsing is CPU-bound (the exact DoS case described), this still blocks the event loop and the timeout cannot preempt it, so the DoS mitigation is ineffective. Run parsing in an isolated worker/process and terminate that worker on timeout (or use a parser that supports abort/cancellation). [incomplete implementation]

Severity Level: Critical 🚨
- ❌ Malicious PDF can still block /api/upload handlers.
- ❌ Event loop blocked preventing all other API responses.
- ⚠️ DoS risk remains despite added parse timeout wrapper.
Steps of Reproduction ✅
1. Start the backend server so that `backend/src/index.js` mounts the upload router at
`app.use('/api/upload', uploadRoutes);` (lines 78–81), which exposes `POST /api/upload/`
and `POST /api/upload/extract-text` via `backend/src/routes/upload.js`.

2. From a client, send a crafted, pathologically complex PDF (e.g., deeply nested objects
/ huge page count) to `POST /api/upload/` so the request is handled by the route at
`backend/src/routes/upload.js:19-55`, which reads the file buffer at line 25 and calls
`const pdfData = await parseWithTimeout(fileBuffer);` at line 26.

3. Inside `parseWithTimeout` at `backend/src/routes/upload.js:11-17`, `pdfParse(buffer)`
(line 13) begins heavy, CPU-bound parsing in the main Node.js process; while this
synchronous work is executing, the event loop is occupied and the competing timeout
created by `setTimeout(..., ms)` at lines 14-15 cannot actually run until parsing yields
back to the event loop.

4. Under these conditions, observe that other requests (e.g., `GET /health` implemented in
`backend/src/index.js:68-73`) hang and the process CPU remains saturated even after the
intended 8-second timeout, demonstrating that the `Promise.race` wrapper does not preempt
or cancel the underlying `pdfParse` work and the DoS risk from worst-case PDFs remains.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** backend/src/routes/upload.js
**Line:** 11:17
**Comment:**
	*Incomplete Implementation: The timeout wrapper does not actually stop `pdfParse`; `Promise.race` only rejects early, but the parser continues running in the background. If parsing is CPU-bound (the exact DoS case described), this still blocks the event loop and the timeout cannot preempt it, so the DoS mitigation is ineffective. Run parsing in an isolated worker/process and terminate that worker on timeout (or use a parser that supports abort/cancellation).

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/routes/upload.js (1)

43-45: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Both endpoints fail to distinguish timeout errors from parsing errors.

Both the POST / and POST /extract-text routes catch all PDF parsing errors and throw a generic 400 Bad Request, but the PR objectives specify that timeout errors should return HTTP status 408 (Request Timeout) or 422 (Unprocessable Entity). Without this distinction, clients cannot implement appropriate retry logic or differentiate between malformed PDFs and legitimately complex files that exceed the time limit.

  • backend/src/routes/upload.js#L43-L45: Add a check for error.message === 'PDF parsing timed out' and throw ApiError(408, 'PDF parsing exceeded the time limit. The file may be too large or complex.') before the generic 400 error.
  • backend/src/routes/upload.js#L73-L75: Add the same timeout error check and throw ApiError(408, ...) before the generic 400 error.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/routes/upload.js` around lines 43 - 45, The catch blocks in both
PDF parsing error handlers fail to distinguish timeout errors from other parsing
errors, which prevents clients from implementing appropriate retry logic. In
backend/src/routes/upload.js at lines 43-45 (the POST / endpoint) and lines
73-75 (the POST /extract-text endpoint), add a check before the generic 400
error response to detect if error.message equals 'PDF parsing timed out'. If a
timeout is detected, throw an ApiError with status 408 and message 'PDF parsing
exceeded the time limit. The file may be too large or complex.' before falling
through to the existing generic 400 error. This allows the same timeout
detection logic to be applied consistently at both locations without modifying
the error sources themselves.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/routes/upload.js`:
- Around line 11-17: The parseWithTimeout function has a resource leak where the
setTimeout timer continues running even after pdfParse completes successfully,
causing unnecessary timer callbacks to accumulate under high load. Store the
timeout ID returned by setTimeout, then use Promise.then() or a finally block to
explicitly clear the timer with clearTimeout once the Promise.race settles
(regardless of whether pdfParse succeeds or the timeout fires), ensuring the
timer callback is never executed after the race is resolved.
- Around line 11-17: The parseWithTimeout function does not clean up the
setTimeout when pdfParse completes successfully, and it does not cancel the
pdfParse operation when the timeout fires, both of which cause resource leaks.
Modify parseWithTimeout to explicitly clear the timeout when pdfParse resolves
first, following the cleanup pattern used in backend/src/config/redis.js.
Additionally, investigate if pdfParse supports cancellation (such as through
AbortSignal) and pass an abort signal to stop the operation when the timeout
fires. If cancellation is not supported by the PDF parsing library, consider
offloading PDF parsing to a background job queue using the existing BullMQ
pattern elsewhere in the codebase to prevent background operations from blocking
the event loop and degrading server performance during attacks.

---

Outside diff comments:
In `@backend/src/routes/upload.js`:
- Around line 43-45: The catch blocks in both PDF parsing error handlers fail to
distinguish timeout errors from other parsing errors, which prevents clients
from implementing appropriate retry logic. In backend/src/routes/upload.js at
lines 43-45 (the POST / endpoint) and lines 73-75 (the POST /extract-text
endpoint), add a check before the generic 400 error response to detect if
error.message equals 'PDF parsing timed out'. If a timeout is detected, throw an
ApiError with status 408 and message 'PDF parsing exceeded the time limit. The
file may be too large or complex.' before falling through to the existing
generic 400 error. This allows the same timeout detection logic to be applied
consistently at both locations without modifying the error sources themselves.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 551e2ccf-d200-4c8b-8952-d4b7743408b3

📥 Commits

Reviewing files that changed from the base of the PR and between 1c7ce5e and 831bcd6.

📒 Files selected for processing (1)
  • backend/src/routes/upload.js

Comment on lines +11 to +17
const parseWithTimeout = (buffer, ms = 8000) =>
Promise.race([
pdfParse(buffer),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('PDF parsing timed out')), ms)
)
]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear the setTimeout timer to prevent minor resource leak.

When pdfParse completes before the timeout, the setTimeout timer continues running until it fires (8 seconds later), at which point it attempts to reject an already-settled promise (a no-op). While Node.js will eventually garbage-collect the timer, explicitly clearing it is cleaner and prevents unnecessary timer callbacks from piling up under high load.

⏱️ Proposed fix to clear the timer
-const parseWithTimeout = (buffer, ms = 8000) =>
-  Promise.race([
-    pdfParse(buffer),
-    new Promise((_, reject) =>
-      setTimeout(() => reject(new Error('PDF parsing timed out')), ms)
-    )
-  ]);
+const parseWithTimeout = (buffer, ms = 8000) => {
+  let timeoutId;
+  const timeoutPromise = new Promise((_, reject) => {
+    timeoutId = setTimeout(() => reject(new Error('PDF parsing timed out')), ms);
+  });
+  
+  return Promise.race([
+    pdfParse(buffer).finally(() => clearTimeout(timeoutId)),
+    timeoutPromise
+  ]);
+};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const parseWithTimeout = (buffer, ms = 8000) =>
Promise.race([
pdfParse(buffer),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('PDF parsing timed out')), ms)
)
]);
const parseWithTimeout = (buffer, ms = 8000) => {
let timeoutId;
const timeoutPromise = new Promise((_, reject) => {
timeoutId = setTimeout(() => reject(new Error('PDF parsing timed out')), ms);
});
return Promise.race([
pdfParse(buffer).finally(() => clearTimeout(timeoutId)),
timeoutPromise
]);
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/routes/upload.js` around lines 11 - 17, The parseWithTimeout
function has a resource leak where the setTimeout timer continues running even
after pdfParse completes successfully, causing unnecessary timer callbacks to
accumulate under high load. Store the timeout ID returned by setTimeout, then
use Promise.then() or a finally block to explicitly clear the timer with
clearTimeout once the Promise.race settles (regardless of whether pdfParse
succeeds or the timeout fires), ensuring the timer callback is never executed
after the race is resolved.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check for existing worker thread usage and pdf-parse internals

# Check if worker_threads are used anywhere
echo "=== Checking for worker_threads usage ==="
rg -n "worker_threads|Worker\(" --type=js

# Check pdf-parse package details and dependencies
echo -e "\n=== Checking pdf-parse in package.json ==="
fd package.json --exec cat {} \; | jq '.dependencies["pdf-parse"], .devDependencies["pdf-parse"]'

# Look for any existing timeout or worker patterns in other routes
echo -e "\n=== Checking for similar timeout patterns ==="
rg -n "Promise\.race|setTimeout.*reject|parseWith" --type=js -g '!node_modules'

Repository: anurag3407/career-pilot

Length of output: 1477


🏁 Script executed:

#!/bin/bash

# Check pdf-parse library details
echo "=== Checking package.json for queue/worker dependencies ==="
cat backend/package.json | jq '.dependencies | keys[] | select(. | contains("bull") or contains("queue") or contains("worker"))'

# Look at error handling in the upload routes to verify claim about generic ApiError(400)
echo -e "\n=== Checking error handling in upload.js ==="
sed -n '43,45p;73,75p' backend/src/routes/upload.js

# Check asyncHandler implementation
echo -e "\n=== Checking asyncHandler wrapper ==="
rg -A 5 "const asyncHandler" backend/src --type=js

# Look for any HTTP status codes used for timeout errors elsewhere
echo -e "\n=== Checking HTTP status code patterns for timeout ==="
rg "408|422|timeout" backend/src/routes --type=js -i

Repository: anurag3407/career-pilot

Length of output: 1341


🏁 Script executed:

#!/bin/bash

# Check what pdfParse actually is and how it's imported
echo "=== Checking pdfParse import and usage ==="
head -20 backend/src/routes/upload.js | grep -A 2 "pdfParse"

# Look for any cleanup logic or resource management in finally blocks
echo -e "\n=== Checking cleanup in finally blocks (lines 46-54, 76-84) ==="
sed -n '46,54p;76,84p' backend/src/routes/upload.js

# Check if there are any comments about pdf-parse behavior or performance
echo -e "\n=== Searching for pdf-parse related comments ==="
rg -B 2 -A 2 "pdf-parse|pdfParse" backend/src/routes/upload.js --type=js

# Check actual Promise.race implementation to see timer cleanup
echo -e "\n=== Checking Promise.race implementation (lines 11-17) ==="
sed -n '11,17p' backend/src/routes/upload.js

Repository: anurag3407/career-pilot

Length of output: 1486


🏁 Script executed:

#!/bin/bash

# Check pdf-parse npm package info to understand its actual behavior
echo "=== Checking pdf-parse implementation details ==="
rg "pdfParse" backend/src/routes/upload.js -B 3 -A 8

# Look for any other files that might handle PDF processing differently
echo -e "\n=== Searching for other PDF handling in codebase ==="
fd -e js -e ts | xargs rg "pdf-parse|pdfParse" --type=js 2>/dev/null

# Check if there are any documented performance issues or TODOs
echo -e "\n=== Checking for performance TODOs or comments ==="
rg "TODO|FIXME|NOTE|BUG|PERF" backend/src/routes/upload.js --type=js

# Verify actual file structure and test if parseWithTimeout is used
echo -e "\n=== Verifying complete upload.js structure ==="
wc -l backend/src/routes/upload.js

Repository: anurag3407/career-pilot

Length of output: 1358


🏁 Script executed:

#!/bin/bash

# Check if there's any documentation or comments about pdf-parse's async behavior
# Also check how pdfParse is used in the input.controller.js
echo "=== Checking pdf-parse usage in input.controller.js ==="
sed -n '1,50p' backend/src/controllers/input.controller.js | grep -A 10 "pdfParse"

# Search for any performance tests or metrics
echo -e "\n=== Checking test files for pdf-parse behavior ==="
fd -e test.js -e spec.js -e .test.js | xargs rg "pdf-parse|pdfParse" --type=js 2>/dev/null | head -20

# Let's verify what the actual API error status codes are in the errorHandler
echo -e "\n=== Checking ApiError implementation ==="
rg -A 15 "class ApiError|export.*ApiError" backend/src/middleware/errorHandler.js --type=js

Repository: anurag3407/career-pilot

Length of output: 1156


🏁 Script executed:

#!/bin/bash

# Check git history or PR/issue context for status code requirements
echo "=== Checking for PR objectives or issue context ==="
git log --all --oneline --grep="408\|422\|timeout\|status" | head -20

# Look for any existing timeout error handling patterns
echo -e "\n=== Checking how other timeouts are handled ==="
rg "new Error.*timeout|timeout.*Error" backend/src --type=js -i -B 2 -A 2

# Check if there's a pattern for setting specific HTTP status codes for timeouts
echo -e "\n=== Checking status codes in error handler ==="
rg "408|422|\.statusCode" backend/src/middleware/errorHandler.js --type=js

# Verify the exact behavior - does the timer callback fire after resolution?
echo -e "\n=== Checking Node.js documentation comment or any mitigation ==="
rg -B 5 "Promise.race" backend/src/routes/upload.js --type=js

Repository: anurag3407/career-pilot

Length of output: 4174


Promise.race only partially mitigates the DoS vulnerability; timer cleanup also leaks.

After the timeout fires and rejects, the pdfParse(buffer) operation continues executing in the background because Promise.race cannot cancel the losing promise. This allows an attacker to degrade server performance by uploading multiple crafted PDFs that trigger timeouts—each parse operation will run to completion in the background while blocking the event loop.

Additionally, the setTimeout is never cleared when pdfParse completes first, causing the timer to fire later and attempt to reject an already-settled promise (a minor resource leak). Compare this to the timeout pattern used in backend/src/config/redis.js, which explicitly clears the timeout on success.

The current implementation improves user experience by returning an error within 8 seconds rather than hanging indefinitely, but it does not prevent resource exhaustion. Consider offloading PDF parsing to a background job queue (similar to the existing BullMQ pattern used elsewhere in the codebase) or implement explicit timer cleanup at minimum.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/routes/upload.js` around lines 11 - 17, The parseWithTimeout
function does not clean up the setTimeout when pdfParse completes successfully,
and it does not cancel the pdfParse operation when the timeout fires, both of
which cause resource leaks. Modify parseWithTimeout to explicitly clear the
timeout when pdfParse resolves first, following the cleanup pattern used in
backend/src/config/redis.js. Additionally, investigate if pdfParse supports
cancellation (such as through AbortSignal) and pass an abort signal to stop the
operation when the timeout fires. If cancellation is not supported by the PDF
parsing library, consider offloading PDF parsing to a background job queue using
the existing BullMQ pattern elsewhere in the codebase to prevent background
operations from blocking the event loop and degrading server performance during
attacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security]: pdfParse() called without timeout

1 participant