fix: wrap pdfParse in timeout to prevent event loop DoS#4138
Conversation
|
@taniy8 is attempting to deploy a commit to the Anurag Mishra's projects Team on Vercel. A member of the Team first needs to authorize it. |
Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
📝 WalkthroughWalkthroughA ChangesPDF Parse Timeout Guard
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| const parseWithTimeout = (buffer, ms = 8000) => | ||
| Promise.race([ | ||
| pdfParse(buffer), | ||
| new Promise((_, reject) => | ||
| setTimeout(() => reject(new Error('PDF parsing timed out')), ms) | ||
| ) | ||
| ]); |
There was a problem hiding this comment.
Suggestion: The timeout wrapper does not actually stop pdfParse; Promise.race only rejects early, but the parser continues running in the background. If parsing is CPU-bound (the exact DoS case described), this still blocks the event loop and the timeout cannot preempt it, so the DoS mitigation is ineffective. Run parsing in an isolated worker/process and terminate that worker on timeout (or use a parser that supports abort/cancellation). [incomplete implementation]
Severity Level: Critical 🚨
- ❌ Malicious PDF can still block /api/upload handlers.
- ❌ Event loop blocked preventing all other API responses.
- ⚠️ DoS risk remains despite added parse timeout wrapper.Steps of Reproduction ✅
1. Start the backend server so that `backend/src/index.js` mounts the upload router at
`app.use('/api/upload', uploadRoutes);` (lines 78–81), which exposes `POST /api/upload/`
and `POST /api/upload/extract-text` via `backend/src/routes/upload.js`.
2. From a client, send a crafted, pathologically complex PDF (e.g., deeply nested objects
/ huge page count) to `POST /api/upload/` so the request is handled by the route at
`backend/src/routes/upload.js:19-55`, which reads the file buffer at line 25 and calls
`const pdfData = await parseWithTimeout(fileBuffer);` at line 26.
3. Inside `parseWithTimeout` at `backend/src/routes/upload.js:11-17`, `pdfParse(buffer)`
(line 13) begins heavy, CPU-bound parsing in the main Node.js process; while this
synchronous work is executing, the event loop is occupied and the competing timeout
created by `setTimeout(..., ms)` at lines 14-15 cannot actually run until parsing yields
back to the event loop.
4. Under these conditions, observe that other requests (e.g., `GET /health` implemented in
`backend/src/index.js:68-73`) hang and the process CPU remains saturated even after the
intended 8-second timeout, demonstrating that the `Promise.race` wrapper does not preempt
or cancel the underlying `pdfParse` work and the DoS risk from worst-case PDFs remains.(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** backend/src/routes/upload.js
**Line:** 11:17
**Comment:**
*Incomplete Implementation: The timeout wrapper does not actually stop `pdfParse`; `Promise.race` only rejects early, but the parser continues running in the background. If parsing is CPU-bound (the exact DoS case described), this still blocks the event loop and the timeout cannot preempt it, so the DoS mitigation is ineffective. Run parsing in an isolated worker/process and terminate that worker on timeout (or use a parser that supports abort/cancellation).
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fixThere was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/routes/upload.js (1)
43-45:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winBoth endpoints fail to distinguish timeout errors from parsing errors.
Both the
POST /andPOST /extract-textroutes catch all PDF parsing errors and throw a generic 400 Bad Request, but the PR objectives specify that timeout errors should return HTTP status 408 (Request Timeout) or 422 (Unprocessable Entity). Without this distinction, clients cannot implement appropriate retry logic or differentiate between malformed PDFs and legitimately complex files that exceed the time limit.
backend/src/routes/upload.js#L43-L45: Add a check forerror.message === 'PDF parsing timed out'and throwApiError(408, 'PDF parsing exceeded the time limit. The file may be too large or complex.')before the generic 400 error.backend/src/routes/upload.js#L73-L75: Add the same timeout error check and throwApiError(408, ...)before the generic 400 error.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/routes/upload.js` around lines 43 - 45, The catch blocks in both PDF parsing error handlers fail to distinguish timeout errors from other parsing errors, which prevents clients from implementing appropriate retry logic. In backend/src/routes/upload.js at lines 43-45 (the POST / endpoint) and lines 73-75 (the POST /extract-text endpoint), add a check before the generic 400 error response to detect if error.message equals 'PDF parsing timed out'. If a timeout is detected, throw an ApiError with status 408 and message 'PDF parsing exceeded the time limit. The file may be too large or complex.' before falling through to the existing generic 400 error. This allows the same timeout detection logic to be applied consistently at both locations without modifying the error sources themselves.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/src/routes/upload.js`:
- Around line 11-17: The parseWithTimeout function has a resource leak where the
setTimeout timer continues running even after pdfParse completes successfully,
causing unnecessary timer callbacks to accumulate under high load. Store the
timeout ID returned by setTimeout, then use Promise.then() or a finally block to
explicitly clear the timer with clearTimeout once the Promise.race settles
(regardless of whether pdfParse succeeds or the timeout fires), ensuring the
timer callback is never executed after the race is resolved.
- Around line 11-17: The parseWithTimeout function does not clean up the
setTimeout when pdfParse completes successfully, and it does not cancel the
pdfParse operation when the timeout fires, both of which cause resource leaks.
Modify parseWithTimeout to explicitly clear the timeout when pdfParse resolves
first, following the cleanup pattern used in backend/src/config/redis.js.
Additionally, investigate if pdfParse supports cancellation (such as through
AbortSignal) and pass an abort signal to stop the operation when the timeout
fires. If cancellation is not supported by the PDF parsing library, consider
offloading PDF parsing to a background job queue using the existing BullMQ
pattern elsewhere in the codebase to prevent background operations from blocking
the event loop and degrading server performance during attacks.
---
Outside diff comments:
In `@backend/src/routes/upload.js`:
- Around line 43-45: The catch blocks in both PDF parsing error handlers fail to
distinguish timeout errors from other parsing errors, which prevents clients
from implementing appropriate retry logic. In backend/src/routes/upload.js at
lines 43-45 (the POST / endpoint) and lines 73-75 (the POST /extract-text
endpoint), add a check before the generic 400 error response to detect if
error.message equals 'PDF parsing timed out'. If a timeout is detected, throw an
ApiError with status 408 and message 'PDF parsing exceeded the time limit. The
file may be too large or complex.' before falling through to the existing
generic 400 error. This allows the same timeout detection logic to be applied
consistently at both locations without modifying the error sources themselves.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 551e2ccf-d200-4c8b-8952-d4b7743408b3
📒 Files selected for processing (1)
backend/src/routes/upload.js
| const parseWithTimeout = (buffer, ms = 8000) => | ||
| Promise.race([ | ||
| pdfParse(buffer), | ||
| new Promise((_, reject) => | ||
| setTimeout(() => reject(new Error('PDF parsing timed out')), ms) | ||
| ) | ||
| ]); |
There was a problem hiding this comment.
Clear the setTimeout timer to prevent minor resource leak.
When pdfParse completes before the timeout, the setTimeout timer continues running until it fires (8 seconds later), at which point it attempts to reject an already-settled promise (a no-op). While Node.js will eventually garbage-collect the timer, explicitly clearing it is cleaner and prevents unnecessary timer callbacks from piling up under high load.
⏱️ Proposed fix to clear the timer
-const parseWithTimeout = (buffer, ms = 8000) =>
- Promise.race([
- pdfParse(buffer),
- new Promise((_, reject) =>
- setTimeout(() => reject(new Error('PDF parsing timed out')), ms)
- )
- ]);
+const parseWithTimeout = (buffer, ms = 8000) => {
+ let timeoutId;
+ const timeoutPromise = new Promise((_, reject) => {
+ timeoutId = setTimeout(() => reject(new Error('PDF parsing timed out')), ms);
+ });
+
+ return Promise.race([
+ pdfParse(buffer).finally(() => clearTimeout(timeoutId)),
+ timeoutPromise
+ ]);
+};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const parseWithTimeout = (buffer, ms = 8000) => | |
| Promise.race([ | |
| pdfParse(buffer), | |
| new Promise((_, reject) => | |
| setTimeout(() => reject(new Error('PDF parsing timed out')), ms) | |
| ) | |
| ]); | |
| const parseWithTimeout = (buffer, ms = 8000) => { | |
| let timeoutId; | |
| const timeoutPromise = new Promise((_, reject) => { | |
| timeoutId = setTimeout(() => reject(new Error('PDF parsing timed out')), ms); | |
| }); | |
| return Promise.race([ | |
| pdfParse(buffer).finally(() => clearTimeout(timeoutId)), | |
| timeoutPromise | |
| ]); | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/src/routes/upload.js` around lines 11 - 17, The parseWithTimeout
function has a resource leak where the setTimeout timer continues running even
after pdfParse completes successfully, causing unnecessary timer callbacks to
accumulate under high load. Store the timeout ID returned by setTimeout, then
use Promise.then() or a finally block to explicitly clear the timer with
clearTimeout once the Promise.race settles (regardless of whether pdfParse
succeeds or the timeout fires), ensuring the timer callback is never executed
after the race is resolved.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check for existing worker thread usage and pdf-parse internals
# Check if worker_threads are used anywhere
echo "=== Checking for worker_threads usage ==="
rg -n "worker_threads|Worker\(" --type=js
# Check pdf-parse package details and dependencies
echo -e "\n=== Checking pdf-parse in package.json ==="
fd package.json --exec cat {} \; | jq '.dependencies["pdf-parse"], .devDependencies["pdf-parse"]'
# Look for any existing timeout or worker patterns in other routes
echo -e "\n=== Checking for similar timeout patterns ==="
rg -n "Promise\.race|setTimeout.*reject|parseWith" --type=js -g '!node_modules'Repository: anurag3407/career-pilot
Length of output: 1477
🏁 Script executed:
#!/bin/bash
# Check pdf-parse library details
echo "=== Checking package.json for queue/worker dependencies ==="
cat backend/package.json | jq '.dependencies | keys[] | select(. | contains("bull") or contains("queue") or contains("worker"))'
# Look at error handling in the upload routes to verify claim about generic ApiError(400)
echo -e "\n=== Checking error handling in upload.js ==="
sed -n '43,45p;73,75p' backend/src/routes/upload.js
# Check asyncHandler implementation
echo -e "\n=== Checking asyncHandler wrapper ==="
rg -A 5 "const asyncHandler" backend/src --type=js
# Look for any HTTP status codes used for timeout errors elsewhere
echo -e "\n=== Checking HTTP status code patterns for timeout ==="
rg "408|422|timeout" backend/src/routes --type=js -iRepository: anurag3407/career-pilot
Length of output: 1341
🏁 Script executed:
#!/bin/bash
# Check what pdfParse actually is and how it's imported
echo "=== Checking pdfParse import and usage ==="
head -20 backend/src/routes/upload.js | grep -A 2 "pdfParse"
# Look for any cleanup logic or resource management in finally blocks
echo -e "\n=== Checking cleanup in finally blocks (lines 46-54, 76-84) ==="
sed -n '46,54p;76,84p' backend/src/routes/upload.js
# Check if there are any comments about pdf-parse behavior or performance
echo -e "\n=== Searching for pdf-parse related comments ==="
rg -B 2 -A 2 "pdf-parse|pdfParse" backend/src/routes/upload.js --type=js
# Check actual Promise.race implementation to see timer cleanup
echo -e "\n=== Checking Promise.race implementation (lines 11-17) ==="
sed -n '11,17p' backend/src/routes/upload.jsRepository: anurag3407/career-pilot
Length of output: 1486
🏁 Script executed:
#!/bin/bash
# Check pdf-parse npm package info to understand its actual behavior
echo "=== Checking pdf-parse implementation details ==="
rg "pdfParse" backend/src/routes/upload.js -B 3 -A 8
# Look for any other files that might handle PDF processing differently
echo -e "\n=== Searching for other PDF handling in codebase ==="
fd -e js -e ts | xargs rg "pdf-parse|pdfParse" --type=js 2>/dev/null
# Check if there are any documented performance issues or TODOs
echo -e "\n=== Checking for performance TODOs or comments ==="
rg "TODO|FIXME|NOTE|BUG|PERF" backend/src/routes/upload.js --type=js
# Verify actual file structure and test if parseWithTimeout is used
echo -e "\n=== Verifying complete upload.js structure ==="
wc -l backend/src/routes/upload.jsRepository: anurag3407/career-pilot
Length of output: 1358
🏁 Script executed:
#!/bin/bash
# Check if there's any documentation or comments about pdf-parse's async behavior
# Also check how pdfParse is used in the input.controller.js
echo "=== Checking pdf-parse usage in input.controller.js ==="
sed -n '1,50p' backend/src/controllers/input.controller.js | grep -A 10 "pdfParse"
# Search for any performance tests or metrics
echo -e "\n=== Checking test files for pdf-parse behavior ==="
fd -e test.js -e spec.js -e .test.js | xargs rg "pdf-parse|pdfParse" --type=js 2>/dev/null | head -20
# Let's verify what the actual API error status codes are in the errorHandler
echo -e "\n=== Checking ApiError implementation ==="
rg -A 15 "class ApiError|export.*ApiError" backend/src/middleware/errorHandler.js --type=jsRepository: anurag3407/career-pilot
Length of output: 1156
🏁 Script executed:
#!/bin/bash
# Check git history or PR/issue context for status code requirements
echo "=== Checking for PR objectives or issue context ==="
git log --all --oneline --grep="408\|422\|timeout\|status" | head -20
# Look for any existing timeout error handling patterns
echo -e "\n=== Checking how other timeouts are handled ==="
rg "new Error.*timeout|timeout.*Error" backend/src --type=js -i -B 2 -A 2
# Check if there's a pattern for setting specific HTTP status codes for timeouts
echo -e "\n=== Checking status codes in error handler ==="
rg "408|422|\.statusCode" backend/src/middleware/errorHandler.js --type=js
# Verify the exact behavior - does the timer callback fire after resolution?
echo -e "\n=== Checking Node.js documentation comment or any mitigation ==="
rg -B 5 "Promise.race" backend/src/routes/upload.js --type=jsRepository: anurag3407/career-pilot
Length of output: 4174
Promise.race only partially mitigates the DoS vulnerability; timer cleanup also leaks.
After the timeout fires and rejects, the pdfParse(buffer) operation continues executing in the background because Promise.race cannot cancel the losing promise. This allows an attacker to degrade server performance by uploading multiple crafted PDFs that trigger timeouts—each parse operation will run to completion in the background while blocking the event loop.
Additionally, the setTimeout is never cleared when pdfParse completes first, causing the timer to fire later and attempt to reject an already-settled promise (a minor resource leak). Compare this to the timeout pattern used in backend/src/config/redis.js, which explicitly clears the timeout on success.
The current implementation improves user experience by returning an error within 8 seconds rather than hanging indefinitely, but it does not prevent resource exhaustion. Consider offloading PDF parsing to a background job queue (similar to the existing BullMQ pattern used elsewhere in the codebase) or implement explicit timer cleanup at minimum.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/src/routes/upload.js` around lines 11 - 17, The parseWithTimeout
function does not clean up the setTimeout when pdfParse completes successfully,
and it does not cancel the pdfParse operation when the timeout fires, both of
which cause resource leaks. Modify parseWithTimeout to explicitly clear the
timeout when pdfParse resolves first, following the cleanup pattern used in
backend/src/config/redis.js. Additionally, investigate if pdfParse supports
cancellation (such as through AbortSignal) and pass an abort signal to stop the
operation when the timeout fires. If cancellation is not supported by the PDF
parsing library, consider offloading PDF parsing to a background job queue using
the existing BullMQ pattern elsewhere in the codebase to prevent background
operations from blocking the event loop and degrading server performance during
attacks.
User description
Description
Both
POST /uploadandPOST /upload/extract-textendpoints were callingpdfParse(fileBuffer)with no execution timeout. A crafted PDF with deeply nested object streams or thousands of pages
could block the Node.js event loop indefinitely, causing a DoS for all concurrent requests.
Added a
parseWithTimeout()helper that wrapspdfParsein aPromise.racewith an 8-secondtimeout, rejecting with an error if parsing exceeds the limit.
Type of Change
Related Issue
Fixes #4133
Testing
Screenshots (MANDATORY for UI/UX changes)
Checklist
CodeAnt-AI Description
Stop PDF uploads from hanging the server
What Changed
Impact
✅ Fewer upload hangs✅ Lower risk of server lockups✅ Clearer PDF upload failures💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.
Summary by CodeRabbit
Release Notes