Skip to content

[Enhancement]: Add retry policy for scan enqueuing to handle queue capacity limits #1349

@wterpstra

Description

@wterpstra

Contact Details

[email protected]

What problem does this solve?

We operate a monorepo setup with many tiny projects that need to be scanned. While we have sufficient concurrent scan capacity (scans complete within minutes), we experience queue capacity spikes when enqueuing many scans simultaneously.

Current Situation:

  • Our concurrent scan capacity can handle the workload efficiently
  • Individual scans are fast and complete quickly
  • However, the initial enqueuing of many small projects creates a spike that exceeds the maximum queue capacity
  • When queue capacity is exceeded, scan creation fails immediately without retries

Impact:

  • Every team using our shared Checkmarx One subscription must implement their own retry logic
  • This creates duplicated effort across teams
  • Tools that depend on the CLI (like the Azure DevOps plugin) cannot easily benefit from retry logic
  • Inconsistent retry implementations across different teams and tools

Root Cause:
The CLI currently only retries on specific HTTP errors (502, 401) as seen in internal/wrappers/client.go:83-105, but does not retry on queue capacity errors during scan creation.

Proposed Solution

Add configurable retry functionality specifically for scan enqueuing failures due to queue capacity limits.

Proposed Flags:

  • --scan-enqueue-retries <count> - Number of retry attempts (default: 0 to maintain backward compatibility)
  • --scan-enqueue-retry-delay <seconds> - Delay between retry attempts (default: reasonable value like 5-10 seconds)

Implementation Approach:

  1. Extend the retry logic in internal/wrappers/scans-http.go:32 (ScansHTTPWrapper.Create())
  2. Detect queue capacity errors in the error response from Checkmarx One API
  3. Apply retry logic similar to the existing SCM rate limit handling in internal/wrappers/rate-limit.go
  4. Respect the new flags when retrying scan creation requests
  5. Log retry attempts for visibility (e.g., "Scan creation failed due to queue capacity, retrying (attempt 1/5)...")

Retry Strategy:
Reuse the existing exponential backoff implementation from internal/wrappers/client.go:83-105, which calculates delay as baseDelayInMilliSec * (1 << attempt). This provides:

  • Consistent behavior with existing retry logic
  • Proven exponential backoff pattern already in the codebase
  • Reduced queue pressure as delays increase progressively

Benefits:

  • Centralized retry logic that all teams can use
  • Works automatically for Azure DevOps plugin and other tools built on the CLI
  • Backward compatible (default of 0 retries maintains current behavior)
  • Reduces burden on individual teams to implement retry logic
  • Consistent behavior across all teams using the shared subscription

Importance Level

Critical

Additional Information

Technical Context:

Use Case Details:

  • Multiple teams on shared subscription
  • Monorepo with 100+ small projects
  • Automated CI/CD pipelines triggering many scans concurrently
  • Queue burns through quickly, but initial spike causes enqueuing failures

Related Functionality:
The CLI already has similar flags for other operations:

  • --wait-delay for polling intervals
  • --scan-timeout for scan timeouts
  • The proposed flags follow this existing pattern

Alternatives Considered:

  • Implementing retry logic in each team's CI/CD pipeline → Creates duplication
  • Increasing queue capacity → Not feasible, as queue capacity is tied to concurrent scan and developer licenses (which are adequate for our workload)

This enhancement would make the CLI more robust for enterprise environments with high-volume scan requirements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions