[Enhancement]: Add retry policy for scan enqueuing to handle queue capacity limits

### Contact Details

wessel.terpstra@vattenfall.com

### What problem does this solve?

We operate a monorepo setup with many tiny projects that need to be scanned. While we have sufficient concurrent scan capacity (scans complete within minutes), we experience queue capacity spikes when enqueuing many scans simultaneously.

**Current Situation:**
- Our concurrent scan capacity can handle the workload efficiently
- Individual scans are fast and complete quickly
- However, the initial enqueuing of many small projects creates a spike that exceeds the maximum queue capacity
- When queue capacity is exceeded, scan creation fails immediately without retries

**Impact:**
- Every team using our shared Checkmarx One subscription must implement their own retry logic
- This creates duplicated effort across teams
- Tools that depend on the CLI (like the Azure DevOps plugin) cannot easily benefit from retry logic
- Inconsistent retry implementations across different teams and tools

**Root Cause:**
The CLI currently only retries on specific HTTP errors (502, 401) as seen in [`internal/wrappers/client.go:83-105`](https://github.com/Checkmarx/ast-cli/blob/main/internal/wrappers/client.go#L83-L105), but does not retry on queue capacity errors during scan creation.

### Proposed Solution

Add configurable retry functionality specifically for scan enqueuing failures due to queue capacity limits.

**Proposed Flags:**
- `--scan-enqueue-retries <count>` - Number of retry attempts (default: 0 to maintain backward compatibility)
- `--scan-enqueue-retry-delay <seconds>` - Delay between retry attempts (default: reasonable value like 5-10 seconds)

**Implementation Approach:**
1. Extend the retry logic in [`internal/wrappers/scans-http.go:32`](https://github.com/Checkmarx/ast-cli/blob/main/internal/wrappers/scans-http.go#L32) (`ScansHTTPWrapper.Create()`)
2. Detect queue capacity errors in the error response from Checkmarx One API
3. Apply retry logic similar to the existing SCM rate limit handling in [`internal/wrappers/rate-limit.go`](https://github.com/Checkmarx/ast-cli/blob/main/internal/wrappers/rate-limit.go)
4. Respect the new flags when retrying scan creation requests
5. Log retry attempts for visibility (e.g., "Scan creation failed due to queue capacity, retrying (attempt 1/5)...")

**Retry Strategy:**
Reuse the existing exponential backoff implementation from [`internal/wrappers/client.go:83-105`](https://github.com/Checkmarx/ast-cli/blob/main/internal/wrappers/client.go#L83-L105), which calculates delay as `baseDelayInMilliSec * (1 << attempt)`. This provides:
- Consistent behavior with existing retry logic
- Proven exponential backoff pattern already in the codebase
- Reduced queue pressure as delays increase progressively

**Benefits:**
- Centralized retry logic that all teams can use
- Works automatically for Azure DevOps plugin and other tools built on the CLI
- Backward compatible (default of 0 retries maintains current behavior)
- Reduces burden on individual teams to implement retry logic
- Consistent behavior across all teams using the shared subscription

### Importance Level

Critical

### Additional Information

**Technical Context:**
- Existing retry logic: [`internal/wrappers/client.go:83-105`](https://github.com/Checkmarx/ast-cli/blob/main/internal/wrappers/client.go#L83-L105) (retryHTTPRequest)
- Scan creation: [`internal/commands/scan.go:2302`](https://github.com/Checkmarx/ast-cli/blob/main/internal/commands/scan.go#L2302) (runCreateScanCommand)
- HTTP wrapper: [`internal/wrappers/scans-http.go:32`](https://github.com/Checkmarx/ast-cli/blob/main/internal/wrappers/scans-http.go#L32) (Create method)
- Precedent: [`internal/wrappers/rate-limit.go`](https://github.com/Checkmarx/ast-cli/blob/main/internal/wrappers/rate-limit.go) demonstrates similar retry logic for SCM rate limits

**Use Case Details:**
- Multiple teams on shared subscription
- Monorepo with 100+ small projects
- Automated CI/CD pipelines triggering many scans concurrently
- Queue burns through quickly, but initial spike causes enqueuing failures

**Related Functionality:**
The CLI already has similar flags for other operations:
- `--wait-delay` for polling intervals
- `--scan-timeout` for scan timeouts
- The proposed flags follow this existing pattern

**Alternatives Considered:**
- Implementing retry logic in each team's CI/CD pipeline → Creates duplication
- Increasing queue capacity → Not feasible, as queue capacity is tied to concurrent scan and developer licenses (which are adequate for our workload)

This enhancement would make the CLI more robust for enterprise environments with high-volume scan requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement]: Add retry policy for scan enqueuing to handle queue capacity limits #1349

Contact Details

What problem does this solve?

Proposed Solution

Importance Level

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement]: Add retry policy for scan enqueuing to handle queue capacity limits #1349

Description

Contact Details

What problem does this solve?

Proposed Solution

Importance Level

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions