diff --git a/hyperfleet-e2e-scenario/hyperfleet-api-e2e-scenario.md b/hyperfleet-e2e-scenario/hyperfleet-api-e2e-scenario.md new file mode 100644 index 0000000..00039fb --- /dev/null +++ b/hyperfleet-e2e-scenario/hyperfleet-api-e2e-scenario.md @@ -0,0 +1,483 @@ +User Story: https://issues.redhat.com/browse/HYPERFLEET-75 + +--- + +## 1. MVP-Critical E2E Test Scenarios (Happy Path) + +### Part 1: Cluster Lifecycle + +### E2E-001: Full Cluster Creation Flow on GCP + +**Objective**: Validate end-to-end cluster creation from API request to Ready state on GCP. + +**Test Steps**: +1. Submit cluster creation request via `POST /api/hyperfleet/v1/clusters` + - Provider: GCP + - Region: us-east1 + - NodeCount: 3 + - Labels: {environment: "test", team: "platform"} + +2. Verify API response + - HTTP 201 Created + - Cluster ID generated + - status.phase = "Not Ready" (MVP: only "Ready" or "Not Ready") + - status.adapters = [] (no adapters reported yet) + - status.lastUpdated set + - generation = 1 + +3. Monitor cluster status via `GET /api/hyperfleet/v1/clusters/{id}` + - Verify phase remains "Not Ready" until all adapters complete + - Monitor status.adapters array as adapters report their status + +4. Monitor adapter statuses via `GET /api/hyperfleet/v1/clusters/{id}/statuses` + - This returns ONE ClusterStatus object containing all adapter statuses + - Verify ClusterStatus.adapterStatuses array is populated by each adapter + - Each adapter reports conditions: Available, Applied, Health + - Validation Adapter conditions: + - Available: False (JobRunning) → True (JobSucceeded) + - Applied: True (JobLaunched) + - Health: True (NoErrors) + - DNS Adapter conditions: + - Available: False (JobRunning) → True (JobSucceeded) + - Applied: True (JobLaunched) + - Health: True (NoErrors) + - Placement Adapter conditions: + - Available: False (JobRunning) → True (JobSucceeded) + - Applied: True (JobLaunched) + - Health: True (NoErrors) + - Pull Secret Adapter conditions: + - Available: False (JobRunning) → True (JobSucceeded) + - Applied: True (JobLaunched) + - Health: True (NoErrors) + - HyperShift Adapter conditions: + - Available: False (JobRunning) → True (JobSucceeded) + - Applied: True (JobLaunched) + - Health: True (NoErrors) +5. Verify final state + - Cluster status.phase = "Ready" + - Cluster status.adapters shows all adapters with: + - name: adapter name + - available: "True" + - observedGeneration: 1 (matches cluster.generation) + - ClusterStatus.adapterStatuses array contains all adapter statuses + - All adapters have Available condition = True + - Cluster API and console are accessible and functional + +**Expected Duration**: Average time + +**Success Criteria**: +- Cluster transitions to Ready state +- All adapters complete successfully +- No errors in logs (API, Sentinel, Adapters, Jobs) +- Kubernetes Jobs complete successfully + +--- + +### E2E-002: Cluster Configuration Update (Post-MVP) + +**Objective**: Validate cluster spec update triggers reconciliation and completes successfully. + +**Test Steps**: +1. Create cluster (using E2E-001) +2. Wait for Ready state +3. Update cluster via `PATCH /api/hyperfleet/v1/clusters/{id}` + - Change nodeCount: 3 → 5 +4. Verify phase transition: Ready → Not Ready → Ready +5. Verify adapter reconciliation (observedGeneration increments to 2) + - Validation Adapter re-validates updated spec (observedGeneration: 2) + - DNS Adapter reconfigures for new region (observedGeneration: 2) + - Infrastructure adapters apply changes (observedGeneration: 2) + - Each adapter reports Available: False → True as they reconcile +6. Verify final state + - Cluster spec updated correctly + - Cluster generation = 2 + - Cluster status.phase = "Ready" + - All adapters in status.adapters have: + - available: "True" + - observedGeneration: 2 (matches cluster.generation) + - ClusterStatus shows all adapters with Available condition = True + +**Expected Duration**: Average time + +**Success Criteria**: +- Spec updates applied successfully +- All adapters reconcile changes +- Cluster returns to Ready state + +--- + +### E2E-003: Cluster Deletion (Post-MVP) + +**Objective**: Validate complete cluster deletion and resource cleanup. + +**Test Steps**: +1. Create cluster (using E2E-001) +2. Wait for Ready state +3. Delete cluster via `DELETE /api/hyperfleet/v1/clusters/{id}` +4. Verify API response + - HTTP 202 Accepted or 204 No Content + - Note: MVP deletion behavior TBD - cluster may be deleted immediately or marked for deletion with deletionTimestamp + - If cluster still exists after DELETE: status.phase = "Not Ready" (MVP has no "Terminating" phase) +5. Monitor deletion progress via `GET /api/hyperfleet/v1/clusters/{id}/statuses` + - Adapters execute cleanup (typically in reverse order) + - Monitor ClusterStatus.adapterStatuses for cleanup progress + - Each adapter reports cleanup via conditions: + - HyperShift Adapter: Available: True (cleanup complete) + - Pull Secret Adapter: Available: True (cleanup complete) + - Placement Adapter: Available: True (cleanup complete) + - DNS Adapter: Available: True (cleanup complete) + - Infrastructure resources removed +6. Verify cluster deletion + - `GET /api/hyperfleet/v1/clusters/{id}` returns HTTP 404 + - Database record deleted + - ClusterStatus also deleted + - All Kubernetes Jobs cleaned up + - Cloud provider resources removed + +**Expected Duration**: Average time + +**Success Criteria**: +- Cluster fully deleted from system +- No orphaned resources in cloud provider +- No orphaned Kubernetes resources + +--- + +### Part 2: Nodepool Lifecycle + +### E2E-004: Full Nodepool Creation Flow + +**Objective**: Validate end-to-end nodepool creation from API request to Ready state for an existing cluster. + +**Test Steps**: +1. Prerequisites: Create cluster via E2E-001 and wait for Ready state +2. Submit nodepool creation request via `POST /api/hyperfleet/v1/clusters/{cluster_id}/nodepools` + - Name: "gpu-nodepool" + - MachineType: "n1-standard-8" + - Replicas: 2 + - Labels: {workload: "gpu", tier: "compute"} + +3. Verify API response + - HTTP 201 Created + - Nodepool ID generated + - status.phase = "Not Ready" (MVP: only "Ready" or "Not Ready") + - status.adapters = [] (no adapters reported yet) + - status.lastUpdated set + - generation = 1 + +4. Verify nodepool appears in list via `GET /api/hyperfleet/v1/clusters/{cluster_id}/nodepools` + - Nodepool included in response + - Can filter by labels + +5. Monitor nodepool status via `GET /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}` + - Verify phase remains "Not Ready" until all adapters complete + - Monitor status.adapters array as adapters report their status + +6. Monitor adapter statuses via `GET /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}/statuses` + - This returns ONE NodepoolStatus object containing all adapter statuses + - Verify NodepoolStatus.adapterStatuses array is populated by each adapter + - Each adapter reports conditions: Available, Applied, Health + - Validation Adapter conditions: + - Available: False (JobRunning) → True (JobSucceeded) + - Applied: True (JobLaunched) + - Health: True (NoErrors) + - Nodepool Adapter conditions: + - Available: False (JobRunning) → True (JobSucceeded) + - Applied: True (JobLaunched) + - Health: True (NoErrors) + +7. Verify final state + - Nodepool status.phase = "Ready" + - Nodepool status.adapters shows all adapters with: + - name: adapter name + - available: "True" + - observedGeneration: 1 (matches nodepool.generation) + - NodepoolStatus.adapterStatuses array contains all adapter statuses + - All adapters have Available condition = True + - Nodepool nodes are running and joined to cluster + +**Expected Duration**: Average time + +**Success Criteria**: +- Nodepool transitions to Ready state +- All adapters complete successfully +- Nodes are created and healthy in the cluster +- No errors in logs (API, Sentinel, Adapters, Jobs) +- Kubernetes Jobs complete successfully + +--- + +### E2E-005: Nodepool Configuration Update (Post-MVP) + +**Objective**: Validate nodepool spec update triggers reconciliation and completes successfully. + +**Test Steps**: +1. Create nodepool (using E2E-004) +2. Wait for Ready state +3. Update nodepool via `PATCH /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}` + - Change replicas: 2 → 4 +4. Verify adapter reconciliation (observedGeneration increments to 2) + - Validation Adapter re-validates updated spec (observedGeneration: 2) + - Nodepool Adapter applies changes (observedGeneration: 2) + - Each adapter reports Available: False → True as they reconcile +5. Verify final state + - Nodepool spec updated correctly + - Nodepool generation = 2 + - Nodepool status.phase = "Ready" + - All adapters in status.adapters have: + - available: "True" + - observedGeneration: 2 (matches nodepool.generation) + - NodepoolStatus shows all adapters with Available condition = True + - 4 nodes are running in the cluster + +**Expected Duration**: Average time + +**Success Criteria**: +- Spec updates applied successfully +- All adapters reconcile changes +- Nodepool returns to Ready state +- Correct number of nodes running + +--- + +### E2E-006: Nodepool Deletion (Post-MVP) + +**Objective**: Validate complete nodepool deletion and resource cleanup. + +**Test Steps**: +1. Create nodepool (using E2E-004) +2. Wait for Ready state +3. Delete nodepool via `DELETE /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}` +4. Verify API response + - HTTP 202 Accepted or 204 No Content + - Note: MVP deletion behavior TBD - nodepool may be deleted immediately or marked for deletion with deletionTimestamp + - If nodepool still exists after DELETE: status.phase = "Not Ready" (MVP has no "Terminating" phase) +5. Monitor deletion progress via `GET /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}/statuses` + - Adapters execute cleanup (typically in reverse order) + - Monitor NodepoolStatus.adapterStatuses for cleanup progress + - Each adapter reports cleanup via conditions: + - Nodepool Adapter: Available: True (cleanup complete) + - Nodepool resources removed +6. Verify nodepool deletion + - `GET /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}` returns HTTP 404 + - Database record deleted + - NodepoolStatus also deleted + - All Kubernetes Jobs cleaned up + - Nodes removed from cluster + +**Expected Duration**: Average time + +**Success Criteria**: +- Nodepool fully deleted from system +- No orphaned nodes in cluster +- No orphaned Kubernetes resources + +--- + +## 2. Failure Scenario Tests + +### E2E-FAIL-001: Adapter Failed (Business Logic) + +**Objective**: Validate system handles validation failures with proper status reporting, distinguishing from adapter health issues. + +**Test Steps**: +1. Create cluster with missing prerequisite (e.g., Route53 zone not configured for specified domain) +2. Submit cluster creation request via `POST /api/hyperfleet/v1/clusters` +3. Monitor adapter status via `GET /api/hyperfleet/v1/clusters/{id}/statuses` + - Verify Validation Adapter conditions in ClusterStatus.adapterStatuses: + - Available: False (reason: "ValidationFailed", message: "Reasonable reason for failure (validation logic)")") + - Applied: True (reason: "JobLaunched", message: "Kubernetes Job created successfully") + - Health: True (reason: "NoErrors", message: "Adapter executed normally (validation logic failed, not adapter error)") + - Verify cluster status.phase = "Not Ready" + - Verify cluster status.adapters shows validation adapter with available: "False" +4. Verify data field contains detailed validation results: + ```json + { + "validationResults": { + "route53ZoneFound": false, + "s3BucketAccessible": true, + "quotaSufficient": true + }, + "checksPerformed": 15, + "checksPassed": 14, + "checksFailed": 1, + "failedChecks": ["route53_zone"] + } + ``` + +**Success Criteria**: +- Validation failure reported with Health: True (business logic failure, not adapter error) +- Available: False indicates work incomplete +- Detailed validation results in data field +- Clear distinction between business logic failures and adapter health issues + +--- + +### E2E-FAIL-002: Adapter Failed (Unexpected Error) + +**Objective**: Validate system handles resource quota failures as unexpected errors with proper status reporting, matching the pattern where Job creation fails. + +**Test Steps**: +1. Configure namespace resource quota limits (e.g., CPU limit in hyperfleet-jobs namespace) +2. Create multiple clusters to consume available quota +3. Create another cluster via `POST /api/hyperfleet/v1/clusters` that will trigger quota exceeded +4. Monitor validation Adapter status via `GET /api/hyperfleet/v1/clusters/{id}/statuses` + - Verify validation Adapter cannot create Job + - Verify validation Adapter conditions in ClusterStatus.adapterStatuses: + - Available: False (reason: "ResourceCreationFailed", message: "Failed to create validation Job") + - Applied: False (reason: "ResourceQuotaExceeded", message: "Failed to create Job: namespace resource quota exceeded (cpu limit reached)") + - Health: False (reason: "UnexpectedError", message: "Adapter could not complete due to resource quota limits") + - Verify cluster status.phase = "Not Ready" + - Verify cluster status.adapters shows dns adapter with available: "False" +5. Verify data field contains detailed error information: + ```json + { + "error": { + "type": "ResourceQuotaExceeded", + "message": "CPU limit reached", + "namespace": "hyperfleet-jobs" + } + } + ``` + +**Success Criteria**: +- Resource quota failure detected and reported via Health: False condition +- Available: False indicates work incomplete +- Applied: False shows Job was NOT created (key distinction from timeout scenarios) +- Detailed error information in data field with error type and context +- Clear distinction: quota exceeded is unexpected infrastructure error (Health: False), not business logic failure + +--- + +### E2E-FAIL-003: Database Connection Failure + +**Objective**: Validate API handles database connection failures gracefully. + +**Test Steps**: +1. Simulate database connection failure (stop PostgreSQL) +2. Attempt cluster operations via API + - GET /clusters (should return 503) + - POST /clusters (should return 503) +3. Verify API error responses + - HTTP 503 Service Unavailable + - Appropriate error messages +4. Restore database connection +5. Verify API operations resume normally +6. Create cluster and verify success + +**Success Criteria**: +- API returns 503 errors during outage +- API doesn't crash +- Operations resume after recovery +- No data corruption + +--- + +### E2E-FAIL-004: Adapter Precondition Not Met + +**Objective**: Validate adapter correctly skips execution when preconditions not met. + +**Test Steps**: +1. Create cluster +2. Monitor HyperShift Adapter (depends on DNS, Placement completing) +3. Simulate DNS Adapter stuck in "Running" phase +4. Verify HyperShift Adapter behavior + - Consumes event from broker + - Evaluates preconditions + - Preconditions not met (DNS not Complete) + - Does NOT create Job + - Acknowledges message +5. Complete DNS Adapter +6. Verify HyperShift Adapter processes next event + - Preconditions now met + - Job created and executed + +**Success Criteria**: +- Adapter correctly evaluates preconditions +- No Job created when preconditions not met +- Adapter processes event when preconditions met +- No deadlocks or stuck states + +--- + +### E2E-FAIL-005: Network Partition Between Components + +**Objective**: Validate system resilience during network partitions. + +**Test Steps**: +1. Create cluster (reaches Ready state) +2. Simulate network partition scenarios: + - Scenario A: Sentinel cannot reach API + - Scenario B: Adapter cannot reach API + - Scenario C: Adapter cannot reach broker +3. Monitor component behavior during partition + - Verify components log connection errors + - Verify components implement retry with backoff + - Verify no crashes or data loss +4. Restore network connectivity +5. Verify components resume normal operation +6. Create new cluster and verify success + +**Success Criteria**: +- Components handle network failures gracefully +- Retry mechanisms work correctly +- No data corruption or loss +- System recovers automatically + +--- + +### E2E-FAIL-006: Cluster Sentinel Operator Crash and Recovery + +**Objective**: Validate system continues functioning after Sentinel restarts. + +**Test Steps**: +1. Create 2 clusters (both in progress, not Ready) +2. Kill Sentinel Operator pod +3. Monitor cluster progress + - Verify no new events published during Sentinel downtime + - Verify adapters continue processing existing events +4. Kubernetes restarts Sentinel pod +5. Monitor Sentinel recovery + - Sentinel resumes polling API + - Sentinel publishes events for both clusters +6. Verify both clusters eventually reach Ready state + +**Success Criteria**: +- Clusters continue progressing during downtime +- Sentinel recovers automatically (Kubernetes restart) +- No events lost +- Clusters complete successfully + +--- + +### E2E-FAIL-007: Nodepool Sentinel Operator Crash and Recovery + +**Objective**: Validate system continues functioning for nodepool operations after Sentinel restarts. + +**Test Steps**: +1. Create cluster (using E2E-001) and wait for Ready state +2. Create 2 nodepools for the cluster (both in progress, not Ready) + - Nodepool A: "compute-pool" with 3 replicas + - Nodepool B: "gpu-pool" with 2 replicas +3. Kill Sentinel Operator pod +4. Monitor nodepool progress + - Verify no new events published during Sentinel downtime + - Verify adapters continue processing existing events for both nodepools +5. Kubernetes restarts Sentinel pod +6. Monitor Sentinel recovery + - Sentinel resumes polling API + - Sentinel publishes events for both nodepools +7. Verify both nodepools eventually reach Ready state + - Via `GET /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}` + - Verify status.phase = "Ready" for both nodepools + - Via `GET /api/hyperfleet/v1/clusters/{cluster_id}/nodepools/{id}/statuses` + - Verify all adapters have Available condition = True + +**Success Criteria**: +- Nodepools continue progressing during Sentinel downtime +- Sentinel recovers automatically (Kubernetes restart) +- No events lost for nodepool operations +- Both nodepools complete successfully +- Nodes are created and joined to cluster for both nodepools + +---