Skip to content

Commit eb37c6a

Browse files
feat: Add support for grace-period for failed/pending pods [KO-455] (#415)
* feat: Add support for grace-period for failed/pending pods [KO-455] * add test-cases along with github action * fix test-case timing issue * define podState type to make code flow more readable
1 parent d6b1c0a commit eb37c6a

File tree

14 files changed

+911
-43
lines changed

14 files changed

+911
-43
lines changed

.github/workflows/README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# GitHub Actions Workflows
2+
3+
This directory contains GitHub Actions workflows for CI/CD automation.
4+
5+
## Available Workflows
6+
7+
### 1. Unit tests (`pkg-unit-tests.yaml`)
8+
**Unit testing workflow for pkg directory**
9+
10+
**Triggers:**
11+
- Push to `master` branch
12+
- Pull requests to `master` branch
13+
- **Only when files in `pkg/` directory change**
14+
15+
**Features:**
16+
- ✅ Runs all unit tests in `pkg/` directory with race detection
17+
- ✅ Basic coverage reporting
18+
- ✅ Fast execution for quick feedback
19+
- ✅ Minimal configuration
20+
21+
### 2. GolangCI Lint (`golangci-lint.yaml`)
22+
**Code quality and linting checks**
23+
24+
**Triggers:**
25+
- Push to `master` branch or version tags
26+
- Pull requests to `master` branch
27+
28+
**Features:**
29+
- ✅ Runs golangci-lint with comprehensive checks
30+
- ✅ 5-minute timeout for large codebases
31+
32+
### 4. CodeQL Analysis (`codeql-analysis.yml`)
33+
**Security and code quality analysis**
34+
35+
**Triggers:**
36+
- Scheduled runs and code changes
37+
38+
**Features:**
39+
- ✅ Security vulnerability scanning
40+
- ✅ Code quality analysis
41+
42+
### 5. Docker Image Release (`docker-image-release.yaml`)
43+
**Container image building and publishing**
44+
45+
**Triggers:**
46+
- Version tag pushes
47+
48+
**Features:**
49+
- ✅ Multi-architecture Docker image builds
50+
- ✅ Image publishing to container registry
51+
52+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: unit tests
2+
3+
on:
4+
push:
5+
branches:
6+
- master
7+
paths:
8+
- 'pkg/**'
9+
- '.github/workflows/pkg-unit-tests.yaml'
10+
pull_request:
11+
branches:
12+
- master
13+
paths:
14+
- 'pkg/**'
15+
- '.github/workflows/pkg-unit-tests.yaml'
16+
17+
jobs:
18+
pkg-unit-tests:
19+
name: pkg
20+
runs-on: ubuntu-latest
21+
22+
steps:
23+
- name: Checkout code
24+
uses: actions/checkout@v4
25+
with:
26+
submodules: true
27+
28+
- name: Setup Go
29+
uses: actions/setup-go@v5
30+
with:
31+
go-version-file: go.mod
32+
cache: true
33+
34+
- name: Run pkg unit tests
35+
run: make pkg-test
36+

Makefile

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,18 @@ go-lint-fix: golangci-lint ## Run golangci-lint linter and perform fixes
141141
.PHONY: all-test
142142
all-test: manifests generate fmt vet setup-envtest cluster-test backup-service-test backup-test restore-test ## Run tests.
143143

144+
.PHONY: pkg-test
145+
pkg-test: ## Run unit tests for pkg directory
146+
@echo "Running pkg unit tests..."
147+
go test -v -race -coverprofile=coverage.out ./pkg/...
148+
@echo "\nCoverage Summary:"
149+
@go tool cover -func=coverage.out | tail -1
150+
151+
.PHONY: pkg-test-coverage
152+
pkg-test-coverage: pkg-test ## Run pkg unit tests and open coverage report in browser
153+
@echo "Opening coverage report in browser..."
154+
go tool cover -html=coverage.out
155+
144156
.PHONY: cluster-test
145157
cluster-test: manifests generate fmt vet setup-envtest ## Run tests.
146158
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" cd $(shell pwd)/test/cluster; mkdir -p ../test-results; go run github.com/onsi/ginkgo/v2/ginkgo --grace-period=10m -p --procs=8 -coverprofile ascover.out -v -show-node-events --focus="$(FOCUS)" -timeout=5h0m0s --junit-report=../test-results/junit-cluster.xml -- ${ARGS}

TESTING.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Testing Guide
2+
3+
This document describes how to run tests for the Aerospike Kubernetes Operator.
4+
5+
## Quick Reference
6+
7+
| Command | Description |
8+
|-----------------------------------|------------------------------------------|
9+
| `make pkg-test` | Run pkg unit tests (CI command) |
10+
| `make pkg-test-coverage` | Run tests + open coverage report |
11+
| `make cluster-test` | Run cluster integration tests |
12+
| `make backup-service-test` | Run backup-service integration tests |
13+
| `make backup-test` | Run backup integration tests |
14+
| `make restore-test` | Run restore integration tests |
15+
| `make all-test` | Run all tests |
16+
| `go test ./pkg/...` | Run all pkg tests manually |
17+
| `go test -v -race ./pkg/utils` | Run specific package with race detection |
18+
| `go test -run TestName ./pkg/...` | Run specific test |
19+
20+
## Quick Start
21+
22+
### Run PKG Unit Tests (Recommended)
23+
24+
```bash
25+
# Run pkg unit tests (same command used in CI)
26+
make pkg-test
27+
28+
# Run pkg unit tests and open coverage report in browser
29+
make pkg-test-coverage
30+
```
31+
32+
## Available Test Targets
33+
34+
### Unit Tests
35+
36+
#### `make pkg-test`
37+
Runs all unit tests in the `pkg/` directory with race detection and coverage reporting.
38+
39+
**What it does:**
40+
- Runs `go test -v -race -coverprofile=coverage.out ./pkg/...`
41+
- Displays coverage summary at the end
42+
- Same command used by GitHub Actions CI
43+
44+
**Example output:**
45+
```
46+
Running pkg unit tests...
47+
=== RUN TestGetFailedPodGracePeriod
48+
--- PASS: TestGetFailedPodGracePeriod (0.00s)
49+
...
50+
PASS
51+
coverage: 29.8% of statements
52+
53+
Coverage Summary:
54+
total: (statements) 18.6%
55+
```
56+
57+
#### `make pkg-test-coverage`
58+
Runs pkg unit tests and opens an HTML coverage report in your browser.
59+
60+
**Use this when:**
61+
- You want to see detailed line-by-line coverage
62+
- You're improving test coverage
63+
- You need to identify untested code paths
64+
65+
### Integration Tests
66+
67+
#### `make cluster-test`
68+
Runs cluster integration tests using Ginkgo.
69+
70+
#### `make backup-service--test`
71+
Runs backup-service integration tests.
72+
73+
#### `make backup-test`
74+
Runs backup integration tests.
75+
76+
#### `make restore-test`
77+
Runs restore integration tests.
78+
79+
#### `make all-test`
80+
Runs all tests (unit + integration).
81+
82+
## Running Tests Manually
83+
84+
### Run all pkg tests:
85+
```bash
86+
go test -v -race ./pkg/...
87+
```
88+
89+
### Run specific package:
90+
```bash
91+
go test -v -race ./pkg/utils
92+
go test -v -race ./pkg/jsonpatch
93+
go test -v -race ./pkg/merge
94+
```
95+
96+
### Run specific test:
97+
```bash
98+
go test -v -race ./pkg/utils -run TestGetFailedPodGracePeriod
99+
```
100+
101+
### Run with coverage:
102+
```bash
103+
go test -v -race -coverprofile=coverage.out ./pkg/...
104+
go tool cover -func=coverage.out
105+
go tool cover -html=coverage.out
106+
```
107+
108+
## Test Coverage
109+
110+
### Current Coverage
111+
112+
Run `make pkg-test` to see current coverage:
113+
```
114+
pkg/jsonpatch: 64.1% of statements
115+
pkg/merge: 89.3% of statements
116+
pkg/utils: 29.8% of statements
117+
total: 18.6% of statements
118+
```
119+
120+
## Additional Resources
121+
- [Go Testing Documentation](https://golang.org/pkg/testing/)
122+
- [GitHub Actions Workflows](.github/workflows/README.md)
123+

api/v1/utils.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ const (
3535
AdminPortName = "admin"
3636

3737
InfoPortName = "info"
38+
39+
DefaultFailedPodGracePeriodSeconds = 60
40+
RequeueIntervalSeconds10 = 10
3841
)
3942

4043
const (

config/manager/manager.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,9 @@ spec:
6060
- name: WATCH_NAMESPACE
6161
# for watching multiple namespaces by operator, give a list of namespaces (e.g. aerospike,test,test1,test2)
6262
value: aerospike
63+
- name: FAILED_POD_GRACE_PERIOD_SECONDS
64+
# for setting the grace period to delete/recover failed pods, default is 60 seconds
65+
value: 60
6366
- name: AEROSPIKE_KUBERNETES_INIT_REGISTRY
6467
# this is the registry used to pull aerospike-init image
6568
value: docker.io

helm-charts/aerospike-kubernetes-operator/templates/aerospike-operator-controller-manager-deployment.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@ spec:
6161
env:
6262
- name: WATCH_NAMESPACE
6363
value: {{ .Values.watchNamespaces | quote }}
64+
- name: FAILED_POD_GRACE_PERIOD_SECONDS
65+
value: {{ .Values.failedPodGracePeriodSeconds | quote }}
6466
- name: AEROSPIKE_KUBERNETES_INIT_REGISTRY
6567
value: {{ .Values.aerospikeKubernetesInitRegistry }}
6668
- name: AEROSPIKE_KUBERNETES_INIT_REGISTRY_NAMESPACE

helm-charts/aerospike-kubernetes-operator/values.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ certs:
3434
## Operator configurations
3535
watchNamespaces: "default,aerospike"
3636

37+
# Grace period to delete/recover failed pods (in seconds)
38+
failedPodGracePeriodSeconds: "60"
39+
3740
# Registry used to pull aerospike-init image
3841
aerospikeKubernetesInitRegistry: "docker.io"
3942

internal/controller/cluster/pod.go

Lines changed: 35 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -261,7 +261,7 @@ func (r *SingleClusterReconciler) rollingRestartPods(
261261
rackState *RackState, podsToRestart []*corev1.Pod, ignorablePodNames sets.Set[string],
262262
restartTypeMap map[string]RestartType,
263263
) common.ReconcileResult {
264-
failedPods, activePods := getFailedAndActivePods(podsToRestart)
264+
failedPods, failedWithinGracePeriodPods, activePods := getFailedAndActivePods(podsToRestart, true)
265265

266266
// If already dead node (failed pod) then no need to check node safety, migration
267267
if len(failedPods) != 0 {
@@ -317,6 +317,15 @@ func (r *SingleClusterReconciler) rollingRestartPods(
317317
}
318318
}
319319

320+
if len(failedWithinGracePeriodPods) != 0 {
321+
r.Log.Info(
322+
"Pods are in failed state but within grace period, will not delete",
323+
"pods", getPodNames(failedWithinGracePeriodPods),
324+
)
325+
326+
return common.ReconcileRequeueAfter(asdbv1.RequeueIntervalSeconds10)
327+
}
328+
320329
return common.ReconcileSuccess()
321330
}
322331

@@ -548,22 +557,28 @@ func (r *SingleClusterReconciler) ensurePodsRunningAndReady(podsToCheck []*corev
548557
podNames,
549558
)
550559

551-
return common.ReconcileRequeueAfter(10)
560+
return common.ReconcileRequeueAfter(asdbv1.RequeueIntervalSeconds10)
552561
}
553562

554-
func getFailedAndActivePods(pods []*corev1.Pod) (failedPods, activePods []*corev1.Pod) {
563+
func getFailedAndActivePods(
564+
pods []*corev1.Pod, withGracePeriod bool) (failedPods, failedWithinGracePeriodPods, activePods []*corev1.Pod,
565+
) {
555566
for idx := range pods {
556567
pod := pods[idx]
557568

558-
if err := utils.CheckPodFailed(pod); err != nil {
569+
podState := utils.CheckPodFailedWithGrace(pod, withGracePeriod)
570+
571+
switch podState.State {
572+
case utils.PodHealthy:
573+
activePods = append(activePods, pod)
574+
case utils.PodFailedInGrace:
575+
failedWithinGracePeriodPods = append(failedWithinGracePeriodPods, pod)
576+
case utils.PodFailed:
559577
failedPods = append(failedPods, pod)
560-
continue
561578
}
562-
563-
activePods = append(activePods, pod)
564579
}
565580

566-
return failedPods, activePods
581+
return failedPods, failedWithinGracePeriodPods, activePods
567582
}
568583

569584
func getNonIgnorablePods(pods []*corev1.Pod, ignorablePodNames sets.Set[string],
@@ -585,7 +600,7 @@ func getNonIgnorablePods(pods []*corev1.Pod, ignorablePodNames sets.Set[string],
585600
func (r *SingleClusterReconciler) safelyDeletePodsAndEnsureImageUpdated(
586601
rackState *RackState, podsToUpdate []*corev1.Pod, ignorablePodNames sets.Set[string],
587602
) common.ReconcileResult {
588-
failedPods, activePods := getFailedAndActivePods(podsToUpdate)
603+
failedPods, failedWithinGracePeriodPods, activePods := getFailedAndActivePods(podsToUpdate, true)
589604

590605
// If already dead node (failed pod) then no need to check node safety, migration
591606
if len(failedPods) != 0 {
@@ -640,6 +655,15 @@ func (r *SingleClusterReconciler) safelyDeletePodsAndEnsureImageUpdated(
640655
}
641656
}
642657

658+
if len(failedWithinGracePeriodPods) != 0 {
659+
r.Log.Info(
660+
"Pods are in failed state but within grace period, will not delete",
661+
"pods", getPodNames(failedWithinGracePeriodPods),
662+
)
663+
664+
return common.ReconcileRequeueAfter(asdbv1.RequeueIntervalSeconds10)
665+
}
666+
643667
return common.ReconcileSuccess()
644668
}
645669

@@ -720,6 +744,7 @@ func (r *SingleClusterReconciler) ensurePodsImageUpdated(podsToCheck []*corev1.P
720744
return common.ReconcileError(err)
721745
}
722746

747+
// For existing cluster operations, no grace period for immediate responsiveness
723748
if err := utils.CheckPodFailed(updatedPod); err != nil {
724749
return common.ReconcileError(err)
725750
}
@@ -746,7 +771,7 @@ func (r *SingleClusterReconciler) ensurePodsImageUpdated(podsToCheck []*corev1.P
746771
podNames,
747772
)
748773

749-
return common.ReconcileRequeueAfter(10)
774+
return common.ReconcileRequeueAfter(asdbv1.RequeueIntervalSeconds10)
750775
}
751776

752777
// cleanupPods checks pods and status before scale-up to detect and fix any

0 commit comments

Comments
 (0)