Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions .github/workflows/resilience-nightly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: resilience-nightly

on:
schedule:
- cron: "17 3 * * *"
workflow_dispatch:
inputs:
mode:
description: "dry-run or live"
required: false
default: dry-run
namespace:
description: "Kubernetes namespace for live mode"
required: false
default: cefas-resilience
release:
description: "Helm release name"
required: false
default: cefas-resilience
kube_context:
description: "kubectl context for live mode"
required: false
default: ""

permissions:
contents: read

concurrency:
group: resilience-nightly-${{ github.ref }}
cancel-in-progress: false

jobs:
k8s-resilience:
name: kubernetes resilience suite
runs-on: ubuntu-latest
timeout-minutes: 30
env:
MODE: ${{ github.event_name == 'workflow_dispatch' && inputs.mode || 'dry-run' }}
NAMESPACE: ${{ github.event_name == 'workflow_dispatch' && inputs.namespace || 'cefas-resilience' }}
RELEASE: ${{ github.event_name == 'workflow_dispatch' && inputs.release || 'cefas-resilience' }}
KUBE_CONTEXT: ${{ github.event_name == 'workflow_dispatch' && inputs.kube_context || '' }}
ARTIFACT_DIR: ${{ runner.temp }}/cefas-resilience/${{ github.run_id }}
steps:
- uses: actions/checkout@v6
- uses: azure/setup-helm@v4
- uses: azure/setup-kubectl@v4
- name: Run resilience suite
run: scripts/k8s_resilience_suite.sh
- name: Upload resilience artifacts
if: always()
uses: actions/upload-artifact@v7
with:
name: resilience-suite
path: ${{ runner.temp }}/cefas-resilience
if-no-files-found: warn
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ BIN_DIR := ./bin
VERSION := $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
LDFLAGS := -ldflags "-s -w -X main.Version=$(VERSION)"

.PHONY: help build server cli install clean fmt lint vet test cover mut sec bench helm-test tools ci
.PHONY: help build server cli install clean fmt lint vet test cover mut sec bench helm-test k8s-resilience tools ci

help: ## List available targets.
@awk 'BEGIN {FS = ":.*?## "} /^[a-zA-Z_-]+:.*?## / {printf " %-12s %s\n", $$1, $$2}' $(MAKEFILE_LIST)
Expand Down Expand Up @@ -76,4 +76,7 @@ bench: ## Run benchmarks across all packages.
helm-test: ## Render-test Helm resilience profiles.
scripts/test_helm_resilience.sh

k8s-resilience: ## Run the Kubernetes resilience suite in dry-run mode by default.
scripts/k8s_resilience_suite.sh

ci: vet lint test cover sec ## Full quality gate (mirror of CI workflow).
6 changes: 6 additions & 0 deletions docs/helm-resilience.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,9 @@ Run the chart smoke tests:
```sh
scripts/test_helm_resilience.sh
```

Run the Kubernetes resilience acceptance suite in CI-safe dry-run mode:

```sh
MODE=dry-run scripts/k8s_resilience_suite.sh
```
64 changes: 64 additions & 0 deletions docs/resilience-acceptance-matrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Kubernetes Resilience Acceptance Matrix

This is the RF=3 acceptance matrix for Kubernetes and Talos-style failure
testing. The suite is implemented by `scripts/k8s_resilience_suite.sh`.

Run the CI-safe render check:

```sh
MODE=dry-run scripts/k8s_resilience_suite.sh
```

Run against a real cluster:

```sh
MODE=live \
KUBE_CONTEXT=hack \
NAMESPACE=cefas-resilience \
RELEASE=cefas-resilience \
KEEP_CLUSTER=1 \
scripts/k8s_resilience_suite.sh
```

Destructive node and provider faults require an explicit opt-in:

```sh
MODE=live \
KUBE_CONTEXT=hack \
ALLOW_DESTRUCTIVE=1 \
NODE_SHUTDOWN_COMMAND='./ops/talos-shutdown-one-node "$TARGET_NODE"' \
NODE_RESTORE_COMMAND='./ops/talos-power-on-one-node "$TARGET_NODE"' \
scripts/k8s_resilience_suite.sh
```

The hook commands run with `TARGET_POD`, `TARGET_NODE`, `NAMESPACE`,
`RELEASE`, `APP`, `SELECTOR`, and `KUBE_CONTEXT` exported.

| Scenario | Fault | Expected Service Behavior | Recovery | Stop Condition |
| --- | --- | --- | --- | --- |
| `healthy-baseline` | None | All database Pods are ready; `cefas-manager doctor` is healthy or degraded-but-serving. | Not applicable. | Doctor fails or reports unsafe. |
| `pod-kill` | Delete one CefasDB Pod and keep PVCs. | StatefulSet recreates the Pod and the RF=3 cluster remains serving. | Wait for rollout and startup probe completion. | Ready quorum is not restored or doctor reports unsafe. |
| `node-drain` | Cordon one node hosting a database Pod and drain CefasDB Pods from it. | One failure domain can be unavailable while the remaining majority stays serving. | Uncordon the node and let Kubernetes reschedule. | Quorum-ready Pod count is lost or doctor reports unsafe. |
| `node-shutdown` | Provider/Talos hook shuts down one node. | One physical host loss remains serving or explicitly degraded-but-serving. | Provider/Talos hook powers the node back on. | Doctor reports unsafe for a one-node loss. |
| `orphan-process` | Provider/Talos hook leaves or simulates a stale process with an old raft identity. | The stale process is fenced by the Kubernetes lease and the active cluster remains serving. | Provider/Talos hook removes the stale process. | The stale process can serve or CPU-spins without a valid lease. |
| `disk-pressure` | Provider/Talos hook applies disk pressure on one database node. | Manager reports pressure and the database stays serving or degraded-but-serving. | Provider/Talos hook removes pressure and confirms PVC health. | Pods spin indefinitely or doctor reports unsafe for a one-node fault. |
| `network-partition` | Provider/Talos hook partitions one database node from the cluster. | The majority side remains serving and the isolated node is not accepted as healthy. | Provider/Talos hook removes the partition and raft catches up. | Split brain is observed or manager cannot identify unsafe state. |
| `two-node-failure` | Two failure domains unavailable at the same time. | RF=3 fails safely with clear quorum or unsafe reporting. | Restore nodes/hooks and wait for raft catch-up. | Pods CPU-spin without functional quorum or health reporting is ambiguous. |

## Artifacts

Each run writes:

- `summary.md`: scenario table and statuses
- `matrix.json`: machine-readable acceptance matrix
- `rendered.yaml`: Helm render used for the run
- `helm-lint.txt`: chart lint output
- `scenarios/<name>/doctor.json`: manager health report when available
- `scenarios/<name>/*-pods.txt`, `*-events.txt`, `*-nodes.txt`, and logs for failures

## Acceptance Rules

One failure domain loss must keep the cluster serving or explicitly
degraded-but-serving. Two failure domain loss is not required to serve with
RF=3, but it must fail safely: the manager must report quorum loss or unsafe
state, and Pods must not remain CPU-bound while nonfunctional.
Loading
Loading