CefasDB · Osvaldo Andrade (osvaldoandrade) · Jun 26, 2026 · Jun 26, 2026
diff --git a/.github/workflows/resilience-nightly.yml b/.github/workflows/resilience-nightly.yml
@@ -0,0 +1,55 @@
+name: resilience-nightly
+
+on:
+  schedule:
+    - cron: "17 3 * * *"
+  workflow_dispatch:
+    inputs:
+      mode:
+        description: "dry-run or live"
+        required: false
+        default: dry-run
+      namespace:
+        description: "Kubernetes namespace for live mode"
+        required: false
+        default: cefas-resilience
+      release:
+        description: "Helm release name"
+        required: false
+        default: cefas-resilience
+      kube_context:
+        description: "kubectl context for live mode"
+        required: false
+        default: ""
+
+permissions:
+  contents: read
+
+concurrency:
+  group: resilience-nightly-${{ github.ref }}
+  cancel-in-progress: false
+
+jobs:
+  k8s-resilience:
+    name: kubernetes resilience suite
+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    env:
+      MODE: ${{ github.event_name == 'workflow_dispatch' && inputs.mode || 'dry-run' }}
+      NAMESPACE: ${{ github.event_name == 'workflow_dispatch' && inputs.namespace || 'cefas-resilience' }}
+      RELEASE: ${{ github.event_name == 'workflow_dispatch' && inputs.release || 'cefas-resilience' }}
+      KUBE_CONTEXT: ${{ github.event_name == 'workflow_dispatch' && inputs.kube_context || '' }}
+      ARTIFACT_DIR: ${{ runner.temp }}/cefas-resilience/${{ github.run_id }}
+    steps:
+      - uses: actions/checkout@v6
+      - uses: azure/setup-helm@v4
+      - uses: azure/setup-kubectl@v4
+      - name: Run resilience suite
+        run: scripts/k8s_resilience_suite.sh
+      - name: Upload resilience artifacts
+        if: always()
+        uses: actions/upload-artifact@v7
+        with:
+          name: resilience-suite
+          path: ${{ runner.temp }}/cefas-resilience
+          if-no-files-found: warn
diff --git a/Makefile b/Makefile
@@ -5,7 +5,7 @@ BIN_DIR := ./bin
 VERSION := $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
 LDFLAGS := -ldflags "-s -w -X main.Version=$(VERSION)"
 
-.PHONY: help build server cli install clean fmt lint vet test cover mut sec bench helm-test tools ci
+.PHONY: help build server cli install clean fmt lint vet test cover mut sec bench helm-test k8s-resilience tools ci
 
 help: ## List available targets.
 	@awk 'BEGIN {FS = ":.*?## "} /^[a-zA-Z_-]+:.*?## / {printf "  %-12s %s\n", $$1, $$2}' $(MAKEFILE_LIST)
@@ -76,4 +76,7 @@ bench: ## Run benchmarks across all packages.
 helm-test: ## Render-test Helm resilience profiles.
 	scripts/test_helm_resilience.sh
 
+k8s-resilience: ## Run the Kubernetes resilience suite in dry-run mode by default.
+	scripts/k8s_resilience_suite.sh
+
 ci: vet lint test cover sec ## Full quality gate (mirror of CI workflow).
diff --git a/docs/helm-resilience.md b/docs/helm-resilience.md
@@ -96,3 +96,9 @@ Run the chart smoke tests:
 ```sh
 scripts/test_helm_resilience.sh
 ```
+
+Run the Kubernetes resilience acceptance suite in CI-safe dry-run mode:
+
+```sh
+MODE=dry-run scripts/k8s_resilience_suite.sh
+```
diff --git a/docs/resilience-acceptance-matrix.md b/docs/resilience-acceptance-matrix.md
@@ -0,0 +1,64 @@
+# Kubernetes Resilience Acceptance Matrix
+
+This is the RF=3 acceptance matrix for Kubernetes and Talos-style failure
+testing. The suite is implemented by `scripts/k8s_resilience_suite.sh`.
+
+Run the CI-safe render check:
+
+```sh
+MODE=dry-run scripts/k8s_resilience_suite.sh
+```
+
+Run against a real cluster:
+
+```sh
+MODE=live \
+KUBE_CONTEXT=hack \
+NAMESPACE=cefas-resilience \
+RELEASE=cefas-resilience \
+KEEP_CLUSTER=1 \
+scripts/k8s_resilience_suite.sh
+```
+
+Destructive node and provider faults require an explicit opt-in:
+
+```sh
+MODE=live \
+KUBE_CONTEXT=hack \
+ALLOW_DESTRUCTIVE=1 \
+NODE_SHUTDOWN_COMMAND='./ops/talos-shutdown-one-node "$TARGET_NODE"' \
+NODE_RESTORE_COMMAND='./ops/talos-power-on-one-node "$TARGET_NODE"' \
+scripts/k8s_resilience_suite.sh
+```
+
+The hook commands run with `TARGET_POD`, `TARGET_NODE`, `NAMESPACE`,
+`RELEASE`, `APP`, `SELECTOR`, and `KUBE_CONTEXT` exported.
+
+| Scenario | Fault | Expected Service Behavior | Recovery | Stop Condition |
+| --- | --- | --- | --- | --- |
+| `healthy-baseline` | None | All database Pods are ready; `cefas-manager doctor` is healthy or degraded-but-serving. | Not applicable. | Doctor fails or reports unsafe. |
+| `pod-kill` | Delete one CefasDB Pod and keep PVCs. | StatefulSet recreates the Pod and the RF=3 cluster remains serving. | Wait for rollout and startup probe completion. | Ready quorum is not restored or doctor reports unsafe. |
+| `node-drain` | Cordon one node hosting a database Pod and drain CefasDB Pods from it. | One failure domain can be unavailable while the remaining majority stays serving. | Uncordon the node and let Kubernetes reschedule. | Quorum-ready Pod count is lost or doctor reports unsafe. |
+| `node-shutdown` | Provider/Talos hook shuts down one node. | One physical host loss remains serving or explicitly degraded-but-serving. | Provider/Talos hook powers the node back on. | Doctor reports unsafe for a one-node loss. |
+| `orphan-process` | Provider/Talos hook leaves or simulates a stale process with an old raft identity. | The stale process is fenced by the Kubernetes lease and the active cluster remains serving. | Provider/Talos hook removes the stale process. | The stale process can serve or CPU-spins without a valid lease. |
+| `disk-pressure` | Provider/Talos hook applies disk pressure on one database node. | Manager reports pressure and the database stays serving or degraded-but-serving. | Provider/Talos hook removes pressure and confirms PVC health. | Pods spin indefinitely or doctor reports unsafe for a one-node fault. |
+| `network-partition` | Provider/Talos hook partitions one database node from the cluster. | The majority side remains serving and the isolated node is not accepted as healthy. | Provider/Talos hook removes the partition and raft catches up. | Split brain is observed or manager cannot identify unsafe state. |
+| `two-node-failure` | Two failure domains unavailable at the same time. | RF=3 fails safely with clear quorum or unsafe reporting. | Restore nodes/hooks and wait for raft catch-up. | Pods CPU-spin without functional quorum or health reporting is ambiguous. |
+
+## Artifacts
+
+Each run writes:
+
+- `summary.md`: scenario table and statuses
+- `matrix.json`: machine-readable acceptance matrix
+- `rendered.yaml`: Helm render used for the run
+- `helm-lint.txt`: chart lint output
+- `scenarios/<name>/doctor.json`: manager health report when available
+- `scenarios/<name>/*-pods.txt`, `*-events.txt`, `*-nodes.txt`, and logs for failures
+
+## Acceptance Rules
+
+One failure domain loss must keep the cluster serving or explicitly
+degraded-but-serving. Two failure domain loss is not required to serve with
+RF=3, but it must fail safely: the manager must report quorum loss or unsafe
+state, and Pods must not remain CPU-bound while nonfunctional.