-
Notifications
You must be signed in to change notification settings - Fork 2
Add chaos testing setup and experiment documentation #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Updated README.md with prerequisites, environment setup, and chaos experiment instructions. - Created EXPERIMENT-GUIDE.md for detailed chaos experiment execution and monitoring. - Added YAML files for chaos experiments: cnpg-primary-pod-delete.yaml, cnpg-random-pod-delete.yaml, and cnpg-replica-pod-delete.yaml. - Implemented Litmus RBAC configuration in litmus-rbac.yaml. - Configured PostgreSQL cluster in pg-eu-cluster.yaml. - Developed scripts for environment verification (check-environment.sh) and chaos results retrieval (get-chaos-results.sh). - Enhanced status check script (status-check.sh) for Litmus installation verification. Signed-off-by: XploY04 <[email protected]>
Signed-off-by: XploY04 <[email protected]>
…TARGET_PODS Signed-off-by: XploY04 <[email protected]>
… updating documentation. Added support for chaos experiments without hard-coded pod names, improved README and quick start guides, and introduced monitoring scripts for better visibility during chaos experiments. Signed-off-by: XploY04 <[email protected]>
…a consistency verification - Implemented `setup-cnp-bench.sh` for configuring cnp-bench with detailed instructions for benchmarking CloudNativePG. - Created `setup-prometheus-monitoring.sh` to apply PodMonitor configurations for Prometheus metrics scraping. - Developed `verify-data-consistency.sh` to check data integrity after chaos experiments, including various consistency tests. - Added `pgbench-continuous-job.yaml` for running continuous pgbench workloads during chaos testing, with options for custom workloads. Signed-off-by: XploY04 <[email protected]>
… Prometheus Signed-off-by: XploY04 <[email protected]>
…for consistency in chaos experiments Signed-off-by: XploY04 <[email protected]>
- Created a Kubernetes Job definition for running the Jepsen PostgreSQL consistency test against a CloudNativePG cluster. - The job includes environment variables for configuration, command execution for testing, and result handling. - Added a PersistentVolumeClaim for storing Jepsen test results with a request for 2Gi of storage. Signed-off-by: XploY04 <[email protected]>
|
Please use the latest version of the operator in the installation instructions: |
…oring enhancements Signed-off-by: XploY04 <[email protected]>
- Implemented a comprehensive bash script to orchestrate Jepsen consistency testing with chaos experiments. - The script includes pre-flight checks, database cleanup, PVC management, Jepsen job deployment, chaos experiment application, and result extraction. - Added logging functionality with color-coded output for better readability. - Integrated error handling and cleanup procedures to ensure graceful exits and resource management. - Provided detailed usage instructions and exit codes for user guidance. Signed-off-by: XploY04 <[email protected]>
…ration Signed-off-by: XploY04 <[email protected]>
…ting - Introduced a new ChaosEngine configuration () for running Jepsen tests without Prometheus probes, allowing for chaos testing in environments lacking monitoring. - Updated existing to remove unnecessary probe configurations and ensure compatibility with the new no-probes variant. - Modified to include a Service definition for metrics collection and changed PodMonitor to ServiceMonitor for better integration with Prometheus. - Removed obsolete and Jepsen job configurations that are no longer needed. - Deleted scripts for fetching chaos results and monitoring CNPG pods, streamlining the testing process. - Enhanced to include namespace and context parameters for improved flexibility. Signed-off-by: XploY04 <[email protected]>
…d improve primary pod identification logic Signed-off-by: XploY04 <[email protected]>
… replication monitoring Signed-off-by: XploY04 <[email protected]>
README.md
Outdated
| > | ||
| > ```bash | ||
| > VERSION="v1.27.1" | ||
| > curl -L "https://github.com/cloudnative-pg/cloudnative-pg/releases/download/${VERSION}/kubectl-cnpg_${VERSION}_linux_amd64.tar.gz" -o /tmp/kubectl-cnpg.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command line doesn't work is not v1.27.1 for the version and amd64 isn't in the binary to download
README.md
Outdated
|
|
||
| ```bash | ||
| # Re-export the playground kubeconfig if you opened a new shell | ||
| export KUBECONFIG=/path/to/cnpg-playground/k8s/kube-config.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably something like
export KUBECONFIG=$PWD/k8s/kube-config.yaml
Since you already inside the cnpg-playground directory
README.md
Outdated
| # Apply the 1.27.1 operator manifest exactly as documented | ||
| kubectl apply --server-side -f \ | ||
| https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.27/releases/cnpg-1.27.1.yaml | ||
|
|
||
| # Alternatively, generate a custom manifest via the kubectl cnpg plugin | ||
| kubectl cnpg install generate --control-plane \ | ||
| | kubectl apply --context kind-k8s-eu -f - --server-side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will keep one or the other, but not both, it's confusing my first question was "Why I'm installing the same twice?"
| > Follow these sections in order; each references the authoritative upstream documentation to keep this README concise. | ||
| ### 1. Bootstrap the CNPG Playground |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this test it's important to mention that it's required to increase the max open files otherwise it's not working, this is on the playground IIRC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think about this. I have added that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where was this added ? Because I can't see it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Increase max open files limit if needed (required for Jepsen on some systems):
ulimit -n 65536
In Prerequisites before running the script.
README.md
Outdated
| kubectl apply -n litmus -f https://hub.litmuschaos.io/api/chaos/master?file=faults/kubernetes/pod-delete/fault.yaml | ||
|
|
||
| # OR install from local file (if you need customization) | ||
| kubectl apply -n litmus -f chaosexperiments/pod-delete-cnpg.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failed with the following error:
error: the namespace from the provided object "default" does not match the namespace "litmus". You must pass '--namespace=default' to perform this operation.
| # Watch the chaos runner pod start (refreshes every 2s) | ||
| watch -n2 'kubectl -n litmus get pods | grep cnpg-jepsen-chaos-noprobes-runner' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I do what? here is not clear if I should exit or not at some point, I'm guessing yes
README.md
Outdated
| # Check experiment logs to see pod deletions (ensure a pod exists first) | ||
| runner_pod=$(kubectl -n litmus get pods -l chaos-runner-name=cnpg-jepsen-chaos-noprobes -o jsonpath='{.items[0].metadata.name}') && \ | ||
| kubectl -n litmus logs -f "$runner_pod" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failed with the following error
runner_pod=$(kubectl -n litmus get pods -l chaos-runner-name=cnpg-jepsen-chaos-noprobes -o jsonpath='{.items[0].metadata.name}') && \
kubectl -n litmus logs -f "$runner_pod"
error: error executing jsonpath "{.items[0].metadata.name}": Error executing template: array index out of bounds: index 0, length 0. Printing more information for debugging the template:
template was:
{.items[0].metadata.name}
object given to jsonpath engine was:
map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":""}}
README.md
Outdated
| Expose the CNPG metrics port (9187) through a dedicated Service + ServiceMonitor bundle, then verify Prometheus scrapes it. Manual management keeps you aligned with the operator deprecation of `spec.monitoring.enablePodMonitor` and dodges the PodMonitor regression in kube-prometheus-stack v79 where CNPG pods only advertise the `postgresql` and `status` ports: | ||
|
|
||
| ```bash | ||
| kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throw the following error
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
Warning: resource namespaces/monitoring is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
namespace/monitoring configured
|
|
||
| > **Note:** Keep using `experiments/cnpg-jepsen-chaos-noprobes.yaml` until Section 5 installs Prometheus/Grafana. Once monitoring is online, switch to `experiments/cnpg-jepsen-chaos.yaml` (probes enabled) for full observability. | ||
| ### 5. Configure monitoring (Prometheus + Grafana) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now on this step I run out of space, having 20G of space, this should be clarified, my test just stop here because of running out of space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this now.
…, refine Jepsen prerequisites, and improve various command examples. Signed-off-by: XploY04 <[email protected]>
…refine chaos result summary output. Signed-off-by: XploY04 <[email protected]>
… corresponding setup instructions. Signed-off-by: XploY04 <[email protected]>
…p and removing optional Litmus UI and advanced CNPG install details. Signed-off-by: XploY04 <[email protected]>
Signed-off-by: XploY04 <[email protected]>
README.md
Outdated
| export KUBECONFIG=$PWD/k8s/kube-config.yaml | ||
| kubectl config use-context kind-k8s-eu | ||
|
|
||
| # Apply the 1.27.1 operator manifest exactly as documented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you have the plugin installed, I would rely on the plugin to install the latest version of the operator. See https://github.com/cloudnative-pg/cnpg-playground/blob/main/demo/setup.sh#L65
…est runner with EOT probe checks, and streamline cluster credential handling. Signed-off-by: XploY04 <[email protected]>
…the for the latest version. Signed-off-by: XploY04 <[email protected]>
…date and upgrade logic. Signed-off-by: XploY04 <[email protected]>
Signed-off-by: Gabriele Bartolini <[email protected]>
…Hub for . Signed-off-by: XploY04 <[email protected]>
| 2. **Install CNPG** → Deploy operator + sample cluster (section 2) | ||
| 3. **Install Litmus** → Install operator, experiments, and RBAC (sections 3, 3.5, 3.6) | ||
| 4. **Smoke-test chaos** → Run the quick pod-delete check without monitoring (section 4) | ||
| 5. **Add monitoring** → Install Prometheus for probe validation (section 5; required before section 6 with probes enabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this section should be moved into the CNPG Playground.
Signed-off-by: Gabriele Bartolini <[email protected]>
| kubectl krew update | ||
| kubectl krew install cnpg || kubectl krew upgrade cnpg | ||
| kubectl cnpg version | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Install and then upgrade, isn't the install by default install the last version ?
| > Follow these sections in order; each references the authoritative upstream documentation to keep this README concise. | ||
| ### 1. Bootstrap the CNPG Playground |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where was this added ? Because I can't see it
| kubectl --context kind-k8s-eu apply -f - --server-side | ||
|
|
||
| # Verify the controller rollout kubectl --context kind-k8s-eu rollout status deployment \ | ||
| -n cnpg-system cnpg-controller-manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clearly la typo, there's a missing new line before the kubectl command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| kubectl apply -f clusters/cnpg-config.yaml | ||
| kubectl rollout restart -n cnpg-system deployment cnpg-controller-manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work, that directory doesn't exists and it's never created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gbartolini added it in the latest commit.
…ple in README. Signed-off-by: XploY04 <[email protected]>
…nd disk space cleanup for chaos testing. Signed-off-by: XploY04 <[email protected]>
This PR introduces a production-ready chaos testing framework for CloudNativePG clusters, combining Jepsen (formal consistency verification) with Litmus Chaos (pod deletion) to provide mathematical proof of data integrity under failure conditions.
Part of LFX Mentorship Program 2025/3 to strengthen CloudNativePG resilience through rigorous testing.
Closes #2
What's Included
Test Workflow
:valid? true(PASS) or:valid? false(FAIL)Quick Start
Expected Results
Successful Test
{:valid? true :anomaly-types [] :not #{}}Statistics
Result Files
results.edn- Consistency verdict (:valid? true/false)history.edn- Complete operation log (3-6 MB)timeline.html- Interactive visualizationSTATISTICS.txt- High-level summaryjepsen.log- Full test execution logsTechnical Highlights
Dynamic Pod Targeting
Comprehensive Probes
Configurable Parameters
Testing Performed
Future Enhancements
Signed-off-by: XploY04 [email protected]