diff --git a/content/post/shift-left-on-performance/index.adoc b/content/post/shift-left-on-performance/index.adoc new file mode 100644 index 0000000..7e3b81c --- /dev/null +++ b/content/post/shift-left-on-performance/index.adoc @@ -0,0 +1,176 @@ +--- +title: "Shift Left on Performance" +date: 2025-03-04T00:00:00Z +categories: ['performance', 'methodology', 'CI/CD'] +summary: 'To speed up delivery of software using CD pipelines, performance testing needs to be continual performed, with the correct bounds to add value to the software development process' +image: 'shift-left-banner.png' +related: [''] +authors: + - John O'Hara +--- += Shift Left on Performance +:icons: font + +Performance testing is often a major bottleneck in the productization of software. Performance tests are normally ran late in the development lifecycle, require a lot of manual intervention and regressions are detected often after a long bake time. + +Shifting from versioned products to continual delivered services changes the risk profile of performance regressions and requires a paradigm shift for managing performance testing. + + +== Problem Statement + +How can performance engineering teams enable Eng / QE / SRE to integrate Performance Testing into their workflows, to reduce the risk that performance regressions propagate through to production services? + + +== Typical Product Workflow + +A typical "boxed product"footnote:[A product that is shipped to customers, either physically or electronically, typically with a multi-month/year release cadence] productization workflow can be represented by; + + +image::typical_workflow.png[Typical Workflow,,,float="right",align="center"] + +The key issues that this type of workflow has; + +* There is a *break in continuity* between `tag build` and `release` stages of the CI Build pipeline +* Development, build and performance testing are performed by different teams, each passing *async* messages between teams +* The feedback loop to developers is *manual* and *slow* +* There is a lot of *manual analysis* performed, often with ad-hoc data capture and reporting + +The above scenario generally develops due to a number of factors; + +* Dedicated performance environments are costly and difficult to setup and manage +* Performance Analysis (including system performance monitoring and analysis) is generally a specialized role, concentrated in small teams +* The time required to manage reliable/accurate benchmarks is often a time sink + +''' + +=== Whats the problem? + +image::shift-left-chart.png[] + +The further into the development cycle performance testing occurs, the more costly it is to fix performance bugs.footnote:[https://www.nist.gov/system/files/documents/director/planning/report02-3.pdf] + +Over the years, methodologies have developed to allow functional tests to be performed earlier in the development lifecycle, reducing the time between functional regressions being introduced and discovered. + +This has the benefits of; + +* Push earlier into development cycle +* Discover quality issues more quickly +* Reduce cost to fix +* Reduce test & deploy cycles + +Functional issues are typically easier to fix than Performance issues because they involve specific, reproducible errors in the software's behavior; therefore, performance testing should be "Shifted-Left" in the same way that functional testing has been + +''' + +=== What does it mean to Shift left? + +In the traditional Waterfall model for software development, shift left means pushing tests earlier into the development cycle; + +image::shift-left-waterfall.jpeg[] + +source: https://insights.sei.cmu.edu/blog/four-types-of-shift-left-testing/ + +==== In the Agile world + +For continually delivered services, "shifting left" incudes an additional dimension; + +image::shift-left-agile.jpeg[Agile Shift Left,,,float="right"] + +source: https://insights.sei.cmu.edu/blog/four-types-of-shift-left-testing/ + +Not only do we want to include performance tests earlier in the dev/release cycle, we also want to ensure that the full suite of performance tests (or any proxy performance tests footnoteL[Citation needed]) captures performance regressions before multiple release cycles have occurred. + +== Risks in the managed service world + +Managed services changes the risk associated with a software product; + +* *Multiple, Rapid dev cycles*; the time period between development and release is greatly reduced + +* Probability of releasing a product with a performance regression *increased* + +* Performance Regressions will effect *all* customers, *immediately* + + +''' + +== Performance Shift-Left Workflow + +In order to manage the changed risk profile of managed services compared to boxed products, an new methodology is required; + +image::shift-workflow.png[Agile Shift Left,,,float="right"] + +In a "Shifted-left" model; + +* *Code Repository Bots* allow performance engineers to initiate *Upstream Performance Tests* against open Pull Requests, returning comparative performance data to workflow that the engineer uses in the day-to-day job. +* *Integrated Performance Threshold* tests provide automated gating of acceptable levels of performance +* *Continual Performance Testing* allows for analyzing trends over time, scaling, soak and chaos type testing, asynchronously from the CI/CD build pipeline +* *Automated Regression Detection* provides automated tooling for detecting catastrophic performance regression related to a single commit, or creeps in performance degradation over time + +Continual analysis is performed by experienced engineers, but the process does not require manual intervention with each release. + +Engineers are free to focus on implementing features and not worry about performance regressions. When regressions are detected, the information they need to identify the root cause is readily available, in a suitable format. + +== Code Repository Bots + +Code Repository Bots initiate performance tests against PR's. Their purpose is to allow engineers to make a decision on whether to merge a PR or not. The results need to be actionable by engineers. Profiling data should also be provide to allow engineers to understand what their changes are doing + +Receive report & analysis of impact of changes to key performance metrics + +Allow automated capture of profiling data of system under load, allowing engineers to *see* what thier changes are doing under realistic scenarios + +* Triggered from CI/CD pipeline +* Automatic / Manual +* Performance Results reported in PR +* Actionable data for engineers (results/profiles added the PR's to keep all information co-located for each PR) + + +== Integrated Performance Thresholds + +The aim of Integrated Performance Tests is to determine whether a release meets acceptable levels of performance with respect to customer expectations, not to capture changes over time. The results need to be automatically calculated and should provide a boolean Pass/Fail result. + +* Pass/Fail criteria - the same as functional tests, the performance should be either be acceptable, or not-acceptable +* Fully automated - not manual intervention / analysis +* Focused on user experience +* Threshold based? +* Integrated with QE tools +* Portable Tests +* Limits Thresholds defined by CPT + +== Continual Performance Testing + +The aim of Continual Performance Testing is to perform larger scale performance workloads, that can take time to perform. + +These tests can include; + +* Large scale end-to-end testing +* Soak tests +* Chaos Testing +* Trend analysis +* Scale testing +* Automated tuning of environment +* Detailed profiling and analysis work + +== Automatic Change Detection + +Automated tools that allow detection of changes in key performance metrics over time. Tools such as Horreumfootnote:[https://horreum.hyperfoil.io/] can be configured to monitor key performance metrics for particular products and cen be integrated into existing workflows and tools to raise alerts/block build pipelines when a significant change is detected. + +The key to incorporating automated tools into the CI/CD pipeline is for the ability for the tools to integrated seamlessly into existing CI/CD pipelines and provide accurate, actionable events. + +== Continual Profiling & Monitoring + +Not all performance issues will be caught during the development lifecycle. It is crucial that production systems are capturing sufficient performance related data that allows performance issues to be identified in production. The data needs to be of sufficient resolution to be able to perform a root case analysis during a post mortem, or provide information to be able to test for the performance issue in the CI/CD pipeline. + +== Integration with Analytical Tools + +In order to understand the performance characteristic of a service running in production, the all of the performance metrics captured at different stages of testing (dev, CI/CD, production) need to be accessible for a performance engineer to analyse. + +This requires the performance metrics to be collocated, and available for analysis by tools, e.g. statistical analysis tools. + +== Further Tooling to assist Product Teams + +Other tools that can help product teams with performance related issues are; + +* *Performance Bisect*: perform an automated bisect on source repository, running performance test(s) each time to automatically identify the code merge that introduced the performance regression +* *Automated profiling analysis*: AI/ML models to automatically spot performance issues in profiling data +* *Proxy Metrics*: System metrics captured during functional testing that will provide an indication that a performance/scale issue will manifest at runtime +* *Automatic tuning of service configuration*: Using Hyper-Parameter Optimizationfootnote:[https://github.com/kruize/hpo] to automatically tune configuration space of a service to optimize the performance for a given target environment/workload \ No newline at end of file diff --git a/content/post/shift-left-on-performance/shift-left-agile.jpeg b/content/post/shift-left-on-performance/shift-left-agile.jpeg new file mode 100644 index 0000000..bd14203 Binary files /dev/null and b/content/post/shift-left-on-performance/shift-left-agile.jpeg differ diff --git a/content/post/shift-left-on-performance/shift-left-banner.png b/content/post/shift-left-on-performance/shift-left-banner.png new file mode 100644 index 0000000..f69ae90 Binary files /dev/null and b/content/post/shift-left-on-performance/shift-left-banner.png differ diff --git a/content/post/shift-left-on-performance/shift-left-chart.png b/content/post/shift-left-on-performance/shift-left-chart.png new file mode 100644 index 0000000..f73749c Binary files /dev/null and b/content/post/shift-left-on-performance/shift-left-chart.png differ diff --git a/content/post/shift-left-on-performance/shift-left-waterfall.jpeg b/content/post/shift-left-on-performance/shift-left-waterfall.jpeg new file mode 100644 index 0000000..f2da31d Binary files /dev/null and b/content/post/shift-left-on-performance/shift-left-waterfall.jpeg differ diff --git a/content/post/shift-left-on-performance/shift-workflow.png b/content/post/shift-left-on-performance/shift-workflow.png new file mode 100644 index 0000000..c4bc69e Binary files /dev/null and b/content/post/shift-left-on-performance/shift-workflow.png differ diff --git a/content/post/shift-left-on-performance/typical_workflow.png b/content/post/shift-left-on-performance/typical_workflow.png new file mode 100644 index 0000000..69a0cec Binary files /dev/null and b/content/post/shift-left-on-performance/typical_workflow.png differ