From 1be4fe80209ba73a5edfc825b107f9015491130c Mon Sep 17 00:00:00 2001 From: Colleen McGinnis Date: Thu, 26 Oct 2023 10:48:47 -0500 Subject: [PATCH 1/3] first draft --- docs/processing-performance.asciidoc | 112 ++++++++++++++++++++++----- 1 file changed, 91 insertions(+), 21 deletions(-) diff --git a/docs/processing-performance.asciidoc b/docs/processing-performance.asciidoc index 8ab84d60f38..f77e20a5d24 100644 --- a/docs/processing-performance.asciidoc +++ b/docs/processing-performance.asciidoc @@ -5,40 +5,110 @@ APM Server performance depends on a number of factors: memory and CPU available, network latency, transaction sizes, workload patterns, agent and server settings, versions, and protocol. -Let's look at a simple example that makes the following assumptions: +// Assumptions +// TBD -* The load is generated in the same region as where APM Server and {es} are deployed. -* We're using the default settings in cloud. -* A small number of agents are reporting. +// Method +To help you understand how sizing APM Server impacts performance, we tested several scenarios: -This leaves us with relevant variables like payload and instance sizes. -See the table below for approximations. -As a reminder, events are +* Using the default hardware template on AWS, GCP and Azure on {ecloud}. +* Comparing the performance when using OpenTelemetry (OTel) events on at least one of the service providers listed above (in this case, GCP). +* For each hardware template, testing with several sizes: 1 GB, 4 GB, 8 GB, and 32 GB. +* For each size, using a fixed number of APM agents: 10 agents for 1 GB, 30 agents for 4 GB, 60 agents for 8 GB, and 240 agents for 32 GB. +* In all scenarios, using medium sized events. Events include <> and <>. +// This leaves us with relevant variables like payload and instance sizes. + +// Results +You can use the results of our tests to guide your sizing decisions, however, performance will vary based on factors unique to your use case like your specific setup, the size of APM event data, and the exact number of agents. + +:hardbreaks-option: + [options="header"] -|======================================================================= -|Transaction/Instance |512 MB Instance |2 GB Instance |8 GB Instance -|Small transactions +|==== +| Profile / Cloud | AWS | Azure | GCP (Agent) | GCP (OTel) + +| *1 GB* +(10 agents) +| 9,600 +events/second +| 6,400 +events/second +| 9,600 +events/second +| 28,000 +events/second + +| *4 GB* +(30 agents) +| 25,500 +events/second +| 18,000 +events/second +| 17,800 +events/second +| 49,300 +events/second + +| *8 GB* +(60 agents) +| 40,500 +events/second +| 26,000 +events/second +| 25,700 +events/second +| 89,300 +events/second -_5 spans with 5 stack frames each_ |600 events/second |1200 events/second |4800 events/second -|Medium transactions +| *16 GB* +(120 agents) +| 72,000 +events/second +| 51,000 +events/second +| 45,300 +events/second +| 145,000 +events/second -_15 spans with 15 stack frames each_ |300 events/second |600 events/second |2400 events/second -|Large transactions +| *32 GB* +(240 agents) +| 135,000 +events/second +| 95,000 +events/second +| 95,000 +events/second +| 381,000 +events/second -_30 spans with 30 stack frames each_ |150 events/second |300 events/second |1400 events/second -|======================================================================= +|==== -In other words, a 512 MB instance can process \~3 MB per second, -while an 8 GB instance can process ~20 MB per second. +:!hardbreaks-option: -APM Server is CPU bound, so it scales better from 2 GB to 8 GB than it does from 512 MB to 2 GB. -This is because larger instance types in {ecloud} come with much more computing power. +// In other words, a 512 MB instance can process \~3 MB per second, +// while an 8 GB instance can process ~20 MB per second. + +APM Server is CPU bound and larger instance types in {ecloud} come with much more computing power. +// so it scales better from 2 GB to 8 GB than it does from 512 MB to 2 GB. Don't forget that the APM Server is stateless. Several instances running do not need to know about each other. This means that with a properly sized {es} instance, APM Server scales out linearly. -NOTE: RUM deserves special consideration. The RUM agent runs in browsers, and there can be many thousands reporting to an APM Server with very variable network latency. \ No newline at end of file +NOTE: RUM deserves special consideration. The RUM agent runs in browsers, and there can be many thousands reporting to an APM Server with very variable network latency. + +// [discrete] +// ==== Troubleshoot sizing issues + +// [discrete] +// ===== Scale the APM Server and ES + +// [discrete] +// ===== Apply head based sampling + +// [discrete] +// ===== Apply tail based sampling From 2cce0153cd3acd58b2123ca545f24597247f9c60 Mon Sep 17 00:00:00 2001 From: Colleen McGinnis Date: Mon, 27 Nov 2023 09:43:14 -0600 Subject: [PATCH 2/3] address initial feedback --- docs/processing-performance.asciidoc | 36 ++++------------------------ 1 file changed, 5 insertions(+), 31 deletions(-) diff --git a/docs/processing-performance.asciidoc b/docs/processing-performance.asciidoc index f77e20a5d24..6ca2556c8d3 100644 --- a/docs/processing-performance.asciidoc +++ b/docs/processing-performance.asciidoc @@ -9,18 +9,15 @@ agent and server settings, versions, and protocol. // TBD // Method -To help you understand how sizing APM Server impacts performance, we tested several scenarios: +We tested several scenarios to help you understand how to size the APM Server so that it can keep up with the load that your Elastic APM agents are sending: * Using the default hardware template on AWS, GCP and Azure on {ecloud}. -* Comparing the performance when using OpenTelemetry (OTel) events on at least one of the service providers listed above (in this case, GCP). * For each hardware template, testing with several sizes: 1 GB, 4 GB, 8 GB, and 32 GB. * For each size, using a fixed number of APM agents: 10 agents for 1 GB, 30 agents for 4 GB, 60 agents for 8 GB, and 240 agents for 32 GB. * In all scenarios, using medium sized events. Events include <> and <>. -// This leaves us with relevant variables like payload and instance sizes. - // Results You can use the results of our tests to guide your sizing decisions, however, performance will vary based on factors unique to your use case like your specific setup, the size of APM event data, and the exact number of agents. @@ -28,7 +25,7 @@ You can use the results of our tests to guide your sizing decisions, however, pe [options="header"] |==== -| Profile / Cloud | AWS | Azure | GCP (Agent) | GCP (OTel) +| Profile / Cloud | AWS | Azure | GCP | *1 GB* (10 agents) @@ -38,8 +35,6 @@ events/second events/second | 9,600 events/second -| 28,000 -events/second | *4 GB* (30 agents) @@ -49,9 +44,7 @@ events/second events/second | 17,800 events/second -| 49,300 -events/second - + | *8 GB* (60 agents) | 40,500 @@ -60,8 +53,6 @@ events/second events/second | 25,700 events/second -| 89,300 -events/second | *16 GB* (120 agents) @@ -71,8 +62,6 @@ events/second events/second | 45,300 events/second -| 145,000 -events/second | *32 GB* (240 agents) @@ -82,18 +71,12 @@ events/second events/second | 95,000 events/second -| 381,000 -events/second |==== :!hardbreaks-option: -// In other words, a 512 MB instance can process \~3 MB per second, -// while an 8 GB instance can process ~20 MB per second. - APM Server is CPU bound and larger instance types in {ecloud} come with much more computing power. -// so it scales better from 2 GB to 8 GB than it does from 512 MB to 2 GB. Don't forget that the APM Server is stateless. Several instances running do not need to know about each other. @@ -101,14 +84,5 @@ This means that with a properly sized {es} instance, APM Server scales out linea NOTE: RUM deserves special consideration. The RUM agent runs in browsers, and there can be many thousands reporting to an APM Server with very variable network latency. -// [discrete] -// ==== Troubleshoot sizing issues - -// [discrete] -// ===== Scale the APM Server and ES - -// [discrete] -// ===== Apply head based sampling - -// [discrete] -// ===== Apply tail based sampling +If your APM Server cannot be scaled to support the load that you are expecting, consider +decreasing the ingestion volume. Read more in <>. From 3adc800be06de714ad092de94f5633c9bb21c0a0 Mon Sep 17 00:00:00 2001 From: Colleen McGinnis Date: Tue, 28 Nov 2023 14:09:16 -0600 Subject: [PATCH 3/3] address more feedback --- docs/processing-performance.asciidoc | 32 +++++++++++++--------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/docs/processing-performance.asciidoc b/docs/processing-performance.asciidoc index 6ca2556c8d3..27e66e3afc6 100644 --- a/docs/processing-performance.asciidoc +++ b/docs/processing-performance.asciidoc @@ -5,10 +5,6 @@ APM Server performance depends on a number of factors: memory and CPU available, network latency, transaction sizes, workload patterns, agent and server settings, versions, and protocol. -// Assumptions -// TBD - -// Method We tested several scenarios to help you understand how to size the APM Server so that it can keep up with the load that your Elastic APM agents are sending: * Using the default hardware template on AWS, GCP and Azure on {ecloud}. @@ -18,8 +14,12 @@ We tested several scenarios to help you understand how to size the APM Server so <> and <>. -// Results -You can use the results of our tests to guide your sizing decisions, however, performance will vary based on factors unique to your use case like your specific setup, the size of APM event data, and the exact number of agents. +NOTE: You will also need to scale up {es} accordingly, potentially with an increased number of shards configured. +For more details on scaling {es}, refer to the {ref}/scalability.html[{es} documentation]. + +The results below include numbers for a synthetic workload. You can use the results of our tests to guide +your sizing decisions, however, *performance will vary based on factors unique to your use case* like your +specific setup, the size of APM event data, and the exact number of agents. :hardbreaks-option: @@ -29,29 +29,29 @@ You can use the results of our tests to guide your sizing decisions, however, pe | *1 GB* (10 agents) -| 9,600 +| 9,000 events/second -| 6,400 +| 6,000 events/second -| 9,600 +| 9,000 events/second | *4 GB* (30 agents) -| 25,500 +| 25,000 events/second | 18,000 events/second -| 17,800 +| 17,000 events/second | *8 GB* (60 agents) -| 40,500 +| 40,000 events/second | 26,000 events/second -| 25,700 +| 25,000 events/second | *16 GB* @@ -60,7 +60,7 @@ events/second events/second | 51,000 events/second -| 45,300 +| 45,000 events/second | *32 GB* @@ -76,13 +76,11 @@ events/second :!hardbreaks-option: -APM Server is CPU bound and larger instance types in {ecloud} come with much more computing power. - Don't forget that the APM Server is stateless. Several instances running do not need to know about each other. This means that with a properly sized {es} instance, APM Server scales out linearly. NOTE: RUM deserves special consideration. The RUM agent runs in browsers, and there can be many thousands reporting to an APM Server with very variable network latency. -If your APM Server cannot be scaled to support the load that you are expecting, consider +Alternatively or in addition to scaling the APM Server, consider decreasing the ingestion volume. Read more in <>.