Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update performance guide #11969

Merged
merged 6 commits into from
Nov 29, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 91 additions & 21 deletions docs/processing-performance.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,40 +5,110 @@ APM Server performance depends on a number of factors: memory and CPU available,
network latency, transaction sizes, workload patterns,
agent and server settings, versions, and protocol.

Let's look at a simple example that makes the following assumptions:
// Assumptions
// TBD

* The load is generated in the same region as where APM Server and {es} are deployed.
* We're using the default settings in cloud.
* A small number of agents are reporting.
// Method
To help you understand how sizing APM Server impacts performance, we tested several scenarios:
colleenmcginnis marked this conversation as resolved.
Show resolved Hide resolved

This leaves us with relevant variables like payload and instance sizes.
See the table below for approximations.
As a reminder, events are
* Using the default hardware template on AWS, GCP and Azure on {ecloud}.
* Comparing the performance when using OpenTelemetry (OTel) events on at least one of the service providers listed above (in this case, GCP).
colleenmcginnis marked this conversation as resolved.
Show resolved Hide resolved
* For each hardware template, testing with several sizes: 1 GB, 4 GB, 8 GB, and 32 GB.
* For each size, using a fixed number of APM agents: 10 agents for 1 GB, 30 agents for 4 GB, 60 agents for 8 GB, and 240 agents for 32 GB.
* In all scenarios, using medium sized events. Events include
<<data-model-transactions,transactions>> and
<<data-model-spans,spans>>.
colleenmcginnis marked this conversation as resolved.
Show resolved Hide resolved

// This leaves us with relevant variables like payload and instance sizes.

// Results
You can use the results of our tests to guide your sizing decisions, however, performance will vary based on factors unique to your use case like your specific setup, the size of APM event data, and the exact number of agents.

:hardbreaks-option:

[options="header"]
|=======================================================================
|Transaction/Instance |512 MB Instance |2 GB Instance |8 GB Instance
|Small transactions
|====
| Profile / Cloud | AWS | Azure | GCP (Agent) | GCP (OTel)
colleenmcginnis marked this conversation as resolved.
Show resolved Hide resolved

| *1 GB*
(10 agents)
| 9,600
events/second
| 6,400
events/second
| 9,600
events/second
| 28,000
events/second

| *4 GB*
(30 agents)
| 25,500
events/second
| 18,000
events/second
| 17,800
events/second
| 49,300
events/second

| *8 GB*
(60 agents)
| 40,500
events/second
| 26,000
events/second
| 25,700
events/second
| 89,300
events/second

_5 spans with 5 stack frames each_ |600 events/second |1200 events/second |4800 events/second
|Medium transactions
| *16 GB*
(120 agents)
| 72,000
events/second
| 51,000
events/second
| 45,300
events/second
| 145,000
events/second

_15 spans with 15 stack frames each_ |300 events/second |600 events/second |2400 events/second
|Large transactions
| *32 GB*
(240 agents)
| 135,000
events/second
| 95,000
events/second
| 95,000
events/second
| 381,000
events/second

_30 spans with 30 stack frames each_ |150 events/second |300 events/second |1400 events/second
|=======================================================================
|====

In other words, a 512 MB instance can process \~3 MB per second,
while an 8 GB instance can process ~20 MB per second.
:!hardbreaks-option:

APM Server is CPU bound, so it scales better from 2 GB to 8 GB than it does from 512 MB to 2 GB.
This is because larger instance types in {ecloud} come with much more computing power.
// In other words, a 512 MB instance can process \~3 MB per second,
// while an 8 GB instance can process ~20 MB per second.

APM Server is CPU bound and larger instance types in {ecloud} come with much more computing power.
colleenmcginnis marked this conversation as resolved.
Show resolved Hide resolved
// so it scales better from 2 GB to 8 GB than it does from 512 MB to 2 GB.

Don't forget that the APM Server is stateless.
Several instances running do not need to know about each other.
This means that with a properly sized {es} instance, APM Server scales out linearly.

NOTE: RUM deserves special consideration. The RUM agent runs in browsers, and there can be many thousands reporting to an APM Server with very variable network latency.
NOTE: RUM deserves special consideration. The RUM agent runs in browsers, and there can be many thousands reporting to an APM Server with very variable network latency.

// [discrete]
// ==== Troubleshoot sizing issues

// [discrete]
// ===== Scale the APM Server and ES

// [discrete]
// ===== Apply head based sampling

// [discrete]
// ===== Apply tail based sampling
colleenmcginnis marked this conversation as resolved.
Show resolved Hide resolved