Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123

1pkg · 2025-01-04T00:04:44Z

Motivation/summary

This fix is branched from #15117.

This PR replaces TimeoutMiddleware with proactively checking for context timeout before writing the response result. The rationale behind this change is that request timeout only ever matters if it happened before the response was written. Otherwise, the client won't care about the response anyway and it's logical for the server to emit either of two error signals consistently in self instrumentation at this stage.

As a result of this changes depending if request timeout happened before or after response was written the self instrumentation will either emit original error transaction with an error log or timeout error transaction with an error log.

This is alternative to errors chaining PR #15122 which will preserve both error logs. Update: After some brief discussion we agreed to move forward with this option for the fix instead of more complex error chaining.

Checklist

Update CHANGELOG.asciidoc
Documentation has been updated

For functional changes, consider:

Is it observable through the addition of either logging or metrics?
Is its use being published in telemetry to enable product improvement?
Have system tests been added to avoid regression?

How to test these changes

This PR includes a unit test that encapsulates the condition to simulate the issue. In order to reproduce against a real instance of APM Server follow the recipe from this comment #14232 (comment).

Related issues

#15122
#15117
Fixes #14232

mergify · 2025-01-04T00:05:18Z

This pull request does not have a backport label. Could you fix it @1pkg? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.17 is the label to automatically backport to the 7.17 branch.
backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-8.x is the label to automatically backport to the 8.x branch.

mergify · 2025-01-04T00:05:19Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

carsonip

I like the simplicity of this fix. Then I realize we may also be tracking the wrong status code in MonitoringMiddleware in the past as timeout middleware is after monitoring middleware. We should also add assertions in the same test TestWrapServerAPMInstrumentationTimeout.

I'm still thinking whether not logging a body write timeout is a deal breaker, as this PR would hide body write timeouts in log_middleware. I don't mind inconsistent status code between log and trace when it is a body write timeout.

internal/beater/server_test.go

carsonip · 2025-01-06T16:18:21Z

Then I realize we may also be tracking the wrong status code in MonitoringMiddleware in the past as timeout middleware is after monitoring middleware. We should also add assertions in the same test TestWrapServerAPMInstrumentationTimeout.

After banging my head against the wall for a bit, here you go:

 
+       monitoringtest.ClearRegistry(intake.MonitoringMap)
+
        req, err := http.NewRequestWithContext(reqCtx, http.MethodPost, srv.URL+api.IntakePath, bytes.NewReader(testData))
        require.NoError(t, err)
        req.Header.Add("Content-Type", "application/x-ndjson")
@@ -622,6 +627,15 @@ func TestWrapServerAPMInstrumentationTimeout(t *testing.T) {
        case <-found:
 
        }
+
+       equal, result := monitoringtest.CompareMonitoringInt(map[request.ResultID]int{
+               request.IDRequestCount:          2,
+               request.IDResponseCount:         2,
+               request.IDResponseErrorsCount:   1,
+               request.IDResponseErrorsTimeout: 1, // test data POST /intake/v2/events
+               request.IDResponseValidAccepted: 1, // self-instrumentation
+       }, intake.MonitoringMap)
+       assert.True(t, equal, result)
 }

carsonip

At a high level this is fine. The edge case about timeout during writing body can actually be handled on line 229 in c.errOnWrite(err), where log middleware can log the timeout, but I don't think we can correct tracing at that point. After all, it is kind of debatable how we should handle write timeouts. If an original response status code is 200, and you have successfully written that to the socket, which means the client actually receives 200, but then times out during the body, does it mean the logged / traced status code should then become 5xx? I'm not sure.

I believe in this PR we should focus on fixing the common case where context is canceled before writing status code, and ensuring logged, traced, and monitored status codes are consistent in this case.

validate metrics middleware

1pkg · 2025-01-06T21:56:00Z

Then I realize we may also be tracking the wrong status code in MonitoringMiddleware in the past as timeout middleware is after monitoring middleware. We should also add assertions in the same test TestWrapServerAPMInstrumentationTimeout.

After banging my head against the wall for a bit, here you go:

 
+       monitoringtest.ClearRegistry(intake.MonitoringMap)
+
        req, err := http.NewRequestWithContext(reqCtx, http.MethodPost, srv.URL+api.IntakePath, bytes.NewReader(testData))
        require.NoError(t, err)
        req.Header.Add("Content-Type", "application/x-ndjson")
@@ -622,6 +627,15 @@ func TestWrapServerAPMInstrumentationTimeout(t *testing.T) {
        case <-found:
 
        }
+
+       equal, result := monitoringtest.CompareMonitoringInt(map[request.ResultID]int{
+               request.IDRequestCount:          2,
+               request.IDResponseCount:         2,
+               request.IDResponseErrorsCount:   1,
+               request.IDResponseErrorsTimeout: 1, // test data POST /intake/v2/events
+               request.IDResponseValidAccepted: 1, // self-instrumentation
+       }, intake.MonitoringMap)
+       assert.True(t, equal, result)
 }

Excellent suggestion @carsonip, I updated the tests to validate metrics middleware results per your comment.

carsonip

lgtm, thanks. Great clean fix. Can you update the PR description testing section please?

…ub.com:1pkg/apm-server into fix-log-trace-inconsistent-status-code-timeout

internal/beater/server_test.go

inge4pres

👍🏼 thanks for working on it and fixing it folks 🙏🏼

speed up the timeout test

…hen dealing with request timeout (#15123) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894)

…hen dealing with request timeout (#15123) (#15164) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894) Co-authored-by: Kostiantyn Masliuk <[email protected]>

…hen dealing with request timeout (#15123) (#15163) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894) Co-authored-by: Kostiantyn Masliuk <[email protected]>

…hen dealing with request timeout (#15123) (#15165) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894) Co-authored-by: Kostiantyn Masliuk <[email protected]>

carsonip and others added 3 commits January 3, 2025 19:04

Add test to demonstrate issue

4737a6f

Merge branch 'main' into fix-log-trace-inconsistent-status-code

b2eac66

self instrumentation: replace timeout middleware with writeResult check

39076d7

1pkg self-assigned this Jan 4, 2025

1pkg requested a review from a team as a code owner January 4, 2025 00:04

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 4, 2025

Merge branch 'main' into fix-log-trace-inconsistent-status-code-timeout

0929040

1pkg mentioned this pull request Jan 4, 2025

Instrumentation: fix log trace inconsistent status code with error result chaining #15122

Closed

2 tasks

lint: make fmt update

84ae2e9

1pkg added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Jan 4, 2025

link: fix check-approvals

a31f905

1pkg assigned carsonip and inge4pres Jan 4, 2025

1pkg mentioned this pull request Jan 4, 2025

Fix incorrect timeout status code in trace produced by self instrumentation #15117

Closed

2 tasks

carsonip reviewed Jan 6, 2025

View reviewed changes

internal/beater/server_test.go Outdated Show resolved Hide resolved

carsonip mentioned this pull request Jan 6, 2025

test: Fix missing assertions for default result IDs in monitoringtest #15135

Merged

2 tasks

carsonip reviewed Jan 6, 2025

View reviewed changes

1pkg added 2 commits January 6, 2025 13:06

Merge branch 'main' into fix-log-trace-inconsistent-status-code-timeout

883fa6c

self instrumentation: update TestWrapServerAPMInstrumentationTimeout to

afa0f87

validate metrics middleware

1pkg requested a review from carsonip January 6, 2025 21:56

carsonip previously approved these changes Jan 7, 2025

View reviewed changes

1pkg enabled auto-merge (squash) January 7, 2025 00:43

1pkg disabled auto-merge January 7, 2025 01:46

Merge branch 'fix-log-trace-inconsistent-status-code-timeout' of gith…

677069e

…ub.com:1pkg/apm-server into fix-log-trace-inconsistent-status-code-timeout

1pkg force-pushed the fix-log-trace-inconsistent-status-code-timeout branch from af85818 to 677069e Compare January 7, 2025 01:47

kruskall reviewed Jan 7, 2025

View reviewed changes

internal/beater/server_test.go Outdated Show resolved Hide resolved

inge4pres previously approved these changes Jan 7, 2025

View reviewed changes

self instumentation: override ELASTIC_APM_API_REQUEST_TIME value to

c2bfe1d

speed up the timeout test

1pkg dismissed stale reviews from inge4pres and carsonip via c2bfe1d January 7, 2025 17:42

Merge branch 'main' into fix-log-trace-inconsistent-status-code-timeout

0c8e25b

1pkg requested review from kruskall, carsonip and inge4pres January 7, 2025 17:44

carsonip approved these changes Jan 7, 2025

View reviewed changes

Merge branch 'main' into fix-log-trace-inconsistent-status-code-timeout

094d220

1pkg enabled auto-merge (squash) January 7, 2025 18:32

1pkg merged commit f2b3894 into elastic:main Jan 7, 2025
11 checks passed

mergify bot mentioned this pull request Jan 7, 2025

[8.x] Instrumentation: fix log trace inconsistent status code with timeout check when writing the response (backport #15123) #15163

Merged

2 tasks

1pkg deleted the fix-log-trace-inconsistent-status-code-timeout branch January 7, 2025 19:06

1pkg mentioned this pull request Jan 7, 2025

Intake v2 request timeout causes inconsistent log and self-instrumented trace #14232

Closed

This was referenced Jan 13, 2025

APM Server 8.16.3 test plan #15225

Closed

APM Server 8.17.1 test plan #15226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123

Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123

1pkg commented Jan 4, 2025 •

edited

Loading

mergify bot commented Jan 4, 2025

mergify bot commented Jan 4, 2025

carsonip left a comment

carsonip commented Jan 6, 2025

carsonip left a comment

1pkg commented Jan 6, 2025

carsonip left a comment

inge4pres left a comment

Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123

Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123

Conversation

1pkg commented Jan 4, 2025 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Jan 4, 2025

mergify bot commented Jan 4, 2025

carsonip left a comment

Choose a reason for hiding this comment

carsonip commented Jan 6, 2025

carsonip left a comment

Choose a reason for hiding this comment

1pkg commented Jan 6, 2025

carsonip left a comment

Choose a reason for hiding this comment

inge4pres left a comment

Choose a reason for hiding this comment

1pkg commented Jan 4, 2025 •

edited

Loading