Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123

Merged

Conversation

1pkg
Copy link
Member

@1pkg 1pkg commented Jan 4, 2025

Motivation/summary

This fix is branched from #15117.

This PR replaces TimeoutMiddleware with proactively checking for context timeout before writing the response result. The rationale behind this change is that request timeout only ever matters if it happened before the response was written. Otherwise, the client won't care about the response anyway and it's logical for the server to emit either of two error signals consistently in self instrumentation at this stage.

As a result of this changes depending if request timeout happened before or after response was written the self instrumentation will either emit original error transaction with an error log or timeout error transaction with an error log.

This is alternative to errors chaining PR #15122 which will preserve both error logs. Update: After some brief discussion we agreed to move forward with this option for the fix instead of more complex error chaining.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

This PR includes a unit test that encapsulates the condition to simulate the issue. In order to reproduce against a real instance of APM Server follow the recipe from this comment #14232 (comment).

Related issues

#15122
#15117
Fixes #14232

@1pkg 1pkg self-assigned this Jan 4, 2025
@1pkg 1pkg requested a review from a team as a code owner January 4, 2025 00:04
Copy link
Contributor

mergify bot commented Jan 4, 2025

This pull request does not have a backport label. Could you fix it @1pkg? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Jan 4, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 4, 2025
@1pkg 1pkg added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Jan 4, 2025
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of this fix. Then I realize we may also be tracking the wrong status code in MonitoringMiddleware in the past as timeout middleware is after monitoring middleware. We should also add assertions in the same test TestWrapServerAPMInstrumentationTimeout.

I'm still thinking whether not logging a body write timeout is a deal breaker, as this PR would hide body write timeouts in log_middleware. I don't mind inconsistent status code between log and trace when it is a body write timeout.

internal/beater/server_test.go Outdated Show resolved Hide resolved
@carsonip
Copy link
Member

carsonip commented Jan 6, 2025

Then I realize we may also be tracking the wrong status code in MonitoringMiddleware in the past as timeout middleware is after monitoring middleware. We should also add assertions in the same test TestWrapServerAPMInstrumentationTimeout.

After banging my head against the wall for a bit, here you go:

 
+       monitoringtest.ClearRegistry(intake.MonitoringMap)
+
        req, err := http.NewRequestWithContext(reqCtx, http.MethodPost, srv.URL+api.IntakePath, bytes.NewReader(testData))
        require.NoError(t, err)
        req.Header.Add("Content-Type", "application/x-ndjson")
@@ -622,6 +627,15 @@ func TestWrapServerAPMInstrumentationTimeout(t *testing.T) {
        case <-found:
 
        }
+
+       equal, result := monitoringtest.CompareMonitoringInt(map[request.ResultID]int{
+               request.IDRequestCount:          2,
+               request.IDResponseCount:         2,
+               request.IDResponseErrorsCount:   1,
+               request.IDResponseErrorsTimeout: 1, // test data POST /intake/v2/events
+               request.IDResponseValidAccepted: 1, // self-instrumentation
+       }, intake.MonitoringMap)
+       assert.True(t, equal, result)
 }
 

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this is fine. The edge case about timeout during writing body can actually be handled on line 229 in c.errOnWrite(err), where log middleware can log the timeout, but I don't think we can correct tracing at that point. After all, it is kind of debatable how we should handle write timeouts. If an original response status code is 200, and you have successfully written that to the socket, which means the client actually receives 200, but then times out during the body, does it mean the logged / traced status code should then become 5xx? I'm not sure.

I believe in this PR we should focus on fixing the common case where context is canceled before writing status code, and ensuring logged, traced, and monitored status codes are consistent in this case.

@1pkg
Copy link
Member Author

1pkg commented Jan 6, 2025

Then I realize we may also be tracking the wrong status code in MonitoringMiddleware in the past as timeout middleware is after monitoring middleware. We should also add assertions in the same test TestWrapServerAPMInstrumentationTimeout.

After banging my head against the wall for a bit, here you go:

 
+       monitoringtest.ClearRegistry(intake.MonitoringMap)
+
        req, err := http.NewRequestWithContext(reqCtx, http.MethodPost, srv.URL+api.IntakePath, bytes.NewReader(testData))
        require.NoError(t, err)
        req.Header.Add("Content-Type", "application/x-ndjson")
@@ -622,6 +627,15 @@ func TestWrapServerAPMInstrumentationTimeout(t *testing.T) {
        case <-found:
 
        }
+
+       equal, result := monitoringtest.CompareMonitoringInt(map[request.ResultID]int{
+               request.IDRequestCount:          2,
+               request.IDResponseCount:         2,
+               request.IDResponseErrorsCount:   1,
+               request.IDResponseErrorsTimeout: 1, // test data POST /intake/v2/events
+               request.IDResponseValidAccepted: 1, // self-instrumentation
+       }, intake.MonitoringMap)
+       assert.True(t, equal, result)
 }
 

Excellent suggestion @carsonip, I updated the tests to validate metrics middleware results per your comment.

@1pkg 1pkg requested a review from carsonip January 6, 2025 21:56
carsonip
carsonip previously approved these changes Jan 7, 2025
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks. Great clean fix. Can you update the PR description testing section please?

@1pkg 1pkg enabled auto-merge (squash) January 7, 2025 00:43
@1pkg 1pkg disabled auto-merge January 7, 2025 01:46
…ub.com:1pkg/apm-server into fix-log-trace-inconsistent-status-code-timeout
@1pkg 1pkg force-pushed the fix-log-trace-inconsistent-status-code-timeout branch from af85818 to 677069e Compare January 7, 2025 01:47
inge4pres
inge4pres previously approved these changes Jan 7, 2025
Copy link
Contributor

@inge4pres inge4pres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 thanks for working on it and fixing it folks 🙏🏼

@1pkg 1pkg dismissed stale reviews from inge4pres and carsonip via c2bfe1d January 7, 2025 17:42
@1pkg 1pkg enabled auto-merge (squash) January 7, 2025 18:32
@1pkg 1pkg merged commit f2b3894 into elastic:main Jan 7, 2025
11 checks passed
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
…hen dealing with request timeout (#15123)

Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation.

---------

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit f2b3894)
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
…hen dealing with request timeout (#15123)

Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation.

---------

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit f2b3894)
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
…hen dealing with request timeout (#15123)

Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation.

---------

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit f2b3894)
@1pkg 1pkg deleted the fix-log-trace-inconsistent-status-code-timeout branch January 7, 2025 19:06
mergify bot added a commit that referenced this pull request Jan 7, 2025
…hen dealing with request timeout (#15123) (#15164)

Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation.

---------

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit f2b3894)

Co-authored-by: Kostiantyn Masliuk <[email protected]>
mergify bot added a commit that referenced this pull request Jan 7, 2025
…hen dealing with request timeout (#15123) (#15163)

Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation.

---------

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit f2b3894)

Co-authored-by: Kostiantyn Masliuk <[email protected]>
mergify bot added a commit that referenced this pull request Jan 7, 2025
…hen dealing with request timeout (#15123) (#15165)

Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation.

---------

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit f2b3894)

Co-authored-by: Kostiantyn Masliuk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intake v2 request timeout causes inconsistent log and self-instrumented trace
4 participants