-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123
Instrumentation: fix log trace inconsistent status code with timeout check when writing the response #15123
Conversation
This pull request does not have a backport label. Could you fix it @1pkg? 🙏
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity of this fix. Then I realize we may also be tracking the wrong status code in MonitoringMiddleware in the past as timeout middleware is after monitoring middleware. We should also add assertions in the same test TestWrapServerAPMInstrumentationTimeout
.
I'm still thinking whether not logging a body write timeout is a deal breaker, as this PR would hide body write timeouts in log_middleware. I don't mind inconsistent status code between log and trace when it is a body write timeout.
After banging my head against the wall for a bit, here you go:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level this is fine. The edge case about timeout during writing body can actually be handled on line 229 in c.errOnWrite(err)
, where log middleware can log the timeout, but I don't think we can correct tracing at that point. After all, it is kind of debatable how we should handle write timeouts. If an original response status code is 200, and you have successfully written that to the socket, which means the client actually receives 200, but then times out during the body, does it mean the logged / traced status code should then become 5xx? I'm not sure.
I believe in this PR we should focus on fixing the common case where context is canceled before writing status code, and ensuring logged, traced, and monitored status codes are consistent in this case.
Excellent suggestion @carsonip, I updated the tests to validate metrics middleware results per your comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks. Great clean fix. Can you update the PR description testing section please?
…ub.com:1pkg/apm-server into fix-log-trace-inconsistent-status-code-timeout
af85818
to
677069e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 thanks for working on it and fixing it folks 🙏🏼
speed up the timeout test
…hen dealing with request timeout (#15123) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894)
…hen dealing with request timeout (#15123) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894)
…hen dealing with request timeout (#15123) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894)
…hen dealing with request timeout (#15123) (#15164) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894) Co-authored-by: Kostiantyn Masliuk <[email protected]>
…hen dealing with request timeout (#15123) (#15163) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894) Co-authored-by: Kostiantyn Masliuk <[email protected]>
…hen dealing with request timeout (#15123) (#15165) Replace TimeoutMiddleware with direct check for request cancelation when writing the response. To prevent events inconsistency in self instrumentation. --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit f2b3894) Co-authored-by: Kostiantyn Masliuk <[email protected]>
Motivation/summary
This fix is branched from #15117.
This PR replaces
TimeoutMiddleware
with proactively checking for context timeout before writing the response result. The rationale behind this change is that request timeout only ever matters if it happened before the response was written. Otherwise, the client won't care about the response anyway and it's logical for the server to emit either of two error signals consistently in self instrumentation at this stage.As a result of this changes depending if request timeout happened before or after response was written the self instrumentation will either emit original error transaction with an error log or timeout error transaction with an error log.
This is alternative to errors chaining PR #15122 which will preserve both error logs. Update: After some brief discussion we agreed to move forward with this option for the fix instead of more complex error chaining.
Checklist
For functional changes, consider:
How to test these changes
This PR includes a unit test that encapsulates the condition to simulate the issue. In order to reproduce against a real instance of APM Server follow the recipe from this comment #14232 (comment).
Related issues
#15122
#15117
Fixes #14232