Skip to content

[do not merge] potential fixes for deferred response error handling #7453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

carodewig
Copy link
Contributor

This has all unraveled after I pulled an innocuous-looking string. There are two semi-related changes in this PR that should probably be split into separate PRs. But as I'd like some early feedback on my approach to both of them, I'm raising this draft PR.

  • Issue 1: errors from deferred responses aren't propagated out of the execution stage (@defer: expected error not returned in specific case #2329).
  • Issue 2: the on_graphql_error selector doesn't work supergraph stage (notably for coprocessors, but I suspect this is also true for logging etc).

Issue 1

This arises because of the way we filter errors within split_incremental_response - error_path.starts_with(&path) never returns true (given the error_path and path values I've observed), so all errors are filtered out.

Current state:

error.path = Path([Key("topProducts", None), Flatten(None)])
path = Path([Key("topProducts", None), Index(0)])

error_path.starts_with(&path) => false

I believe the fix for this is to remove the trailing index from path, but I need to test this on more complex deferred queries. My concern is a path like [Key("top"), Index(0), Key("value"), Index(1)] is possible, and I don't know if Index(0) will be present in error.path. (f913ee9)

Issue 2

The on_graphql_error selector currently relies on CONTAINS_GRAPHQL_ERROR within the response context. That value isn't actually set until the telemetry layer of the supergraph stage, so it works for router coprocessors but not for supergraph coprocessors.

I'm generally concerned about using a global context value for on_graphql_error, since once issue 1 is fixed we can surface errors at any point in a deferred response. I'm not yet sure how to handle it at the router stage (where we're dealing with bytes), but at the supergraph stage I think it would be much better to make decisions via response.errors and response.incremental[*].errors rather than relying on CONTAINS_GRAPHQL_ERROR. (a6d801f and 4b7fdcd)

I don't like my tentative solution of always returning true for on_graphql_error on_response and relying on callers to know to use on_event_response.. but don't have a better idea at the moment.

Bonus Issues / TBD

  • If we make the changes I think we should make for issue 2, we now have different meanings for the on_graphql_error selector at the router and supergraph stages. This will need to either be fixed or be thoroughly documented.
  • Telemetry probably doesn't use on_event_response, so that will need to be updated or else everything will be logged on on_graphql_error: true

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Copy link
Contributor

github-actions bot commented May 9, 2025

@carodewig, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented May 9, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

5 new, 30 changed, 17 removed
+ graphos/routing/(latest)/customization/rhai.mdx
+ graphos/routing/(latest)/customization/rhai-reference.mdx
+ graphos/routing/(latest)/operations/index.mdx
+ graphos/routing/(latest)/self-hosted/install.mdx
+ graphos/routing/(latest)/about-router.mdx
* (developer-tools)/apollo-mcp-server/(latest)/command-reference.mdx
* (developer-tools)/apollo-mcp-server/(latest)/quickstart.mdx
* (developer-tools)/apollo-mcp-server/(latest)/guides/index.mdx
* graphos/routing/(latest)/cloud/dedicated.mdx
* graphos/routing/(latest)/customization/coprocessor.mdx
* graphos/routing/(latest)/customization/custom-binary.mdx
* graphos/routing/(latest)/observability/telemetry/instrumentation/standard-instruments.mdx
* graphos/routing/(latest)/observability/telemetry/trace-exporters/datadog.mdx
* graphos/routing/(latest)/observability/telemetry/trace-exporters/dynatrace.mdx
* graphos/routing/(latest)/observability/telemetry/trace-exporters/new-relic.mdx
* graphos/routing/(latest)/observability/telemetry/trace-exporters/jaeger.mdx
* graphos/routing/(latest)/observability/client-id-enforcement.mdx
* graphos/routing/(latest)/observability/debugging-subgraph-requests.mdx
* graphos/routing/(latest)/operations/defer.mdx
* graphos/routing/(latest)/operations/file-upload.mdx
* graphos/routing/(latest)/performance/caching/distributed.mdx
* graphos/routing/(latest)/performance/caching/index.mdx
* graphos/routing/(latest)/query-planning/native-query-planner.mdx
* graphos/routing/(latest)/security/demand-control.mdx
* graphos/routing/(latest)/security/authorization.mdx
* graphos/routing/(latest)/security/tls.mdx
* graphos/routing/(latest)/self-hosted/containerization/index.mdx
* graphos/routing/(latest)/self-hosted/containerization/docker.mdx
* graphos/routing/(latest)/self-hosted/index.mdx
* graphos/routing/(latest)/upgrade/from-router-v1.mdx
* graphos/routing/(latest)/graphos-reporting.mdx
* graphos/routing/(latest)/request-lifecycle.mdx
* graphos/routing/(latest)/about-v2.mdx
* graphos/routing/(latest)/federation-version-support.mdx
* graphos/routing/(latest)/graphos-features.mdx
- graphos/routing/(latest)/configuration/overview.mdx
- graphos/routing/(latest)/configuration/envvars.mdx
- graphos/routing/(latest)/configuration/cli.mdx
- graphos/routing/(latest)/configuration/yaml.mdx
- graphos/routing/(latest)/customization/coprocessor/index.mdx
- graphos/routing/(latest)/customization/coprocessor/reference.mdx
- graphos/routing/(latest)/customization/rhai/index.mdx
- graphos/routing/(latest)/customization/rhai/reference.mdx
- graphos/routing/(latest)/operations/subscriptions/overview.mdx
- graphos/routing/(latest)/operations/subscriptions/configuration.mdx
- graphos/routing/(latest)/security/authorization-overview.mdx
- graphos/routing/(latest)/self-hosted/containerization/gcp.mdx
- graphos/routing/(latest)/self-hosted/containerization/aws.mdx
- graphos/routing/(latest)/self-hosted/containerization/azure.mdx
- graphos/routing/(latest)/changelog.mdx
- graphos/routing/(latest)/get-started.mdx
- graphos/routing/(latest)/license.mdx

Build ID: 32829321deb5d21eafb01224

URL: https://www.apollographql.com/docs/deploy-preview/32829321deb5d21eafb01224

Comment on lines +373 to +378
// Always return `true` for `on_graphql_error` on the `response` selector.
// Each chunk of a response (if a stream) or the full request should also be checked
// with `on_response_event`.
// TODO: this could well be a terrible idea but it's currently my only idea for how
// to do error handling on deferred errors
Some(opentelemetry::Value::Bool(true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not work when using condition with this selector. For instruments or events for example. I'm not sure, I'm suspicious

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe you could just ignore this selector in on_response and also remove this selector from Stage::Response in the is_active method I think that could do the job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants