Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes pg replica error: "canceling statement due to conflict with recovery #2233

Merged
merged 1 commit into from
Feb 4, 2025

Conversation

vroldanbet
Copy link
Contributor

A customer observed less frequent but constant errors returned by SpiceDB when using Postgres strict replicas mode.

When googling the error, it typically points to a conflict when applying WAL changes to the replicas that conflicted with in-flight queries.

Two parameters could be tweaked to reduce the likelihood: max_standby_archive_delay max_standby_streaming_delay. Postgres docs also describe a bit what is going on in https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT

A relevant part from the docs:

On the primary server, these cases simply result in waiting;
and the user might choose to cancel either of the conflicting actions.
However, on the standby there is no choice: the WAL-logged action
already occurred on the primary so the standby must not fail to apply
it. Furthermore, allowing WAL application to wait indefinitely may be
very undesirable, because the standby's state will become increasingly
far behind the primary's. Therefore, a mechanism is provided to forcibly
cancel standby queries that conflict with to-be-applied WAL records.

These are similar to serialization errors; in fact, we observed pgx reporting them as SQL error code 40001.

There are at least two strategies we could take:

  • retry the request
  • redirect the request to the primary

I don't believe retrying is a good idea here. If I understood correctly, Postgres gives you those flags as a grace period for a query to complete before being forcefully canceled. I suspect retrying could have an undesirable side-effect: it increases the odds that WAL changes are delayed before application, which in turn increases lag and would make SpiceDB fall back more often to the primary.

Therefore, to avoid adding pressure that could increase the lag, I suggest not retrying and directly falling back to the primary.

@vroldanbet vroldanbet requested a review from a team as a code owner February 4, 2025 18:47
@vroldanbet vroldanbet self-assigned this Feb 4, 2025
@github-actions github-actions bot added area/datastore Affects the storage system area/dependencies Affects dependencies labels Feb 4, 2025
@vroldanbet vroldanbet force-pushed the fix-pg-replica-serialization-errors branch from 76bf3e5 to 654e634 Compare February 4, 2025 19:35
@github-actions github-actions bot added the area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) label Feb 4, 2025
@vroldanbet vroldanbet force-pushed the fix-pg-replica-serialization-errors branch from 654e634 to 964526f Compare February 4, 2025 19:38
Copy link
Member

@josephschorr josephschorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…overy"

A customer observed less frequent but constant errors returned by SpiceDB when using Postgres strict replicas mode.

When googling the error, it typically points to a conflict when applying WAL changes to the replicas that conflicted with in-flight queries.

Two parameters could be tweaked to reduce the likelihood:
max_standby_archive_delay max_standby_streaming_delay.
Postgres docs also describe a bit what is going on in
https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT

A relevant part from the docs:

>On the primary server, these cases simply result in waiting;
>and the user might choose to cancel either of the conflicting actions.
>However, on the standby there is no choice: the WAL-logged action
>already occurred on the primary so the standby must not fail to apply
>it. Furthermore, allowing WAL application to wait indefinitely may be
>very undesirable, because the standby's state will become increasingly
>far behind the primary's. Therefore, a mechanism is provided to forcibly
>cancel standby queries that conflict with to-be-applied WAL records.

These are similar to serialization errors; in fact, we observed pgx reporting them as SQL error code 40001.

There are at least two strategies we could take:
- retry the request
- redirect the request to the primary

I don't believe retrying is a good idea here. If I understood correctly, Postgres gives you those flags as a grace period for a query to complete before being forcefully canceled. I suspect retrying could have an undesirable side-effect: it increases the odds that WAL changes are delayed before application, which in turn increases lag and would make SpiceDB fall back more often to the primary.

Therefore, to avoid adding pressure that could increase the lag, I suggest not retrying and directly falling back to the primary.
@vroldanbet vroldanbet force-pushed the fix-pg-replica-serialization-errors branch from 964526f to 19fdfd7 Compare February 4, 2025 19:44
@vroldanbet vroldanbet enabled auto-merge February 4, 2025 19:51
@vroldanbet vroldanbet added this pull request to the merge queue Feb 4, 2025
Merged via the queue into main with commit 20638b4 Feb 4, 2025
39 checks passed
@vroldanbet vroldanbet deleted the fix-pg-replica-serialization-errors branch February 4, 2025 20:05
@github-actions github-actions bot locked and limited conversation to collaborators Feb 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/datastore Affects the storage system area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants