fixes pg replica error: "canceling statement due to conflict with recovery #2233

vroldanbet · 2025-02-04T18:47:01Z

A customer observed less frequent but constant errors returned by SpiceDB when using Postgres strict replicas mode.

When googling the error, it typically points to a conflict when applying WAL changes to the replicas that conflicted with in-flight queries.

Two parameters could be tweaked to reduce the likelihood: max_standby_archive_delay max_standby_streaming_delay. Postgres docs also describe a bit what is going on in https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT

A relevant part from the docs:

On the primary server, these cases simply result in waiting;
and the user might choose to cancel either of the conflicting actions.
However, on the standby there is no choice: the WAL-logged action
already occurred on the primary so the standby must not fail to apply
it. Furthermore, allowing WAL application to wait indefinitely may be
very undesirable, because the standby's state will become increasingly
far behind the primary's. Therefore, a mechanism is provided to forcibly
cancel standby queries that conflict with to-be-applied WAL records.

These are similar to serialization errors; in fact, we observed pgx reporting them as SQL error code 40001.

There are at least two strategies we could take:

retry the request
redirect the request to the primary

I don't believe retrying is a good idea here. If I understood correctly, Postgres gives you those flags as a grace period for a query to complete before being forcefully canceled. I suspect retrying could have an undesirable side-effect: it increases the odds that WAL changes are delayed before application, which in turn increases lag and would make SpiceDB fall back more often to the primary.

Therefore, to avoid adding pressure that could increase the lag, I suggest not retrying and directly falling back to the primary.

josephschorr

LGTM

…overy" A customer observed less frequent but constant errors returned by SpiceDB when using Postgres strict replicas mode. When googling the error, it typically points to a conflict when applying WAL changes to the replicas that conflicted with in-flight queries. Two parameters could be tweaked to reduce the likelihood: max_standby_archive_delay max_standby_streaming_delay. Postgres docs also describe a bit what is going on in https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT A relevant part from the docs: >On the primary server, these cases simply result in waiting; >and the user might choose to cancel either of the conflicting actions. >However, on the standby there is no choice: the WAL-logged action >already occurred on the primary so the standby must not fail to apply >it. Furthermore, allowing WAL application to wait indefinitely may be >very undesirable, because the standby's state will become increasingly >far behind the primary's. Therefore, a mechanism is provided to forcibly >cancel standby queries that conflict with to-be-applied WAL records. These are similar to serialization errors; in fact, we observed pgx reporting them as SQL error code 40001. There are at least two strategies we could take: - retry the request - redirect the request to the primary I don't believe retrying is a good idea here. If I understood correctly, Postgres gives you those flags as a grace period for a query to complete before being forcefully canceled. I suspect retrying could have an undesirable side-effect: it increases the odds that WAL changes are delayed before application, which in turn increases lag and would make SpiceDB fall back more often to the primary. Therefore, to avoid adding pressure that could increase the lag, I suggest not retrying and directly falling back to the primary.

vroldanbet requested a review from a team as a code owner February 4, 2025 18:47

vroldanbet self-assigned this Feb 4, 2025

github-actions bot added area/datastore Affects the storage system area/dependencies Affects dependencies labels Feb 4, 2025

vroldanbet force-pushed the fix-pg-replica-serialization-errors branch from 76bf3e5 to 654e634 Compare February 4, 2025 19:35

github-actions bot added the area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) label Feb 4, 2025

vroldanbet force-pushed the fix-pg-replica-serialization-errors branch from 654e634 to 964526f Compare February 4, 2025 19:38

josephschorr approved these changes Feb 4, 2025

View reviewed changes

vroldanbet force-pushed the fix-pg-replica-serialization-errors branch from 964526f to 19fdfd7 Compare February 4, 2025 19:44

vroldanbet enabled auto-merge February 4, 2025 19:51

vroldanbet added this pull request to the merge queue Feb 4, 2025

Merged via the queue into main with commit 20638b4 Feb 4, 2025
39 checks passed

vroldanbet deleted the fix-pg-replica-serialization-errors branch February 4, 2025 20:05

github-actions bot locked and limited conversation to collaborators Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes pg replica error: "canceling statement due to conflict with recovery #2233

fixes pg replica error: "canceling statement due to conflict with recovery #2233

vroldanbet commented Feb 4, 2025

josephschorr left a comment

fixes pg replica error: "canceling statement due to conflict with recovery #2233

fixes pg replica error: "canceling statement due to conflict with recovery #2233

Conversation

vroldanbet commented Feb 4, 2025

josephschorr left a comment

Choose a reason for hiding this comment