fixes pg replica error: "canceling statement due to conflict with recovery #2233
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A customer observed less frequent but constant errors returned by SpiceDB when using Postgres strict replicas mode.
When googling the error, it typically points to a conflict when applying WAL changes to the replicas that conflicted with in-flight queries.
Two parameters could be tweaked to reduce the likelihood: max_standby_archive_delay max_standby_streaming_delay. Postgres docs also describe a bit what is going on in https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT
A relevant part from the docs:
These are similar to serialization errors; in fact, we observed pgx reporting them as SQL error code 40001.
There are at least two strategies we could take:
I don't believe retrying is a good idea here. If I understood correctly, Postgres gives you those flags as a grace period for a query to complete before being forcefully canceled. I suspect retrying could have an undesirable side-effect: it increases the odds that WAL changes are delayed before application, which in turn increases lag and would make SpiceDB fall back more often to the primary.
Therefore, to avoid adding pressure that could increase the lag, I suggest not retrying and directly falling back to the primary.