You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[#26504] YSQL, QueryDiagnostics: Resolve the race condition in creation and killing of yb_query_diagnostics bgworker
Summary:
It is possible that we hit a race condition in the creation and killing of yb_query_diagnostics bgworker in the following scenario:
a backend adds a new entry
the bgworker process the entry and removes it from the hash table
the bgworker checks that there are no entries, so it decides to kill itself.
The bgworker releases the lock, but doesn't kill itself yet
a backend acquires the lock
the backend adds another entry
the backend determines that the bgworker is currently running and releases the lock.
```
/* Worker was never initialized (invalid slot and generation) */
if (bg_worker_handle->slot == -1 && bg_worker_handle->generation == -1)
BgWorkerRegister();
/* Worker was initialized but not currently running */
else if (GetBackgroundWorkerPid(bg_worker_handle, &pid) != BGWH_STARTED)
BgWorkerRegister();
}
LWLockRelease(bundles_in_progress_lock);
```
the bgworker then kills itself, as it had decided to do.
At this point, we are stuck - a new bgworker will only be created if a new entry is added.
To resolve this, we introduce a shared variable called bg_worker_should_be_active, which indicates the expected state of the background worker at any given time. When inserting a new bundle, if we expect the background worker to be inactive but find that it is still running, we wait for 5 seconds. If it remains active after this period, we raise an ereport(ERROR, ...).
| backend | bgworker |
| ---------------------------------------------------- | ---------------------------- |
| Acquire exclusive lock | |
| Start bgworker as it doesnt exist already | |
| add a entry to hash table | |
| release lock | |
| collects the data | ... |
| (interval over) | (interval over) |
| | process,dump the data |
| | remove entry from hash table |
| | acquire exclusive lock |
| | mark bgworker to be killed |
| | release lock |
| Tries to add another entry | |
| Sees bgworker marked to be killed but not yet killed | |
| So waits for 5 sec for it to be killed. | |
| | Bgworker killed |
| Acquire exclusive lock | |
| Start bgworker as it doesnt exist already | |
| add a entry to hash table | |
| release lock | |
Also note that, Since we take an EXCLUSIVE lock while creating the bgworker. Multiple clients should not be able to start bgworker.
Jira: DB-15871
Test Plan:
yb_build.sh --java-test 'org.yb.pgsql.TestYbQueryDiagnostics#testBgworkerRaceConditionResolved'
yb_build.sh --java-test 'org.yb.pgsql.TestYbQueryDiagnostics#testBgworkerRaceConditionTimeout'
Reviewers: asaha, telgersma
Reviewed By: telgersma
Subscribers: yql
Differential Revision: https://phorge.dev.yugabyte.com/D42180
0 commit comments