Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions src/server/server.cc
Original file line number Diff line number Diff line change
Expand Up @@ -242,20 +242,7 @@ Status Server::Start() {
return Status::OK();
}

void Server::Stop() {
stop_ = true;

slaveof_mu_.lock();
if (replication_thread_) replication_thread_->Stop();
slaveof_mu_.unlock();

for (const auto &worker : worker_threads_) {
worker->Stop(0 /* immediately terminate */);
}

rocksdb::CancelAllBackgroundWork(storage->GetDB(), true);
task_runner_.Cancel();
}
void Server::Stop() { stop_ = true; }

void Server::Join() {
if (auto s = util::ThreadJoin(cron_thread_); !s) {
Expand Down Expand Up @@ -853,6 +840,19 @@ void Server::cron() {
CleanupExitedSlaves();
recordInstantaneousMetrics();
}

assert(stop_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird to move those codes from Stop() to cron(). I suspect the issue was caused by the server being stopped before the replication was started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's weird. But I don't find a better place to impl the logic, so I read the Redis code and find Redis is also impl the logic in its serverCron().

Copy link
Contributor Author

@cxljs cxljs Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the issue was caused by the server being stopped before the replication was started.

I think before the replication was started, the ReplicationThread::stop_flag_ is true, so ReplicationThread::Stop() will not call the join().

And according to the log:

I20250226 11:50:45.351392 140030983013952 main.cc:50] Signal Terminated (15) received, stopping the server
W20250226 11:50:45.351526 140030983013952 replication.cc:371] Replication thread operation failed: thread #140030983013952 cannot be `join`ed: Resource deadlock avoided
I20250226 11:50:45.351537 140030983013952 replication.cc:373] [replication] Stopped

3952 (the repl thread) receive the signal -> call Server::Stop() -> join itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3952 (the repl thread) receive the signal -> call Server::Stop() -> join itself.

The repl thread won't receive the terminated signal and stop the server.

Copy link
Contributor Author

@cxljs cxljs Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repl thread won't receive the terminated signal and stop the server.

Why do you have this conclusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I20250226 11:50:45.351392 140030983013952 main.cc:50] Signal Terminated (15) received, stopping the server

Only the main thread will handle the signal. After receiving the terminated signal, the server->Stop() will be called, and it will then stop the replication thread afterward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


slaveof_mu_.lock();
if (replication_thread_) replication_thread_->Stop();
slaveof_mu_.unlock();

for (const auto &worker : worker_threads_) {
worker->Stop(0 /* immediately terminate */);
}

rocksdb::CancelAllBackgroundWork(storage->GetDB(), true);
task_runner_.Cancel();
}

Server::InfoEntries Server::GetRocksDBInfo() {
Expand Down
Loading