Skip to content

Conversation

@LindaSummer
Copy link
Member

@LindaSummer LindaSummer commented Aug 1, 2024

Issue

#1389

Proposed Changes

  • refactor worker's connection structure from std:::map to tbb::concurrent_hash_map
  • remove redundant mutex

Comment

During my testing and researching on tbb, I find that the tbb::concurrent_hash_map is not thread safe on traversing and erasing.
Here is the related issue: uxlfoundation/oneTBB#183.
So, I have to keep the lock on iteration and erasing.

After reading the tbb code, I find that tbb::concurrent_hashmap should use range() with tbb::parallel_for and tbb::parallel_reduce for traversing.

    //------------------------------------------------------------------------
    // Parallel algorithm support
    //------------------------------------------------------------------------
    range_type range( size_type grainsize=1 ) {
        return range_type( *this, grainsize );
    }
    const_range_type range( size_type grainsize=1 ) const {
        return const_range_type( *this, grainsize );
    }

    //------------------------------------------------------------------------
    // STL support - not thread-safe methods
    //------------------------------------------------------------------------
    iterator begin() { return iterator( *this, 0, this->my_embedded_segment, this->my_embedded_segment->node_list.load(std::memory_order_relaxed) ); }
    const_iterator begin() const { return const_iterator( *this, 0, this->my_embedded_segment, this->my_embedded_segment->node_list.load(std::memory_order_relaxed) ); }
    const_iterator cbegin() const { return const_iterator( *this, 0, this->my_embedded_segment, this->my_embedded_segment->node_list.load(std::memory_order_relaxed) ); }
    iterator end() { return iterator( *this, 0, nullptr, nullptr ); }
    const_iterator end() const { return const_iterator( *this, 0, nullptr, nullptr ); }
    const_iterator cend() const { return const_iterator( *this, 0, nullptr, nullptr ); }

@LindaSummer LindaSummer force-pushed the feature/tbb_worker_with_mutex branch from ee539f1 to b42ca2d Compare August 1, 2024 13:57
@LindaSummer LindaSummer force-pushed the feature/tbb_worker_with_mutex branch from 8c8951b to 1104241 Compare August 4, 2024 15:28
@LindaSummer LindaSummer changed the title feat: replace worker connection's mutex partially with tbb::concurrent_hash_map feat: replace worker connection's mutex with tbb::concurrent_hash_map Aug 4, 2024
@git-hulk git-hulk requested a review from mapleFU August 5, 2024 02:04
@mapleFU
Copy link
Member

mapleFU commented Aug 5, 2024

Would you mind try redis-benchmark to test some commands? I don't know how this would improve our performance, it would great to have some data

@LindaSummer
Copy link
Member Author

Would you mind try redis-benchmark to test some commands? I don't know how this would improve our performance, it would great to have some data

Hi @mapleFU ,

Thanks for your suggestion!
In fact I'm also not sure how much will tbb hashmap affect the performance. 😆
I will make a benchmark with redis benchmark later today.

Best Regards,
Edward

}

std::vector<int> Worker::getConnFds() const {
return tbb::parallel_reduce(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO parallel_for looks like some cpu-bound operations, and would it occupies more threads than expected here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mapleFU ,
Thanks for your review suggestion.
I read the tbb's official document, and find the schedule uses all cores on default.
Maybe we should set a limitation for tbb schedule.
In fact, I'm not sure do we really need tbb hashmap to replace current mutex now. 🤔

Best Regards,
Edward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I'm not sure do we really need tbb hashmap to replace current mutex now.

I think it's worth doing if we don't traverse all conns_ frequently.

@LindaSummer
Copy link
Member Author

Would you mind try redis-benchmark to test some commands? I don't know how this would improve our performance, it would great to have some data

Hi @mapleFU ,
Sorry for delay reply.
I'm not so familiar with redis-benchmark.
I use the default redis benchmark command and generate two csv reports here.

It seems that the performance just has a slight change.

Here is my computer basic info.

OS: Rocky Linux 9.1 (Blue Onyx) x86_64
Kernel: 5.14.0-162.6.1.el9_1.0.1.x86_64
CPU: AMD Ryzen 5 1600 (12) @ 3.200GHz
Memory: 9970MiB / 31785MiB

This is the unstable branch's report with commit 6473d6c.

"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"PING_INLINE","64808.82","0.404","0.072","0.391","0.511","0.671","3.647"
"PING_MBULK","65402.22","0.400","0.096","0.391","0.503","0.567","3.191"
"SET","64391.50","0.418","0.080","0.407","0.527","0.607","1.647"
"GET","64641.24","0.407","0.080","0.399","0.503","0.583","0.999"
"INCR","62500.00","0.559","0.056","0.503","0.951","1.215","2.127"
"LPUSH","56980.06","0.846","0.048","0.855","1.375","1.591","2.135"
"RPUSH","58685.45","0.801","0.048","0.799","1.199","1.391","2.511"
"LPOP","51572.98","0.939","0.048","0.967","1.479","1.783","3.247"
"RPOP","50838.84","0.960","0.048","0.927","1.607","1.967","3.743"
"SADD","62421.97","0.521","0.056","0.471","0.911","1.575","2.199"
"HSET","40064.10","1.229","0.056","0.999","2.655","3.023","3.999"
"SPOP","65146.58","0.403","0.088","0.399","0.511","0.567","0.903"
"ZADD","64061.50","0.627","0.040","0.575","1.111","1.279","1.775"
"ZPOPMIN","66225.16","0.398","0.080","0.391","0.511","0.599","1.031"
"LPUSH (needed to benchmark LRANGE)","54436.58","0.893","0.048","0.839","1.575","1.815","2.687"
"LRANGE_100 (first 100 elements)","40404.04","0.680","0.256","0.687","0.823","0.879","1.343"
"LRANGE_300 (first 300 elements)","19872.81","1.377","0.432","1.367","1.735","1.935","5.535"
"LRANGE_500 (first 500 elements)","13666.80","2.029","0.664","2.023","2.535","2.847","5.327"
"LRANGE_600 (first 600 elements)","12012.01","2.336","1.072","2.159","3.471","3.847","6.159"
"MSET (10 keys)","46772.68","1.030","0.048","1.007","2.047","2.519","18.703"

This is current tbb feature branch's report.

"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"PING_INLINE","66269.05","0.395","0.072","0.391","0.479","0.551","3.055"
"PING_MBULK","65616.80","0.398","0.080","0.391","0.487","0.575","1.015"
"SET","61538.46","0.436","0.064","0.431","0.535","0.591","1.343"
"GET","65274.15","0.402","0.088","0.399","0.511","0.567","0.839"
"INCR","62695.92","0.572","0.056","0.503","1.039","1.295","2.103"
"LPUSH","58719.91","0.810","0.048","0.767","1.399","1.639","3.903"
"RPUSH","57703.40","0.829","0.048","0.879","1.415","1.655","3.031"
"LPOP","51177.07","0.952","0.048","0.951","1.615","1.927","3.175"
"RPOP","49776.01","0.981","0.048","0.999","1.839","2.223","3.391"
"SADD","63897.76","0.572","0.040","0.503","1.007","1.215","2.103"
"HSET","40866.37","1.205","0.056","1.223","1.999","2.287","3.295"
"SPOP","63171.20","0.416","0.088","0.407","0.535","0.639","1.175"
"ZADD","65274.15","0.685","0.048","0.679","1.103","1.263","1.975"
"ZPOPMIN","63451.78","0.414","0.064","0.415","0.511","0.575","0.927"
"LPUSH (needed to benchmark LRANGE)","54436.58","0.883","0.048","0.863","1.559","1.831","2.591"
"LRANGE_100 (first 100 elements)","38358.27","0.710","0.280","0.719","0.847","0.895","1.431"
"LRANGE_300 (first 300 elements)","17927.57","1.496","0.696","1.471","1.887","2.039","2.951"
"LRANGE_500 (first 500 elements)","12485.95","2.157","1.056","2.183","2.615","2.959","6.047"
"LRANGE_600 (first 600 elements)","11053.39","2.435","0.784","2.423","3.015","3.375","7.471"
"MSET (10 keys)","47505.94","1.030","0.056","1.087","1.567","1.783","2.719"

Here is the delta report of tbb minus unstable.

test,rps_diff,avg_latency_ms_diff,min_latency_ms_diff,p50_latency_ms_diff,p95_latency_ms_diff,p99_latency_ms_diff,max_latency_ms_diff
PING_INLINE,1460.23,-0.009,0.0,0.0,-0.032,-0.12,-0.592
PING_MBULK,214.58,-0.002,-0.016,0.0,-0.016,0.008,-2.176
SET,-2853.04,0.018,-0.016,0.024,0.008,-0.016,-0.304
GET,632.91,-0.005,0.008,0.0,0.008,-0.016,-0.16
INCR,195.92,0.013,0.0,0.0,0.088,0.08,-0.024
LPUSH,1739.85,-0.036,0.0,-0.088,0.024,0.048,1.768
RPUSH,-982.05,0.028,0.0,0.08,0.216,0.264,0.52
LPOP,-395.91,0.013,0.0,-0.016,0.136,0.144,-0.072
RPOP,-1062.83,0.021,0.0,0.072,0.232,0.256,-0.352
SADD,1475.79,0.051,-0.016,0.032,0.096,-0.36,-0.096
HSET,802.27,-0.024,0.0,0.224,-0.656,-0.736,-0.704
SPOP,-1975.38,0.013,0.0,0.008,0.024,0.072,0.272
ZADD,1212.65,0.058,0.008,0.104,-0.008,-0.016,0.2
ZPOPMIN,-2773.38,0.016,-0.016,0.024,0.0,-0.024,-0.104
LPUSH (needed to benchmark LRANGE),0.0,-0.01,0.0,0.024,-0.016,0.016,-0.096
LRANGE_100 (first 100 elements),-2045.77,0.03,0.024,0.032,0.024,0.016,0.088
LRANGE_300 (first 300 elements),-1945.24,0.119,0.264,0.104,0.152,0.104,-2.584
LRANGE_500 (first 500 elements),-1180.85,0.128,0.392,0.16,0.08,0.112,0.72
LRANGE_600 (first 600 elements),-958.62,0.099,-0.288,0.264,-0.456,-0.472,1.312
MSET (10 keys),733.26,0.0,0.008,0.08,-0.48,-0.736,-15.984

Best Regards,
Edward

@mapleFU
Copy link
Member

mapleFU commented Aug 7, 2024

Thanks for your benchmarking! Maybe I'd like to run this and trying to perf the mutex contention when I have time.

I've check out the benchmark result but the benchmark result turns to be unstable 🤔 The graph below shows that in different percentage result would be different...So maybe we should analysis the performance bottleneck for connections handling and check whether it could bring up performance improvement

QQ_1723045383522

@LindaSummer
Copy link
Member Author

Thanks for your benchmarking! Maybe I'd like to run this and trying to perf the mutex contention when I have time.

I've check out the benchmark result but the benchmark result turns to be unstable 🤔 The graph below shows that in different percentage result would be different...So maybe we should analysis the performance bottleneck for connections handling and check whether it could bring up performance improvement

QQ_1723045383522

Hi @mapleFU ,

Thanks very much for your help! 😊
I will learn more about redis-benchamrk and try to use a larger test sample for report.

Best Regards,
Edward

@git-hulk
Copy link
Member

git-hulk commented Aug 8, 2024

@mapleFU This PR generally looks good to me. Huge thanks for your efforts @LindaSummer.

@mapleFU
Copy link
Member

mapleFU commented Aug 8, 2024

I don't have time for this today, I'll take a round tomorrow

for (const auto &iter : monitor_conns_) {
conns.emplace_back(iter.second);
auto collect_conns_fn = [](ConnMap &conns) {
return parallel_reduce(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is parallel_reduce necessary here?

Copy link
Member Author

@LindaSummer LindaSummer Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @PragmaTwice ,

I use a reduce action before iteration because the parallel_for will acquire a range rather than one item.
So, I want to move the time costly action Close() outside of the parallel_for.

But since this function is inside the dtor, if the ownership is managed correctly, it should be the only function uses internal state of the Worker object.
I think the parallel_reduce can be replaced by parallel_for or even removed.

I will refactor it and run tests before pushing to current PR.

Best Regards,
Edward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean maybe a single-thread for is enough here? didn't have a deep look yet cc @mapleFU

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @PragmaTwice ,

I have removed the parallel_reduce from dtor and all tests are passed in my repo's GitHub Actions.
I think in dtor the concurrency protection of internal state may be redundant since all references should be released except the dtor itself.

Best Regards,
Edward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean maybe a single-thread for is enough here? didn't have a deep look yet

Yeah parallel task also has a dop (degree of parallel), and I guess default of parallel_for would be high and be good at writing cpu-bound tasks like OpenMP

I think maybe lock is prefered here

}

std::vector<int> Worker::getConnFds() const {
return tbb::parallel_reduce(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I'm not sure do we really need tbb hashmap to replace current mutex now.

I think it's worth doing if we don't traverse all conns_ frequently.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other part LGTM

@mapleFU
Copy link
Member

mapleFU commented Aug 13, 2024

General LGTM, maybe I'd testing it this week, if the operations is just raw read/erase it would works better

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry for delay replying

After an-around of thinking, I think:

Pros:

  1. Maybe faster on conn. When conn mutex is a bottleneck on throughput, it could actually making it faster
  2. In the code holds the mutex in a worker for a longtime, the new tbb will brings up smaller critical section and making the front-end user thread get faster response, which benifits the latency

Cons:

  1. Would conns be a large number? If conn is in a little number, it would brings up inconsistency between monitor_conns_ and conns_. It's much more difficult to figure out the correctness than std::mutex. Especially on the function like KickouIdleClients. It improves the latency which blocked by the global lock, but when connections will not be so many, it doesn't make any sense @git-hulk
  2. tbb::parallel_for would goes here: https://github.com/oneapi-src/oneTBB/blob/2d516c871b79d81032f0ca44d76e5072bc62d479/src/tbb/task_dispatcher.h#L243 which requires task_scheduler and an independent pool.

@LindaSummer
Copy link
Member Author

Hi @mapleFU ,

Thanks very much for your review suggestions! 😄

it would brings up inconsistency between monitor_conns_ and conns_

I agree on this concern. Removing the lock will lead to two connection collection's inconsistency.
It can be mitigated if we fetch the monitor_conns_ accessor inside the related accessor scope of conns_.
But frankly I think this way increased the complexity of current logic.
If the connections quantity is not so much, the performance may not have improvement.

tbb::parallel_for would goes here: https://github.com/oneapi-src/oneTBB/blob/2d516c871b79d81032f0ca44d76e5072bc62d479/src/tbb/task_dispatcher.h#L243 which requires task_scheduler and an independent pool.

Thanks for your investigation!
I think that I should follow current's way and construct a single thread execution pool for the parallel_for and parallel_reduce.

By the way, after reading our codebase, I think that the tbb container may more suitable for containers which just need to protect themselves rather than more related states.
Maybe states like block_keys_ and blocked_stream_consumers_ are more suitable for tbb containers.

Best Regards,
Edward

return {Status::NotOK, "connection doesn't exist"};
}

void Worker::BecomeMonitorConn(redis::Connection *conn) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mapleFU ,

Sorry for delay reply.
I'd like to use this way on all accessors to mitigate potential inconsistency in monitor_conns_ and conns_.

Suggested change
void Worker::BecomeMonitorConn(redis::Connection *conn) {
void Worker::BecomeMonitorConn(redis::Connection *conn) {
ConnMap::accessor accessor;
ConnMap::accessor monitor_accessor;
bool find_conn = conns_.find(accessor, conn->GetFD());
bool find_monitor = monitor_conns_.find(monitor_accessor, conn->GetFD());
if (find_conn) {
conns_.erase(accessor);
}
if (find_monitor) {
monitor_accessor->second = conn;
} else {
monitor_conns_.insert(monitor_accessor, std::make_pair(conn->GetFD(), conn));
}
srv->IncrMonitorClientNum();
conn->EnableFlag(redis::Connection::kMonitor);
}

Best Regards,
Edward

@LindaSummer
Copy link
Member Author

Hi @mapleFU ,

Sorry for delay reply.
I have added constraints for parallel_for and parallel_reduce.
It should now only use one thread for execution.

Best Regards,
Edward

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 3, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants