-
Notifications
You must be signed in to change notification settings - Fork 575
feat(tdigest): add the support of TDIGEST.REVRANK command #3130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution. Could you add some golang test cases for it? Refer to https://github.com/apache/kvrocks/blob/unstable/tests/gocase/unit/type/tdigest/tdigest_test.go. |
|
@PragmaTwice ok,I will add some golang test |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | ||
| LockGuard guard(storage_->GetLockManager(), ns_key); | ||
|
|
||
| if (auto status = getMetaDataByNsKey(ctx, ns_key, &metadata); !status.ok()) { | ||
| return status; | ||
| } | ||
|
|
||
| if (metadata.total_observations == 0) { | ||
| result->resize(inputs.size(), -2); | ||
| return rocksdb::Status::OK(); | ||
| } | ||
|
|
||
| if (metadata.unmerged_nodes > 0) { | ||
| auto batch = storage_->GetWriteBatchBase(); | ||
| WriteBatchLogData log_data(kRedisTDigest); | ||
| if (auto status = batch->PutLogData(log_data.Encode()); !status.ok()) { | ||
| return status; | ||
| } | ||
|
|
||
| if (auto status = mergeCurrentBuffer(ctx, ns_key, batch, &metadata); !status.ok()) { | ||
| return status; | ||
| } | ||
|
|
||
| std::string metadata_bytes; | ||
| metadata.Encode(&metadata_bytes); | ||
| if (auto status = batch->Put(metadata_cf_handle_, ns_key, metadata_bytes); !status.ok()) { | ||
| return status; | ||
| } | ||
|
|
||
| if (auto status = storage_->Write(ctx, storage_->DefaultWriteOptions(), batch->GetWriteBatch()); !status.ok()) { | ||
| return status; | ||
| } | ||
|
|
||
| ctx.RefreshLatestSnapshot(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @donghao526 ,
The merge action could be refactored into a function to reduce duplication of the same logic.
Best Regards,
Edward
src/types/redis_tdigest.cc
Outdated
| for (auto value : inputs) { | ||
| auto status_or_rank = TDigestRevRank(dump_centroids, value); | ||
| if (!status_or_rank) { | ||
| return rocksdb::Status::InvalidArgument(status_or_rank.Msg()); | ||
| } | ||
| result->push_back(*status_or_rank); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @donghao526 ,
We could sort the inputs and get the ranks with just one scan of the centroids since it's sorted.
Best Regards,
Edward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @LindaSummer
I encountered a problem when I was testing. After the nodes merged, are there two adjacent centroids can be with the same mean?
I Test with
TDIGEST.CREATE s COMPRESSION 1000
TDIGEST.ADD s 10 10 10 10 20 20
I found the centroids after merged are:
(1) mean: 10 weight: 1
(2) mean: 10 weight: 1
(3) mean: 10 weight: 1
(4) mean: 10 weight: 1
(5) mean: 20 weight: 1
(6) mean: 20 weight: 1
Is this as expected or a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @donghao526 ,
It is expected, and you could refer to #2878 for more details.
So we need a stable way for both serialization and deserialization.
The trigger for the merge is the weight, not the mean. So we could treat the mean only as a label of one centroid. The whole logic is driven by weight.
Best Regards,
Edward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
src/types/tdigest.h
Outdated
| } | ||
|
|
||
| template <typename TD> | ||
| inline StatusOr<int> TDigestRevRank(TD&& td, double value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @donghao526 ,
We need to use a stable way to compare between doubles.
It will be tough to assume that the two double numbers are equal to or greater than.
After solving this, we should add some test cases for this corner case.
Best Regards,
Edward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @donghao526 ,
Since the other code snippets use this way now. You could leave it with the current logic.
I will try to create a new PR to solve the unstable comparison problem in this file.
Best Regards,
Edward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, after your new PR, I can help to fix here.
…ks into feature/tdigest-revrank
Co-authored-by: hulk <[email protected]>
… commands (apache#3107) Co-authored-by: Twice <[email protected]>
…e#3136) The usage of this function was removed in apache#395. Co-authored-by: Twice <[email protected]>
…pache#3139) Accessing undeclared keys in lua scripting may lead to unexpected behavior in the current design of kvrocks (also in redis, refer to https://redis.io/docs/latest/commands/eval/), so in this PR we add a new option `lua-strict-key-accessing` to prevent users to access undeclared keys in lua, e.g. ``` EVAL "return redis.call('set', 'a', 1)" // ERROR! EVAL "return redis.call('set', KEYS[1], 1)" 1 a // ok ``` This check is performed in both lua scripting and lua functions.
Also, use hardcoded IDs when creating the user and group, to ensure they remain stable. Closes apache#3135. Co-authored-by: Aleks Lozovyuk <[email protected]>
…lt (apache#3144) To align with the description in kvrocks.conf. https://github.com/apache/kvrocks/blob/9446fbde325767f7b64896c4539035402d0f8f7a/kvrocks.conf#L1016-L1017
…()` (apache#3145) Historically, it was used for doing some checks. https://github.com/apache/kvrocks/blob/2bbfe5aa9531f6e76bfd10a2cc450b9bfa0f15d9/src/storage.cc#L175-L182 But these checks no longer exist today, so this operation should be unnecessary. `ListColumnFamilies()` needs to iterate through the MANIFEST, and might be costly for a large one. For example, 80MB MANIFEST, took 1.6 seconds.
…che#3146) Pattern-based SCAN iterations may skip remaining keys in a hash slot, causing incomplete key retrieval. **Bug Reproduction:** - 10 keys in the same hash slot: ["119483", "166988", "210695", "223656", "48063", "59022", "65976", "74937", "88379", "99338"] - Initial SCAN with pattern `2*`: - Returns cursor C1 and **empty keyset** (no keys match `2*`) - Records "119483" as last scanned key - Subsequent SCAN with cursor C1 and same pattern: 1. RocksDB iterator seeks to "119483" 2. Calls `Next()` → gets "166988" (next key in slot) 3. "166988" ∉ `2*` pattern → no key returned 4. **Error**: Scan incorrectly increments slot index 5. **Result**: Remaining 8 keys in slot are skipped **Bug Fix Implementation:** When scanning with a previous scan cursor and match pattern: 1. If the last scanned key is lexicographically before the pattern's start range: → Use the pattern's minimum matching key as the seek key → Instead of using the last scanned key **Example:** - Pattern: `2*` → Minimum matching key = `"2"` (hex: \x32) - Last scanned key: `"119483"` (hex: \x31\x31...) - Since `"119483"` < `"2"` lexically: ✓ **Correct:** Seek to `"2"` ✗ **Buggy:** Seek to `"119483"` --------- Co-authored-by: Twice <[email protected]>
…ks into feature/tdigest-revrank
|
@LindaSummer I have modified the code according to your suggestion, please review.
Best Regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return {Status::RedisExecErr, s.ToString()}; | ||
| } | ||
|
|
||
| if (result.data()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (result.data()) { | |
| if (!result.empty()) { |
src/types/tdigest.h
Outdated
| i--; | ||
|
|
||
| // handle the prev inputs which has the same value | ||
| while ((i > 0) && (inputs[indices[i]] == inputs[indices[i - 1]])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use a map<double, std::vector<size_t>> to construct a sorted input and group redundant input with rank into a group, and it may simplify our logic.
src/types/tdigest.h
Outdated
| i--; | ||
| } | ||
| return Status::OK(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a newline to the file.
| } | ||
|
|
||
| template <typename TD> | ||
| inline Status TDigestRank(TD&& td, const std::vector<double>& inputs, std::vector<int>& result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems specialized for RevRank rather than Rank.
Could we implement the Rank, then wrap the Iterator to a reverse version with the same logic to construct RevRank?


ISSUE
It closes #3063.
Proposed Changes
Add TDIGEST.REVRANK command implementation
Add cpp unit tests