-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8648] Fix a bug for secondary index deletion #12447
Conversation
8200082
to
b71c47f
Compare
HoodieMetadataPayload payload = record.getData(); | ||
if (!payload.isDeleted()) { // process only valid records. | ||
if (keySet.contains(recordKey)) { | ||
logRecordsMap.put(recordKey, record); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also do
deletedRecordsFromLogs.remove(recordKey);
within the if block.
just incase a record is deleted first and added back later. we should not treat it as deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we cannot. The single usage of deletedRecordsFromLogs
is to filter records from base file, .e.g., we have br
, lr1
(deleted), lr2
(non-deleted). When we handle lr2
, we cannot remove its rk
, since rk
is used to filter br
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok because the log records are handled in temporal order so it won't be treated as deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have test for this scenario ?
i.e.
add a record to logfile1, delete in logfile2 and then add it back in logfile3.
if the test works, I am good.
if (baseFileRecords != null) { | ||
baseFileRecords.forEach((key, value) -> { | ||
if (!deletedRecordsFromLogs.contains(key)) { | ||
recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are not merging the records from base to log records? I understand in case of secondary index, a record is either created or deleted and there is no real merge, but lets keep the original code as is and just add fixes on top of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For eg, I went to master and made edits as below. all tests you have added in this patch worked.
+ @VisibleForTesting
+ public static Map<String, String> reverseLookupSecondaryKeysInternal(List<String> keySet, Map<String, HoodieRecord<HoodieMetadataPayload>> baseFileRecords,
+ HoodieMetadataLogRecordReader logRecordScanner) {
+ Set<String> deletedRecordsFromLogs = new HashSet<>();
+ // Map of recordKey (primaryKey) -> log record that is not deleted for all input recordKeys
+ Map<String, HoodieRecord<HoodieMetadataPayload>> logRecordsMap = new HashMap<>();
+ logRecordScanner.getRecords().forEach(record -> {
+ String recordKey = SecondaryIndexKeyUtils.getRecordKeyFromSecondaryIndexKey(record.getRecordKey());
+ HoodieMetadataPayload payload = record.getData();
+ if (!payload.isDeleted()) { // process only valid records.
+ if (keySet.contains(recordKey)) {
+ logRecordsMap.put(recordKey, record);
+ deletedRecordsFromLogs.remove(recordKey); // we can check if its present and then remove if need be
+ }
+ } else {
+ deletedRecordsFromLogs.add(recordKey);
+ logRecordsMap.remove(recordKey);
+ }
+ });
+
+ Map<String, String> recordKeyMap = new HashMap<>();
+ if (baseFileRecords == null || baseFileRecords.isEmpty()) {
+ logRecordsMap.forEach((key1, value1) -> {
+ if (!value1.getData().isDeleted() && !deletedRecordsFromLogs.contains(key1)) {
+ recordKeyMap.put(key1, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value1.getRecordKey()));
+ }
+ });
+ } else {
+ // Iterate over all provided log-records, merging them into existing records
+ logRecordsMap.forEach((key1, value1) -> baseFileRecords.merge(key1, value1, (oldRecord, newRecord) -> {
+ Option<HoodieRecord<HoodieMetadataPayload>> mergedRecord = HoodieMetadataPayload.combineSecondaryIndexRecord(oldRecord, newRecord);
+ return mergedRecord.orElse(null);
+ }));
+ baseFileRecords.forEach((key, value) -> {
+ if (!deletedRecordsFromLogs.contains(key)) {
+ recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey()));
+ }
+ });
+ }
return recordKeyMap;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use the above code; but it is more complex than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan , the reason is that we discuss here is that the logic of this function is too complex to understand. so we should try to make it as simple as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are fine IMO, but maybe I am biased due to prior work. Ideally, I would like to retain the semantics of combineSecondaryIndexRecord
but the current patch is also simple to understand. Plus, if we filter out tombstone records before calling combineSecondaryIndexRecord
then there is no point in keeping it. Please remove this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will adopt Siva's approach to avoid back-and-forth arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor feedback
Map<String, HoodieRecord<HoodieMetadataPayload>> baseFileRecords, | ||
HoodieMetadataLogRecordReader logRecordScanner) { | ||
Map<String, String> recordKeyMap = new HashMap<>(); | ||
Set<String> keySet = new TreeSet<>(recordKeys); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to create a TreeSet of recordKeys? I don't think order of record keys matters here. I believe it was there already, probably to optimize contains
check. But, even a simple HashSet would do right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the order of the record key here does not matter. I can check which one is more performant regarding the search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally HashSet is faster than TreeSet; will change it.
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
Outdated
Show resolved
Hide resolved
if (baseFileRecords != null) { | ||
baseFileRecords.forEach((key, value) -> { | ||
if (!deletedRecordsFromLogs.contains(key)) { | ||
recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are fine IMO, but maybe I am biased due to prior work. Ideally, I would like to retain the semantics of combineSecondaryIndexRecord
but the current patch is also simple to understand. Plus, if we filter out tombstone records before calling combineSecondaryIndexRecord
then there is no point in keeping it. Please remove this method.
179649b
to
f2fb700
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linliu-code After refactoring - Test Secondary Index With Updates Compaction Clustering Deletes *** FAILED ***
https://github.com/apache/hudi/actions/runs/12380514412/job/34557088618?pr=12447#step:5:8624
Maybe refactoring missed the actual fix which you had originally. Could you run the test locally multiple times to ensure flakiness is completely fixed?
Yeah, it may trigger the same bug before or a different bug. Will check. |
@codope After the refactoring, the test failed quickly. Should be a regression. Let me change back to my previous version to confirm. |
8f39643
to
f34959a
Compare
3767203
to
75c0092
Compare
Change Logs
When we try to search for the SI records for a given record key, the way we search for the secondary key could cause some SI records missed, and further cause some SI records are not deleted correctly.
Impact
Correct SI behavior.
Risk level (write none, low medium or high below)
Medium.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist