[HUDI-8648] Fix a bug for secondary index deletion #12447

linliu-code · 2024-12-08T16:14:41Z

Change Logs

When we try to search for the SI records for a given record key, the way we search for the secondary key could cause some SI records missed, and further cause some SI records are not deleted correctly.

Impact

Correct SI behavior.

Risk level (write none, low medium or high below)

Medium.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan · 2024-12-11T23:15:26Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

+      HoodieMetadataPayload payload = record.getData();
+      if (!payload.isDeleted()) { // process only valid records.
+        if (keySet.contains(recordKey)) {
+          logRecordsMap.put(recordKey, record);


should we also do

deletedRecordsFromLogs.remove(recordKey);

within the if block.
just incase a record is deleted first and added back later. we should not treat it as deleted.

No, we cannot. The single usage of deletedRecordsFromLogs is to filter records from base file, .e.g., we have br, lr1(deleted), lr2(non-deleted). When we handle lr2, we cannot remove its rk, since rk is used to filter br.

I think it's ok because the log records are handled in temporal order so it won't be treated as deleted.

do we have test for this scenario ?
i.e.

add a record to logfile1, delete in logfile2 and then add it back in logfile3.
if the test works, I am good.

nsivabalan · 2024-12-11T23:17:15Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

+    if (baseFileRecords != null) {
+      baseFileRecords.forEach((key, value) -> {
+        if (!deletedRecordsFromLogs.contains(key)) {
+          recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey()));


why we are not merging the records from base to log records? I understand in case of secondary index, a record is either created or deleted and there is no real merge, but lets keep the original code as is and just add fixes on top of that.

For eg, I went to master and made edits as below. all tests you have added in this patch worked.

+ @VisibleForTesting + public static Map<String, String> reverseLookupSecondaryKeysInternal(List<String> keySet, Map<String, HoodieRecord<HoodieMetadataPayload>> baseFileRecords, + HoodieMetadataLogRecordReader logRecordScanner) { + Set<String> deletedRecordsFromLogs = new HashSet<>(); + // Map of recordKey (primaryKey) -> log record that is not deleted for all input recordKeys + Map<String, HoodieRecord<HoodieMetadataPayload>> logRecordsMap = new HashMap<>(); + logRecordScanner.getRecords().forEach(record -> { + String recordKey = SecondaryIndexKeyUtils.getRecordKeyFromSecondaryIndexKey(record.getRecordKey()); + HoodieMetadataPayload payload = record.getData(); + if (!payload.isDeleted()) { // process only valid records. + if (keySet.contains(recordKey)) { + logRecordsMap.put(recordKey, record); + deletedRecordsFromLogs.remove(recordKey); // we can check if its present and then remove if need be + } + } else { + deletedRecordsFromLogs.add(recordKey); + logRecordsMap.remove(recordKey); + } + }); + + Map<String, String> recordKeyMap = new HashMap<>(); + if (baseFileRecords == null || baseFileRecords.isEmpty()) { + logRecordsMap.forEach((key1, value1) -> { + if (!value1.getData().isDeleted() && !deletedRecordsFromLogs.contains(key1)) { + recordKeyMap.put(key1, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value1.getRecordKey())); + } + }); + } else { + // Iterate over all provided log-records, merging them into existing records + logRecordsMap.forEach((key1, value1) -> baseFileRecords.merge(key1, value1, (oldRecord, newRecord) -> { + Option<HoodieRecord<HoodieMetadataPayload>> mergedRecord = HoodieMetadataPayload.combineSecondaryIndexRecord(oldRecord, newRecord); + return mergedRecord.orElse(null); + })); + baseFileRecords.forEach((key, value) -> { + if (!deletedRecordsFromLogs.contains(key)) { + recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey())); + } + }); + } return recordKeyMap; }

I can use the above code; but it is more complex than needed.

@nsivabalan , the reason is that we discuss here is that the logic of this function is too complex to understand. so we should try to make it as simple as possible.

Both are fine IMO, but maybe I am biased due to prior work. Ideally, I would like to retain the semantics of combineSecondaryIndexRecord but the current patch is also simple to understand. Plus, if we filter out tombstone records before calling combineSecondaryIndexRecord then there is no point in keeping it. Please remove this method.

Yeah, I will adopt Siva's approach to avoid back-and-forth arguments.

nsivabalan

minor feedback

codope · 2024-12-17T13:12:20Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

+                                                                       Map<String, HoodieRecord<HoodieMetadataPayload>> baseFileRecords,
+                                                                       HoodieMetadataLogRecordReader logRecordScanner) {
+    Map<String, String> recordKeyMap = new HashMap<>();
+    Set<String> keySet = new TreeSet<>(recordKeys);


Do we need to create a TreeSet of recordKeys? I don't think order of record keys matters here. I believe it was there already, probably to optimize contains check. But, even a simple HashSet would do right?

Yeah, the order of the record key here does not matter. I can check which one is more performant regarding the search.

Generally HashSet is faster than TreeSet; will change it.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

codope · 2024-12-17T14:40:04Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

+    if (baseFileRecords != null) {
+      baseFileRecords.forEach((key, value) -> {
+        if (!deletedRecordsFromLogs.contains(key)) {
+          recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey()));


Both are fine IMO, but maybe I am biased due to prior work. Ideally, I would like to retain the semantics of combineSecondaryIndexRecord but the current patch is also simple to understand. Plus, if we filter out tombstone records before calling combineSecondaryIndexRecord then there is no point in keeping it. Please remove this method.

codope

@linliu-code After refactoring - Test Secondary Index With Updates Compaction Clustering Deletes *** FAILED ***
https://github.com/apache/hudi/actions/runs/12380514412/job/34557088618?pr=12447#step:5:8624

Maybe refactoring missed the actual fix which you had originally. Could you run the test locally multiple times to ensure flakiness is completely fixed?

linliu-code · 2024-12-18T04:29:44Z

Yeah, it may trigger the same bug before or a different bug. Will check.

linliu-code · 2024-12-18T04:52:06Z

@codope After the refactoring, the test failed quickly. Should be a regression. Let me change back to my previous version to confirm.

hudi-bot · 2024-12-27T19:58:22Z

CI report:

b71c47f UNKNOWN
f34959a UNKNOWN
3ccf599 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:S PR with lines of changes in (10, 100] label Dec 8, 2024

linliu-code force-pushed the HUDI-8643-real-fix branch from 8200082 to b71c47f Compare December 9, 2024 08:39

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Dec 9, 2024

linliu-code changed the title ~~[HUDI-8643] Fix a read bug for SI~~ [HUDI-8648] Fix a read bug for SI Dec 9, 2024

linliu-code changed the title ~~[HUDI-8648] Fix a read bug for SI~~ [HUDI-8648] Fix a bug for secondary index deletion Dec 9, 2024

nsivabalan reviewed Dec 11, 2024

View reviewed changes

codope reviewed Dec 17, 2024

View reviewed changes

linliu-code force-pushed the HUDI-8643-real-fix branch from 179649b to f2fb700 Compare December 17, 2024 19:51

linliu-code requested review from codope and nsivabalan December 17, 2024 20:05

nsivabalan approved these changes Dec 18, 2024

View reviewed changes

codope requested changes Dec 18, 2024

View reviewed changes

linliu-code added 6 commits December 18, 2024 23:32

Fix the bug

b498452

Refactor and add unit tests

2307917

update some comments

2f9b780

Refactor based on comments

8f5a82b

Use Hashset and share the key set

ace384f

Fix the logic

f34959a

linliu-code force-pushed the HUDI-8643-real-fix branch from 8f39643 to f34959a Compare December 20, 2024 19:53

Fix tests

75c0092

linliu-code force-pushed the HUDI-8643-real-fix branch from 3767203 to 75c0092 Compare December 20, 2024 20:22

remove delete reocrds

3ccf599

linliu-code requested review from codope and nsivabalan December 27, 2024 20:13

codope approved these changes Dec 28, 2024

View reviewed changes

codope merged commit 66a9401 into apache:master Dec 28, 2024
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8648] Fix a bug for secondary index deletion #12447

[HUDI-8648] Fix a bug for secondary index deletion #12447

linliu-code commented Dec 8, 2024 •

edited

Loading

nsivabalan Dec 11, 2024

linliu-code Dec 12, 2024

codope Dec 17, 2024

nsivabalan Dec 18, 2024

nsivabalan Dec 11, 2024

nsivabalan Dec 11, 2024

linliu-code Dec 12, 2024 •

edited

Loading

linliu-code Dec 12, 2024

codope Dec 17, 2024

linliu-code Dec 17, 2024 •

edited

Loading

nsivabalan left a comment

codope Dec 17, 2024 •

edited

Loading

linliu-code Dec 17, 2024

linliu-code Dec 17, 2024

codope Dec 17, 2024

codope left a comment

linliu-code commented Dec 18, 2024

linliu-code commented Dec 18, 2024

hudi-bot commented Dec 27, 2024

[HUDI-8648] Fix a bug for secondary index deletion #12447

[HUDI-8648] Fix a bug for secondary index deletion #12447

Conversation

linliu-code commented Dec 8, 2024 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linliu-code Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linliu-code Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

codope Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codope left a comment

Choose a reason for hiding this comment

linliu-code commented Dec 18, 2024

linliu-code commented Dec 18, 2024

hudi-bot commented Dec 27, 2024

CI report:

linliu-code commented Dec 8, 2024 •

edited

Loading

linliu-code Dec 12, 2024 •

edited

Loading

linliu-code Dec 17, 2024 •

edited

Loading

codope Dec 17, 2024 •

edited

Loading