Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8648] Fix a bug for secondary index deletion #12447

Merged
merged 8 commits into from
Dec 28, 2024

Conversation

linliu-code
Copy link
Contributor

@linliu-code linliu-code commented Dec 8, 2024

Change Logs

When we try to search for the SI records for a given record key, the way we search for the secondary key could cause some SI records missed, and further cause some SI records are not deleted correctly.

Impact

Correct SI behavior.

Risk level (write none, low medium or high below)

Medium.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Dec 8, 2024
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Dec 9, 2024
@linliu-code linliu-code changed the title [HUDI-8643] Fix a read bug for SI [HUDI-8648] Fix a read bug for SI Dec 9, 2024
@linliu-code linliu-code changed the title [HUDI-8648] Fix a read bug for SI [HUDI-8648] Fix a bug for secondary index deletion Dec 9, 2024
HoodieMetadataPayload payload = record.getData();
if (!payload.isDeleted()) { // process only valid records.
if (keySet.contains(recordKey)) {
logRecordsMap.put(recordKey, record);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also do

deletedRecordsFromLogs.remove(recordKey);

within the if block.
just incase a record is deleted first and added back later. we should not treat it as deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we cannot. The single usage of deletedRecordsFromLogs is to filter records from base file, .e.g., we have br, lr1(deleted), lr2(non-deleted). When we handle lr2, we cannot remove its rk, since rk is used to filter br.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok because the log records are handled in temporal order so it won't be treated as deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have test for this scenario ?
i.e.

add a record to logfile1, delete in logfile2 and then add it back in logfile3.
if the test works, I am good.

if (baseFileRecords != null) {
baseFileRecords.forEach((key, value) -> {
if (!deletedRecordsFromLogs.contains(key)) {
recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are not merging the records from base to log records? I understand in case of secondary index, a record is either created or deleted and there is no real merge, but lets keep the original code as is and just add fixes on top of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For eg, I went to master and made edits as below. all tests you have added in this patch worked.

+  @VisibleForTesting
+  public static Map<String, String> reverseLookupSecondaryKeysInternal(List<String> keySet, Map<String, HoodieRecord<HoodieMetadataPayload>> baseFileRecords,
+                                                                HoodieMetadataLogRecordReader logRecordScanner) {
+    Set<String> deletedRecordsFromLogs = new HashSet<>();
+    // Map of recordKey (primaryKey) -> log record that is not deleted for all input recordKeys
+    Map<String, HoodieRecord<HoodieMetadataPayload>> logRecordsMap = new HashMap<>();
+    logRecordScanner.getRecords().forEach(record -> {
+      String recordKey = SecondaryIndexKeyUtils.getRecordKeyFromSecondaryIndexKey(record.getRecordKey());
+      HoodieMetadataPayload payload = record.getData();
+      if (!payload.isDeleted()) { // process only valid records.
+        if (keySet.contains(recordKey)) {
+          logRecordsMap.put(recordKey, record);
+          deletedRecordsFromLogs.remove(recordKey); // we can check if its present and then remove if need be
+        }
+      } else {
+        deletedRecordsFromLogs.add(recordKey);
+        logRecordsMap.remove(recordKey);
+      }
+    });
+
+    Map<String, String> recordKeyMap = new HashMap<>();
+    if (baseFileRecords == null || baseFileRecords.isEmpty()) {
+      logRecordsMap.forEach((key1, value1) -> {
+        if (!value1.getData().isDeleted() && !deletedRecordsFromLogs.contains(key1)) {
+          recordKeyMap.put(key1, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value1.getRecordKey()));
+        }
+      });
+    } else {
+      // Iterate over all provided log-records, merging them into existing records
+      logRecordsMap.forEach((key1, value1) -> baseFileRecords.merge(key1, value1, (oldRecord, newRecord) -> {
+        Option<HoodieRecord<HoodieMetadataPayload>> mergedRecord = HoodieMetadataPayload.combineSecondaryIndexRecord(oldRecord, newRecord);
+        return mergedRecord.orElse(null);
+      }));
+      baseFileRecords.forEach((key, value) -> {
+        if (!deletedRecordsFromLogs.contains(key)) {
+          recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey()));
+        }
+      });
+    }
     return recordKeyMap;
   }

Copy link
Contributor Author

@linliu-code linliu-code Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use the above code; but it is more complex than needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan , the reason is that we discuss here is that the logic of this function is too complex to understand. so we should try to make it as simple as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are fine IMO, but maybe I am biased due to prior work. Ideally, I would like to retain the semantics of combineSecondaryIndexRecord but the current patch is also simple to understand. Plus, if we filter out tombstone records before calling combineSecondaryIndexRecord then there is no point in keeping it. Please remove this method.

Copy link
Contributor Author

@linliu-code linliu-code Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will adopt Siva's approach to avoid back-and-forth arguments.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor feedback

Map<String, HoodieRecord<HoodieMetadataPayload>> baseFileRecords,
HoodieMetadataLogRecordReader logRecordScanner) {
Map<String, String> recordKeyMap = new HashMap<>();
Set<String> keySet = new TreeSet<>(recordKeys);
Copy link
Member

@codope codope Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create a TreeSet of recordKeys? I don't think order of record keys matters here. I believe it was there already, probably to optimize contains check. But, even a simple HashSet would do right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the order of the record key here does not matter. I can check which one is more performant regarding the search.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally HashSet is faster than TreeSet; will change it.

if (baseFileRecords != null) {
baseFileRecords.forEach((key, value) -> {
if (!deletedRecordsFromLogs.contains(key)) {
recordKeyMap.put(key, SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(value.getRecordKey()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are fine IMO, but maybe I am biased due to prior work. Ideally, I would like to retain the semantics of combineSecondaryIndexRecord but the current patch is also simple to understand. Plus, if we filter out tombstone records before calling combineSecondaryIndexRecord then there is no point in keeping it. Please remove this method.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linliu-code After refactoring - Test Secondary Index With Updates Compaction Clustering Deletes *** FAILED ***
https://github.com/apache/hudi/actions/runs/12380514412/job/34557088618?pr=12447#step:5:8624

Maybe refactoring missed the actual fix which you had originally. Could you run the test locally multiple times to ensure flakiness is completely fixed?

@linliu-code
Copy link
Contributor Author

Yeah, it may trigger the same bug before or a different bug. Will check.

@linliu-code
Copy link
Contributor Author

@codope After the refactoring, the test failed quickly. Should be a regression. Let me change back to my previous version to confirm.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 66a9401 into apache:master Dec 28, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants