Skip to content

Conversation

@sadanand48
Copy link
Contributor

What changes were proposed in this pull request?

Because of fixes in HDDS-13905 an entire BG service task would be blocked because of bootstrap operation running on a leader OM. One possible improvement here would be to just to create a hardlink(which we already do to ensure this file doesn't get deleted by rocksdb operations but we also write the file into the Tarball stream synchronously

public static long linkAndIncludeFile(File file, String entryName,
ArchiveOutputStream<TarArchiveEntry> archiveOutput, Path tmpDir) throws IOException {
File link = tmpDir.resolve(entryName).toFile();
long bytes = 0;
try {
Files.createLink(link.toPath(), file.toPath());
TarArchiveEntry entry = archiveOutput.createArchiveEntry(link, entryName);
archiveOutput.putArchiveEntry(entry);
try (InputStream input = Files.newInputStream(link.toPath())) {
bytes = IOUtils.copyLarge(input, archiveOutput);
}
archiveOutput.closeArchiveEntry();
} catch (IOException ioe) {
LOG.error("Couldn't create hardlink for file {} while including it in tarball.",
file.getAbsolutePath(), ioe);
throw ioe;
} finally {
Files.deleteIfExists(link.toPath());
}
return bytes;
}
) to the file to be copied into tar outputStream into a tmp directory under the bootstrap lock and write all the entries corresponding to the link created outside the lock(The entry should also include the hardLinkFile created in the last batch).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13906

How was this patch tested?

Will add tests

@swamirishi swamirishi self-requested a review January 5, 2026 14:11
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @sadanand48 . Have a few doubts with the implementation

@sadanand48 sadanand48 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jan 5, 2026
@swamirishi
Copy link
Contributor

@rnblough Do you wanna take a look at this patch?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces the Bootstrap write lock time during OM bootstrapping by refactoring the checkpoint streaming process. The key change is separating the file collection phase (which creates hardlinks while holding the lock) from the actual tar streaming phase (which now happens outside the lock).

Changes:

  • Introduced OMDBArchiver class to manage hardlink creation and deferred tar writing
  • Refactored OMDBCheckpointServletInodeBasedXfer to collect files under lock and write outside lock
  • Updated tests to work with the new two-phase approach

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 15 comments.

File Description
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBArchiver.java New class that manages hardlink creation during lock phase and tar writing outside lock
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java Refactored to use two-phase approach: collectDbDataToTransfer (under lock) and writeToArchive (outside lock)
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOMDBArchiver.java New test class covering OMDBArchiver functionality
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java Updated integration tests to work with new method signatures and two-phase approach
Comments suppressed due to low confidence (1)

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java:192

  • Potential resource leak: If an IOException occurs in the second try block (lines 173-182) after successfully creating the OMDBArchiver and collecting data, the hardlinks created in the tmpDir may not be cleaned up properly. The cleanup in the finally block only deletes the tmpDir, but if writeToArchive fails, the hardlink files stored in filesToWriteIntoTarball map won't be explicitly deleted since the cleanup happens in writeToArchive's finally block. Consider adding explicit cleanup of the archiver's resources in the outer finally block.
    try {
      Instant start = Instant.now();
      OutputStream outputStream = response.getOutputStream();
      omdbArchiver.writeToArchive(getConf(), outputStream);
      Instant end = Instant.now();
      long duration = Duration.between(start, end).toMillis();
      LOG.info("Time taken to write the checkpoint to response output " +
          "stream: {} milliseconds", duration);
    } catch (IOException e) {
      LOG.error("unable to write to archive stream", e);
    } finally {
      try {
        if (tmpdir != null) {
          FileUtils.deleteDirectory(tmpdir.toFile());
        }
      } catch (IOException e) {
        LOG.error("unable to delete: " + tmpdir, e.toString());
      }
    }
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants