Skip to content

Conversation

@AnudeepKonaboina
Copy link
Contributor

@AnudeepKonaboina AnudeepKonaboina commented Dec 2, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Improve observability of Delta Lake VACUUM operations by including the table identifier (table ID / table path) in all VACUUM‑related log messages. This helps users understand exactly which table is being vacuumed when multiple Delta tables are processed within the same job. Though we have the table path in the logswhen the VACUUM job starts , it is not very debug friendly and searchable as the paths are very long. This fixes #5594

Currently, when user runs VACUUM on Multiple tables , the log messages do not include any table identifier. When multiple tables are processed in a single job, users cannot determine which table is being cleaned up from the logs alone:

Current logs:

25/11/27 05:05:14 INFO VacuumCommand: Starting garbage collection (dryRun = false) of untracked files older than 20 Nov 2025 05:05:14 GMT in /tmp/delta/vacuum
25/11/27 05:05:38 INFO VacuumCommand: Starting garbage collection (dryRun = false) of untracked files older than 20 Nov 2025 05:05:38 GMT in  /tmp/delta/orders
25/11/27 05:05:44 INFO VacuumCommand: Deleting untracked files and empty directories in  /tmp/delta/orders The amount of data to be deleted is 0 (in bytes)
25/11/27 05:05:26 INFO VacuumCommand: Deleting untracked files and empty directories in  /tmp/delta/vacuum. The amount of data to be deleted is 0 (in bytes)
25/11/27 05:05:29 INFO VacuumCommand: Deleted 0 files (0 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,None,604800000,1763615114082,1,4,0,0,10038,1158,1764219914030,1764219929008,8,8,8,false,0,0,7,None,None,FULL)

25/11/27 05:05:45 INFO VacuumCommand: Deleted 0 files (0 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,None,604800000,1763615138428,1,4,0,0,4887,938,1764219938390,1764219945316,8,8,8,false,0,0,5,None,None,LITE)

How was this patch tested?

Ran the below code , on delta-3.3.0 before the change and then ran the same code by building an assembly jar after the fix and below is the difference in the logs

Code:

import org.apache.spark.sql.functions._
import java.nio.file.{Files, Paths}
import scala.util.Try

val path = "/tmp/delta_vacuum"

// Clean up from previous runs
Try {
  import sys.process._
  s"rm -rf $path".!
}

// Write a simple Delta table
spark.range(10).write.format("delta").mode("overwrite").save(path)

// Make a few commits so VACUUM has something to work with
spark.range(10, 20).write.format("delta").mode("append").save(path)
spark.range(20, 30).write.format("delta").mode("append").save(path)

// Disable retention check for testing
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

// Run VACUUM
spark.sql(s"VACUUM delta.`$path` RETAIN 0 HOURS")

Logs before the fix:

25/11/27 05:05:38 INFO VacuumCommand: Starting garbage collection (dryRun = false) of untracked files older than 20 Nov 2025 05:05:38 GMT in  /tmp/delta/orders
25/11/27 05:05:44 INFO VacuumCommand: Deleting untracked files and empty directories in  /tmp/delta/orders The amount of data to be deleted is 0 (in bytes)
25/11/27 05:05:45 INFO VacuumCommand: Deleted 0 files (0 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,None,604800000,1763615138428,1,4,0,0,4887,938,1764219938390,1764219945316,8,8,8,false,0,0,5,None,None,FULL)

Logs with this fix:

25/11/27 14:36:18 INFO VacuumCommand: [tableId=e63846fa] [VACUUM_FULL] Starting garbage collection (dryRun = false) of untracked files older than 20 Nov 2025 05:05:38 GMT in  /tmp/delta/orders
25/11/27 14:36:30 INFO VacuumCommand: [tableId=e63846fa] [VACUUM_FULL] Deleting untracked files and empty directories in  /tmp/delta/orders The amount of data to be deleted is 0 (in bytes)
25/11/27 14:36:40 INFO VacuumCommand: [tableId=e63846fa] [VACUUM_FULL] Deleted 13 files (6473 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,Some(0),604800000,1764254189829,1,25,13,6473,7982,581,1764254189825,1764254199336,0,13,None,None,FULL)

Does this PR introduce any user-facing changes?

Yes, it improves observability and debugging capability . To get all the vacuum logs for a specific table , user can now search with the string "VacuumCommand: [tableId=<table_id>]" to get all the required logs

@AnudeepKonaboina AnudeepKonaboina changed the title [Feature] Vacuum table_id prefix logging [Feature] Vacuum table_id prefix logging for observability Dec 2, 2025
@AnudeepKonaboina AnudeepKonaboina changed the title [Feature] Vacuum table_id prefix logging for observability [Feature] Adding table_id prefix for VACUUM logging for observability Dec 2, 2025
@AnudeepKonaboina AnudeepKonaboina force-pushed the feature/issue-5594 branch 6 times, most recently from 31e75b3 to 4e64721 Compare December 3, 2025 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request][Spark] Add table identifier for VACUUM log messages for better observability

1 participant