feat(databricks): adds Azure oauth to Databricks (#15117)

Anshul759 · pjain155_uhg · web-flow · commit bdb46d9909ab · 2025-11-06T19:46:34.000+05:30
Co-authored-by: pjain155_uhg &lt;anshul_p@optum.com&gt;
diff --git a/metadata-ingestion/docs/sources/databricks/unity-catalog_pre.md b/metadata-ingestion/docs/sources/databricks/unity-catalog_pre.md
@@ -4,27 +4,44 @@
 - Create a [Databricks Service Principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#what-is-a-service-principal)
   - You can skip this step and use your own account to get things running quickly,
     but we strongly recommend creating a dedicated service principal for production use.
+
+#### Authentication Options
+
+You can authenticate with Databricks using either a Personal Access Token or Azure authentication:
+
+**Option 1: Personal Access Token (PAT)**
+
 - Generate a Databricks Personal Access token following the following guides:
   - [Service Principals](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#personal-access-tokens)
   - [Personal Access Tokens](https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens)
-- Provision your service account:
-  - To ingest your workspace's metadata and lineage, your service principal must have all of the following:
-    - One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest
-    - One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest
-    - Ownership of or `SELECT` privilege on any tables and views you want to ingest
-    - [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
-    - [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
-  - To ingest legacy hive_metastore catalog (`include_hive_metastore` - enabled by default), your service principal must have all of the following:
-    - `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
-    - `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
-    - `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
-    - [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
-  - To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
-  - To `include_usage_statistics` (enabled by default), your service principal must have one of the following:
-    - `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
-    - When `usage_data_source` is set to `SYSTEM_TABLES` or `AUTO` (default) with `warehouse_id` configured: `SELECT` privilege on `system.query.history` table for improved performance with large query volumes and multi-workspace setups.
-  - To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
-  - To ingest `profiling` information with `method: analyze` and `call_analyze: true` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile.
-    - Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`.
-      You will still need `SELECT` privilege on those tables to fetch the results.
-- Check the starter recipe below and replace `workspace_url` and `token` with your information from the previous steps.
+
+**Option 2: Azure Authentication (for Azure Databricks)**
+
+- Create an Azure Active Directory application:
+  - Follow the [Azure AD app registration guide](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app)
+  - Note down the `client_id` (Application ID), `tenant_id` (Directory ID), and create a `client_secret`
+- Grant the Azure AD application access to your Databricks workspace:
+  - Add the service principal to your Databricks workspace following [this guide](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-a-service-principal-to-your-azure-databricks-account-using-the-account-console)
+
+#### Provision your service account:
+
+- To ingest your workspace's metadata and lineage, your service principal must have all of the following:
+  - One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest
+  - One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest
+  - Ownership of or `SELECT` privilege on any tables and views you want to ingest
+  - [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
+  - [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
+- To ingest legacy hive_metastore catalog (`include_hive_metastore` - enabled by default), your service principal must have all of the following:
+  - `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
+  - `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
+  - `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
+  - [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
+- To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
+- To `include_usage_statistics` (enabled by default), your service principal must have one of the following:
+  - `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
+  - When `usage_data_source` is set to `SYSTEM_TABLES` or `AUTO` (default) with `warehouse_id` configured: `SELECT` privilege on `system.query.history` table for improved performance with large query volumes and multi-workspace setups.
+- To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
+- To ingest `profiling` information with `method: analyze` and `call_analyze: true` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile.
+  - Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`.
+    You will still need `SELECT` privilege on those tables to fetch the results.
+- Check the starter recipe below and replace `workspace_url` and either `token` (for PAT authentication) or `azure_auth` credentials (for Azure authentication) with your information from the previous steps.
diff --git a/metadata-ingestion/docs/sources/databricks/unity-catalog_recipe.yml b/metadata-ingestion/docs/sources/databricks/unity-catalog_recipe.yml
@@ -2,7 +2,16 @@ source:
   type: databricks
   config:
     workspace_url: https://my-workspace.cloud.databricks.com
+    
+    # Authentication Option 1: Personal Access Token
     token: "<token>"
+    
+    # Authentication Option 2: Azure Authentication (for Azure Databricks)
+    # Uncomment the following section and comment out the token above to use Azure auth
+    # azure_auth:
+    #   client_id: "<azure_client_id>"
+    #   tenant_id: "<azure_tenant_id>" 
+    #   client_secret: "<azure_client_secret>"
     include_metastore: false
     include_ownership: true
     include_ml_model_aliases: false
diff --git a/metadata-ingestion/src/datahub/ingestion/source/unity/azure_auth_config.py b/metadata-ingestion/src/datahub/ingestion/source/unity/azure_auth_config.py
@@ -0,0 +1,15 @@
+from pydantic import Field, SecretStr
+
+from datahub.configuration import ConfigModel
+
+
+class AzureAuthConfig(ConfigModel):
+    client_secret: SecretStr = Field(
+        description="Azure application client secret used for authentication. This is a confidential credential that should be kept secure."
+    )
+    client_id: str = Field(
+        description="Azure application (client) ID. This is the unique identifier for the registered Azure AD application.",
+    )
+    tenant_id: str = Field(
+        description="Azure tenant (directory) ID. This identifies the Azure AD tenant where the application is registered.",
+    )
diff --git a/metadata-ingestion/src/datahub/ingestion/source/unity/config.py b/metadata-ingestion/src/datahub/ingestion/source/unity/config.py
@@ -413,6 +413,24 @@ def workspace_url_should_start_with_http_scheme(cls, workspace_url: str) -> str:
             )
         return workspace_url
 
+    @model_validator(mode="before")
+    def either_token_or_azure_auth_provided(cls, values: dict) -> dict:
+        token = values.get("token")
+        azure_auth = values.get("azure_auth")
+
+        # Check if exactly one of the authentication methods is provided
+        if not token and not azure_auth:
+            raise ValueError(
+                "Either 'azure_auth' or 'token' (personal access token) must be provided in the configuration."
+            )
+
+        if token and azure_auth:
+            raise ValueError(
+                "Cannot specify both 'token' and 'azure_auth'. Please provide only one authentication method."
+            )
+
+        return values
+
     @field_validator("include_metastore", mode="after")
     @classmethod
     def include_metastore_warning(cls, v: bool) -> bool:
diff --git a/metadata-ingestion/src/datahub/ingestion/source/unity/connection.py b/metadata-ingestion/src/datahub/ingestion/source/unity/connection.py
@@ -8,6 +8,7 @@
 
 from datahub.configuration.common import ConfigModel
 from datahub.ingestion.source.sql.sqlalchemy_uri import make_sqlalchemy_uri
+from datahub.ingestion.source.unity.azure_auth_config import AzureAuthConfig
 
 DATABRICKS = "databricks"
 
@@ -19,7 +20,12 @@ class UnityCatalogConnectionConfig(ConfigModel):
     """
 
     scheme: str = DATABRICKS
-    token: str = pydantic.Field(description="Databricks personal access token")
+    token: Optional[str] = pydantic.Field(
+        default=None, description="Databricks personal access token"
+    )
+    azure_auth: Optional[AzureAuthConfig] = Field(
+        default=None, description="Azure configuration"
+    )
     workspace_url: str = pydantic.Field(
         description="Databricks workspace url. e.g. https://my-workspace.cloud.databricks.com"
     )
diff --git a/metadata-ingestion/src/datahub/ingestion/source/unity/connection_test.py b/metadata-ingestion/src/datahub/ingestion/source/unity/connection_test.py
@@ -16,10 +16,10 @@ def __init__(self, config: UnityCatalogSourceConfig):
         self.report = UnityCatalogReport()
         self.proxy = UnityCatalogApiProxy(
             self.config.workspace_url,
-            self.config.token,
             self.config.profiling.warehouse_id,
             report=self.report,
             databricks_api_page_size=self.config.databricks_api_page_size,
+            personal_access_token=self.config.token,
         )
 
     def get_connection_test(self) -> TestConnectionReport:
diff --git a/metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py b/metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py
@@ -44,6 +44,7 @@
 from datahub._version import nice_version_name
 from datahub.api.entities.external.unity_catalog_external_entites import UnityCatalogTag
 from datahub.emitter.mce_builder import parse_ts_millis
+from datahub.ingestion.source.unity.azure_auth_config import AzureAuthConfig
 from datahub.ingestion.source.unity.config import (
     LineageDataSource,
     UsageDataSource,
@@ -169,20 +170,31 @@ class UnityCatalogApiProxy(UnityCatalogProxyProfilingMixin):
     def __init__(
         self,
         workspace_url: str,
-        personal_access_token: str,
         warehouse_id: Optional[str],
         report: UnityCatalogReport,
         hive_metastore_proxy: Optional[HiveMetastoreProxy] = None,
         lineage_data_source: LineageDataSource = LineageDataSource.AUTO,
         usage_data_source: UsageDataSource = UsageDataSource.AUTO,
         databricks_api_page_size: int = 0,
+        personal_access_token: Optional[str] = None,
+        azure_auth: Optional[AzureAuthConfig] = None,
     ):
-        self._workspace_client = WorkspaceClient(
-            host=workspace_url,
-            token=personal_access_token,
-            product="datahub",
-            product_version=nice_version_name(),
-        )
+        if azure_auth:
+            self._workspace_client = WorkspaceClient(
+                host=workspace_url,
+                azure_tenant_id=azure_auth.tenant_id,
+                azure_client_id=azure_auth.client_id,
+                azure_client_secret=azure_auth.client_secret.get_secret_value(),
+                product="datahub",
+                product_version=nice_version_name(),
+            )
+        else:
+            self._workspace_client = WorkspaceClient(
+                host=workspace_url,
+                token=personal_access_token,
+                product="datahub",
+                product_version=nice_version_name(),
+            )
         self.warehouse_id = warehouse_id or ""
         self.report = report
         self.hive_metastore_proxy = hive_metastore_proxy
diff --git a/metadata-ingestion/src/datahub/ingestion/source/unity/source.py b/metadata-ingestion/src/datahub/ingestion/source/unity/source.py
@@ -211,13 +211,14 @@ def __init__(self, ctx: PipelineContext, config: UnityCatalogSourceConfig):
 
         self.unity_catalog_api_proxy = UnityCatalogApiProxy(
             config.workspace_url,
-            config.token,
             config.warehouse_id,
             report=self.report,
             hive_metastore_proxy=self.hive_metastore_proxy,
             lineage_data_source=config.lineage_data_source,
             usage_data_source=config.usage_data_source,
             databricks_api_page_size=config.databricks_api_page_size,
+            personal_access_token=config.token if config.token else None,
+            azure_auth=config.azure_auth if config.azure_auth else None,
         )
 
         self.external_url_base = urljoin(self.config.workspace_url, "/explore/data")
diff --git a/metadata-ingestion/tests/unit/test_unity_catalog_source.py b/metadata-ingestion/tests/unit/test_unity_catalog_source.py