Skip to content

Commit bdb46d9

Browse files
Anshul759pjain155_uhg
andauthored
feat(databricks): adds Azure oauth to Databricks (#15117)
Co-authored-by: pjain155_uhg <[email protected]>
1 parent 71daaa9 commit bdb46d9

File tree

9 files changed

+345
-34
lines changed

9 files changed

+345
-34
lines changed

metadata-ingestion/docs/sources/databricks/unity-catalog_pre.md

Lines changed: 38 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,44 @@
44
- Create a [Databricks Service Principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#what-is-a-service-principal)
55
- You can skip this step and use your own account to get things running quickly,
66
but we strongly recommend creating a dedicated service principal for production use.
7+
8+
#### Authentication Options
9+
10+
You can authenticate with Databricks using either a Personal Access Token or Azure authentication:
11+
12+
**Option 1: Personal Access Token (PAT)**
13+
714
- Generate a Databricks Personal Access token following the following guides:
815
- [Service Principals](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#personal-access-tokens)
916
- [Personal Access Tokens](https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens)
10-
- Provision your service account:
11-
- To ingest your workspace's metadata and lineage, your service principal must have all of the following:
12-
- One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest
13-
- One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest
14-
- Ownership of or `SELECT` privilege on any tables and views you want to ingest
15-
- [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
16-
- [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
17-
- To ingest legacy hive_metastore catalog (`include_hive_metastore` - enabled by default), your service principal must have all of the following:
18-
- `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
19-
- `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
20-
- `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
21-
- [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
22-
- To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
23-
- To `include_usage_statistics` (enabled by default), your service principal must have one of the following:
24-
- `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
25-
- When `usage_data_source` is set to `SYSTEM_TABLES` or `AUTO` (default) with `warehouse_id` configured: `SELECT` privilege on `system.query.history` table for improved performance with large query volumes and multi-workspace setups.
26-
- To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
27-
- To ingest `profiling` information with `method: analyze` and `call_analyze: true` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile.
28-
- Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`.
29-
You will still need `SELECT` privilege on those tables to fetch the results.
30-
- Check the starter recipe below and replace `workspace_url` and `token` with your information from the previous steps.
17+
18+
**Option 2: Azure Authentication (for Azure Databricks)**
19+
20+
- Create an Azure Active Directory application:
21+
- Follow the [Azure AD app registration guide](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app)
22+
- Note down the `client_id` (Application ID), `tenant_id` (Directory ID), and create a `client_secret`
23+
- Grant the Azure AD application access to your Databricks workspace:
24+
- Add the service principal to your Databricks workspace following [this guide](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-a-service-principal-to-your-azure-databricks-account-using-the-account-console)
25+
26+
#### Provision your service account:
27+
28+
- To ingest your workspace's metadata and lineage, your service principal must have all of the following:
29+
- One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest
30+
- One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest
31+
- Ownership of or `SELECT` privilege on any tables and views you want to ingest
32+
- [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
33+
- [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
34+
- To ingest legacy hive_metastore catalog (`include_hive_metastore` - enabled by default), your service principal must have all of the following:
35+
- `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
36+
- `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
37+
- `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
38+
- [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
39+
- To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
40+
- To `include_usage_statistics` (enabled by default), your service principal must have one of the following:
41+
- `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
42+
- When `usage_data_source` is set to `SYSTEM_TABLES` or `AUTO` (default) with `warehouse_id` configured: `SELECT` privilege on `system.query.history` table for improved performance with large query volumes and multi-workspace setups.
43+
- To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
44+
- To ingest `profiling` information with `method: analyze` and `call_analyze: true` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile.
45+
- Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`.
46+
You will still need `SELECT` privilege on those tables to fetch the results.
47+
- Check the starter recipe below and replace `workspace_url` and either `token` (for PAT authentication) or `azure_auth` credentials (for Azure authentication) with your information from the previous steps.

metadata-ingestion/docs/sources/databricks/unity-catalog_recipe.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,16 @@ source:
22
type: databricks
33
config:
44
workspace_url: https://my-workspace.cloud.databricks.com
5+
6+
# Authentication Option 1: Personal Access Token
57
token: "<token>"
8+
9+
# Authentication Option 2: Azure Authentication (for Azure Databricks)
10+
# Uncomment the following section and comment out the token above to use Azure auth
11+
# azure_auth:
12+
# client_id: "<azure_client_id>"
13+
# tenant_id: "<azure_tenant_id>"
14+
# client_secret: "<azure_client_secret>"
615
include_metastore: false
716
include_ownership: true
817
include_ml_model_aliases: false
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
from pydantic import Field, SecretStr
2+
3+
from datahub.configuration import ConfigModel
4+
5+
6+
class AzureAuthConfig(ConfigModel):
7+
client_secret: SecretStr = Field(
8+
description="Azure application client secret used for authentication. This is a confidential credential that should be kept secure."
9+
)
10+
client_id: str = Field(
11+
description="Azure application (client) ID. This is the unique identifier for the registered Azure AD application.",
12+
)
13+
tenant_id: str = Field(
14+
description="Azure tenant (directory) ID. This identifies the Azure AD tenant where the application is registered.",
15+
)

metadata-ingestion/src/datahub/ingestion/source/unity/config.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,24 @@ def workspace_url_should_start_with_http_scheme(cls, workspace_url: str) -> str:
413413
)
414414
return workspace_url
415415

416+
@model_validator(mode="before")
417+
def either_token_or_azure_auth_provided(cls, values: dict) -> dict:
418+
token = values.get("token")
419+
azure_auth = values.get("azure_auth")
420+
421+
# Check if exactly one of the authentication methods is provided
422+
if not token and not azure_auth:
423+
raise ValueError(
424+
"Either 'azure_auth' or 'token' (personal access token) must be provided in the configuration."
425+
)
426+
427+
if token and azure_auth:
428+
raise ValueError(
429+
"Cannot specify both 'token' and 'azure_auth'. Please provide only one authentication method."
430+
)
431+
432+
return values
433+
416434
@field_validator("include_metastore", mode="after")
417435
@classmethod
418436
def include_metastore_warning(cls, v: bool) -> bool:

metadata-ingestion/src/datahub/ingestion/source/unity/connection.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
from datahub.configuration.common import ConfigModel
1010
from datahub.ingestion.source.sql.sqlalchemy_uri import make_sqlalchemy_uri
11+
from datahub.ingestion.source.unity.azure_auth_config import AzureAuthConfig
1112

1213
DATABRICKS = "databricks"
1314

@@ -19,7 +20,12 @@ class UnityCatalogConnectionConfig(ConfigModel):
1920
"""
2021

2122
scheme: str = DATABRICKS
22-
token: str = pydantic.Field(description="Databricks personal access token")
23+
token: Optional[str] = pydantic.Field(
24+
default=None, description="Databricks personal access token"
25+
)
26+
azure_auth: Optional[AzureAuthConfig] = Field(
27+
default=None, description="Azure configuration"
28+
)
2329
workspace_url: str = pydantic.Field(
2430
description="Databricks workspace url. e.g. https://my-workspace.cloud.databricks.com"
2531
)

metadata-ingestion/src/datahub/ingestion/source/unity/connection_test.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@ def __init__(self, config: UnityCatalogSourceConfig):
1616
self.report = UnityCatalogReport()
1717
self.proxy = UnityCatalogApiProxy(
1818
self.config.workspace_url,
19-
self.config.token,
2019
self.config.profiling.warehouse_id,
2120
report=self.report,
2221
databricks_api_page_size=self.config.databricks_api_page_size,
22+
personal_access_token=self.config.token,
2323
)
2424

2525
def get_connection_test(self) -> TestConnectionReport:

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
from datahub._version import nice_version_name
4545
from datahub.api.entities.external.unity_catalog_external_entites import UnityCatalogTag
4646
from datahub.emitter.mce_builder import parse_ts_millis
47+
from datahub.ingestion.source.unity.azure_auth_config import AzureAuthConfig
4748
from datahub.ingestion.source.unity.config import (
4849
LineageDataSource,
4950
UsageDataSource,
@@ -169,20 +170,31 @@ class UnityCatalogApiProxy(UnityCatalogProxyProfilingMixin):
169170
def __init__(
170171
self,
171172
workspace_url: str,
172-
personal_access_token: str,
173173
warehouse_id: Optional[str],
174174
report: UnityCatalogReport,
175175
hive_metastore_proxy: Optional[HiveMetastoreProxy] = None,
176176
lineage_data_source: LineageDataSource = LineageDataSource.AUTO,
177177
usage_data_source: UsageDataSource = UsageDataSource.AUTO,
178178
databricks_api_page_size: int = 0,
179+
personal_access_token: Optional[str] = None,
180+
azure_auth: Optional[AzureAuthConfig] = None,
179181
):
180-
self._workspace_client = WorkspaceClient(
181-
host=workspace_url,
182-
token=personal_access_token,
183-
product="datahub",
184-
product_version=nice_version_name(),
185-
)
182+
if azure_auth:
183+
self._workspace_client = WorkspaceClient(
184+
host=workspace_url,
185+
azure_tenant_id=azure_auth.tenant_id,
186+
azure_client_id=azure_auth.client_id,
187+
azure_client_secret=azure_auth.client_secret.get_secret_value(),
188+
product="datahub",
189+
product_version=nice_version_name(),
190+
)
191+
else:
192+
self._workspace_client = WorkspaceClient(
193+
host=workspace_url,
194+
token=personal_access_token,
195+
product="datahub",
196+
product_version=nice_version_name(),
197+
)
186198
self.warehouse_id = warehouse_id or ""
187199
self.report = report
188200
self.hive_metastore_proxy = hive_metastore_proxy

metadata-ingestion/src/datahub/ingestion/source/unity/source.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,13 +211,14 @@ def __init__(self, ctx: PipelineContext, config: UnityCatalogSourceConfig):
211211

212212
self.unity_catalog_api_proxy = UnityCatalogApiProxy(
213213
config.workspace_url,
214-
config.token,
215214
config.warehouse_id,
216215
report=self.report,
217216
hive_metastore_proxy=self.hive_metastore_proxy,
218217
lineage_data_source=config.lineage_data_source,
219218
usage_data_source=config.usage_data_source,
220219
databricks_api_page_size=config.databricks_api_page_size,
220+
personal_access_token=config.token if config.token else None,
221+
azure_auth=config.azure_auth if config.azure_auth else None,
221222
)
222223

223224
self.external_url_base = urljoin(self.config.workspace_url, "/explore/data")

0 commit comments

Comments
 (0)