Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 38 additions & 21 deletions metadata-ingestion/docs/sources/databricks/unity-catalog_pre.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,44 @@
- Create a [Databricks Service Principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#what-is-a-service-principal)
- You can skip this step and use your own account to get things running quickly,
but we strongly recommend creating a dedicated service principal for production use.

#### Authentication Options

You can authenticate with Databricks using either a Personal Access Token or Azure authentication:

**Option 1: Personal Access Token (PAT)**

- Generate a Databricks Personal Access token following the following guides:
- [Service Principals](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#personal-access-tokens)
- [Personal Access Tokens](https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens)
- Provision your service account:
- To ingest your workspace's metadata and lineage, your service principal must have all of the following:
- One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest
- One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest
- Ownership of or `SELECT` privilege on any tables and views you want to ingest
- [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
- [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
- To ingest legacy hive_metastore catalog (`include_hive_metastore` - enabled by default), your service principal must have all of the following:
- `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
- `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
- `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
- [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
- To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
- To `include_usage_statistics` (enabled by default), your service principal must have one of the following:
- `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
- When `usage_data_source` is set to `SYSTEM_TABLES` or `AUTO` (default) with `warehouse_id` configured: `SELECT` privilege on `system.query.history` table for improved performance with large query volumes and multi-workspace setups.
- To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
- To ingest `profiling` information with `method: analyze` and `call_analyze: true` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile.
- Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`.
You will still need `SELECT` privilege on those tables to fetch the results.
- Check the starter recipe below and replace `workspace_url` and `token` with your information from the previous steps.

**Option 2: Azure Authentication (for Azure Databricks)**

- Create an Azure Active Directory application:
- Follow the [Azure AD app registration guide](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app)
- Note down the `client_id` (Application ID), `tenant_id` (Directory ID), and create a `client_secret`
- Grant the Azure AD application access to your Databricks workspace:
- Add the service principal to your Databricks workspace following [this guide](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-a-service-principal-to-your-azure-databricks-account-using-the-account-console)

#### Provision your service account:

- To ingest your workspace's metadata and lineage, your service principal must have all of the following:
- One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest
- One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest
- Ownership of or `SELECT` privilege on any tables and views you want to ingest
- [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
- [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
- To ingest legacy hive_metastore catalog (`include_hive_metastore` - enabled by default), your service principal must have all of the following:
- `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
- `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
- `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
- [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
- To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
- To `include_usage_statistics` (enabled by default), your service principal must have one of the following:
- `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
- When `usage_data_source` is set to `SYSTEM_TABLES` or `AUTO` (default) with `warehouse_id` configured: `SELECT` privilege on `system.query.history` table for improved performance with large query volumes and multi-workspace setups.
- To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
- To ingest `profiling` information with `method: analyze` and `call_analyze: true` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile.
- Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`.
You will still need `SELECT` privilege on those tables to fetch the results.
- Check the starter recipe below and replace `workspace_url` and either `token` (for PAT authentication) or `azure_auth` credentials (for Azure authentication) with your information from the previous steps.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,16 @@ source:
type: databricks
config:
workspace_url: https://my-workspace.cloud.databricks.com

# Authentication Option 1: Personal Access Token
token: "<token>"

# Authentication Option 2: Azure Authentication (for Azure Databricks)
# Uncomment the following section and comment out the token above to use Azure auth
# azure_auth:
# client_id: "<azure_client_id>"
# tenant_id: "<azure_tenant_id>"
# client_secret: "<azure_client_secret>"
include_metastore: false
include_ownership: true
include_ml_model_aliases: false
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from pydantic import Field, SecretStr

from datahub.configuration import ConfigModel


class AzureAuthConfig(ConfigModel):
client_secret: SecretStr = Field(
description="Azure application client secret used for authentication. This is a confidential credential that should be kept secure."
)
client_id: str = Field(
description="Azure application (client) ID. This is the unique identifier for the registered Azure AD application.",
)
tenant_id: str = Field(
description="Azure tenant (directory) ID. This identifies the Azure AD tenant where the application is registered.",
)
18 changes: 18 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/source/unity/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,24 @@ def workspace_url_should_start_with_http_scheme(cls, workspace_url: str) -> str:
)
return workspace_url

@model_validator(mode="before")
def either_token_or_azure_auth_provided(cls, values: dict) -> dict:
token = values.get("token")
azure_auth = values.get("azure_auth")

# Check if exactly one of the authentication methods is provided
if not token and not azure_auth:
raise ValueError(
"Either 'azure_auth' or 'token' (personal access token) must be provided in the configuration."
)

if token and azure_auth:
raise ValueError(
"Cannot specify both 'token' and 'azure_auth'. Please provide only one authentication method."
)

return values

@field_validator("include_metastore", mode="after")
@classmethod
def include_metastore_warning(cls, v: bool) -> bool:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from datahub.configuration.common import ConfigModel
from datahub.ingestion.source.sql.sqlalchemy_uri import make_sqlalchemy_uri
from datahub.ingestion.source.unity.azure_auth_config import AzureAuthConfig

DATABRICKS = "databricks"

Expand All @@ -19,7 +20,12 @@ class UnityCatalogConnectionConfig(ConfigModel):
"""

scheme: str = DATABRICKS
token: str = pydantic.Field(description="Databricks personal access token")
token: Optional[str] = pydantic.Field(
default=None, description="Databricks personal access token"
)
azure_auth: Optional[AzureAuthConfig] = Field(
default=None, description="Azure configuration"
)
Comment on lines +23 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With both being optional, could you add a pydantic validator that one or the other is required? that is, raise a validation error if both are none.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added validator in metadata-ingestion/src/datahub/ingestion/source/unity/config.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool
I have just merged a PR that prevents usage of pydantic v1 validators

so heads up you will need to update your pydantic.root_validator to model_validator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up. Have updated the root_validator with model_validator

workspace_url: str = pydantic.Field(
description="Databricks workspace url. e.g. https://my-workspace.cloud.databricks.com"
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ def __init__(self, config: UnityCatalogSourceConfig):
self.report = UnityCatalogReport()
self.proxy = UnityCatalogApiProxy(
self.config.workspace_url,
self.config.token,
self.config.profiling.warehouse_id,
report=self.report,
databricks_api_page_size=self.config.databricks_api_page_size,
personal_access_token=self.config.token,
)

def get_connection_test(self) -> TestConnectionReport:
Expand Down
26 changes: 19 additions & 7 deletions metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
from datahub._version import nice_version_name
from datahub.api.entities.external.unity_catalog_external_entites import UnityCatalogTag
from datahub.emitter.mce_builder import parse_ts_millis
from datahub.ingestion.source.unity.azure_auth_config import AzureAuthConfig
from datahub.ingestion.source.unity.config import (
LineageDataSource,
UsageDataSource,
Expand Down Expand Up @@ -169,20 +170,31 @@ class UnityCatalogApiProxy(UnityCatalogProxyProfilingMixin):
def __init__(
self,
workspace_url: str,
personal_access_token: str,
warehouse_id: Optional[str],
report: UnityCatalogReport,
hive_metastore_proxy: Optional[HiveMetastoreProxy] = None,
lineage_data_source: LineageDataSource = LineageDataSource.AUTO,
usage_data_source: UsageDataSource = UsageDataSource.AUTO,
databricks_api_page_size: int = 0,
personal_access_token: Optional[str] = None,
azure_auth: Optional[AzureAuthConfig] = None,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validation required -- if azure_auth is provided, all three fields (client_id, client_secret, tenant_id) must be present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fields defined in metadata-ingestion/src/datahub/ingestion/source/unity/azure_auth_config.py are neither Optional nor have an default value, i believe it would make these fields mandatory in cases azure auth is provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added the unit test cases for the same

self._workspace_client = WorkspaceClient(
host=workspace_url,
token=personal_access_token,
product="datahub",
product_version=nice_version_name(),
)
if azure_auth:
self._workspace_client = WorkspaceClient(
host=workspace_url,
azure_tenant_id=azure_auth.tenant_id,
azure_client_id=azure_auth.client_id,
azure_client_secret=azure_auth.client_secret.get_secret_value(),
product="datahub",
product_version=nice_version_name(),
)
else:
self._workspace_client = WorkspaceClient(
host=workspace_url,
token=personal_access_token,
product="datahub",
product_version=nice_version_name(),
)
self.warehouse_id = warehouse_id or ""
self.report = report
self.hive_metastore_proxy = hive_metastore_proxy
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -211,13 +211,14 @@ def __init__(self, ctx: PipelineContext, config: UnityCatalogSourceConfig):

self.unity_catalog_api_proxy = UnityCatalogApiProxy(
config.workspace_url,
config.token,
config.warehouse_id,
report=self.report,
hive_metastore_proxy=self.hive_metastore_proxy,
lineage_data_source=config.lineage_data_source,
usage_data_source=config.usage_data_source,
databricks_api_page_size=config.databricks_api_page_size,
personal_access_token=config.token if config.token else None,
azure_auth=config.azure_auth if config.azure_auth else None,
)

self.external_url_base = urljoin(self.config.workspace_url, "/explore/data")
Expand Down
Loading
Loading