Skip to content

Commit

Permalink
[DOCS] Update syncing DataHub docs (#12504)
Browse files Browse the repository at this point in the history
  • Loading branch information
sgomezvillamor authored Dec 18, 2024
1 parent d518dd6 commit 20b733b
Showing 1 changed file with 25 additions and 2 deletions.
27 changes: 25 additions & 2 deletions website/docs/syncing_datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,26 @@ obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting `DataHubSyncTool` as one of the sync tool classes
for `HoodieStreamer`.

The target Hudi table will be sync'ed to DataHub as a `Dataset`. The Hudi table's avro schema will be sync'ed, along
with the commit timestamp when running the sync.
The target Hudi table will be sync'ed to DataHub as a `Dataset`, which will be created with the following properties:

* Hudi table properties and partitioning information
* Spark-related properties
* User-defined properties
* The last commit and the last commit completion timestamps

Additionally, the `Dataset` object will include the following metadata:

* sub-type as `Table`
* browse path
* parent container
* Avro schema
* optionally, attached with a `Domain` object

Also, the parent database will be sync'ed to DataHub as a `Container`, including the following metadata:

* sub-type as `Database`
* browse paths
* optionally, attached with a `Domain` object

### Configurations

Expand All @@ -27,6 +45,11 @@ By default, the sync config's database name and table name will be used to make
Subclass `HoodieDataHubDatasetIdentifier` and set it to `hoodie.meta.sync.datahub.dataset.identifier.class` to customize
the URN creation.

Optionally, sync'ed `Dataset` and `Container` objects can be attached with a `Domain` object. To do this, set
`hoodie.meta.sync.datahub.domain.name` to a valid `Domain` URN. Also, sync'ed `Dataset` can be attached with
user defined properties. To do this, set `hoodie.meta.sync.datahub.table.properties` to a comma-separated key-value
string (_eg_ `key1=val1,key2=val2`).

### Example

The following shows an example configuration to run `HoodieStreamer` with `DataHubSyncTool`.
Expand Down

0 comments on commit 20b733b

Please sign in to comment.