From 93a4d2abc211b9a434b5d0dbcf1c7d994849df61 Mon Sep 17 00:00:00 2001 From: Aditya Goenka <63430370+ad1happy2go@users.noreply.github.com> Date: Fri, 3 Jan 2025 18:41:30 +0530 Subject: [PATCH] [DOCS] Added reference blogs to hudi docs (#12505) * Added reference blogs to hudi docs * Uniformed formatting * Fixed a duplicate entry under reference --- website/docs/azure_hoodie.md | 5 +++++ website/docs/cleaning.md | 4 ++++ website/docs/cli.md | 5 +++++ website/docs/clustering.md | 5 +++++ website/docs/compaction.md | 6 ++++++ website/docs/concepts.md | 5 +++++ website/docs/concurrency_control.md | 5 +++++ website/docs/indexes.md | 5 +++++ website/docs/key_generation.md | 4 +++- website/docs/markers.md | 5 +++++ website/docs/metadata.md | 2 +- website/docs/performance.md | 6 ++++++ website/docs/precommit_validator.md | 5 ++++- website/docs/record_merger.md | 4 ++++ website/docs/timeline.md | 5 +++++ website/docs/writing_tables_streaming_writes.md | 6 ++++++ 16 files changed, 74 insertions(+), 3 deletions(-) diff --git a/website/docs/azure_hoodie.md b/website/docs/azure_hoodie.md index f28ec609c70d8..31e1fa916042a 100644 --- a/website/docs/azure_hoodie.md +++ b/website/docs/azure_hoodie.md @@ -48,3 +48,8 @@ This combination works out of the box. No extra config needed. .format("org.apache.hudi") .load("/mountpoint/hudi-tables/customer") ``` + +## Related Resources + +

Blogs

+* [How to use Apache Hudi with Databricks](https://www.onehouse.ai/blog/how-to-use-apache-hudi-with-databricks) \ No newline at end of file diff --git a/website/docs/cleaning.md b/website/docs/cleaning.md index 5f6ea4b369728..557540cf4b998 100644 --- a/website/docs/cleaning.md +++ b/website/docs/cleaning.md @@ -148,6 +148,10 @@ cleans run --sparkMaster local --hoodieConfigs hoodie.cleaner.policy=KEEP_LATEST You can find more details and the relevant code for these commands in [`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java) class. ## Related Resources + +

Blogs

+* [Cleaner and Archival in Apache Hudi](https://medium.com/@simpsons/cleaner-and-archival-in-apache-hudi-9e15b08b2933) +

Videos

* [Cleaner Service: Save up to 40% on data lake storage costs | Hudi Labs](https://youtu.be/mUvRhJDoO3w) diff --git a/website/docs/cli.md b/website/docs/cli.md index def32b11a8e3a..fbae38e18b7ca 100644 --- a/website/docs/cli.md +++ b/website/docs/cli.md @@ -753,3 +753,8 @@ table change-table-type COW ║ hoodie.timeline.layout.version │ 1 │ 1 ║ ╚════════════════════════════════════════════════╧══════════════════════════════════════╧══════════════════════════════════════╝ ``` + +## Related Resources + +

Blogs

+* [Getting Started: Manage your Hudi tables with the admin Hudi-CLI tool](https://www.onehouse.ai/blog/getting-started-manage-your-hudi-tables-with-the-admin-hudi-cli-tool) diff --git a/website/docs/clustering.md b/website/docs/clustering.md index 0bbbad9781a91..64dbdb02fca08 100644 --- a/website/docs/clustering.md +++ b/website/docs/clustering.md @@ -341,6 +341,11 @@ and execution strategy `org.apache.hudi.client.clustering.run.strategy.JavaSortA out-of-the-box. Note that as of now only linear sort is supported in Java execution strategy. ## Related Resources + +

Blogs

+[Apache Hudi Z-Order and Hilbert Space Filling Curves](https://www.onehouse.ai/blog/apachehudi-z-order-and-hilbert-space-filling-curves) +[Hudi Z-Order and Hilbert Space-filling Curves](https://medium.com/apache-hudi-blogs/hudi-z-order-and-hilbert-space-filling-curves-68fa28bffaf0) +

Videos

* [Understanding Clustering in Apache Hudi and the Benefits of Asynchronous Clustering](https://www.youtube.com/watch?v=R_sm4wlGXuE) diff --git a/website/docs/compaction.md b/website/docs/compaction.md index 7859030052aa6..3e1a6186876ce 100644 --- a/website/docs/compaction.md +++ b/website/docs/compaction.md @@ -226,3 +226,9 @@ Offline compaction needs to submit the Flink task on the command line. The progr | `--seq` | `LIFO` (Optional) | The order in which compaction tasks are executed. Executing from the latest compaction plan by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the oldest plan. | | `--service` | `false` (Optional) | Whether to start a monitoring service that checks and schedules new compaction task in configured interval. | | `--min-compaction-interval-seconds` | `600(s)` (optional) | The checking interval for service mode, by default 10 minutes. | + +## Related Resources + +

Blogs

+[Apache Hudi Compaction](https://medium.com/@simpsons/apache-hudi-compaction-6e6383790234) +[Standalone HoodieCompactor Utility](https://medium.com/@simpsons/standalone-hoodiecompactor-utility-890198e4c539) \ No newline at end of file diff --git a/website/docs/concepts.md b/website/docs/concepts.md index 8d0adf8dd5a1b..32e6f322f12f5 100644 --- a/website/docs/concepts.md +++ b/website/docs/concepts.md @@ -169,4 +169,9 @@ The intention of merge on read table is to enable near real-time processing dire data out to specialized systems, which may not be able to handle the data volume. There are also a few secondary side benefits to this table such as reduced write amplification by avoiding synchronous merge of data, i.e, the amount of data written per 1 bytes of data in a batch +## Related Resources +

Blogs

+* [Comparing Apache Hudi's MOR and COW Tables: Use Cases from Uber and Shopee](https://www.onehouse.ai/blog/comparing-apache-hudis-mor-and-cow-tables-use-cases-from-uber-and-shopee) +* [Hudi Metafields demystified](https://www.onehouse.ai/blog/hudi-metafields-demystified) +* [File Naming conventions in Apache Hudi](https://medium.com/@simpsons/file-naming-conventions-in-apache-hudi-cd1cdd95f5e7) \ No newline at end of file diff --git a/website/docs/concurrency_control.md b/website/docs/concurrency_control.md index 549f1ddd17eb1..90fe990ba98e3 100644 --- a/website/docs/concurrency_control.md +++ b/website/docs/concurrency_control.md @@ -333,6 +333,11 @@ If you are using the `WriteClient` API, please note that multiple writes to the It is **NOT** recommended to use the same instance of the write client to perform multi writing. ## Related Resources + +

Blogs

+* [Data Lakehouse Concurrency Control](https://www.onehouse.ai/blog/lakehouse-concurrency-control-are-we-too-optimistic) +* [Multi-writer support with Apache Hudi](https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6) +

Videos

* [Hands on Lab with using DynamoDB as lock table for Apache Hudi Data Lakes](https://youtu.be/JP0orl9_0yQ) diff --git a/website/docs/indexes.md b/website/docs/indexes.md index 73310e431b686..03a57ceeb1b3b 100644 --- a/website/docs/indexes.md +++ b/website/docs/indexes.md @@ -219,6 +219,11 @@ partition path value could change due to an update e.g users table partitioned b ## Related Resources + +

Blogs

+ +* [Global vs Non-global index in Apache Hudi](https://medium.com/@simpsons/global-vs-non-global-index-in-apache-hudi-ac880b031cbc) +

Videos

* [Global Bloom Index: Remove duplicates & guarantee uniquness - Hudi Labs](https://youtu.be/XlRvMFJ7g9c) diff --git a/website/docs/key_generation.md b/website/docs/key_generation.md index 2e4fa4263876a..3a7b109c33639 100644 --- a/website/docs/key_generation.md +++ b/website/docs/key_generation.md @@ -212,4 +212,6 @@ Partition path generated from key generator: "04/01/2020" ## Related Resources -* [Hudi metafields demystified](https://www.onehouse.ai/blog/hudi-metafields-demystified) \ No newline at end of file +

Blogs

+* [Hudi metafields demystified](https://www.onehouse.ai/blog/hudi-metafields-demystified) +* [Primary key and Partition Generators with Apache Hudi](https://medium.com/@simpsons/primary-key-and-partition-generators-with-apache-hudi-f0e4d71d9d26) \ No newline at end of file diff --git a/website/docs/markers.md b/website/docs/markers.md index 71321d70c1910..2710546ae9907 100644 --- a/website/docs/markers.md +++ b/website/docs/markers.md @@ -89,3 +89,8 @@ with direct markers because the file system metadata is efficiently cached in me | `hoodie.markers.timeline_server_based.batch.num_threads` | 20 | Number of threads to use for batch processing marker creation requests at the timeline server. | | `hoodie.markers.timeline_server_based.batch.interval_ms` | 50 | The batch interval in milliseconds for marker creation batch processing. | + +## Related Resources + +

Blogs

+[Timeline Server in Apache Hudi](https://medium.com/@simpsons/timeline-server-in-apache-hudi-b5be25f85e47) diff --git a/website/docs/metadata.md b/website/docs/metadata.md index 47661f314114d..8f3b403112ac2 100644 --- a/website/docs/metadata.md +++ b/website/docs/metadata.md @@ -129,6 +129,6 @@ metadata table across all writers. ## Related Resources

Blogs

- * [Table service deployment models in Apache Hudi](https://medium.com/@simpsons/table-service-deployment-models-in-apache-hudi-9cfa5a44addf) * [Multi Modal Indexing for the Data Lakehouse](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi) +* [How to Optimize Performance for Your Open Data Lakehouse](https://www.onehouse.ai/blog/how-to-optimize-performance-for-your-open-data-lakehouse) diff --git a/website/docs/performance.md b/website/docs/performance.md index 0663535c07d7d..89bcd3ee75b03 100644 --- a/website/docs/performance.md +++ b/website/docs/performance.md @@ -131,3 +131,9 @@ To enable Data Skipping in your queries make sure to set following properties to - `hoodie.enable.data.skipping` (to control data skipping, enabled by default) - `hoodie.metadata.enable` (to enable metadata table use on the read path, enabled by default) - `hoodie.metadata.index.column.stats.enable` (to enable column stats index use on the read path) + +## Related Resources + +

Blogs

+* [Hudi’s Column Stats Index and Data Skipping feature help speed up queries by an orders of magnitude!](https://www.onehouse.ai/blog/hudis-column-stats-index-and-data-skipping-feature-help-speed-up-queries-by-an-orders-of-magnitude) +* [Top 3 Things You Can Do to Get Fast Upsert Performance in Apache Hudi](https://www.onehouse.ai/blog/top-3-things-you-can-do-to-get-fast-upsert-performance-in-apache-hudi) \ No newline at end of file diff --git a/website/docs/precommit_validator.md b/website/docs/precommit_validator.md index d5faf61057dee..0ec4c6cacc53d 100644 --- a/website/docs/precommit_validator.md +++ b/website/docs/precommit_validator.md @@ -96,6 +96,9 @@ Hudi offers a [commit notification service](platform_services_post_commit_callba The commit notification service can be combined with pre-commit validators to send a notification when a commit fails a validation. This is possible by passing details about the validation as a custom value to the HTTP endpoint. ## Related Resources -

Videos

+

Blogs

+* [Apply Pre-Commit Validation for Data Quality in Apache Hudi](https://www.onehouse.ai/blog/apply-pre-commit-validation-for-data-quality-in-apache-hudi) + +

Videos

* [Learn About Apache Hudi Pre Commit Validator with Hands on Lab](https://www.youtube.com/watch?v=KNzs9dj_Btc) diff --git a/website/docs/record_merger.md b/website/docs/record_merger.md index 378c5575ad19c..5dfc70f08e78c 100644 --- a/website/docs/record_merger.md +++ b/website/docs/record_merger.md @@ -251,3 +251,7 @@ example, [`MySqlDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd1 captured via Debezium for MySQL and PostgresDB. [`AWSDmsAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java) provides support for applying changes captured via Amazon Database Migration Service onto S3. For full configurations, go [here](/docs/configurations#RECORD_PAYLOAD) and please check out [this FAQ](faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage) if you want to implement your own custom payloads. +## Related Resources + +

Blogs

+* [How to define your own merge logic with Apache Hudi](https://medium.com/@simpsons/how-to-define-your-own-merge-logic-with-apache-hudi-622ee5ccab1e) diff --git a/website/docs/timeline.md b/website/docs/timeline.md index 75226295930b9..47d35a50eafc8 100644 --- a/website/docs/timeline.md +++ b/website/docs/timeline.md @@ -151,3 +151,8 @@ Flink jobs using the SQL can be configured through the options in WITH clause. T Refer [here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for more details. +## Related Resources + +

Blogs

+* [Apache Hudi Timeline: Foundational pillar for ACID transactions](https://medium.com/@simpsons/hoodie-timeline-foundational-pillar-for-acid-transactions-be871399cbae) + diff --git a/website/docs/writing_tables_streaming_writes.md b/website/docs/writing_tables_streaming_writes.md index 1dbf6b6dc6d6f..86a790705e1c1 100644 --- a/website/docs/writing_tables_streaming_writes.md +++ b/website/docs/writing_tables_streaming_writes.md @@ -93,3 +93,9 @@ df.writeStream.format("hudi") +## Related Resources + +

Blogs

+* [An Introduction to the Hudi and Flink Integration](https://www.onehouse.ai/blog/intro-to-hudi-and-flink) +* [Bulk Insert Sort Modes with Apache Hudi](https://medium.com/@simpsons/bulk-insert-sort-modes-with-apache-hudi-c781e77841bc) +