diff --git a/README.md b/README.md index 9b011e8c7165a..5686cf42d7ef2 100644 --- a/README.md +++ b/README.md @@ -129,16 +129,35 @@ versioned_sidebars/version-0.7.0-sidebars.json ``` ### Linking docs - -- Remember to include the `.md` extension. -- Files will be linked to correct corresponding version. -- Relative paths work as well. - -```md -The [@hello](hello.md#paginate) document is great! - -See the [Tutorial](../getting-started/tutorial.md) for more info. -``` +Relative paths work well. - Files will be linked to correct corresponding version. + - PREFER RELATIVE PATHS to be consistent with linking. + - **Good Example of linking.** + For ex say we are updating a 0.12.0 version doc which is older. + ```md + A [callback notification](writing_data#commit-notifications) is exposed + ``` + This automatically resolves to /docs/0.12.0/writing_data#commit-notifications. + - **Bad example of linking.** + For ex say we are updating a 0.12.0 version doc which is older. + ```md + A [callback notification](/docs/writing_data#commit-notifications) is exposed + ``` + This will resolve to the most recent release, specifically /docs/writing_data#commit-notifications . We do not want a 0.12.0 doc page to point to a page from a later release. + - DO NOT use next version when linking. + - Good Example of linking when you are working on unreleased version (from next version). + ```md + Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new + file slices and [cleaning](cleaning) action gets rid of unused/older file slices to reclaim space on the file system. + ``` + This automatically resolves to /docs/next/compaction and /docs/next/cleaning pages. + + - Bad Example of linking when you are working on unreleased version (from next version). + ```md + Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new + file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system. + ``` + Even though it directly points to /docs/next which is intended target, this accumulates as tech debt when this copy of docs gets released, we will hav a older doc always pointing to /docs/next/ + ## Versions diff --git a/content/docs/0.8.0/concurrency_control/.index.html.swp b/content/docs/0.8.0/concurrency_control/.index.html.swp new file mode 100644 index 0000000000000..122ce09e20f5e Binary files /dev/null and b/content/docs/0.8.0/concurrency_control/.index.html.swp differ diff --git a/website/blog/2019-09-09-ingesting-database-changes.md b/website/blog/2019-09-09-ingesting-database-changes.md index 2c8b068e5a2b2..79373fd776810 100644 --- a/website/blog/2019-09-09-ingesting-database-changes.md +++ b/website/blog/2019-09-09-ingesting-database-changes.md @@ -44,5 +44,5 @@ inputDataset.write.format("org.apache.hudi”) .save("/path/on/dfs"); ``` -Alternatively, you can also use the Hudi [DeltaStreamer](https://hudi.apache.org/writing_data#deltastreamer) tool with the DFSSource. +Alternatively, you can also use the Hudi [DeltaStreamer](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer) tool with the DFSSource. diff --git a/website/blog/2020-01-20-change-capture-using-aws.md b/website/blog/2020-01-20-change-capture-using-aws.md index c7ff93ec1b846..29b2589174d7c 100644 --- a/website/blog/2020-01-20-change-capture-using-aws.md +++ b/website/blog/2020-01-20-change-capture-using-aws.md @@ -20,7 +20,7 @@ In this blog, we will build an end-end solution for capturing changes from a MyS We can break up the problem into two pieces. 1. **Extracting change logs from MySQL** : Surprisingly, this is still a pretty tricky problem to solve and often Hudi users get stuck here. Thankfully, at-least for AWS users, there is a [Database Migration service](https://aws.amazon.com/dms/) (DMS for short), that does this change capture and uploads them as parquet files on S3 -2. **Applying these change logs to your data lake table** : Once there are change logs in some form, the next step is to apply them incrementally to your table. This mundane task can be fully automated using the Hudi [DeltaStreamer](http://hudi.apache.org/docs/writing_data#deltastreamer) tool. +2. **Applying these change logs to your data lake table** : Once there are change logs in some form, the next step is to apply them incrementally to your table. This mundane task can be fully automated using the Hudi [DeltaStreamer](http://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer) tool. diff --git a/website/blog/2021-01-27-hudi-clustering-intro.md b/website/blog/2021-01-27-hudi-clustering-intro.md index f1af4433e5e50..d227d729ce342 100644 --- a/website/blog/2021-01-27-hudi-clustering-intro.md +++ b/website/blog/2021-01-27-hudi-clustering-intro.md @@ -17,7 +17,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei ## Clustering Architecture -At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies. +At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies. diff --git a/website/blog/2021-03-01-hudi-file-sizing.md b/website/blog/2021-03-01-hudi-file-sizing.md index 304c2aaaabb94..808688b37339c 100644 --- a/website/blog/2021-03-01-hudi-file-sizing.md +++ b/website/blog/2021-03-01-hudi-file-sizing.md @@ -36,9 +36,9 @@ For illustration purposes, we are going to consider only COPY_ON_WRITE table. Configs of interest before we dive into the algorithm: -- [Max file size](/docs/configurations#limitFileSize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value
-- [Soft file limit](/docs/configurations#compactionSmallFileSize): Max file size below which a given data file is considered to a small file
-- [Insert split size](/docs/configurations#insertSplitSize): Number of inserts grouped for a single partition. This value should match +- [Max file size](/docs/configurations#hoodieparquetmaxfilesize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value
+- [Soft file limit](/docs/configurations#hoodieparquetsmallfilelimit): Max file size below which a given data file is considered to a small file
+- [Insert split size](/docs/configurations#hoodiecopyonwriteinsertsplitsize): Number of inserts grouped for a single partition. This value should match the number of records in a single file (you can determine based on max file size and per record size) For instance, if your first config value is 120MB and 2nd config value is set to 100MB, any file whose size is < 100MB diff --git a/website/blog/2021-08-16-kafka-custom-deserializer.md b/website/blog/2021-08-16-kafka-custom-deserializer.md index c8146b6343d07..7ed4bf3e03a2c 100644 --- a/website/blog/2021-08-16-kafka-custom-deserializer.md +++ b/website/blog/2021-08-16-kafka-custom-deserializer.md @@ -18,7 +18,7 @@ In our case a Confluent schema registry is used to maintain the schema and as sc ## What do we want to achieve? -We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/writing_data#deltastreamer) +We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer) Ideally every topic should be able to evolve the schema to match new business requirements. Producers start producing data with a new schema version and the DeltaStreamer picks up the new schema and ingests the data with the new schema. For this to work, we run our DeltaStreamer instances with the latest schema version available from the Schema Registry to ensure that we always use the freshest schema with all attributes. A prerequisites is that all the mentioned Schema evolutions must be `BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of Avro Schema changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). This ensures that every record in the kafka topic can always be read using the latest schema. diff --git a/website/docs/cli.md b/website/docs/cli.md index 1c30b9b6fa6e0..7cc4cdd92b0c2 100644 --- a/website/docs/cli.md +++ b/website/docs/cli.md @@ -578,7 +578,7 @@ Compaction successfully repaired ### Savepoint and Restore As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. You can read more about savepoints and restore [here](/docs/next/disaster_recovery) +savepoint at a later point in time if need be. You can read more about savepoints and restore [here](disaster_recovery) To trigger savepoint for a hudi table ```java diff --git a/website/docs/concurrency_control.md b/website/docs/concurrency_control.md index 8550888e734fe..d9867be88a8ec 100644 --- a/website/docs/concurrency_control.md +++ b/website/docs/concurrency_control.md @@ -8,7 +8,7 @@ last_modified_at: 2021-03-19T15:59:57-04:00 --- Concurrency control defines how different writers/readers/table services coordinate access to a Hudi table. Hudi ensures atomic writes, by way of publishing commits atomically to the timeline, stamped with an instant time that denotes the time at which the action is deemed to have occurred. Unlike general purpose file version control, Hudi draws clear distinction between -writer processes that issue [write operations](/docs/next/write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and +writer processes that issue [write operations](write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and readers (that execute queries and read data). Hudi provides @@ -23,7 +23,7 @@ We’ll also describe ways to ingest data into a Hudi Table from multiple writer ## Distributed Locking A pre-requisite for distributed co-ordination in Hudi, like many other distributed database systems is a distributed lock provider, that different processes can use to plan, schedule and -execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](/docs/next/timeline#truetime-generation), as discussed before. +execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](timeline#truetime-generation), as discussed before. External locking is typically used in conjunction with optimistic concurrency control because it provides a way to prevent conflicts that might occur when two or more transactions (commits in our case) attempt to modify the same resource concurrently. @@ -204,7 +204,7 @@ Multiple writers can operate on the table with non-blocking conflict resolution. file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is currently available for preview in version 1.0.0-beta only with the caveat that conflict resolution is not supported yet between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink -writers [here](/docs/next/sql_dml#non-blocking-concurrency-control-experimental). +writers [here](sql_dml#non-blocking-concurrency-control-experimental). ## Early conflict Detection diff --git a/website/docs/deployment.md b/website/docs/deployment.md index 9bafde59c4658..7785f4ceaca1f 100644 --- a/website/docs/deployment.md +++ b/website/docs/deployment.md @@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Batch Writes](/docs/next/writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Batch Writes](writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/docs/faq.md b/website/docs/faq.md index 1378839b81cb8..26c3eb50d2147 100644 --- a/website/docs/faq.md +++ b/website/docs/faq.md @@ -6,10 +6,10 @@ keywords: [hudi, writing, reading] The FAQs are split into following pages. Please refer to the specific pages for more info. -- [General](/docs/next/faq_general) -- [Design & Concepts](/docs/next/faq_design_and_concepts) -- [Writing Tables](/docs/next/faq_writing_tables) -- [Reading Tables](/docs/next/faq_reading_tables) -- [Table Services](/docs/next/faq_table_services) -- [Storage](/docs/next/faq_storage) -- [Integrations](/docs/next/faq_integrations) +- [General](faq_general) +- [Design & Concepts](faq_design_and_concepts) +- [Writing Tables](faq_writing_tables) +- [Reading Tables](faq_reading_tables) +- [Table Services](faq_table_services) +- [Storage](faq_storage) +- [Integrations](faq_integrations) diff --git a/website/docs/faq_general.md b/website/docs/faq_general.md index 61b6c12a4b5db..9f0a6c7d5153a 100644 --- a/website/docs/faq_general.md +++ b/website/docs/faq_general.md @@ -61,7 +61,7 @@ Nonetheless, Hudi is designed very much like a database and provides similar fun ### How do I model the data stored in Hudi? -When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](/docs/next/writing_data) for an example. +When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](writing_data) for an example. When querying/reading data, Hudi just presents itself as a json-like hierarchical table, everyone is used to querying using Hive/Spark/Presto over Parquet/Json/Avro. diff --git a/website/docs/faq_table_services.md b/website/docs/faq_table_services.md index 0ca730094e4f1..7ff398687e392 100644 --- a/website/docs/faq_table_services.md +++ b/website/docs/faq_table_services.md @@ -50,6 +50,6 @@ Hudi runs cleaner to remove old file versions as part of writing data either in Yes. Hudi provides the ability to post a callback notification about a write commit. You can use a http hook or choose to -be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](/docs/next/platform_services_post_commit_callback) +be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](platform_services_post_commit_callback) for details diff --git a/website/docs/faq_writing_tables.md b/website/docs/faq_writing_tables.md index bed07a16e57a6..2374006d95533 100644 --- a/website/docs/faq_writing_tables.md +++ b/website/docs/faq_writing_tables.md @@ -6,7 +6,7 @@ keywords: [hudi, writing, reading] ### What are some ways to write a Hudi table? -Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](/docs/next/writing_data#spark-datasource-api) to write into Hudi. +Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](writing_data#spark-datasource-api) to write into Hudi. ### How is a Hudi writer job deployed? @@ -68,7 +68,7 @@ As you could see, ([combineAndGetUpdateValue(), getInsertValue()](https://github ### How do I delete records in the dataset using Hudi? -GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see [here](/docs/next/writing_data#deletes). +GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see [here](writing_data#deletes). ### Should I need to worry about deleting all copies of the records in case of duplicates? @@ -147,7 +147,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. @@ -183,7 +183,7 @@ No, Hudi does not expose uncommitted files/blocks to the readers. Further, Hudi ### How are conflicts detected in Hudi between multiple writers? -Hudi employs [optimistic concurrency control](/docs/concurrency_control#supported-concurrency-controls) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs. +Hudi employs [optimistic concurrency control](concurrency_control) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs. ### Can single-writer inserts have duplicates? diff --git a/website/docs/file_sizing.md b/website/docs/file_sizing.md index c637a5a630cc3..62ad0f7a43208 100644 --- a/website/docs/file_sizing.md +++ b/website/docs/file_sizing.md @@ -148,7 +148,7 @@ while the clustering service runs. :::note Hudi always creates immutable files on storage. To be able to do auto-sizing or clustering, Hudi will always create a -newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](/docs/next/cleaning) +newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](cleaning) will later kick in and delete the older version small file and keep the latest one. ::: diff --git a/website/docs/flink-quick-start-guide.md b/website/docs/flink-quick-start-guide.md index 0ab2322d766e1..1cfda067c71c5 100644 --- a/website/docs/flink-quick-start-guide.md +++ b/website/docs/flink-quick-start-guide.md @@ -449,19 +449,19 @@ feature is that it now lets you author streaming pipelines on streaming or batch ## Where To Go From Here? - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. -- **Configuration** : For [Global Configuration](/docs/next/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/next/flink_tuning#table-options). -- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/next/ingestion_flink#cdc-ingestion), [Bulk Insert](/docs/next/ingestion_flink#bulk-insert), [Index Bootstrap](/docs/next/ingestion_flink#index-bootstrap), [Changelog Mode](/docs/next/ingestion_flink#changelog-mode) and [Append Mode](/docs/next/ingestion_flink#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](/docs/next/sql_dml#non-blocking-concurrency-control-experimental). -- **Reading Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/sql_queries#streaming-query) and [Incremental Query](/docs/sql_queries#incremental-query). -- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/next/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/next/flink_tuning#write-rate-limit). +- **Configuration** : For [Global Configuration](flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_tuning#table-options). +- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](ingestion_flink#cdc-ingestion), [Bulk Insert](ingestion_flink#bulk-insert), [Index Bootstrap](ingestion_flink#index-bootstrap), [Changelog Mode](ingestion_flink#changelog-mode) and [Append Mode](ingestion_flink#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](sql_dml#non-blocking-concurrency-control-experimental). +- **Reading Data** : Flink supports different modes for reading, such as [Streaming Query](sql_queries#streaming-query) and [Incremental Query](/docs/sql_queries#incremental-query). +- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](flink_tuning#memory-optimization) and [Write Rate Limit](flink_tuning#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). -- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb). +- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](sql_queries#presto). - **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/sql_ddl/#create-catalog). If you are relatively new to Apache Hudi, it is important to be familiar with a few core concepts: - - [Hudi Timeline](/docs/next/timeline) – How Hudi manages transactions and other table services - - [Hudi Storage Layout](/docs/next/storage_layouts) - How the files are laid out on storage - - [Hudi Table Types](/docs/next/table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ` - - [Hudi Query Types](/docs/next/table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries + - [Hudi Timeline](timeline) – How Hudi manages transactions and other table services + - [Hudi Storage Layout](storage_layouts) - How the files are laid out on storage + - [Hudi Table Types](table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ` + - [Hudi Query Types](table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries See more in the "Concepts" section of the docs. diff --git a/website/docs/hudi_stack.md b/website/docs/hudi_stack.md index ab2408f431648..59517ede41dac 100644 --- a/website/docs/hudi_stack.md +++ b/website/docs/hudi_stack.md @@ -157,7 +157,7 @@ Platform services offer functionality that is specific to data and workloads, an Services, like [Hudi Streamer](./hoodie_streaming_ingestion#hudi-streamer) (or its Flink counterpart), are specialized in handling data and workloads, seamlessly integrating with Kafka streams and various formats to build data lakes. They support functionalities like automatic checkpoint management, integration with major schema registries (including Confluent), and deduplication of data. Hudi Streamer also offers features for backfills, one-off runs, and continuous mode operation with Spark/Flink streaming writers. Additionally, -Hudi provides tools for [snapshotting](./snapshot_exporter) and incrementally [exporting](./snapshot_exporter#examples) Hudi tables, importing new tables, and [post-commit callback](/docs/next/platform_services_post_commit_callback) for analytics or +Hudi provides tools for [snapshotting](./snapshot_exporter) and incrementally [exporting](./snapshot_exporter#examples) Hudi tables, importing new tables, and [post-commit callback](platform_services_post_commit_callback) for analytics or workflow management, enhancing the deployment of production-grade incremental pipelines. Apart from these services, Hudi also provides broad support for different catalogs such as [Hive Metastore](./syncing_metastore), [AWS Glue](./syncing_aws_glue_data_catalog/), [Google BigQuery](./gcp_bigquery), [DataHub](./syncing_datahub), etc. that allows syncing of Hudi tables to be queried by interactive engines such as Trino and Presto. diff --git a/website/docs/indexes.md b/website/docs/indexes.md index 512242ba811d6..c2284f2d473eb 100644 --- a/website/docs/indexes.md +++ b/website/docs/indexes.md @@ -19,8 +19,8 @@ Only clustering or cross-partition updates that are implemented as deletes + ins file group at any completed instant on the timeline. ## Need for indexing -For [Copy-On-Write tables](/docs/next/table_types#copy-on-write-table), indexing enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite. -For [Merge-On-Read tables](/docs/next/table_types#merge-on-read-table), indexing allows Hudi to bound the amount of change records any given base file needs to be merged against. Specifically, a given base file needs to merged +For [Copy-On-Write tables](table_types#copy-on-write-table), indexing enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite. +For [Merge-On-Read tables](table_types#merge-on-read-table), indexing allows Hudi to bound the amount of change records any given base file needs to be merged against. Specifically, a given base file needs to merged only against updates for records that are part of that base file. ![Fact table](/assets/images/blog/hudi-indexes/with_without_index.png) @@ -28,7 +28,7 @@ only against updates for records that are part of that base file. In contrast, - Designs without an indexing component (e.g: [Apache Hive/Apache Iceberg](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)) end up having to merge all the base files against all incoming updates/delete records - (10-100x more [read amplification](/docs/next/table_types#comparison)). + (10-100x more [read amplification](table_types#comparison)). - Designs that implement heavily write-optimized OLTP data structures like LSM trees do not require an indexing component. But they perform poorly scan heavy workloads against cloud storage making them unsuitable for serving analytical queries. @@ -42,8 +42,8 @@ implemented by enhancing the metadata table with the flexibility to extend to ne along with an [asynchronous index](https://hudi.apache.org/docs/metadata_indexing/#setup-async-indexing) building Hudi supports a multi-modal index by augmenting the metadata table with the capability to incorporate new types of indexes, complemented by an -asynchronous mechanism for [index construction](/docs/next/metadata_indexing). This enhancement supports a range of indexes within -the [metadata table](/docs/next/metadata#metadata-table), significantly improving the efficiency of both writing to and reading from the table. +asynchronous mechanism for [index construction](metadata_indexing). This enhancement supports a range of indexes within +the [metadata table](metadata#metadata-table), significantly improving the efficiency of both writing to and reading from the table. ![Indexes](/assets/images/hudi-stack-indexes.png)

Figure: Indexes in Hudi

@@ -68,7 +68,7 @@ the [metadata table](/docs/next/metadata#metadata-table), significantly improvin An [expression index](https://github.com/apache/hudi/blob/3789840be3d041cbcfc6b24786740210e4e6d6ac/rfc/rfc-63/rfc-63.md) is an index on a function of a column. If a query has a predicate on a function of a column, the expression index can be used to speed up the query. Expression index is stored in *expr_index_* prefixed partitions (one for each expression index) under metadata table. Expression index can be created using SQL syntax. Please checkout SQL DDL - docs [here](/docs/next/sql_ddl#create-functional-index-experimental) for more details. + docs [here](sql_ddl#create-expression-index) for more details. ### Secondary Index diff --git a/website/docs/metadata.md b/website/docs/metadata.md index fb79f19799acb..0295489e348b9 100644 --- a/website/docs/metadata.md +++ b/website/docs/metadata.md @@ -62,7 +62,7 @@ Following are the different types of metadata currently supported. ``` -To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-and-secondary-index-experimental). +To try out these features, refer to the [SQL guide](sql_ddl#create-partition-stats-index). ## Metadata Tracking on Writers @@ -153,7 +153,7 @@ process which cannot rely on the in-process lock provider. ### Deployment Model C: Multi-writer -If your current deployment model is [multi-writer](/docs/concurrency_control#model-c-multi-writer) along with a lock +If your current deployment model is [multi-writer](concurrency_control#full-on-multi-writer--async-table-services) along with a lock provider and other required configs set for every writer as follows, there is no additional configuration required. You can bring up the writers sequentially after stopping the writers for enabling metadata table. Applying the proper configurations to only partial writers leads to loss of data from the inconsistent writer. So, ensure you enable diff --git a/website/docs/metadata_indexing.md b/website/docs/metadata_indexing.md index ee0609965fbe1..d1978c1e486ee 100644 --- a/website/docs/metadata_indexing.md +++ b/website/docs/metadata_indexing.md @@ -31,7 +31,7 @@ asynchronous indexing. To learn more about the design of asynchronous indexing f ## Index Creation Using SQL Currently indexes like secondary index, expression index and record index can be created using SQL create index command. -For more information on these indexes please refer [metadata section](/docs/next/metadata/#types-of-table-metadata) +For more information on these indexes please refer [metadata section](metadata/#types-of-table-metadata) :::note Please note in order to create secondary index: @@ -54,7 +54,7 @@ CREATE INDEX idx_column_ts ON hudi_indexed_table USING column_stats(ts) OPTIONS( CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING bloom_filters(driver) OPTIONS(expr='identity'); ``` -For more information on index creation using SQL refer [SQL DDL](/docs/next/sql_ddl#create-index) +For more information on index creation using SQL refer [SQL DDL](sql_ddl#create-index) ## Index Creation Using Datasource @@ -182,8 +182,8 @@ us schedule the indexing for COLUMN_STATS index. First we need to define a prope As mentioned before, metadata indices are pluggable. One can add any index at any point in time depending on changing business requirements. Some configurations to enable particular indices are listed below. Currently, available indices under -metadata table can be explored [here](/docs/next/metadata/#types-of-table-metadata) along with [configs](/docs/next/metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side) -to enable them. The full set of metadata configurations can be explored [here](/docs/next/configurations/#Metadata-Configs). +metadata table can be explored [here](indexes#multi-modal-indexing) along with [configs](metadata#metadata-tracking-on-writers) +to enable them. The full set of metadata configurations can be explored [here](configurations/#Metadata-Configs). :::note Enabling the metadata table and configuring a lock provider are the prerequisites for using async indexer. Checkout a sample diff --git a/website/docs/precommit_validator.md b/website/docs/precommit_validator.md index 5e13fca3dc0e2..d5faf61057dee 100644 --- a/website/docs/precommit_validator.md +++ b/website/docs/precommit_validator.md @@ -91,7 +91,7 @@ void validateRecordsBeforeAndAfter(Dataset before, ``` ## Additional Monitoring with Notifications -Hudi offers a [commit notification service](/docs/next/platform_services_post_commit_callback) that can be configured to trigger notifications about write commits. +Hudi offers a [commit notification service](platform_services_post_commit_callback) that can be configured to trigger notifications about write commits. The commit notification service can be combined with pre-commit validators to send a notification when a commit fails a validation. This is possible by passing details about the validation as a custom value to the HTTP endpoint. diff --git a/website/docs/procedures.md b/website/docs/procedures.md index 1dbeb899b14fa..19d6566801117 100644 --- a/website/docs/procedures.md +++ b/website/docs/procedures.md @@ -472,10 +472,10 @@ archive commits. |------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | N | None | Hudi table name | | path | String | N | None | Path of table | -| [min_commits](/docs/next/configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. | -| [max_commits](/docs/next/configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. | -| [retain_commits](/docs/next/configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. | -| [enable_metadata](/docs/next/configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table | +| [min_commits](configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. | +| [max_commits](configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. | +| [retain_commits](configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. | +| [enable_metadata](configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table | **Output** @@ -672,7 +672,7 @@ copy table to a temporary view. | Parameter Name | Type | Required | Default Value | Description | |-------------------------------------------------------------------|---------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Hudi table name | -| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | +| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | | view_name | String | Y | None | Name of view | | begin_instance_time | String | N | "" | Begin instance time | | end_instance_time | String | N | "" | End instance time | @@ -705,7 +705,7 @@ copy table to a new table. | Parameter Name | Type | Required | Default Value | Description | |-------------------------------------------------------------------|--------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Hudi table name | -| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | +| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | | new_table | String | Y | None | Name of new table | | begin_instance_time | String | N | "" | Begin instance time | | end_instance_time | String | N | "" | End instance time | @@ -1535,10 +1535,10 @@ Run cleaner on a hoodie table. |---------------------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Name of table to be cleaned | | schedule_in_line | Boolean | N | true | Set "true" if you want to schedule and run a clean. Set false if you have already scheduled a clean and want to run that. | -| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. | -| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. | -| [hours_retained](/docs/next/configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. | -| [file_versions_retained](/docs/next/configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. | +| [clean_policy](configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. | +| [retain_commits](configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. | +| [hours_retained](configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. | +| [file_versions_retained](configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. | | [trigger_strategy](/docs/next/configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` | | [trigger_max_commits](/docs/next/configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. | | [options](/docs/next/configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" | diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md index 31069822df761..83a03a4a1121f 100644 --- a/website/docs/querying_data.md +++ b/website/docs/querying_data.md @@ -7,7 +7,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00 --- :::danger -This page is no longer maintained. Please refer to Hudi [SQL DDL](/docs/next/sql_ddl), [SQL DML](/docs/next/sql_dml), [SQL Queries](/docs/next/sql_queries) and [Procedures](/docs/next/procedures) for the latest documentation. +This page is no longer maintained. Please refer to Hudi [SQL DDL](sql_ddl), [SQL DML](sql_dml), [SQL Queries](sql_queries) and [Procedures](procedures) for the latest documentation. ::: Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts#query-types). diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index 7c4c2e8077d90..4ddb4005df319 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -257,7 +257,7 @@ CREATE TABLE hudi_table ( PARTITIONED BY (city); ``` -For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](/docs/next/sql_ddl) reference guide. +For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](sql_ddl) reference guide. @@ -301,7 +301,7 @@ inserts.write.format("hudi"). ``` :::info Mapping to Hudi write operations -Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables, +Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables, with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as the write operation, matching the out-of-behavior of Spark's Parquet Datasource. ::: @@ -334,7 +334,7 @@ inserts.write.format("hudi"). \ ``` :::info Mapping to Hudi write operations -Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables, +Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables, with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as the write operation, matching the out-of-behavior of Spark's Parquet Datasource. ::: @@ -343,7 +343,7 @@ the write operation, matching the out-of-behavior of Spark's Parquet Datasource. -Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](/docs/next/sql_dml#insert-into) for more advanced options. +Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](sql_dml#insert-into) for more advanced options. ```sql INSERT INTO hudi_table @@ -455,7 +455,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un -Hudi table can be update using a regular UPDATE statement. See [Update](/docs/next/sql_dml#update) for more advanced options. +Hudi table can be update using a regular UPDATE statement. See [Update](sql_dml#update) for more advanced options. ```sql UPDATE hudi_table SET fare = 25.0 WHERE rider = 'rider-D'; @@ -485,7 +485,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un -[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](/docs/next/concepts). +[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](concepts). Look for changes in `_hoodie_commit_time`, `fare` fields for the given `_hoodie_record_key` value from a previous commit. ## Merging Data {#merge} @@ -1264,7 +1264,7 @@ PARTITIONED BY (city); > :::note Implications of defining record keys -Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](/docs/next/write_operations). +Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](write_operations). Also if a record key is configured, then it's also advisable to specify a precombine or ordering field, to correctly handle cases where the source data has multiple records with the same key. See section below. ::: @@ -1276,8 +1276,8 @@ Hudi also uses this mechanism to support out-of-order data arrival into a table, For e.g. using a _created_at_ timestamp field as the precombine field will prevent older versions of a record from overwriting newer ones or being exposed to queries, even if they are written at a later commit time to the table. This is one of the key features, that makes Hudi, best suited for dealing with streaming data. -To enable different merge semantics, Hudi supports [merge modes](/docs/next/record_merger). Commit time and event time based merge modes are supported out of the box. -Users can also define their own custom merge strategies, see [here](/docs/next/sql_ddl#create-table-with-record-merge-mode). +To enable different merge semantics, Hudi supports [merge modes](record_merger). Commit time and event time based merge modes are supported out of the box. +Users can also define their own custom merge strategies, see [here](sql_ddl#create-table-with-record-merge-mode). `(see also [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions)) -for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](/docs/next/migration_guide). +for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](migration_guide). ### Spark SQL Reference -For advanced usage of spark SQL, please refer to [Spark SQL DDL](/docs/next/sql_ddl) and [Spark SQL DML](/docs/next/sql_dml) reference guides. -For alter table commands, check out [this](/docs/next/sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](/docs/next/procedures) out. +For advanced usage of spark SQL, please refer to [Spark SQL DDL](sql_ddl) and [Spark SQL DML](sql_dml) reference guides. +For alter table commands, check out [this](sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](procedures) out. ### Streaming workloads @@ -1355,9 +1355,9 @@ Hudi provides industry-leading performance and functionality for streaming data. from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider, transformation support, automatic table services and so on. -**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/next/writing_tables_streaming_writes#spark-streaming) for more. +**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](writing_tables_streaming_writes#spark-streaming) for more. -Check out more information on [modeling data in Hudi](/docs/next/faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](/docs/next/writing_tables_streaming_writes). +Check out more information on [modeling data in Hudi](faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](writing_tables_streaming_writes). ### Dockerized Demo Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy diff --git a/website/docs/record_merger.md b/website/docs/record_merger.md index d98a5fc462a6d..378c5575ad19c 100644 --- a/website/docs/record_merger.md +++ b/website/docs/record_merger.md @@ -6,7 +6,7 @@ toc_min_heading_level: 2 toc_max_heading_level: 4 --- -Hudi handles mutations to records and streaming data, as we briefly touched upon in [timeline ordering](/docs/next/timeline#ordering-of-actions) section. +Hudi handles mutations to records and streaming data, as we briefly touched upon in [timeline ordering](timeline#ordering-of-actions) section. To provide users full-fledged support for stream processing, Hudi goes all the way making the storage engine and the underlying storage format understand how to merge changes to the same record key, that may arrive even in different order at different times. With the rise of mobile applications and IoT, these scenarios have become the normal than an exception. For e.g. a social networking application uploading user events several hours after they happened, @@ -54,7 +54,7 @@ With event time ordering, the merging picks the record with the highest value on In the example above, two microservices product change records about orders at different times, that can arrive out-of-order. As color coded, this can lead to application-level inconsistent states in the table if simply merged in commit time order like a cancelled order being re-created or a paid order moved back to just created state expecting payment again. Event time ordering helps by ignoring older state changes that arrive late and -avoiding order status from "jumping back" in time. Combined with [non-blocking concurrency control](/docs/next/concurrency_control#non-blocking-concurrency-control-mode), +avoiding order status from "jumping back" in time. Combined with [non-blocking concurrency control](concurrency_control#non-blocking-concurrency-control-mode), this provides a very powerful way for processing such data streams efficiently and correctly. ### CUSTOM @@ -249,5 +249,5 @@ Payload class can be specified using the below configs. For more advanced config There are also quite a few other implementations. Developers may be interested in looking at the hierarchy of `HoodieRecordPayload` interface. For example, [`MySqlDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java) and [`PostgresDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/PostgresDebeziumAvroPayload.java) provides support for seamlessly applying changes captured via Debezium for MySQL and PostgresDB. [`AWSDmsAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java) provides support for applying changes captured via Amazon Database Migration Service onto S3. -For full configurations, go [here](/docs/configurations#RECORD_PAYLOAD) and please check out [this FAQ](/docs/next/faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage) if you want to implement your own custom payloads. +For full configurations, go [here](/docs/configurations#RECORD_PAYLOAD) and please check out [this FAQ](faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage) if you want to implement your own custom payloads. diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md index e04f64d68b9d0..c00b815ac4e85 100644 --- a/website/docs/sql_ddl.md +++ b/website/docs/sql_ddl.md @@ -105,7 +105,7 @@ TBLPROPERTIES ( ### Create table with merge modes {#create-table-with-record-merge-mode} -Hudi supports different [record merge modes](/docs/next/record_merger) to handle merge of incoming records with existing +Hudi supports different [record merge modes](record_merger) to handle merge of incoming records with existing records. To create a table with specific record merge mode, you can set `recordMergeMode` option. ```sql @@ -127,7 +127,7 @@ LOCATION 'file:///tmp/hudi_table_merge_mode/'; With `EVENT_TIME_ORDERING`, the record with the larger event time (`precombineField`) overwrites the record with the smaller event time on the same key, regardless of transaction's commit time. Users can set `CUSTOM` mode to provide their own merge logic. With `CUSTOM` merge mode, you can provide a custom class that implements the merge logic. The interfaces -to implement is explained in detail [here](/docs/next/record_merger#custom). +to implement is explained in detail [here](record_merger#custom). ```sql CREATE TABLE IF NOT EXISTS hudi_table_merge_mode_custom ( @@ -236,7 +236,7 @@ AS SELECT * FROM parquet_table; ### Create Index Hudi supports creating and dropping different types of indexes on a table. For more information on different -type of indexes please refer [multi-modal indexing](/docs/next/indexes#multi-modal-indexing). Secondary +type of indexes please refer [multi-modal indexing](indexes#multi-modal-indexing). Secondary index, expression index and record indexes can be created using SQL create index command. ```sql @@ -529,7 +529,7 @@ CREATE INDEX idx_bloom_rider ON hudi_indexed_table USING bloom_filters(rider) OP - Secondary index can only be used for tables using OverwriteWithLatestAvroPayload payload or COMMIT_TIME_ORDERING merge mode - Column stats Expression Index can not be created using `identity` expression with SQL. Users can leverage column stat index using Datasource instead. - Index update can fail with schema evolution. -- Only one index can be created at a time using [async indexer](/docs/next/metadata_indexing). +- Only one index can be created at a time using [async indexer](metadata_indexing). ### Setting Hudi configs @@ -592,7 +592,7 @@ Users can set table properties while creating a table. The important table prope #### Passing Lock Providers for Concurrent Writers Hudi requires a lock provider to support concurrent writers or asynchronous table services when using OCC -and [NBCC](/docs/next/concurrency_control#non-blocking-concurrency-control-mode-experimental) (Non-Blocking Concurrency Control) +and [NBCC](concurrency_control#non-blocking-concurrency-control) (Non-Blocking Concurrency Control) concurrency mode. For NBCC mode, locking is only used to write the commit metadata file in the timeline. Writes are serialized by completion time. Users can pass these table properties into *TBLPROPERTIES* as well. Below is an example for a Zookeeper based configuration. @@ -843,7 +843,7 @@ WITH ( ### Create Table in Non-Blocking Concurrency Control Mode -The following is an example of creating a Flink table in [Non-Blocking Concurrency Control mode](/docs/next/concurrency_control#non-blocking-concurrency-control). +The following is an example of creating a Flink table in [Non-Blocking Concurrency Control mode](concurrency_control#non-blocking-concurrency-control). ```sql -- This is a datagen source that can generate records continuously @@ -911,7 +911,7 @@ ALTER TABLE tableA RENAME TO tableB; ### Setting Hudi configs #### Using table options -You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](/docs/next/configurations#FLINK_SQL) +You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](configurations#FLINK_SQL) These configs will be applied to all the operations on that table. ```sql diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md index 43d5d940fb379..6f5fe28a3eba1 100644 --- a/website/docs/sql_dml.md +++ b/website/docs/sql_dml.md @@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem'; SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data from your Hudi tables. Let's explore them one by one. -Please refer to [SQL DDL](/docs/next/sql_ddl) for creating Hudi tables using SQL. +Please refer to [SQL DDL](sql_ddl) for creating Hudi tables using SQL. ### Insert Into @@ -25,7 +25,7 @@ SELECT FROM ; :::note Deprecations From 0.14.0, `hoodie.sql.bulk.insert.enable` and `hoodie.sql.insert.mode` are deprecated. Users are expected to use `hoodie.spark.sql.insert.into.operation` instead. -To manage duplicates with `INSERT INTO`, please check out [insert dup policy config](/docs/next/configurations#hoodiedatasourceinsertduppolicy). +To manage duplicates with `INSERT INTO`, please check out [insert dup policy config](configurations#hoodiedatasourceinsertduppolicy). ::: Examples: @@ -384,7 +384,7 @@ INSERT INTO hudi_table select ... from ...; Hudi Flink supports a new non-blocking concurrency control mode, where multiple writer tasks can be executed concurrently without blocking each other. One can read more about this mode in -the [concurrency control](/docs/next/concurrency_control#model-c-multi-writer) docs. Let us see it in action here. +the [concurrency control](concurrency_control#model-c-multi-writer) docs. Let us see it in action here. In the below example, we have two streaming ingestion pipelines that concurrently update the same table. One of the pipeline is responsible for the compaction and cleaning table services, while the other pipeline is just for data diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md index b0bda5b6d11d7..f96ddca3bb6d2 100644 --- a/website/docs/sql_queries.md +++ b/website/docs/sql_queries.md @@ -196,7 +196,7 @@ DROP INDEX partition_stats on hudi_indexed_table; ### Snapshot Query with Event Time Ordering -Hudi supports different [record merge modes](/docs/next/record_merger) for merging the records from the same key. Event +Hudi supports different [record merge modes](record_merger) for merging the records from the same key. Event time ordering is one of the merge modes where the records are merged based on the event time. Let's create a table with event time ordering merge mode. diff --git a/website/docs/storage_layouts.md b/website/docs/storage_layouts.md index 64feb755ba5e3..a7395ed6e858e 100644 --- a/website/docs/storage_layouts.md +++ b/website/docs/storage_layouts.md @@ -12,8 +12,8 @@ The following describes the general organization of files in storage for a Hudi * Each slice contains a **_base file_** (parquet/orc/hfile) (defined by the config - [hoodie.table.base.file.format](https://hudi.apache.org/docs/next/configurations/#hoodietablebasefileformat) ) written by a commit that completed at a certain instant, along with set of **_log files_** (*.log.*) written by commits that completed before the next base file's requested instant. -* Hudi employs Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new - file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system. +* Hudi employs Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new + file slices and [cleaning](cleaning) action gets rid of unused/older file slices to reclaim space on the file system. * All metadata including timeline, metadata table are stored in a special `.hoodie` directory under the base path. ![file groups in a table partition](/assets/images/MOR_new.png) diff --git a/website/docs/use_cases.md b/website/docs/use_cases.md index 9ad0d255b3d9d..8120c7436078c 100644 --- a/website/docs/use_cases.md +++ b/website/docs/use_cases.md @@ -65,10 +65,10 @@ together to build out the platform. Such an open platform is also essential for - While open data formats help, Hudi unlocks complete freedom by also providing open compute services for ingesting, optimizing, indexing and querying data. For e.g Hudi's writers come with a self-managing table service runtime that can maintain tables automatically in the background on each write. Often times, Hudi and your favorite open query engine is all you need to get an open data platform up and running. -- Examples of open services that make performance optimization or management easy include: [auto file sizing](/docs/next/file_sizing) to solve the "small files" problem, - [clustering](/docs/next/clustering) to co-locate data next to each other, [compaction](/docs/next/compaction) to allow tuning of low latency ingestion + fast read queries, - [indexing](/docs/next/indexes) - for faster writes/queries, Multi-Dimensional Partitioning (Z-Ordering), automatic cleanup of uncommitted data with marker mechanism, - [auto cleaning](/docs/next/cleaning) to automatically removing old versions of files. +- Examples of open services that make performance optimization or management easy include: [auto file sizing](file_sizing) to solve the "small files" problem, + [clustering](clustering) to co-locate data next to each other, [compaction](compaction) to allow tuning of low latency ingestion + fast read queries, + [indexing](indexes) - for faster writes/queries, Multi-Dimensional Partitioning (Z-Ordering), automatic cleanup of uncommitted data with marker mechanism, + [auto cleaning](cleaning) to automatically removing old versions of files. - Hudi provides rich options for pre-sorting/loading data efficiently and then follow on with rich set of data clustering techniques to manage file sizes and data distribution within a table. In each case, Hudi provides high-degree of configurability in terms of when/how often these services are scheduled, planned and executed. For e.g. Hudi ships with a handful of common planning strategies for compaction and clustering. - Along with compatibility with other open table formats like [Apache Iceberg](https://iceberg.apache.org/)/[Delta Lake](https://delta.io/), and catalog sync services to various data catalogs, Hudi is one of the most open choices for your data foundation. diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md index 0ffff5713e57a..d4e8f8fedf250 100644 --- a/website/docs/write_operations.md +++ b/website/docs/write_operations.md @@ -120,16 +120,16 @@ Here are the basic configs relevant to the write operations types mentioned abov The following is an inside look on the Hudi write path and the sequence of events that occur during a write. 1. [Deduping](/docs/configurations#hoodiecombinebeforeinsert) : First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key. -2. [Index Lookup](/docs/next/indexes) : Next, an index lookup is performed to try and match the input records to identify which file groups they belong to. -3. [File Sizing](/docs/next/file_sizing): Then, based on the average size of previous commits, Hudi will make a plan to add enough records to a small file to get it close to the configured maximum limit. -4. [Partitioning](/docs/next/storage_layouts): We now arrive at partitioning where we decide what file groups certain updates and inserts will be placed in or if new file groups will be created +2. [Index Lookup](indexes) : Next, an index lookup is performed to try and match the input records to identify which file groups they belong to. +3. [File Sizing](file_sizing): Then, based on the average size of previous commits, Hudi will make a plan to add enough records to a small file to get it close to the configured maximum limit. +4. [Partitioning](storage_layouts): We now arrive at partitioning where we decide what file groups certain updates and inserts will be placed in or if new file groups will be created 5. Write I/O :Now we actually do the write operations which is either creating a new base file, appending to the log file, or versioning an existing base file. -6. Update [Index](/docs/next/indexes): Now that the write is performed, we will go back and update the index. -7. Commit: Finally we commit all of these changes atomically. ([Post-commit callback](/docs/next/platform_services_post_commit_callback) can be configured.) -8. [Clean](/docs/next/cleaning) (if needed): Following the commit, cleaning is invoked if needed. -9. [Compaction](/docs/next/compaction): If you are using MOR tables, compaction will either run inline, or be scheduled asynchronously -10. Archive : Lastly, we perform an archival step which moves old [timeline](/docs/next/timeline) items to an archive folder. +6. Update [Index](indexes): Now that the write is performed, we will go back and update the index. +7. Commit: Finally we commit all of these changes atomically. ([Post-commit callback](platform_services_post_commit_callback) can be configured.) +8. [Clean](cleaning) (if needed): Following the commit, cleaning is invoked if needed. +9. [Compaction](compaction): If you are using MOR tables, compaction will either run inline, or be scheduled asynchronously +10. Archive : Lastly, we perform an archival step which moves old [timeline](timeline) items to an archive folder. Here is a diagramatic representation of the flow. diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md index 308de8ca78acb..81462307a7f56 100644 --- a/website/docs/writing_data.md +++ b/website/docs/writing_data.md @@ -83,7 +83,7 @@ df.write.format("hudi"). You can check the data generated under `/tmp/hudi_trips_cow////`. We provided a record key (`uuid` in [schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in [schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](/docs/next/faq_general/#how-do-i-model-the-data-stored-in-hudi) +[Modeling data stored in Hudi](faq_general/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/hoodie_streaming_ingestion). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue `insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) @@ -119,7 +119,7 @@ df.write.format("hudi"). You can check the data generated under `/tmp/hudi_trips_cow////`. We provided a record key (`uuid` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](/docs/next/faq_general/#how-do-i-model-the-data-stored-in-hudi) +[Modeling data stored in Hudi](faq_general/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/hoodie_streaming_ingestion). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue `insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) diff --git a/website/releases/release-0.11.0.md b/website/releases/release-0.11.0.md index fbea4897b45c8..dce7ba212fe37 100644 --- a/website/releases/release-0.11.0.md +++ b/website/releases/release-0.11.0.md @@ -59,7 +59,7 @@ latency with data skipping. Two new indices are added to the metadata table They are disabled by default. You can enable them by setting `hoodie.metadata.index.bloom.filter.enable` and `hoodie.metadata.index.column.stats.enable` to `true`, respectively. -*Refer to the [metadata table guide](/docs/metadata#deployment-considerations) for detailed instructions on upgrade and +*Refer to the [metadata table guide](/docs/metadata#deployment-considerations-for-metadata-table) for detailed instructions on upgrade and deployment.* ### Data Skipping with Metadata Table diff --git a/website/releases/release-0.7.0.md b/website/releases/release-0.7.0.md index 3573b331093c0..1eb9274ec689a 100644 --- a/website/releases/release-0.7.0.md +++ b/website/releases/release-0.7.0.md @@ -64,7 +64,7 @@ Specifically, the `HoodieFlinkStreamer` allows for Hudi Copy-On-Write table to b derived/ETL pipelines similar to data [sensors](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/sensors/index) in Apache Airflow. - **Insert Overwrite/Insert Overwrite Table**: We have added these two new write operation types, predominantly to help existing batch ETL jobs, which typically overwrite entire tables/partitions each run. These operations are much cheaper, than having to issue upserts, given they are bulk replacing the target table. - Check [here](/docs/quick-start-guide#insert-overwrite-table) for examples. + Check [here](/docs/0.7.0/quick-start-guide#insert-overwrite-table) for examples. - **Delete Partition**: For users of WriteClient/RDD level apis, we have added an API to delete an entire partition, again without issuing deletes at the record level. - The current default `OverwriteWithLatestAvroPayload` will overwrite the value in storage, even if for e.g the upsert was reissued for an older value of the key. Added a new `DefaultHoodieRecordPayload` and a new payload config `hoodie.payload.ordering.field` helps specify a field, that the incoming upsert record can be compared with diff --git a/website/releases/release-0.9.0.md b/website/releases/release-0.9.0.md index 50b7005e2ff8e..dbb4b5cc67e54 100644 --- a/website/releases/release-0.9.0.md +++ b/website/releases/release-0.9.0.md @@ -45,7 +45,7 @@ Hudi tables are now registered with Hive as spark datasource tables, meaning Spa instead of relying on the Hive fallbacks within Spark, which are ill-maintained/cumbersome. This unlocks many optimizations such as the use of Hudi's own [FileIndex](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46) implementation for optimized caching and the use of the Hudi metadata table, for faster listing of large tables. We have also added support for -[timetravel query](/docs/quick-start-guide#time-travel-query), for spark datasource. +[timetravel query](/docs/0.9.0/quick-start-guide#time-travel-query), for spark datasource. ### Writer side improvements diff --git a/website/releases/release-1.0.0-beta1.md b/website/releases/release-1.0.0-beta1.md index fa8c371b1e3c1..7fc1b1f6e5263 100644 --- a/website/releases/release-1.0.0-beta1.md +++ b/website/releases/release-1.0.0-beta1.md @@ -45,7 +45,7 @@ changes in this release: - Completed actions, their plans and completion metadata are stored in a more scalable [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) based timeline organized in an * *_archived_** storage location under the .hoodie metadata path. It consists of Apache Parquet files with action - instant data and bookkeeping metadata files, in the following manner. Checkout [timeline](/docs/next/timeline#lsm-timeline) docs for more details. + instant data and bookkeeping metadata files, in the following manner. Checkout [timeline](/docs/next/timeline#lsm-timeline-history) docs for more details. #### Log File Format @@ -68,7 +68,7 @@ A new concurrency control mode called `NON_BLOCKING_CONCURRENCY_CONTROL` is intr OCC, multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is currently available for preview in version 1.0.0-beta only. You can read more about it under -section [Model C: Multi-writer](/docs/next/concurrency_control#non-blocking-concurrency-control-mode-experimental). A complete example with multiple +section [Model C: Multi-writer](/docs/next/concurrency_control#non-blocking-concurrency-control). A complete example with multiple Flink streaming writers is available [here](/docs/next/sql_dml#non-blocking-concurrency-control-experimental). You can follow the [RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-66/rfc-66.md) and the [JIRA](https://issues.apache.org/jira/browse/HUDI-6640) for more details. diff --git a/website/releases/release-1.0.0-beta2.md b/website/releases/release-1.0.0-beta2.md index bea04c3bfd189..698b2aa3c6e4b 100644 --- a/website/releases/release-1.0.0-beta2.md +++ b/website/releases/release-1.0.0-beta2.md @@ -57,7 +57,7 @@ queries with predicate on columns other than record key columns. Partition stats index aggregates statistics at the partition level for the columns for which it is enabled. This helps in efficient partition pruning even for non-partition fields. -To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-and-secondary-index-experimental). +To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-index). ### API Changes diff --git a/website/versioned_docs/version-0.10.0/clustering.md b/website/versioned_docs/version-0.10.0/clustering.md index f9bdd572751ac..e630e92445d5b 100644 --- a/website/versioned_docs/version-0.10.0/clustering.md +++ b/website/versioned_docs/version-0.10.0/clustering.md @@ -12,7 +12,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei ## Clustering Architecture -At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies. +At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies. diff --git a/website/versioned_docs/version-0.10.0/compaction.md b/website/versioned_docs/version-0.10.0/compaction.md index 015d21ec68221..e7689b7fdcb3d 100644 --- a/website/versioned_docs/version-0.10.0/compaction.md +++ b/website/versioned_docs/version-0.10.0/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.10.0/concurrency_control.md b/website/versioned_docs/version-0.10.0/concurrency_control.md index 4602ea5df44ef..54e3f23df4a97 100644 --- a/website/versioned_docs/version-0.10.0/concurrency_control.md +++ b/website/versioned_docs/version-0.10.0/concurrency_control.md @@ -19,13 +19,13 @@ between multiple table service writers and readers. Additionally, using MVCC, Hu the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits (or writers) happening to the same table, if they do not have writes to overlapping files being changed, both writers are allowed to succeed. This feature is currently *experimental* and requires either Zookeeper or HiveMetastore to acquire locks. -It may be helpful to understand the different guarantees provided by [write operations](/docs/writing_data#write-operations) via Hudi datasource or the delta streamer. +It may be helpful to understand the different guarantees provided by [write operations](/docs/write_operations) via Hudi datasource or the delta streamer. ## Single Writer Guarantees - *UPSERT Guarantee*: The target table will NEVER show duplicates. - - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. - - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. + - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. + - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out of order. ## Multi Writer Guarantees @@ -33,8 +33,8 @@ It may be helpful to understand the different guarantees provided by [write oper With multiple writers using OCC, some of the above guarantees change as follows - *UPSERT Guarantee*: The target table will NEVER show duplicates. -- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. -- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. +- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. +- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints MIGHT be out of order due to multiple writer jobs finishing at different times. ## Enabling Multi Writing diff --git a/website/versioned_docs/version-0.10.0/deployment.md b/website/versioned_docs/version-0.10.0/deployment.md index c3f3de84e88c9..7614b28c439ce 100644 --- a/website/versioned_docs/version-0.10.0/deployment.md +++ b/website/versioned_docs/version-0.10.0/deployment.md @@ -25,9 +25,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](hoodie_deltastreamer#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](hoodie_deltastreamer#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -126,7 +126,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.10.0/faq.md b/website/versioned_docs/version-0.10.0/faq.md index 0af77241155c8..44d2cabf82893 100644 --- a/website/versioned_docs/version-0.10.0/faq.md +++ b/website/versioned_docs/version-0.10.0/faq.md @@ -284,7 +284,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.10.0/file_sizing.md b/website/versioned_docs/version-0.10.0/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.10.0/file_sizing.md +++ b/website/versioned_docs/version-0.10.0/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.10.0/performance.md b/website/versioned_docs/version-0.10.0/performance.md index 53152730bd84a..274ed9dc3fd40 100644 --- a/website/versioned_docs/version-0.10.0/performance.md +++ b/website/versioned_docs/version-0.10.0/performance.md @@ -14,10 +14,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. -- The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload +- The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. -- Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull -- User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), +- Cleaner can be [configured](/docs/configurations#hoodiecleanercommitsretained) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull +- User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations/#hoodielogfilemaxsize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.10.0/querying_data.md b/website/versioned_docs/version-0.10.0/querying_data.md index 0a4dea319568e..3854fb8594b09 100644 --- a/website/versioned_docs/version-0.10.0/querying_data.md +++ b/website/versioned_docs/version-0.10.0/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.10.0/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.10.0/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. @@ -49,7 +49,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` For examples, refer to [Incremental Queries](/docs/quick-start-guide#incremental-query) in the Spark quickstart. -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +Please refer to [configurations](/docs/configurations/#SPARK_DATASOURCE) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. @@ -171,10 +171,10 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par separated) and calls InputFormat.listStatus() only once with all those partitions. ## PrestoDB -To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/0.10.0/query_engine_setup#PrestoDB) page. +To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/0.10.0/query_engine_setup#prestodb) page. ## Trino -To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/0.10.0/query_engine_setup#Trino) page. +To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/0.10.0/query_engine_setup#trino) page. ## Impala (3.4 or later) diff --git a/website/versioned_docs/version-0.10.0/quick-start-guide.md b/website/versioned_docs/version-0.10.0/quick-start-guide.md index 281be622248a9..44b9d38a585c4 100644 --- a/website/versioned_docs/version-0.10.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.10.0/quick-start-guide.md @@ -409,7 +409,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: @@ -445,7 +445,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: diff --git a/website/versioned_docs/version-0.10.0/write_operations.md b/website/versioned_docs/version-0.10.0/write_operations.md index eb3cb9a452202..952fe3b119699 100644 --- a/website/versioned_docs/version-0.10.0/write_operations.md +++ b/website/versioned_docs/version-0.10.0/write_operations.md @@ -37,7 +37,7 @@ Hudi supports implementing two types of deletes on data stored in Hudi tables, b ## Writing path The following is an inside look on the Hudi write path and the sequence of events that occur during a write. -1. [Deduping](/docs/configurations/#writeinsertdeduplicate) +1. [Deduping](configurations#hoodiecombinebeforeinsert) 1. First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key. 2. [Index Lookup](/docs/indexing) 1. Next, an index lookup is performed to try and match the input records to identify which file groups they belong to. @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.10.0/writing_data.md b/website/versioned_docs/version-0.10.0/writing_data.md index 719813360c4c4..9806ef7064849 100644 --- a/website/versioned_docs/version-0.10.0/writing_data.md +++ b/website/versioned_docs/version-0.10.0/writing_data.md @@ -93,7 +93,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: @@ -129,7 +129,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: diff --git a/website/versioned_docs/version-0.10.1/clustering.md b/website/versioned_docs/version-0.10.1/clustering.md index f9bdd572751ac..e630e92445d5b 100644 --- a/website/versioned_docs/version-0.10.1/clustering.md +++ b/website/versioned_docs/version-0.10.1/clustering.md @@ -12,7 +12,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei ## Clustering Architecture -At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies. +At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies. diff --git a/website/versioned_docs/version-0.10.1/compaction.md b/website/versioned_docs/version-0.10.1/compaction.md index c56df32c186d9..5267f1209844d 100644 --- a/website/versioned_docs/version-0.10.1/compaction.md +++ b/website/versioned_docs/version-0.10.1/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.10.1/concurrency_control.md b/website/versioned_docs/version-0.10.1/concurrency_control.md index 6ea34baa9aab1..b926babecb892 100644 --- a/website/versioned_docs/version-0.10.1/concurrency_control.md +++ b/website/versioned_docs/version-0.10.1/concurrency_control.md @@ -19,13 +19,13 @@ between multiple table service writers and readers. Additionally, using MVCC, Hu the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits (or writers) happening to the same table, if they do not have writes to overlapping files being changed, both writers are allowed to succeed. This feature is currently *experimental* and requires either Zookeeper or HiveMetastore to acquire locks. -It may be helpful to understand the different guarantees provided by [write operations](/docs/writing_data#write-operations) via Hudi datasource or the delta streamer. +It may be helpful to understand the different guarantees provided by [write operations](/docs/write_operations) via Hudi datasource or the delta streamer. ## Single Writer Guarantees - *UPSERT Guarantee*: The target table will NEVER show duplicates. - - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. - - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. + - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. + - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out of order. ## Multi Writer Guarantees @@ -33,8 +33,8 @@ It may be helpful to understand the different guarantees provided by [write oper With multiple writers using OCC, some of the above guarantees change as follows - *UPSERT Guarantee*: The target table will NEVER show duplicates. -- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. -- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. +- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. +- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints MIGHT be out of order due to multiple writer jobs finishing at different times. ## Enabling Multi Writing diff --git a/website/versioned_docs/version-0.10.1/deployment.md b/website/versioned_docs/version-0.10.1/deployment.md index c3f3de84e88c9..7614b28c439ce 100644 --- a/website/versioned_docs/version-0.10.1/deployment.md +++ b/website/versioned_docs/version-0.10.1/deployment.md @@ -25,9 +25,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](hoodie_deltastreamer#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](hoodie_deltastreamer#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -126,7 +126,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.10.1/faq.md b/website/versioned_docs/version-0.10.1/faq.md index d9691b8bdeb4d..f6feaa2534351 100644 --- a/website/versioned_docs/version-0.10.1/faq.md +++ b/website/versioned_docs/version-0.10.1/faq.md @@ -284,7 +284,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.10.1/file_sizing.md b/website/versioned_docs/version-0.10.1/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.10.1/file_sizing.md +++ b/website/versioned_docs/version-0.10.1/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.10.1/performance.md b/website/versioned_docs/version-0.10.1/performance.md index 53152730bd84a..274ed9dc3fd40 100644 --- a/website/versioned_docs/version-0.10.1/performance.md +++ b/website/versioned_docs/version-0.10.1/performance.md @@ -14,10 +14,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. -- The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload +- The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. -- Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull -- User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), +- Cleaner can be [configured](/docs/configurations#hoodiecleanercommitsretained) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull +- User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations/#hoodielogfilemaxsize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.10.1/querying_data.md b/website/versioned_docs/version-0.10.1/querying_data.md index a4fe212de99b5..4a120b3423f35 100644 --- a/website/versioned_docs/version-0.10.1/querying_data.md +++ b/website/versioned_docs/version-0.10.1/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. @@ -49,7 +49,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` For examples, refer to [Incremental Queries](/docs/quick-start-guide#incremental-query) in the Spark quickstart. -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +Please refer to [configurations](/docs/configurations/#SPARK_DATASOURCE) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. @@ -170,10 +170,10 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par separated) and calls InputFormat.listStatus() only once with all those partitions. ## PrestoDB -To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#PrestoDB) page. +To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#prestodb) page. ## Trino -To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#Trino) page. +To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#trino) page. ## Impala (3.4 or later) diff --git a/website/versioned_docs/version-0.10.1/quick-start-guide.md b/website/versioned_docs/version-0.10.1/quick-start-guide.md index fc4f17202ff3f..d286cc3e61cec 100644 --- a/website/versioned_docs/version-0.10.1/quick-start-guide.md +++ b/website/versioned_docs/version-0.10.1/quick-start-guide.md @@ -426,7 +426,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: @@ -462,7 +462,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: diff --git a/website/versioned_docs/version-0.10.1/tuning-guide.md b/website/versioned_docs/version-0.10.1/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.10.1/tuning-guide.md +++ b/website/versioned_docs/version-0.10.1/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.10.1/write_operations.md b/website/versioned_docs/version-0.10.1/write_operations.md index eb3cb9a452202..952fe3b119699 100644 --- a/website/versioned_docs/version-0.10.1/write_operations.md +++ b/website/versioned_docs/version-0.10.1/write_operations.md @@ -37,7 +37,7 @@ Hudi supports implementing two types of deletes on data stored in Hudi tables, b ## Writing path The following is an inside look on the Hudi write path and the sequence of events that occur during a write. -1. [Deduping](/docs/configurations/#writeinsertdeduplicate) +1. [Deduping](configurations#hoodiecombinebeforeinsert) 1. First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key. 2. [Index Lookup](/docs/indexing) 1. Next, an index lookup is performed to try and match the input records to identify which file groups they belong to. @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.10.1/writing_data.md b/website/versioned_docs/version-0.10.1/writing_data.md index 719813360c4c4..9806ef7064849 100644 --- a/website/versioned_docs/version-0.10.1/writing_data.md +++ b/website/versioned_docs/version-0.10.1/writing_data.md @@ -93,7 +93,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: @@ -129,7 +129,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: diff --git a/website/versioned_docs/version-0.11.0/compaction.md b/website/versioned_docs/version-0.11.0/compaction.md index a6249b7ae7c48..e99cc2082c5fe 100644 --- a/website/versioned_docs/version-0.11.0/compaction.md +++ b/website/versioned_docs/version-0.11.0/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.11.0/deployment.md b/website/versioned_docs/version-0.11.0/deployment.md index 24ea35e3999fc..7fbc595b8b2b2 100644 --- a/website/versioned_docs/version-0.11.0/deployment.md +++ b/website/versioned_docs/version-0.11.0/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.11.0/faq.md b/website/versioned_docs/version-0.11.0/faq.md index 6c2c86fef5d6a..32469d64e81f3 100644 --- a/website/versioned_docs/version-0.11.0/faq.md +++ b/website/versioned_docs/version-0.11.0/faq.md @@ -284,7 +284,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.11.0/file_sizing.md b/website/versioned_docs/version-0.11.0/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.11.0/file_sizing.md +++ b/website/versioned_docs/version-0.11.0/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.11.0/querying_data.md b/website/versioned_docs/version-0.11.0/querying_data.md index 6ad05015e7535..3a81bc22a17ad 100644 --- a/website/versioned_docs/version-0.11.0/querying_data.md +++ b/website/versioned_docs/version-0.11.0/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.0/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.0/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. diff --git a/website/versioned_docs/version-0.11.0/tuning-guide.md b/website/versioned_docs/version-0.11.0/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.11.0/tuning-guide.md +++ b/website/versioned_docs/version-0.11.0/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.11.0/write_operations.md b/website/versioned_docs/version-0.11.0/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.11.0/write_operations.md +++ b/website/versioned_docs/version-0.11.0/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.11.1/compaction.md b/website/versioned_docs/version-0.11.1/compaction.md index 9d73e31bd5b02..7b84502c973d5 100644 --- a/website/versioned_docs/version-0.11.1/compaction.md +++ b/website/versioned_docs/version-0.11.1/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.11.1/deployment.md b/website/versioned_docs/version-0.11.1/deployment.md index c8c4e5cefdc63..bce07498029b7 100644 --- a/website/versioned_docs/version-0.11.1/deployment.md +++ b/website/versioned_docs/version-0.11.1/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.11.1/faq.md b/website/versioned_docs/version-0.11.1/faq.md index b081d0fe1b03f..095480d298424 100644 --- a/website/versioned_docs/version-0.11.1/faq.md +++ b/website/versioned_docs/version-0.11.1/faq.md @@ -295,7 +295,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.11.1/file_sizing.md b/website/versioned_docs/version-0.11.1/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.11.1/file_sizing.md +++ b/website/versioned_docs/version-0.11.1/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.11.1/querying_data.md b/website/versioned_docs/version-0.11.1/querying_data.md index 4cae617a5b6e0..b9c2294a83a4b 100644 --- a/website/versioned_docs/version-0.11.1/querying_data.md +++ b/website/versioned_docs/version-0.11.1/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.1/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.1/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. diff --git a/website/versioned_docs/version-0.11.1/tuning-guide.md b/website/versioned_docs/version-0.11.1/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.11.1/tuning-guide.md +++ b/website/versioned_docs/version-0.11.1/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.11.1/write_operations.md b/website/versioned_docs/version-0.11.1/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.11.1/write_operations.md +++ b/website/versioned_docs/version-0.11.1/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.12.0/compaction.md b/website/versioned_docs/version-0.12.0/compaction.md index 9d73e31bd5b02..7b84502c973d5 100644 --- a/website/versioned_docs/version-0.12.0/compaction.md +++ b/website/versioned_docs/version-0.12.0/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.12.0/deployment.md b/website/versioned_docs/version-0.12.0/deployment.md index 8964d2f913568..1476b051a628e 100644 --- a/website/versioned_docs/version-0.12.0/deployment.md +++ b/website/versioned_docs/version-0.12.0/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.12.0/faq.md b/website/versioned_docs/version-0.12.0/faq.md index b43043d89cd04..cb3225571c2f9 100644 --- a/website/versioned_docs/version-0.12.0/faq.md +++ b/website/versioned_docs/version-0.12.0/faq.md @@ -322,7 +322,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.12.0/file_sizing.md b/website/versioned_docs/version-0.12.0/file_sizing.md index 0bb0d9b003b14..1c1c12fe20711 100644 --- a/website/versioned_docs/version-0.12.0/file_sizing.md +++ b/website/versioned_docs/version-0.12.0/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md index 4b7e642099f11..75bae79386e89 100644 --- a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md @@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. - **Configuration** : For [Global Configuration](/docs/0.12.0/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.12.0/flink_configuration#table-options). - **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.12.0/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.12.0/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.12.0/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.12.0/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.12.0/hoodie_deltastreamer#append-mode). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). +- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.12.0/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.12.0/flink_configuration#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). - **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.12.0/query_engine_setup#prestodb). diff --git a/website/versioned_docs/version-0.12.0/querying_data.md b/website/versioned_docs/version-0.12.0/querying_data.md index 074d9a2e7c436..70adabf40a637 100644 --- a/website/versioned_docs/version-0.12.0/querying_data.md +++ b/website/versioned_docs/version-0.12.0/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.0/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.0/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. diff --git a/website/versioned_docs/version-0.12.0/tuning-guide.md b/website/versioned_docs/version-0.12.0/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.12.0/tuning-guide.md +++ b/website/versioned_docs/version-0.12.0/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.12.0/write_operations.md b/website/versioned_docs/version-0.12.0/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.12.0/write_operations.md +++ b/website/versioned_docs/version-0.12.0/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.12.1/compaction.md b/website/versioned_docs/version-0.12.1/compaction.md index 9d73e31bd5b02..7b84502c973d5 100644 --- a/website/versioned_docs/version-0.12.1/compaction.md +++ b/website/versioned_docs/version-0.12.1/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.12.1/deployment.md b/website/versioned_docs/version-0.12.1/deployment.md index edd7bc69305e1..4f01c1b397548 100644 --- a/website/versioned_docs/version-0.12.1/deployment.md +++ b/website/versioned_docs/version-0.12.1/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.12.1/faq.md b/website/versioned_docs/version-0.12.1/faq.md index 41b76ec6c15d4..9245d723aa216 100644 --- a/website/versioned_docs/version-0.12.1/faq.md +++ b/website/versioned_docs/version-0.12.1/faq.md @@ -322,7 +322,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.12.1/file_sizing.md b/website/versioned_docs/version-0.12.1/file_sizing.md index 0bb0d9b003b14..1c1c12fe20711 100644 --- a/website/versioned_docs/version-0.12.1/file_sizing.md +++ b/website/versioned_docs/version-0.12.1/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.12.1/querying_data.md b/website/versioned_docs/version-0.12.1/querying_data.md index 332368fcd33bb..374502e96d2d3 100644 --- a/website/versioned_docs/version-0.12.1/querying_data.md +++ b/website/versioned_docs/version-0.12.1/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.1/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.1/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. diff --git a/website/versioned_docs/version-0.12.1/tuning-guide.md b/website/versioned_docs/version-0.12.1/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.12.1/tuning-guide.md +++ b/website/versioned_docs/version-0.12.1/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.12.1/write_operations.md b/website/versioned_docs/version-0.12.1/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.12.1/write_operations.md +++ b/website/versioned_docs/version-0.12.1/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.12.2/compaction.md b/website/versioned_docs/version-0.12.2/compaction.md index a6249b7ae7c48..e99cc2082c5fe 100644 --- a/website/versioned_docs/version-0.12.2/compaction.md +++ b/website/versioned_docs/version-0.12.2/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.12.2/deployment.md b/website/versioned_docs/version-0.12.2/deployment.md index 18d9259f745e0..57f1ed35cb467 100644 --- a/website/versioned_docs/version-0.12.2/deployment.md +++ b/website/versioned_docs/version-0.12.2/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.12.2/faq.md b/website/versioned_docs/version-0.12.2/faq.md index 2752b49e3a793..0cf53d918d4cd 100644 --- a/website/versioned_docs/version-0.12.2/faq.md +++ b/website/versioned_docs/version-0.12.2/faq.md @@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.12.2/file_sizing.md b/website/versioned_docs/version-0.12.2/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.12.2/file_sizing.md +++ b/website/versioned_docs/version-0.12.2/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md b/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md index 41fb1dc503a42..3d0944ccd2b6f 100644 --- a/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md @@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. - **Configuration** : For [Global Configuration](/docs/0.12.2/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.12.2/flink_configuration#table-options). - **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.12.2/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.12.2/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.12.2/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.12.2/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.12.2/hoodie_deltastreamer#append-mode). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). +- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.12.2/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.12.2/flink_configuration#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). - **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.12.2/query_engine_setup#prestodb). diff --git a/website/versioned_docs/version-0.12.2/querying_data.md b/website/versioned_docs/version-0.12.2/querying_data.md index fff64bc0bad2a..23d1835010a6d 100644 --- a/website/versioned_docs/version-0.12.2/querying_data.md +++ b/website/versioned_docs/version-0.12.2/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.2/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.2/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. diff --git a/website/versioned_docs/version-0.12.2/quick-start-guide.md b/website/versioned_docs/version-0.12.2/quick-start-guide.md index 0143f3b9f896a..aa108a75a50d8 100644 --- a/website/versioned_docs/version-0.12.2/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.2/quick-start-guide.md @@ -1099,7 +1099,7 @@ For CoW tables, table services work in inline mode by default. For MoR tables, some async services are enabled by default. :::note -Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions. +Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions. If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported. ::: diff --git a/website/versioned_docs/version-0.12.2/tuning-guide.md b/website/versioned_docs/version-0.12.2/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.12.2/tuning-guide.md +++ b/website/versioned_docs/version-0.12.2/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.12.2/write_operations.md b/website/versioned_docs/version-0.12.2/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.12.2/write_operations.md +++ b/website/versioned_docs/version-0.12.2/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.12.3/compaction.md b/website/versioned_docs/version-0.12.3/compaction.md index a6249b7ae7c48..e99cc2082c5fe 100644 --- a/website/versioned_docs/version-0.12.3/compaction.md +++ b/website/versioned_docs/version-0.12.3/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.12.3/deployment.md b/website/versioned_docs/version-0.12.3/deployment.md index 998dafa23ed62..cd51d9c9cb5c5 100644 --- a/website/versioned_docs/version-0.12.3/deployment.md +++ b/website/versioned_docs/version-0.12.3/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.12.3/faq.md b/website/versioned_docs/version-0.12.3/faq.md index 05b60c270c790..5d5aafa0ed15c 100644 --- a/website/versioned_docs/version-0.12.3/faq.md +++ b/website/versioned_docs/version-0.12.3/faq.md @@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.12.3/file_sizing.md b/website/versioned_docs/version-0.12.3/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.12.3/file_sizing.md +++ b/website/versioned_docs/version-0.12.3/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md b/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md index afffd7f244e59..1795181452268 100644 --- a/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md @@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. - **Configuration** : For [Global Configuration](/docs/0.12.3/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.12.3/flink_configuration#table-options). - **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.12.3/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.12.3/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.12.3/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.12.3/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.12.3/hoodie_deltastreamer#append-mode). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). +- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.12.3/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.12.3/flink_configuration#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). - **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.12.3/query_engine_setup#prestodb). diff --git a/website/versioned_docs/version-0.12.3/querying_data.md b/website/versioned_docs/version-0.12.3/querying_data.md index ddd7b5ced1319..470fc4f5df9df 100644 --- a/website/versioned_docs/version-0.12.3/querying_data.md +++ b/website/versioned_docs/version-0.12.3/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.3/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.3/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. diff --git a/website/versioned_docs/version-0.12.3/quick-start-guide.md b/website/versioned_docs/version-0.12.3/quick-start-guide.md index 3a990aa74431b..67418ddd7cedc 100644 --- a/website/versioned_docs/version-0.12.3/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.3/quick-start-guide.md @@ -1099,7 +1099,7 @@ For CoW tables, table services work in inline mode by default. For MoR tables, some async services are enabled by default. :::note -Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions. +Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions. If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported. ::: diff --git a/website/versioned_docs/version-0.12.3/tuning-guide.md b/website/versioned_docs/version-0.12.3/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.12.3/tuning-guide.md +++ b/website/versioned_docs/version-0.12.3/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.12.3/write_operations.md b/website/versioned_docs/version-0.12.3/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.12.3/write_operations.md +++ b/website/versioned_docs/version-0.12.3/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.13.0/compaction.md b/website/versioned_docs/version-0.13.0/compaction.md index a6249b7ae7c48..e99cc2082c5fe 100644 --- a/website/versioned_docs/version-0.13.0/compaction.md +++ b/website/versioned_docs/version-0.13.0/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.13.0/deployment.md b/website/versioned_docs/version-0.13.0/deployment.md index 8ccc654b4f2ad..2837cc92d43da 100644 --- a/website/versioned_docs/version-0.13.0/deployment.md +++ b/website/versioned_docs/version-0.13.0/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.13.0/faq.md b/website/versioned_docs/version-0.13.0/faq.md index b0011e893621d..6daa604c7a840 100644 --- a/website/versioned_docs/version-0.13.0/faq.md +++ b/website/versioned_docs/version-0.13.0/faq.md @@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.13.0/file_sizing.md b/website/versioned_docs/version-0.13.0/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.13.0/file_sizing.md +++ b/website/versioned_docs/version-0.13.0/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md index f9f91a4c1e4dc..8cae9919bc061 100644 --- a/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md @@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. - **Configuration** : For [Global Configuration](/docs/0.13.0/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.13.0/flink_configuration#table-options). - **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.13.0/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.13.0/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.13.0/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.13.0/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.13.0/hoodie_deltastreamer#append-mode). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). +- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.13.0/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.13.0/flink_configuration#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). - **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.13.0/query_engine_setup#prestodb). diff --git a/website/versioned_docs/version-0.13.0/querying_data.md b/website/versioned_docs/version-0.13.0/querying_data.md index b62b5e9f63e3d..d95f1b4f71f54 100644 --- a/website/versioned_docs/version-0.13.0/querying_data.md +++ b/website/versioned_docs/version-0.13.0/querying_data.md @@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`. See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. -To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.13.0/query_engine_setup#Spark-DataSource) page. +To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.13.0/query_engine_setup#spark) page. ### Snapshot query {#spark-snap-query} Retrieve the data table at the present point in time. diff --git a/website/versioned_docs/version-0.13.0/quick-start-guide.md b/website/versioned_docs/version-0.13.0/quick-start-guide.md index d4b55283e3f25..839747cfbaa55 100644 --- a/website/versioned_docs/version-0.13.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.13.0/quick-start-guide.md @@ -1103,7 +1103,7 @@ For CoW tables, table services work in inline mode by default. For MoR tables, some async services are enabled by default. :::note -Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions. +Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions. If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported. ::: diff --git a/website/versioned_docs/version-0.13.0/tuning-guide.md b/website/versioned_docs/version-0.13.0/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.13.0/tuning-guide.md +++ b/website/versioned_docs/version-0.13.0/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.13.0/write_operations.md b/website/versioned_docs/version-0.13.0/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.13.0/write_operations.md +++ b/website/versioned_docs/version-0.13.0/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.13.1/compaction.md b/website/versioned_docs/version-0.13.1/compaction.md index a6249b7ae7c48..e99cc2082c5fe 100644 --- a/website/versioned_docs/version-0.13.1/compaction.md +++ b/website/versioned_docs/version-0.13.1/compaction.md @@ -95,7 +95,7 @@ is enabled by default. ::: ### Hudi Compactor Utility -Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions) +Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions) Example: ```properties diff --git a/website/versioned_docs/version-0.13.1/deployment.md b/website/versioned_docs/version-0.13.1/deployment.md index 7554cbfa85094..3a90bd9bcaa4d 100644 --- a/website/versioned_docs/version-0.13.1/deployment.md +++ b/website/versioned_docs/version-0.13.1/deployment.md @@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.13.1/faq.md b/website/versioned_docs/version-0.13.1/faq.md index bd6ba91094c24..40cdd44df9726 100644 --- a/website/versioned_docs/version-0.13.1/faq.md +++ b/website/versioned_docs/version-0.13.1/faq.md @@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.13.1/file_sizing.md b/website/versioned_docs/version-0.13.1/file_sizing.md index e7935445d9e6d..58831e4b29959 100644 --- a/website/versioned_docs/version-0.13.1/file_sizing.md +++ b/website/versioned_docs/version-0.13.1/file_sizing.md @@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the -configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all +configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. ### For Merge-On-Read diff --git a/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md b/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md index 62c5671b1a1e0..e30598a9e2309 100644 --- a/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md @@ -12,11 +12,11 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. - **Configuration** : For [Global Configuration](/docs/0.13.1/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.13.1/flink_configuration#table-options). - **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.13.1/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.13.1/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.13.1/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.13.1/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.13.1/hoodie_deltastreamer#append-mode). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). +- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.13.1/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.13.1/flink_configuration#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). -- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb). -- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/querying_data/#hudi-catalog). +- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](querying_data#prestodb). +- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](querying_data/#hudi-catalog). ## Quick Start diff --git a/website/versioned_docs/version-0.13.1/quick-start-guide.md b/website/versioned_docs/version-0.13.1/quick-start-guide.md index acba285387860..297b05d20bd7a 100644 --- a/website/versioned_docs/version-0.13.1/quick-start-guide.md +++ b/website/versioned_docs/version-0.13.1/quick-start-guide.md @@ -1103,7 +1103,7 @@ For CoW tables, table services work in inline mode by default. For MoR tables, some async services are enabled by default. :::note -Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions. +Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions. If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported. ::: diff --git a/website/versioned_docs/version-0.13.1/record_payload.md b/website/versioned_docs/version-0.13.1/record_payload.md index 48c3f0e6b79da..750e198586315 100644 --- a/website/versioned_docs/version-0.13.1/record_payload.md +++ b/website/versioned_docs/version-0.13.1/record_payload.md @@ -139,5 +139,5 @@ Amazon Database Migration Service onto S3. Record payloads are tunable to suit many use cases. Please check out the configurations listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic, please check -out [this FAQ](/docs/faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a +out [this FAQ](faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a separate document, we will talk about a new record merger API for optimized payload handling. diff --git a/website/versioned_docs/version-0.13.1/tuning-guide.md b/website/versioned_docs/version-0.13.1/tuning-guide.md index 4affeafda663d..12b68098e0600 100644 --- a/website/versioned_docs/version-0.13.1/tuning-guide.md +++ b/website/versioned_docs/version-0.13.1/tuning-guide.md @@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb **Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance. -**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. +**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it. **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup. diff --git a/website/versioned_docs/version-0.13.1/write_operations.md b/website/versioned_docs/version-0.13.1/write_operations.md index baa6d7dbf8483..9ff8431384cad 100644 --- a/website/versioned_docs/version-0.13.1/write_operations.md +++ b/website/versioned_docs/version-0.13.1/write_operations.md @@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.14.0/deployment.md b/website/versioned_docs/version-0.14.0/deployment.md index f5ad89c7f8170..b400e413a7d3b 100644 --- a/website/versioned_docs/version-0.14.0/deployment.md +++ b/website/versioned_docs/version-0.14.0/deployment.md @@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.14.0/faq.md b/website/versioned_docs/version-0.14.0/faq.md index c64c63bda988f..74bf66d3ae7ab 100644 --- a/website/versioned_docs/version-0.14.0/faq.md +++ b/website/versioned_docs/version-0.14.0/faq.md @@ -162,7 +162,7 @@ Further - Hudi’s commit time can be a logical time and need not strictly be a ### What are some ways to write a Hudi table? -Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](https://hudi.apache.org/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](https://hudi.apache.org/docs/hoodie_streaming_ingestion#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](https://hudi.apache.org/docs/writing_data/#spark-datasource-writer) to write into Hudi. +Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](https://hudi.apache.org/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](https://hudi.apache.org/docs/0.14.0/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](https://hudi.apache.org/docs/writing_data/#spark-datasource-writer) to write into Hudi. ### How is a Hudi writer job deployed? @@ -303,7 +303,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk\_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. diff --git a/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md index 54afa766a19b3..c64f84a4e6023 100644 --- a/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md @@ -11,11 +11,11 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. - **Configuration** : For [Global Configuration](/docs/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/flink_tuning#table-options). -- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](/docs/hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](/docs/hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](/docs/hoodie_streaming_ingestion#changelog-mode) and [Append Mode](/docs/hoodie_streaming_ingestion#append-mode). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). +- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](hoodie_streaming_ingestion#changelog-mode) and [Append Mode](hoodie_streaming_ingestion#append-mode). +- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](sql_queries#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/flink_tuning#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). -- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb). +- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](sql_queries#presto). - **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/sql_ddl/#create-catalog). ## Quick Start diff --git a/website/versioned_docs/version-0.14.0/quick-start-guide.md b/website/versioned_docs/version-0.14.0/quick-start-guide.md index 1512bab6e0503..32c89b2afe075 100644 --- a/website/versioned_docs/version-0.14.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.14.0/quick-start-guide.md @@ -1123,9 +1123,9 @@ Hudi provides industry-leading performance and functionality for streaming data. from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider, transformation support, automatic table services and so on. -**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/hoodie_streaming_ingestion#structured-streaming) for more. +**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](hoodie_streaming_ingestion#structured-streaming) for more. -Check out more information on [modeling data in Hudi](/docs/faq#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](/docs/writing_data). +Check out more information on [modeling data in Hudi](faq#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](/docs/writing_data). ### Dockerized Demo Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy diff --git a/website/versioned_docs/version-0.14.0/record_payload.md b/website/versioned_docs/version-0.14.0/record_payload.md index 1ed47b2ca9676..fb63c8f52939a 100644 --- a/website/versioned_docs/version-0.14.0/record_payload.md +++ b/website/versioned_docs/version-0.14.0/record_payload.md @@ -172,5 +172,5 @@ provides support for applying changes captured via Amazon Database Migration Ser Record payloads are tunable to suit many use cases. Please check out the configurations listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic, -please check out [this FAQ](/docs/faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a +please check out [this FAQ](faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a separate document, we will talk about a new record merger API for optimized payload handling. diff --git a/website/versioned_docs/version-0.14.0/write_operations.md b/website/versioned_docs/version-0.14.0/write_operations.md index abd1bdb66db7d..29132a38f5a6c 100644 --- a/website/versioned_docs/version-0.14.0/write_operations.md +++ b/website/versioned_docs/version-0.14.0/write_operations.md @@ -100,7 +100,7 @@ The following is an inside look on the Hudi write path and the sequence of event 6. Update [Index](/docs/indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) 8. [Clean](/docs/hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. 9. [Compaction](/docs/compaction) diff --git a/website/versioned_docs/version-0.14.1/cli.md b/website/versioned_docs/version-0.14.1/cli.md index 1c30b9b6fa6e0..7cc4cdd92b0c2 100644 --- a/website/versioned_docs/version-0.14.1/cli.md +++ b/website/versioned_docs/version-0.14.1/cli.md @@ -578,7 +578,7 @@ Compaction successfully repaired ### Savepoint and Restore As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. You can read more about savepoints and restore [here](/docs/next/disaster_recovery) +savepoint at a later point in time if need be. You can read more about savepoints and restore [here](disaster_recovery) To trigger savepoint for a hudi table ```java diff --git a/website/versioned_docs/version-0.14.1/concurrency_control.md b/website/versioned_docs/version-0.14.1/concurrency_control.md index dd4e217829e25..3efcc04924946 100644 --- a/website/versioned_docs/version-0.14.1/concurrency_control.md +++ b/website/versioned_docs/version-0.14.1/concurrency_control.md @@ -77,7 +77,7 @@ Multiple writers can operate on the table with non-blocking conflict resolution. file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is currently available for preview in version 1.0.0-beta only with the caveat that conflict resolution is not supported yet between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink -writers [here](/docs/next/writing_data#non-blocking-concurrency-control-experimental). +writers [here](writing_data#non-blocking-concurrency-control-experimental). ## Enabling Multi Writing diff --git a/website/versioned_docs/version-0.14.1/deployment.md b/website/versioned_docs/version-0.14.1/deployment.md index f5ad89c7f8170..b400e413a7d3b 100644 --- a/website/versioned_docs/version-0.14.1/deployment.md +++ b/website/versioned_docs/version-0.14.1/deployment.md @@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.14.1/faq_writing_tables.md b/website/versioned_docs/version-0.14.1/faq_writing_tables.md index 40c3a99fa99f2..ae23bbf1e7d29 100644 --- a/website/versioned_docs/version-0.14.1/faq_writing_tables.md +++ b/website/versioned_docs/version-0.14.1/faq_writing_tables.md @@ -6,7 +6,7 @@ keywords: [hudi, writing, reading] ### What are some ways to write a Hudi table? -Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](/docs/writing_data/#spark-datasource-writer) to write into Hudi. +Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](hoodie_streaming_ingestion#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](writing_data/#spark-datasource-writer) to write into Hudi. ### How is a Hudi writer job deployed? @@ -147,7 +147,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk\_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. @@ -183,7 +183,7 @@ No, Hudi does not expose uncommitted files/blocks to the readers. Further, Hudi ### How are conflicts detected in Hudi between multiple writers? -Hudi employs [optimistic concurrency control](/docs/concurrency_control#supported-concurrency-controls) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs. +Hudi employs [optimistic concurrency control](concurrency_control) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs. ### Can single-writer inserts have duplicates? diff --git a/website/versioned_docs/version-0.14.1/file_layouts.md b/website/versioned_docs/version-0.14.1/file_layouts.md index 71ee6d563079c..b8b5ca7c342a1 100644 --- a/website/versioned_docs/version-0.14.1/file_layouts.md +++ b/website/versioned_docs/version-0.14.1/file_layouts.md @@ -10,8 +10,8 @@ The following describes the general file layout structure for Apache Hudi. Pleas * Each file group contains several file slices * Each slice contains a base file (*.parquet/*.orc) (defined by the config - [hoodie.table.base.file.format](https://hudi.apache.org/docs/next/configurations/#hoodietablebasefileformat) ) produced at a certain commit/compaction instant time, along with set of log files (*.log.*) that contain inserts/updates to the base file since the base file was produced. -Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new -file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system. +Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new +file slices and [cleaning](hoodie_cleaner) action gets rid of unused/older file slices to reclaim space on the file system. ![Partition On HDFS](/assets/images/hudi_partitions_HDFS.png) diff --git a/website/versioned_docs/version-0.14.1/file_sizing.md b/website/versioned_docs/version-0.14.1/file_sizing.md index c637a5a630cc3..a451b09b6c58f 100644 --- a/website/versioned_docs/version-0.14.1/file_sizing.md +++ b/website/versioned_docs/version-0.14.1/file_sizing.md @@ -148,7 +148,7 @@ while the clustering service runs. :::note Hudi always creates immutable files on storage. To be able to do auto-sizing or clustering, Hudi will always create a -newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](/docs/next/cleaning) +newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](hoodie_cleaner) will later kick in and delete the older version small file and keep the latest one. ::: diff --git a/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md b/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md index 02e5a19e5f6c0..74ecc9d73a9fc 100644 --- a/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md @@ -453,19 +453,19 @@ feature is that it now lets you author streaming pipelines on streaming or batch ## Where To Go From Here? - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. -- **Configuration** : For [Global Configuration](/docs/next/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/next/flink_tuning#table-options). -- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](/docs/hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](/docs/hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](/docs/hoodie_streaming_ingestion#changelog-mode) and [Append Mode](/docs/hoodie_streaming_ingestion#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](/docs/next/writing_data#non-blocking-concurrency-control-experimental). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). -- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/next/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/next/flink_tuning#write-rate-limit). +- **Configuration** : For [Global Configuration](flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_tuning#table-options). +- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](hoodie_streaming_ingestion#changelog-mode) and [Append Mode](hoodie_streaming_ingestion#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](writing_data#non-blocking-concurrency-control-experimental). +- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](sql_queries#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). +- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](flink_tuning#memory-optimization) and [Write Rate Limit](flink_tuning#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). -- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb). +- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](sql_queries#presto). - **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/sql_ddl/#create-catalog). If you are relatively new to Apache Hudi, it is important to be familiar with a few core concepts: - - [Hudi Timeline](/docs/next/timeline) – How Hudi manages transactions and other table services - - [Hudi File Layout](/docs/next/storage_layouts) - How the files are laid out on storage - - [Hudi Table Types](/docs/next/table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ` - - [Hudi Query Types](/docs/next/table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries + - [Hudi Timeline](timeline) – How Hudi manages transactions and other table services + - [Hudi File Layout](file_layouts) - How the files are laid out on storage + - [Hudi Table Types](table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ` + - [Hudi Query Types](table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries See more in the "Concepts" section of the docs. diff --git a/website/versioned_docs/version-0.14.1/indexing.md b/website/versioned_docs/version-0.14.1/indexing.md index 034246ad5805c..53e883c385616 100644 --- a/website/versioned_docs/version-0.14.1/indexing.md +++ b/website/versioned_docs/version-0.14.1/indexing.md @@ -11,9 +11,9 @@ Hudi provides efficient upserts, by mapping a given hoodie key (record key + par This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records. -For [Copy-On-Write tables](/docs/next/table_types#copy-on-write-table), this enables fast upsert/delete operations, by +For [Copy-On-Write tables](table_types#copy-on-write-table), this enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite. -For [Merge-On-Read tables](/docs/next/table_types#merge-on-read-table), this design allows Hudi to bound the amount of +For [Merge-On-Read tables](table_types#merge-on-read-table), this design allows Hudi to bound the amount of records any given base file needs to be merged against. Specifically, a given base file needs to merged only against updates for records that are part of that base file. In contrast, designs without an indexing component (e.g: [Apache Hive ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)), diff --git a/website/versioned_docs/version-0.14.1/metadata_indexing.md b/website/versioned_docs/version-0.14.1/metadata_indexing.md index 5b96ed07bd407..2a0bbfca06f04 100644 --- a/website/versioned_docs/version-0.14.1/metadata_indexing.md +++ b/website/versioned_docs/version-0.14.1/metadata_indexing.md @@ -78,8 +78,8 @@ us schedule the indexing for COLUMN_STATS index. First we need to define a prope As mentioned before, metadata indices are pluggable. One can add any index at any point in time depending on changing business requirements. Some configurations to enable particular indices are listed below. Currently, available indices under -metadata table can be explored [here](/docs/next/metadata#metadata-table-indices) along with [configs](/docs/next/metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side) -to enable them. The full set of metadata configurations can be explored [here](/docs/next/configurations/#Metadata-Configs). +metadata table can be explored [here](metadata#metadata-table-indices) along with [configs](metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side) +to enable them. The full set of metadata configurations can be explored [here](configurations/#Metadata-Configs). :::note Enabling the metadata table and configuring a lock provider are the prerequisites for using async indexer. Checkout a sample diff --git a/website/versioned_docs/version-0.14.1/procedures.md b/website/versioned_docs/version-0.14.1/procedures.md index c2cd0dea7c11e..0a895560df091 100644 --- a/website/versioned_docs/version-0.14.1/procedures.md +++ b/website/versioned_docs/version-0.14.1/procedures.md @@ -472,10 +472,10 @@ archive commits. |------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | N | None | Hudi table name | | path | String | N | None | Path of table | -| [min_commits](/docs/next/configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. | -| [max_commits](/docs/next/configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. | -| [retain_commits](/docs/next/configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. | -| [enable_metadata](/docs/next/configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table | +| [min_commits](configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. | +| [max_commits](configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. | +| [retain_commits](configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. | +| [enable_metadata](configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table | **Output** @@ -672,7 +672,7 @@ copy table to a temporary view. | Parameter Name | Type | Required | Default Value | Description | |-------------------------------------------------------------------|---------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Hudi table name | -| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | +| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | | view_name | String | Y | None | Name of view | | begin_instance_time | String | N | "" | Begin instance time | | end_instance_time | String | N | "" | End instance time | @@ -705,7 +705,7 @@ copy table to a new table. | Parameter Name | Type | Required | Default Value | Description | |-------------------------------------------------------------------|--------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Hudi table name | -| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | +| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | | new_table | String | Y | None | Name of new table | | begin_instance_time | String | N | "" | Begin instance time | | end_instance_time | String | N | "" | End instance time | @@ -1533,13 +1533,13 @@ Run cleaner on a hoodie table. |---------------------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Name of table to be cleaned | | schedule_in_line | Boolean | N | true | Set "true" if you want to schedule and run a clean. Set false if you have already scheduled a clean and want to run that. | -| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. | -| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. | -| [hours_retained](/docs/next/configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. | -| [file_versions_retained](/docs/next/configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. | -| [trigger_strategy](/docs/next/configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` | -| [trigger_max_commits](/docs/next/configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. | -| [options](/docs/next/configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" | +| [clean_policy](configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. | +| [retain_commits](configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. | +| [hours_retained](configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. | +| [file_versions_retained](configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. | +| [trigger_strategy](configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` | +| [trigger_max_commits](configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. | +| [options](configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" | **Output** @@ -1633,12 +1633,12 @@ Sync the table's latest schema to Hive metastore. | metastore_uri | String | N | "" | Metastore_uri | | username | String | N | "" | User name | | password | String | N | "" | Password | -| [use_jdbc](/docs/next/configurations#hoodiedatasourcehive_syncuse_jdbc) | String | N | "" | Use JDBC when hive synchronization is enabled | -| [mode](/docs/next/configurations#hoodiedatasourcehive_syncmode) | String | N | "" | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. | -| [partition_fields](/docs/next/configurations#hoodiedatasourcehive_syncpartition_fields) | String | N | "" | Field in the table to use for determining hive partition columns. | | -| [partition_extractor_class](/docs/next/configurations#hoodiedatasourcehive_syncpartition_extractor_class) | String | N | "" | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. | -| [strategy](/docs/next/configurations#hoodiedatasourcehive_synctablestrategy) | String | N | "" | Hive table synchronization strategy. Available option: RO, RT, ALL. | -| [sync_incremental](/docs/next/configurations#hoodiemetasyncincremental) | String | N | "" | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to `false`, the meta sync executes a full partition sync operation when partitions are lost. | +| [use_jdbc](configurations#hoodiedatasourcehive_syncuse_jdbc) | String | N | "" | Use JDBC when hive synchronization is enabled | +| [mode](configurations#hoodiedatasourcehive_syncmode) | String | N | "" | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. | +| [partition_fields](configurations#hoodiedatasourcehive_syncpartition_fields) | String | N | "" | Field in the table to use for determining hive partition columns. | | +| [partition_extractor_class](configurations#hoodiedatasourcehive_syncpartition_extractor_class) | String | N | "" | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. | +| [strategy](configurations#hoodiedatasourcehive_synctablestrategy) | String | N | "" | Hive table synchronization strategy. Available option: RO, RT, ALL. | +| [sync_incremental](configurations#hoodiemetasyncincremental) | String | N | "" | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to `false`, the meta sync executes a full partition sync operation when partitions are lost. | @@ -1848,18 +1848,18 @@ Convert an existing table to Hudi. |------------------------------------------------------------------------------|---------|----------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Name of table to be clustered | | table_type | String | Y | None | Table type, MERGE_ON_READ or COPY_ON_WRITE | -| [bootstrap_path](/docs/next/configurations#hoodiebootstrapbasepath) | String | Y | None | Base path of the dataset that needs to be bootstrapped as a Hudi table | +| [bootstrap_path](configurations#hoodiebootstrapbasepath) | String | Y | None | Base path of the dataset that needs to be bootstrapped as a Hudi table | | base_path | String | Y | None | Base path | | rowKey_field | String | Y | None | Primary key field | | base_file_format | String | N | "PARQUET" | Format of base file | | partition_path_field | String | N | "" | Partitioned column field | -| [bootstrap_index_class](/docs/next/configurations#hoodiebootstrapindexclass) | String | N | "org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex" | Implementation to use, for mapping a skeleton base file to a bootstrap base file. | -| [selector_class](/docs/next/configurations#hoodiebootstrapmodeselector) | String | N | "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector" | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped | +| [bootstrap_index_class](configurations#hoodiebootstrapindexclass) | String | N | "org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex" | Implementation to use, for mapping a skeleton base file to a bootstrap base file. | +| [selector_class](configurations#hoodiebootstrapmodeselector) | String | N | "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector" | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped | | key_generator_class | String | N | "org.apache.hudi.keygen.SimpleKeyGenerator" | Class of key generator | | full_bootstrap_input_provider | String | N | "org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider" | Class of full bootstrap input provider | | schema_provider_class | String | N | "" | Class of schema provider | | payload_class | String | N | "org.apache.hudi.common.model.OverwriteWithLatestAvroPayload" | Class of payload | -| [parallelism](/docs/next/configurations#hoodiebootstrapparallelism) | Int | N | 1500 | For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (`hoodie.bulkinsert.shuffle.parallelism`), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this. | +| [parallelism](configurations#hoodiebootstrapparallelism) | Int | N | 1500 | For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (`hoodie.bulkinsert.shuffle.parallelism`), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this. | | enable_hive_sync | Boolean | N | false | Whether to enable hive sync | | props_file_path | String | N | "" | Path of properties file | | bootstrap_overwrite | Boolean | N | false | Overwrite bootstrap path | diff --git a/website/versioned_docs/version-0.14.1/querying_data.md b/website/versioned_docs/version-0.14.1/querying_data.md index c43ee1fd7f45e..ee330fede3a99 100644 --- a/website/versioned_docs/version-0.14.1/querying_data.md +++ b/website/versioned_docs/version-0.14.1/querying_data.md @@ -7,7 +7,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00 --- :::danger -This page is no longer maintained. Please refer to Hudi [SQL DDL](/docs/next/sql_ddl), [SQL DML](/docs/next/sql_dml), [SQL Queries](/docs/next/sql_queries) and [Procedures](/docs/next/procedures) for the latest documentation. +This page is no longer maintained. Please refer to Hudi [SQL DDL](sql_ddl), [SQL DML](sql_dml), [SQL Queries](sql_queries) and [Procedures](procedures) for the latest documentation. ::: Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts#query-types). diff --git a/website/versioned_docs/version-0.14.1/quick-start-guide.md b/website/versioned_docs/version-0.14.1/quick-start-guide.md index 6f07e7363b682..3c78608036fa7 100644 --- a/website/versioned_docs/version-0.14.1/quick-start-guide.md +++ b/website/versioned_docs/version-0.14.1/quick-start-guide.md @@ -223,7 +223,7 @@ CREATE TABLE hudi_table ( PARTITIONED BY (city); ``` -For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](/docs/next/sql_ddl) reference guide. +For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](sql_ddl) reference guide. @@ -267,7 +267,7 @@ inserts.write.format("hudi"). ``` :::info Mapping to Hudi write operations -Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables, +Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables, with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as the write operation, matching the out-of-behavior of Spark's Parquet Datasource. ::: @@ -300,7 +300,7 @@ inserts.write.format("hudi"). \ ``` :::info Mapping to Hudi write operations -Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables, +Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables, with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as the write operation, matching the out-of-behavior of Spark's Parquet Datasource. ::: @@ -309,7 +309,7 @@ the write operation, matching the out-of-behavior of Spark's Parquet Datasource. -Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](/docs/next/sql_dml#insert-into) for more advanced options. +Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](sql_dml#insert-into) for more advanced options. ```sql INSERT INTO hudi_table @@ -421,7 +421,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un -Hudi table can be update using a regular UPDATE statement. See [Update](/docs/next/sql_dml#update) for more advanced options. +Hudi table can be update using a regular UPDATE statement. See [Update](sql_dml#update) for more advanced options. ```sql UPDATE hudi_table SET fare = 25.0 WHERE rider = 'rider-D'; @@ -451,7 +451,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un -[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](/docs/next/concepts). +[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](concepts). Look for changes in `_hoodie_commit_time`, `fare` fields for the given `_hoodie_record_key` value from a previous commit. ## Merging Data {#merge} @@ -539,7 +539,7 @@ MERGE statement either using `SET *` or using `SET column1 = expression1 [, colu ## Delete data {#deletes} Delete operation removes the records specified from the table. For example, this code snippet deletes records -for the HoodieKeys passed in. Check out the [deletion section](/docs/next/writing_data#deletes) for more details. +for the HoodieKeys passed in. Check out the [deletion section](writing_data#deletes) for more details. :::note Implications of defining record keys -Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](/docs/next/write_operations). +Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](write_operations). Also if a record key is configured, then it's also advisable to specify a precombine or ordering field, to correctly handle cases where the source data has multiple records with the same key. See section below. ::: @@ -1108,29 +1108,29 @@ PARTITIONED BY (city); ## Where to go from here? You can also [build hudi yourself](https://github.com/apache/hudi#building-apache-hudi-from-source) and try this quickstart using `--jars `(see also [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions)) -for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](/docs/next/migration_guide). +for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](migration_guide). ### Spark SQL Reference -For advanced usage of spark SQL, please refer to [Spark SQL DDL](/docs/next/sql_ddl) and [Spark SQL DML](/docs/next/sql_dml) reference guides. -For alter table commands, check out [this](/docs/next/sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](/docs/next/procedures) out. +For advanced usage of spark SQL, please refer to [Spark SQL DDL](sql_ddl) and [Spark SQL DML](sql_dml) reference guides. +For alter table commands, check out [this](sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](procedures) out. ### Streaming workloads Hudi provides industry-leading performance and functionality for streaming data. -**Hudi Streamer** - Hudi provides an incremental ingestion/ETL tool - [HoodieStreamer](/docs/next/hoodie_streaming_ingestion#hudi-streamer), to assist with ingesting data into Hudi +**Hudi Streamer** - Hudi provides an incremental ingestion/ETL tool - [HoodieStreamer](hoodie_streaming_ingestion#hudi-streamer), to assist with ingesting data into Hudi from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider, transformation support, automatic table services and so on. -**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/next/hoodie_streaming_ingestion#structured-streaming) for more. +**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](hoodie_streaming_ingestion#structured-streaming) for more. -Check out more information on [modeling data in Hudi](/docs/next/faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](/docs/next/writing_data). +Check out more information on [modeling data in Hudi](faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](writing_data). ### Dockerized Demo Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy to get your transactional data lakes up and running quickly, across a variety query engines like Hive, Flink, Spark, Presto, Trino and much more. We have put together a [demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that showcases all of this on a docker based setup with all dependent systems running locally. We recommend you replicate the same setup and run the demo yourself, by following -steps [here](/docs/next/docker_demo) to get a taste for it. +steps [here](docker_demo) to get a taste for it. diff --git a/website/versioned_docs/version-0.14.1/record_payload.md b/website/versioned_docs/version-0.14.1/record_payload.md index 105a87ae9a02c..0f514dced09e5 100644 --- a/website/versioned_docs/version-0.14.1/record_payload.md +++ b/website/versioned_docs/version-0.14.1/record_payload.md @@ -172,6 +172,6 @@ provides support for applying changes captured via Amazon Database Migration Ser Record payloads are tunable to suit many use cases. Please check out the configurations listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic, -please check out [this FAQ](/docs/next/faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a +please check out [this FAQ](faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a separate document, we will talk about a new record merger API for optimized payload handling. diff --git a/website/versioned_docs/version-0.14.1/sql_ddl.md b/website/versioned_docs/version-0.14.1/sql_ddl.md index 0c953905fcb4d..eb44b5da7c1d3 100644 --- a/website/versioned_docs/version-0.14.1/sql_ddl.md +++ b/website/versioned_docs/version-0.14.1/sql_ddl.md @@ -104,7 +104,7 @@ TBLPROPERTIES ( ``` ### Create table from an external location -Often, Hudi tables are created from streaming writers like the [streamer tool](/docs/next/hoodie_streaming_ingestion#hudi-streamer), which +Often, Hudi tables are created from streaming writers like the [streamer tool](hoodie_streaming_ingestion#hudi-streamer), which may later need some SQL statements to run on them. You can create an External table using the `location` statement. ```sql @@ -389,7 +389,7 @@ Users can set table properties while creating a table. The important table prope #### Passing Lock Providers for Concurrent Writers Hudi requires a lock provider to support concurrent writers or asynchronous table services when using OCC -and [NBCC](/docs/next/concurrency_control#non-blocking-concurrency-control-mode-experimental) (Non-Blocking Concurrency Control) +and [NBCC](concurrency_control#non-blocking-concurrency-control) (Non-Blocking Concurrency Control) concurrency mode. For NBCC mode, locking is only used to write the commit metadata file in the timeline. Writes are serialized by completion time. Users can pass these table properties into *TBLPROPERTIES* as well. Below is an example for a Zookeeper based configuration. @@ -612,7 +612,7 @@ ALTER TABLE tableA RENAME TO tableB; ### Setting Hudi configs #### Using table options -You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](/docs/next/configurations#FLINK_SQL) +You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](configurations#FLINK_SQL) These configs will be applied to all the operations on that table. ```sql diff --git a/website/versioned_docs/version-0.14.1/sql_dml.md b/website/versioned_docs/version-0.14.1/sql_dml.md index 1021984356424..fec050936e41f 100644 --- a/website/versioned_docs/version-0.14.1/sql_dml.md +++ b/website/versioned_docs/version-0.14.1/sql_dml.md @@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem'; SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data from your Hudi tables. Let's explore them one by one. -Please refer to [SQL DDL](/docs/next/sql_ddl) for creating Hudi tables using SQL. +Please refer to [SQL DDL](sql_ddl) for creating Hudi tables using SQL. ### Insert Into @@ -25,7 +25,7 @@ SELECT FROM ; :::note Deprecations From 0.14.0, `hoodie.sql.bulk.insert.enable` and `hoodie.sql.insert.mode` are deprecated. Users are expected to use `hoodie.spark.sql.insert.into.operation` instead. -To manage duplicates with `INSERT INTO`, please check out [insert dup policy config](/docs/next/configurations#hoodiedatasourceinsertduppolicy). +To manage duplicates with `INSERT INTO`, please check out [insert dup policy config](configurations#hoodiedatasourceinsertduppolicy). ::: Examples: diff --git a/website/versioned_docs/version-0.14.1/use_cases.md b/website/versioned_docs/version-0.14.1/use_cases.md index edf37a3a4b00b..4d06f1e571a68 100644 --- a/website/versioned_docs/version-0.14.1/use_cases.md +++ b/website/versioned_docs/version-0.14.1/use_cases.md @@ -85,7 +85,7 @@ stream processing world to ensure pipelines don't break from non backwards compa ### ACID Transactions Along with a table, Apache Hudi brings ACID transactional guarantees to a data lake. -Hudi ensures atomic writes, by way of publishing commits atomically to a [timeline](/docs/next/timeline), stamped with an +Hudi ensures atomic writes, by way of publishing commits atomically to a [timeline](timeline), stamped with an instant time that denotes the time at which the action is deemed to have occurred. Unlike general purpose file version control, Hudi draws clear distinction between writer processes (that issue user’s upserts/deletes), table services (that write data/metadata to optimize/perform bookkeeping) and readers @@ -127,12 +127,12 @@ cost savings for your data lake. Some examples of the Apache Hudi services that make this performance optimization easy include: -- [Auto File Sizing](/docs/next/file_sizing) - to solve the "small files" problem. -- [Clustering](/docs/next/clustering) - to co-locate data next to each other. -- [Compaction](/docs/next/compaction) - to allow tuning of low latency ingestion and fast read queries. -- [Indexing](/docs/next/indexes) - for efficient upserts and deletes. +- [Auto File Sizing](file_sizing) - to solve the "small files" problem. +- [Clustering](clustering) - to co-locate data next to each other. +- [Compaction](compaction) - to allow tuning of low latency ingestion and fast read queries. +- [Indexing](indexing) - for efficient upserts and deletes. - Multi-Dimensional Partitioning (Z-Ordering) - Traditional folder style partitioning on low-cardinality, while also Z-Ordering data within files based on high-cardinality - Metadata Table - No more slow S3 file listings or throttling. -- [Auto Cleaning](/docs/next/cleaning) - Keeps your storage costs in check by automatically removing old versions of files. +- [Auto Cleaning](hoodie_cleaner) - Keeps your storage costs in check by automatically removing old versions of files. diff --git a/website/versioned_docs/version-0.14.1/write_operations.md b/website/versioned_docs/version-0.14.1/write_operations.md index d340c590b12c3..492cd52046cf9 100644 --- a/website/versioned_docs/version-0.14.1/write_operations.md +++ b/website/versioned_docs/version-0.14.1/write_operations.md @@ -88,25 +88,25 @@ The following is an inside look on the Hudi write path and the sequence of event 1. [Deduping](/docs/configurations#hoodiecombinebeforeinsert) 1. First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key. -2. [Index Lookup](/docs/next/indexes) +2. [Index Lookup](indexing) 1. Next, an index lookup is performed to try and match the input records to identify which file groups they belong to. -3. [File Sizing](/docs/next/file_sizing) +3. [File Sizing](file_sizing) 1. Then, based on the average size of previous commits, Hudi will make a plan to add enough records to a small file to get it close to the configured maximum limit. -4. [Partitioning](/docs/next/storage_layouts) +4. [Partitioning](file_layouts) 1. We now arrive at partitioning where we decide what file groups certain updates and inserts will be placed in or if new file groups will be created 5. Write I/O 1. Now we actually do the write operations which is either creating a new base file, appending to the log file, or versioning an existing base file. -6. Update [Index](/docs/next/indexes) +6. Update [Index](indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/next/writing_data#commit-notifications) is exposed) -8. [Clean](/docs/next/cleaning) (if needed) + 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed) +8. [Clean](hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. -9. [Compaction](/docs/next/compaction) +9. [Compaction](compaction) 1. If you are using MOR tables, compaction will either run inline, or be scheduled asynchronously 10. Archive - 1. Lastly, we perform an archival step which moves old [timeline](/docs/next/timeline) items to an archive folder. + 1. Lastly, we perform an archival step which moves old [timeline](timeline) items to an archive folder. Here is a diagramatic representation of the flow. diff --git a/website/versioned_docs/version-0.14.1/writing_data.md b/website/versioned_docs/version-0.14.1/writing_data.md index 10226cf9747cb..b8613bac3d6b8 100644 --- a/website/versioned_docs/version-0.14.1/writing_data.md +++ b/website/versioned_docs/version-0.14.1/writing_data.md @@ -12,12 +12,12 @@ In this section, we will cover ways to ingest new changes from external sources Currently Hudi supports following ways to write the data. - [Hudi Streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) - [Spark Hudi Datasource](#spark-datasource-writer) -- [Spark Structured Streaming](/docs/hoodie_streaming_ingestion#structured-streaming) -- [Spark SQL](/docs/next/sql_ddl#spark-sql) -- [Flink Writer](/docs/next/hoodie_streaming_ingestion#flink-ingestion) -- [Flink SQL](/docs/next/sql_ddl#flink) +- [Spark Structured Streaming](hoodie_streaming_ingestion#structured-streaming) +- [Spark SQL](sql_ddl#spark-sql) +- [Flink Writer](hoodie_streaming_ingestion#flink-ingestion) +- [Flink SQL](sql_ddl#flink) - [Java Writer](#java-writer) -- [Kafka Connect](/docs/next/hoodie_streaming_ingestion#kafka-connect-sink) +- [Kafka Connect](hoodie_streaming_ingestion#kafka-connect-sink) ## Spark Datasource Writer @@ -98,7 +98,7 @@ df.write.format("hudi"). You can check the data generated under `/tmp/hudi_trips_cow////`. We provided a record key (`uuid` in [schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in [schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](/docs/next/faq_general/#how-do-i-model-the-data-stored-in-hudi) +[Modeling data stored in Hudi](faq_general/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue `insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) @@ -134,7 +134,7 @@ df.write.format("hudi"). You can check the data generated under `/tmp/hudi_trips_cow////`. We provided a record key (`uuid` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](/docs/next/faq_general/#how-do-i-model-the-data-stored-in-hudi) +[Modeling data stored in Hudi](faq_general/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue `insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) @@ -547,7 +547,7 @@ INSERT INTO hudi_table select ... from ...; Hudi Flink supports a new non-blocking concurrency control mode, where multiple writer tasks can be executed concurrently without blocking each other. One can read more about this mode in -the [concurrency control](/docs/next/concurrency_control#model-c-multi-writer) docs. Let us see it in action here. +the [concurrency control](concurrency_control#model-c-multi-writer) docs. Let us see it in action here. In the below example, we have two streaming ingestion pipelines that concurrently update the same table. One of the pipeline is responsible for the compaction and cleaning table services, while the other pipeline is just for data diff --git a/website/versioned_docs/version-0.15.0/cli.md b/website/versioned_docs/version-0.15.0/cli.md index 1c30b9b6fa6e0..7cc4cdd92b0c2 100644 --- a/website/versioned_docs/version-0.15.0/cli.md +++ b/website/versioned_docs/version-0.15.0/cli.md @@ -578,7 +578,7 @@ Compaction successfully repaired ### Savepoint and Restore As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. You can read more about savepoints and restore [here](/docs/next/disaster_recovery) +savepoint at a later point in time if need be. You can read more about savepoints and restore [here](disaster_recovery) To trigger savepoint for a hudi table ```java diff --git a/website/versioned_docs/version-0.15.0/concurrency_control.md b/website/versioned_docs/version-0.15.0/concurrency_control.md index 64c9af85b6690..78724b99165fa 100644 --- a/website/versioned_docs/version-0.15.0/concurrency_control.md +++ b/website/versioned_docs/version-0.15.0/concurrency_control.md @@ -77,7 +77,7 @@ Multiple writers can operate on the table with non-blocking conflict resolution. file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is currently available for preview in version 1.0.0-beta only with the caveat that conflict resolution is not supported yet between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink -writers [here](/docs/next/sql_dml#non-blocking-concurrency-control-experimental). +writers [here](sql_dml#non-blocking-concurrency-control-experimental). ## Enabling Multi Writing diff --git a/website/versioned_docs/version-0.15.0/deployment.md b/website/versioned_docs/version-0.15.0/deployment.md index 9bafde59c4658..7785f4ceaca1f 100644 --- a/website/versioned_docs/version-0.15.0/deployment.md +++ b/website/versioned_docs/version-0.15.0/deployment.md @@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Batch Writes](/docs/next/writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Batch Writes](writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.15.0/faq.md b/website/versioned_docs/version-0.15.0/faq.md index 1378839b81cb8..b128401893094 100644 --- a/website/versioned_docs/version-0.15.0/faq.md +++ b/website/versioned_docs/version-0.15.0/faq.md @@ -6,7 +6,7 @@ keywords: [hudi, writing, reading] The FAQs are split into following pages. Please refer to the specific pages for more info. -- [General](/docs/next/faq_general) +- [General](faq_general) - [Design & Concepts](/docs/next/faq_design_and_concepts) - [Writing Tables](/docs/next/faq_writing_tables) - [Reading Tables](/docs/next/faq_reading_tables) diff --git a/website/versioned_docs/version-0.15.0/faq_general.md b/website/versioned_docs/version-0.15.0/faq_general.md index 2682d17e95068..e9d649e363ec8 100644 --- a/website/versioned_docs/version-0.15.0/faq_general.md +++ b/website/versioned_docs/version-0.15.0/faq_general.md @@ -61,7 +61,7 @@ Nonetheless, Hudi is designed very much like a database and provides similar fun ### How do I model the data stored in Hudi? -When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](/docs/next/writing_data) for an example. +When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](writing_data) for an example. When querying/reading data, Hudi just presents itself as a json-like hierarchical table, everyone is used to querying using Hive/Spark/Presto over Parquet/Json/Avro. diff --git a/website/versioned_docs/version-0.15.0/faq_table_services.md b/website/versioned_docs/version-0.15.0/faq_table_services.md index 0ca730094e4f1..7ff398687e392 100644 --- a/website/versioned_docs/version-0.15.0/faq_table_services.md +++ b/website/versioned_docs/version-0.15.0/faq_table_services.md @@ -50,6 +50,6 @@ Hudi runs cleaner to remove old file versions as part of writing data either in Yes. Hudi provides the ability to post a callback notification about a write commit. You can use a http hook or choose to -be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](/docs/next/platform_services_post_commit_callback) +be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](platform_services_post_commit_callback) for details diff --git a/website/versioned_docs/version-0.15.0/faq_writing_tables.md b/website/versioned_docs/version-0.15.0/faq_writing_tables.md index f898e0975fa1e..1edc6b06520c3 100644 --- a/website/versioned_docs/version-0.15.0/faq_writing_tables.md +++ b/website/versioned_docs/version-0.15.0/faq_writing_tables.md @@ -6,7 +6,7 @@ keywords: [hudi, writing, reading] ### What are some ways to write a Hudi table? -Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](/docs/next/writing_data#spark-datasource-api) to write into Hudi. +Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](writing_data#spark-datasource-api) to write into Hudi. ### How is a Hudi writer job deployed? @@ -68,7 +68,7 @@ As you could see, ([combineAndGetUpdateValue(), getInsertValue()](https://github ### How do I delete records in the dataset using Hudi? -GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see [here](/docs/next/writing_data#deletes). +GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see [here](writing_data#deletes). ### Should I need to worry about deleting all copies of the records in case of duplicates? @@ -147,7 +147,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk\_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` ) -For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. +For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices. @@ -183,7 +183,7 @@ No, Hudi does not expose uncommitted files/blocks to the readers. Further, Hudi ### How are conflicts detected in Hudi between multiple writers? -Hudi employs [optimistic concurrency control](/docs/concurrency_control#supported-concurrency-controls) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs. +Hudi employs [optimistic concurrency control](concurrency_control) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs. ### Can single-writer inserts have duplicates? diff --git a/website/versioned_docs/version-0.15.0/file_layouts.md b/website/versioned_docs/version-0.15.0/file_layouts.md index 478130fbd7e7a..6fca19dec4910 100644 --- a/website/versioned_docs/version-0.15.0/file_layouts.md +++ b/website/versioned_docs/version-0.15.0/file_layouts.md @@ -10,8 +10,8 @@ The following describes the general file layout structure for Apache Hudi. Pleas * Each file group contains several file slices * Each slice contains a base file (*.parquet/*.orc) (defined by the config - [hoodie.table.base.file.format](https://hudi.apache.org/docs/next/configurations/#hoodietablebasefileformat) ) produced at a certain commit/compaction instant time, along with set of log files (*.log.*) that contain inserts/updates to the base file since the base file was produced. -Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new -file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system. +Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new +file slices and [cleaning](hoodie_cleaner) action gets rid of unused/older file slices to reclaim space on the file system. ![Partition On HDFS](/assets/images/MOR_new.png) diff --git a/website/versioned_docs/version-0.15.0/file_sizing.md b/website/versioned_docs/version-0.15.0/file_sizing.md index c637a5a630cc3..a451b09b6c58f 100644 --- a/website/versioned_docs/version-0.15.0/file_sizing.md +++ b/website/versioned_docs/version-0.15.0/file_sizing.md @@ -148,7 +148,7 @@ while the clustering service runs. :::note Hudi always creates immutable files on storage. To be able to do auto-sizing or clustering, Hudi will always create a -newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](/docs/next/cleaning) +newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](hoodie_cleaner) will later kick in and delete the older version small file and keep the latest one. ::: diff --git a/website/versioned_docs/version-0.15.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.15.0/flink-quick-start-guide.md index c18b0ec1838c5..cc4cfe9005a49 100644 --- a/website/versioned_docs/version-0.15.0/flink-quick-start-guide.md +++ b/website/versioned_docs/version-0.15.0/flink-quick-start-guide.md @@ -448,19 +448,19 @@ feature is that it now lets you author streaming pipelines on streaming or batch ## Where To Go From Here? - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. -- **Configuration** : For [Global Configuration](/docs/next/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/next/flink_tuning#table-options). -- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/next/ingestion_flink#cdc-ingestion), [Bulk Insert](/docs/next/ingestion_flink#bulk-insert), [Index Bootstrap](/docs/next/ingestion_flink#index-bootstrap), [Changelog Mode](/docs/next/ingestion_flink#changelog-mode) and [Append Mode](/docs/next/ingestion_flink#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](/docs/next/sql_dml#non-blocking-concurrency-control-experimental). -- **Reading Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/sql_queries#streaming-query) and [Incremental Query](/docs/sql_queries#incremental-query). -- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/next/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/next/flink_tuning#write-rate-limit). +- **Configuration** : For [Global Configuration](flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_tuning#table-options). +- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](ingestion_flink#cdc-ingestion), [Bulk Insert](ingestion_flink#bulk-insert), [Index Bootstrap](ingestion_flink#index-bootstrap), [Changelog Mode](ingestion_flink#changelog-mode) and [Append Mode](ingestion_flink#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](sql_dml#non-blocking-concurrency-control-experimental). +- **Reading Data** : Flink supports different modes for reading, such as [Streaming Query](sql_queries#streaming-query) and [Incremental Query](/docs/sql_queries#incremental-query). +- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](flink_tuning#memory-optimization) and [Write Rate Limit](flink_tuning#write-rate-limit). - **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). -- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb). +- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](sql_queries#presto). - **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/sql_ddl/#create-catalog). If you are relatively new to Apache Hudi, it is important to be familiar with a few core concepts: - - [Hudi Timeline](/docs/next/timeline) – How Hudi manages transactions and other table services - - [Hudi File Layout](/docs/next/storage_layouts) - How the files are laid out on storage - - [Hudi Table Types](/docs/next/table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ` - - [Hudi Query Types](/docs/next/table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries + - [Hudi Timeline](timeline) – How Hudi manages transactions and other table services + - [Hudi File Layout](file_layouts) - How the files are laid out on storage + - [Hudi Table Types](table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ` + - [Hudi Query Types](table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries See more in the "Concepts" section of the docs. diff --git a/website/versioned_docs/version-0.15.0/hudi_stack.md b/website/versioned_docs/version-0.15.0/hudi_stack.md index 203e8ce5947d8..ab856396db794 100644 --- a/website/versioned_docs/version-0.15.0/hudi_stack.md +++ b/website/versioned_docs/version-0.15.0/hudi_stack.md @@ -93,7 +93,7 @@ Hudi provides snapshot isolation for writers and readers, enabling consistent ta ![Platform Services](/assets/images/blog/hudistack/platform_2.png)

Figure: Various platform services in Hudi

-Platform services offer functionality that is specific to data and workloads, and they sit directly on top of the table services, interfacing with writers and readers. Services, like [Hudi Streamer](./hoodie_streaming_ingestion#hudi-streamer), are specialized in handling data and workloads, seamlessly integrating with Kafka streams and various formats to build data lakes. They support functionalities like automatic checkpoint management, integration with major schema registries (including Confluent), and deduplication of data. Hudi Streamer also offers features for backfills, one-off runs, and continuous mode operation with Spark/Flink streaming writers. Additionally, Hudi provides tools for [snapshotting](./snapshot_exporter) and incrementally [exporting](./snapshot_exporter#examples) Hudi tables, importing new tables, and [post-commit callback](/docs/next/platform_services_post_commit_callback) for analytics or workflow management, enhancing the deployment of production-grade incremental pipelines. Apart from these services, Hudi also provides broad support for different catalogs such as [Hive Metastore](./syncing_metastore), [AWS Glue](./syncing_aws_glue_data_catalog/), [Google BigQuery](./gcp_bigquery), [DataHub](./syncing_datahub), etc. that allows syncing of Hudi tables to be queried by interactive engines such as Trino and Presto. +Platform services offer functionality that is specific to data and workloads, and they sit directly on top of the table services, interfacing with writers and readers. Services, like [Hudi Streamer](./hoodie_streaming_ingestion#hudi-streamer), are specialized in handling data and workloads, seamlessly integrating with Kafka streams and various formats to build data lakes. They support functionalities like automatic checkpoint management, integration with major schema registries (including Confluent), and deduplication of data. Hudi Streamer also offers features for backfills, one-off runs, and continuous mode operation with Spark/Flink streaming writers. Additionally, Hudi provides tools for [snapshotting](./snapshot_exporter) and incrementally [exporting](./snapshot_exporter#examples) Hudi tables, importing new tables, and [post-commit callback](platform_services_post_commit_callback) for analytics or workflow management, enhancing the deployment of production-grade incremental pipelines. Apart from these services, Hudi also provides broad support for different catalogs such as [Hive Metastore](./syncing_metastore), [AWS Glue](./syncing_aws_glue_data_catalog/), [Google BigQuery](./gcp_bigquery), [DataHub](./syncing_datahub), etc. that allows syncing of Hudi tables to be queried by interactive engines such as Trino and Presto. ### Query Engines Apache Hudi is compatible with a wide array of query engines, catering to various analytical needs. For distributed ETL batch processing, Apache Spark is frequently utilized, leveraging its efficient handling of large-scale data. In the realm of streaming use cases, compute engines such as Apache Flink and Apache Spark's Structured Streaming provide robust support when paired with Hudi. Moreover, Hudi supports modern data lake query engines such as Trino and Presto, as well as modern analytical databases such as ClickHouse and StarRocks. This diverse support of compute engines positions Apache Hudi as a flexible and adaptable platform for a broad spectrum of use cases. \ No newline at end of file diff --git a/website/versioned_docs/version-0.15.0/indexing.md b/website/versioned_docs/version-0.15.0/indexing.md index 009750d95966f..4a233123530dc 100644 --- a/website/versioned_docs/version-0.15.0/indexing.md +++ b/website/versioned_docs/version-0.15.0/indexing.md @@ -11,9 +11,9 @@ Hudi provides efficient upserts, by mapping a given hoodie key (record key + par This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records. -For [Copy-On-Write tables](/docs/next/table_types#copy-on-write-table), this enables fast upsert/delete operations, by +For [Copy-On-Write tables](table_types#copy-on-write-table), this enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite. -For [Merge-On-Read tables](/docs/next/table_types#merge-on-read-table), this design allows Hudi to bound the amount of +For [Merge-On-Read tables](table_types#merge-on-read-table), this design allows Hudi to bound the amount of records any given base file needs to be merged against. Specifically, a given base file needs to merged only against updates for records that are part of that base file. In contrast, designs without an indexing component (e.g: [Apache Hive ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)), diff --git a/website/versioned_docs/version-0.15.0/metadata_indexing.md b/website/versioned_docs/version-0.15.0/metadata_indexing.md index 5b96ed07bd407..2a0bbfca06f04 100644 --- a/website/versioned_docs/version-0.15.0/metadata_indexing.md +++ b/website/versioned_docs/version-0.15.0/metadata_indexing.md @@ -78,8 +78,8 @@ us schedule the indexing for COLUMN_STATS index. First we need to define a prope As mentioned before, metadata indices are pluggable. One can add any index at any point in time depending on changing business requirements. Some configurations to enable particular indices are listed below. Currently, available indices under -metadata table can be explored [here](/docs/next/metadata#metadata-table-indices) along with [configs](/docs/next/metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side) -to enable them. The full set of metadata configurations can be explored [here](/docs/next/configurations/#Metadata-Configs). +metadata table can be explored [here](metadata#metadata-table-indices) along with [configs](metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side) +to enable them. The full set of metadata configurations can be explored [here](configurations/#Metadata-Configs). :::note Enabling the metadata table and configuring a lock provider are the prerequisites for using async indexer. Checkout a sample diff --git a/website/versioned_docs/version-0.15.0/precommit_validator.md b/website/versioned_docs/version-0.15.0/precommit_validator.md index 5e13fca3dc0e2..d5faf61057dee 100644 --- a/website/versioned_docs/version-0.15.0/precommit_validator.md +++ b/website/versioned_docs/version-0.15.0/precommit_validator.md @@ -91,7 +91,7 @@ void validateRecordsBeforeAndAfter(Dataset before, ``` ## Additional Monitoring with Notifications -Hudi offers a [commit notification service](/docs/next/platform_services_post_commit_callback) that can be configured to trigger notifications about write commits. +Hudi offers a [commit notification service](platform_services_post_commit_callback) that can be configured to trigger notifications about write commits. The commit notification service can be combined with pre-commit validators to send a notification when a commit fails a validation. This is possible by passing details about the validation as a custom value to the HTTP endpoint. diff --git a/website/versioned_docs/version-0.15.0/procedures.md b/website/versioned_docs/version-0.15.0/procedures.md index 4877e7d300127..938ec385d8a33 100644 --- a/website/versioned_docs/version-0.15.0/procedures.md +++ b/website/versioned_docs/version-0.15.0/procedures.md @@ -472,10 +472,10 @@ archive commits. |------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | N | None | Hudi table name | | path | String | N | None | Path of table | -| [min_commits](/docs/next/configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. | -| [max_commits](/docs/next/configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. | -| [retain_commits](/docs/next/configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. | -| [enable_metadata](/docs/next/configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table | +| [min_commits](configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. | +| [max_commits](configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. | +| [retain_commits](configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. | +| [enable_metadata](configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table | **Output** @@ -672,7 +672,7 @@ copy table to a temporary view. | Parameter Name | Type | Required | Default Value | Description | |-------------------------------------------------------------------|---------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Hudi table name | -| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | +| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | | view_name | String | Y | None | Name of view | | begin_instance_time | String | N | "" | Begin instance time | | end_instance_time | String | N | "" | End instance time | @@ -705,7 +705,7 @@ copy table to a new table. | Parameter Name | Type | Required | Default Value | Description | |-------------------------------------------------------------------|--------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Hudi table name | -| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | +| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) | | new_table | String | Y | None | Name of new table | | begin_instance_time | String | N | "" | Begin instance time | | end_instance_time | String | N | "" | End instance time | @@ -1533,13 +1533,13 @@ Run cleaner on a hoodie table. |---------------------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Name of table to be cleaned | | schedule_in_line | Boolean | N | true | Set "true" if you want to schedule and run a clean. Set false if you have already scheduled a clean and want to run that. | -| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. | -| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. | -| [hours_retained](/docs/next/configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. | -| [file_versions_retained](/docs/next/configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. | -| [trigger_strategy](/docs/next/configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` | -| [trigger_max_commits](/docs/next/configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. | -| [options](/docs/next/configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" | +| [clean_policy](configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. | +| [retain_commits](configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. | +| [hours_retained](configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. | +| [file_versions_retained](configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. | +| [trigger_strategy](configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` | +| [trigger_max_commits](configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. | +| [options](configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" | **Output** @@ -1633,12 +1633,12 @@ Sync the table's latest schema to Hive metastore. | metastore_uri | String | N | "" | Metastore_uri | | username | String | N | "" | User name | | password | String | N | "" | Password | -| [use_jdbc](/docs/next/configurations#hoodiedatasourcehive_syncuse_jdbc) | String | N | "" | Use JDBC when hive synchronization is enabled | -| [mode](/docs/next/configurations#hoodiedatasourcehive_syncmode) | String | N | "" | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. | -| [partition_fields](/docs/next/configurations#hoodiedatasourcehive_syncpartition_fields) | String | N | "" | Field in the table to use for determining hive partition columns. | | -| [partition_extractor_class](/docs/next/configurations#hoodiedatasourcehive_syncpartition_extractor_class) | String | N | "" | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. | -| [strategy](/docs/next/configurations#hoodiedatasourcehive_synctablestrategy) | String | N | "" | Hive table synchronization strategy. Available option: RO, RT, ALL. | -| [sync_incremental](/docs/next/configurations#hoodiemetasyncincremental) | String | N | "" | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to `false`, the meta sync executes a full partition sync operation when partitions are lost. | +| [use_jdbc](configurations#hoodiedatasourcehive_syncuse_jdbc) | String | N | "" | Use JDBC when hive synchronization is enabled | +| [mode](configurations#hoodiedatasourcehive_syncmode) | String | N | "" | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. | +| [partition_fields](configurations#hoodiedatasourcehive_syncpartition_fields) | String | N | "" | Field in the table to use for determining hive partition columns. | | +| [partition_extractor_class](configurations#hoodiedatasourcehive_syncpartition_extractor_class) | String | N | "" | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. | +| [strategy](configurations#hoodiedatasourcehive_synctablestrategy) | String | N | "" | Hive table synchronization strategy. Available option: RO, RT, ALL. | +| [sync_incremental](configurations#hoodiemetasyncincremental) | String | N | "" | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to `false`, the meta sync executes a full partition sync operation when partitions are lost. | @@ -1848,18 +1848,18 @@ Convert an existing table to Hudi. |------------------------------------------------------------------------------|---------|----------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | table | String | Y | None | Name of table to be clustered | | table_type | String | Y | None | Table type, MERGE_ON_READ or COPY_ON_WRITE | -| [bootstrap_path](/docs/next/configurations#hoodiebootstrapbasepath) | String | Y | None | Base path of the dataset that needs to be bootstrapped as a Hudi table | +| [bootstrap_path](configurations#hoodiebootstrapbasepath) | String | Y | None | Base path of the dataset that needs to be bootstrapped as a Hudi table | | base_path | String | Y | None | Base path | | rowKey_field | String | Y | None | Primary key field | | base_file_format | String | N | "PARQUET" | Format of base file | | partition_path_field | String | N | "" | Partitioned column field | -| [bootstrap_index_class](/docs/next/configurations#hoodiebootstrapindexclass) | String | N | "org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex" | Implementation to use, for mapping a skeleton base file to a bootstrap base file. | -| [selector_class](/docs/next/configurations#hoodiebootstrapmodeselector) | String | N | "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector" | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped | +| [bootstrap_index_class](configurations#hoodiebootstrapindexclass) | String | N | "org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex" | Implementation to use, for mapping a skeleton base file to a bootstrap base file. | +| [selector_class](configurations#hoodiebootstrapmodeselector) | String | N | "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector" | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped | | key_generator_class | String | N | "org.apache.hudi.keygen.SimpleKeyGenerator" | Class of key generator | | full_bootstrap_input_provider | String | N | "org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider" | Class of full bootstrap input provider | | schema_provider_class | String | N | "" | Class of schema provider | | payload_class | String | N | "org.apache.hudi.common.model.OverwriteWithLatestAvroPayload" | Class of payload | -| [parallelism](/docs/next/configurations#hoodiebootstrapparallelism) | Int | N | 1500 | For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (`hoodie.bulkinsert.shuffle.parallelism`), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this. | +| [parallelism](configurations#hoodiebootstrapparallelism) | Int | N | 1500 | For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (`hoodie.bulkinsert.shuffle.parallelism`), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this. | | enable_hive_sync | Boolean | N | false | Whether to enable hive sync | | props_file_path | String | N | "" | Path of properties file | | bootstrap_overwrite | Boolean | N | false | Overwrite bootstrap path | diff --git a/website/versioned_docs/version-0.15.0/querying_data.md b/website/versioned_docs/version-0.15.0/querying_data.md index 31069822df761..83a03a4a1121f 100644 --- a/website/versioned_docs/version-0.15.0/querying_data.md +++ b/website/versioned_docs/version-0.15.0/querying_data.md @@ -7,7 +7,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00 --- :::danger -This page is no longer maintained. Please refer to Hudi [SQL DDL](/docs/next/sql_ddl), [SQL DML](/docs/next/sql_dml), [SQL Queries](/docs/next/sql_queries) and [Procedures](/docs/next/procedures) for the latest documentation. +This page is no longer maintained. Please refer to Hudi [SQL DDL](sql_ddl), [SQL DML](sql_dml), [SQL Queries](sql_queries) and [Procedures](procedures) for the latest documentation. ::: Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts#query-types). diff --git a/website/versioned_docs/version-0.15.0/quick-start-guide.md b/website/versioned_docs/version-0.15.0/quick-start-guide.md index 0ff3225da1cad..b100fb9a85e5c 100644 --- a/website/versioned_docs/version-0.15.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.15.0/quick-start-guide.md @@ -255,7 +255,7 @@ CREATE TABLE hudi_table ( PARTITIONED BY (city); ``` -For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](/docs/next/sql_ddl) reference guide. +For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](sql_ddl) reference guide.
@@ -299,7 +299,7 @@ inserts.write.format("hudi"). ``` :::info Mapping to Hudi write operations -Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables, +Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables, with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as the write operation, matching the out-of-behavior of Spark's Parquet Datasource. ::: @@ -332,7 +332,7 @@ inserts.write.format("hudi"). \ ``` :::info Mapping to Hudi write operations -Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables, +Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables, with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as the write operation, matching the out-of-behavior of Spark's Parquet Datasource. ::: @@ -341,7 +341,7 @@ the write operation, matching the out-of-behavior of Spark's Parquet Datasource. -Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](/docs/next/sql_dml#insert-into) for more advanced options. +Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](sql_dml#insert-into) for more advanced options. ```sql INSERT INTO hudi_table @@ -453,7 +453,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un -Hudi table can be update using a regular UPDATE statement. See [Update](/docs/next/sql_dml#update) for more advanced options. +Hudi table can be update using a regular UPDATE statement. See [Update](sql_dml#update) for more advanced options. ```sql UPDATE hudi_table SET fare = 25.0 WHERE rider = 'rider-D'; @@ -483,7 +483,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un -[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](/docs/next/concepts). +[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](concepts). Look for changes in `_hoodie_commit_time`, `fare` fields for the given `_hoodie_record_key` value from a previous commit. ## Merging Data {#merge} @@ -1067,7 +1067,7 @@ PARTITIONED BY (city); > :::note Implications of defining record keys -Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](/docs/next/write_operations). +Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](write_operations). Also if a record key is configured, then it's also advisable to specify a precombine or ordering field, to correctly handle cases where the source data has multiple records with the same key. See section below. ::: @@ -1140,12 +1140,12 @@ PARTITIONED BY (city); ## Where to go from here? You can also [build hudi yourself](https://github.com/apache/hudi#building-apache-hudi-from-source) and try this quickstart using `--jars `(see also [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions)) -for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](/docs/next/migration_guide). +for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](migration_guide). ### Spark SQL Reference -For advanced usage of spark SQL, please refer to [Spark SQL DDL](/docs/next/sql_ddl) and [Spark SQL DML](/docs/next/sql_dml) reference guides. -For alter table commands, check out [this](/docs/next/sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](/docs/next/procedures) out. +For advanced usage of spark SQL, please refer to [Spark SQL DDL](sql_ddl) and [Spark SQL DML](sql_dml) reference guides. +For alter table commands, check out [this](sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](procedures) out. ### Streaming workloads @@ -1155,14 +1155,14 @@ Hudi provides industry-leading performance and functionality for streaming data. from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider, transformation support, automatic table services and so on. -**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/next/writing_tables_streaming_writes#spark-streaming) for more. +**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](writing_tables_streaming_writes#spark-streaming) for more. -Check out more information on [modeling data in Hudi](/docs/next/faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](/docs/next/writing_tables_streaming_writes). +Check out more information on [modeling data in Hudi](faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](writing_tables_streaming_writes). ### Dockerized Demo Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy to get your transactional data lakes up and running quickly, across a variety query engines like Hive, Flink, Spark, Presto, Trino and much more. We have put together a [demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that showcases all of this on a docker based setup with all dependent systems running locally. We recommend you replicate the same setup and run the demo yourself, by following -steps [here](/docs/next/docker_demo) to get a taste for it. +steps [here](docker_demo) to get a taste for it. diff --git a/website/versioned_docs/version-0.15.0/record_payload.md b/website/versioned_docs/version-0.15.0/record_payload.md index 105a87ae9a02c..0f514dced09e5 100644 --- a/website/versioned_docs/version-0.15.0/record_payload.md +++ b/website/versioned_docs/version-0.15.0/record_payload.md @@ -172,6 +172,6 @@ provides support for applying changes captured via Amazon Database Migration Ser Record payloads are tunable to suit many use cases. Please check out the configurations listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic, -please check out [this FAQ](/docs/next/faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a +please check out [this FAQ](faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a separate document, we will talk about a new record merger API for optimized payload handling. diff --git a/website/versioned_docs/version-0.15.0/sql_ddl.md b/website/versioned_docs/version-0.15.0/sql_ddl.md index 4651e819cdb31..56dd6a6775656 100644 --- a/website/versioned_docs/version-0.15.0/sql_ddl.md +++ b/website/versioned_docs/version-0.15.0/sql_ddl.md @@ -389,7 +389,7 @@ Users can set table properties while creating a table. The important table prope #### Passing Lock Providers for Concurrent Writers Hudi requires a lock provider to support concurrent writers or asynchronous table services when using OCC -and [NBCC](/docs/next/concurrency_control#non-blocking-concurrency-control-mode-experimental) (Non-Blocking Concurrency Control) +and [NBCC](concurrency_control#non-blocking-concurrency-control) (Non-Blocking Concurrency Control) concurrency mode. For NBCC mode, locking is only used to write the commit metadata file in the timeline. Writes are serialized by completion time. Users can pass these table properties into *TBLPROPERTIES* as well. Below is an example for a Zookeeper based configuration. @@ -612,7 +612,7 @@ ALTER TABLE tableA RENAME TO tableB; ### Setting Hudi configs #### Using table options -You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](/docs/next/configurations#FLINK_SQL) +You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](configurations#FLINK_SQL) These configs will be applied to all the operations on that table. ```sql diff --git a/website/versioned_docs/version-0.15.0/sql_dml.md b/website/versioned_docs/version-0.15.0/sql_dml.md index edb63730b135c..b94b382df68dd 100644 --- a/website/versioned_docs/version-0.15.0/sql_dml.md +++ b/website/versioned_docs/version-0.15.0/sql_dml.md @@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem'; SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data from your Hudi tables. Let's explore them one by one. -Please refer to [SQL DDL](/docs/next/sql_ddl) for creating Hudi tables using SQL. +Please refer to [SQL DDL](sql_ddl) for creating Hudi tables using SQL. ### Insert Into @@ -25,7 +25,7 @@ SELECT FROM ; :::note Deprecations From 0.14.0, `hoodie.sql.bulk.insert.enable` and `hoodie.sql.insert.mode` are deprecated. Users are expected to use `hoodie.spark.sql.insert.into.operation` instead. -To manage duplicates with `INSERT INTO`, please check out [insert dup policy config](/docs/next/configurations#hoodiedatasourceinsertduppolicy). +To manage duplicates with `INSERT INTO`, please check out [insert dup policy config](configurations#hoodiedatasourceinsertduppolicy). ::: Examples: @@ -317,7 +317,7 @@ INSERT INTO hudi_table select ... from ...; Hudi Flink supports a new non-blocking concurrency control mode, where multiple writer tasks can be executed concurrently without blocking each other. One can read more about this mode in -the [concurrency control](/docs/next/concurrency_control#model-c-multi-writer) docs. Let us see it in action here. +the [concurrency control](concurrency_control#model-c-multi-writer) docs. Let us see it in action here. In the below example, we have two streaming ingestion pipelines that concurrently update the same table. One of the pipeline is responsible for the compaction and cleaning table services, while the other pipeline is just for data diff --git a/website/versioned_docs/version-0.15.0/use_cases.md b/website/versioned_docs/version-0.15.0/use_cases.md index edf37a3a4b00b..4d06f1e571a68 100644 --- a/website/versioned_docs/version-0.15.0/use_cases.md +++ b/website/versioned_docs/version-0.15.0/use_cases.md @@ -85,7 +85,7 @@ stream processing world to ensure pipelines don't break from non backwards compa ### ACID Transactions Along with a table, Apache Hudi brings ACID transactional guarantees to a data lake. -Hudi ensures atomic writes, by way of publishing commits atomically to a [timeline](/docs/next/timeline), stamped with an +Hudi ensures atomic writes, by way of publishing commits atomically to a [timeline](timeline), stamped with an instant time that denotes the time at which the action is deemed to have occurred. Unlike general purpose file version control, Hudi draws clear distinction between writer processes (that issue user’s upserts/deletes), table services (that write data/metadata to optimize/perform bookkeeping) and readers @@ -127,12 +127,12 @@ cost savings for your data lake. Some examples of the Apache Hudi services that make this performance optimization easy include: -- [Auto File Sizing](/docs/next/file_sizing) - to solve the "small files" problem. -- [Clustering](/docs/next/clustering) - to co-locate data next to each other. -- [Compaction](/docs/next/compaction) - to allow tuning of low latency ingestion and fast read queries. -- [Indexing](/docs/next/indexes) - for efficient upserts and deletes. +- [Auto File Sizing](file_sizing) - to solve the "small files" problem. +- [Clustering](clustering) - to co-locate data next to each other. +- [Compaction](compaction) - to allow tuning of low latency ingestion and fast read queries. +- [Indexing](indexing) - for efficient upserts and deletes. - Multi-Dimensional Partitioning (Z-Ordering) - Traditional folder style partitioning on low-cardinality, while also Z-Ordering data within files based on high-cardinality - Metadata Table - No more slow S3 file listings or throttling. -- [Auto Cleaning](/docs/next/cleaning) - Keeps your storage costs in check by automatically removing old versions of files. +- [Auto Cleaning](hoodie_cleaner) - Keeps your storage costs in check by automatically removing old versions of files. diff --git a/website/versioned_docs/version-0.15.0/write_operations.md b/website/versioned_docs/version-0.15.0/write_operations.md index df0a22e01f9a4..3d48a9f4ebe39 100644 --- a/website/versioned_docs/version-0.15.0/write_operations.md +++ b/website/versioned_docs/version-0.15.0/write_operations.md @@ -87,25 +87,25 @@ The following is an inside look on the Hudi write path and the sequence of event 1. [Deduping](/docs/configurations#hoodiecombinebeforeinsert) 1. First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key. -2. [Index Lookup](/docs/next/indexes) +2. [Index Lookup](indexing) 1. Next, an index lookup is performed to try and match the input records to identify which file groups they belong to. -3. [File Sizing](/docs/next/file_sizing) +3. [File Sizing](file_sizing) 1. Then, based on the average size of previous commits, Hudi will make a plan to add enough records to a small file to get it close to the configured maximum limit. -4. [Partitioning](/docs/next/storage_layouts) +4. [Partitioning](file_layouts) 1. We now arrive at partitioning where we decide what file groups certain updates and inserts will be placed in or if new file groups will be created 5. Write I/O 1. Now we actually do the write operations which is either creating a new base file, appending to the log file, or versioning an existing base file. -6. Update [Index](/docs/next/indexes) +6. Update [Index](indexing) 1. Now that the write is performed, we will go back and update the index. 7. Commit - 1. Finally we commit all of these changes atomically. ([Post-commit callback](/docs/next/platform_services_post_commit_callback) can be configured.) -8. [Clean](/docs/next/cleaning) (if needed) + 1. Finally we commit all of these changes atomically. ([Post-commit callback](platform_services_post_commit_callback) can be configured.) +8. [Clean](hoodie_cleaner) (if needed) 1. Following the commit, cleaning is invoked if needed. -9. [Compaction](/docs/next/compaction) +9. [Compaction](compaction) 1. If you are using MOR tables, compaction will either run inline, or be scheduled asynchronously 10. Archive - 1. Lastly, we perform an archival step which moves old [timeline](/docs/next/timeline) items to an archive folder. + 1. Lastly, we perform an archival step which moves old [timeline](timeline) items to an archive folder. Here is a diagramatic representation of the flow. diff --git a/website/versioned_docs/version-0.15.0/writing_data.md b/website/versioned_docs/version-0.15.0/writing_data.md index fd5e6582ad3b7..2630cc299dbbe 100644 --- a/website/versioned_docs/version-0.15.0/writing_data.md +++ b/website/versioned_docs/version-0.15.0/writing_data.md @@ -83,7 +83,7 @@ df.write.format("hudi"). You can check the data generated under `/tmp/hudi_trips_cow////`. We provided a record key (`uuid` in [schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in [schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](/docs/next/faq_general/#how-do-i-model-the-data-stored-in-hudi) +[Modeling data stored in Hudi](faq_general/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/hoodie_streaming_ingestion). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue `insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) @@ -119,7 +119,7 @@ df.write.format("hudi"). You can check the data generated under `/tmp/hudi_trips_cow////`. We provided a record key (`uuid` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](/docs/next/faq_general/#how-do-i-model-the-data-stored-in-hudi) +[Modeling data stored in Hudi](faq_general/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/hoodie_streaming_ingestion). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue `insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) diff --git a/website/versioned_docs/version-0.5.0/querying_data.md b/website/versioned_docs/version-0.5.0/querying_data.md index 49f6dd4cc30cb..1985ab3d2f744 100644 --- a/website/versioned_docs/version-0.5.0/querying_data.md +++ b/website/versioned_docs/version-0.5.0/querying_data.md @@ -129,7 +129,7 @@ A sample incremental pull, that will obtain all records written since `beginInst .load(tablePath); // For incremental view, pass in the root/base path of dataset ``` -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +Please refer to [configurations](/docs/configurations#spark-datasource-configs) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.5.0/quick-start-guide.md b/website/versioned_docs/version-0.5.0/quick-start-guide.md index 92660aea93615..7318eae83d3cb 100644 --- a/website/versioned_docs/version-0.5.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.5.0/quick-start-guide.md @@ -64,7 +64,7 @@ You can check the data generated under `/tmp/hudi_cow_table///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Datasets](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data diff --git a/website/versioned_docs/version-0.5.0/writing_data.md b/website/versioned_docs/version-0.5.0/writing_data.md index 7510c3408dc66..61cf8b83ba09a 100644 --- a/website/versioned_docs/version-0.5.0/writing_data.md +++ b/website/versioned_docs/version-0.5.0/writing_data.md @@ -212,10 +212,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi datasets. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#logfilemaxsizelogfilesize--1gb) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.5.1/deployment.md b/website/versioned_docs/version-0.5.1/deployment.md index 72ce29c86971f..3bda49ed6310e 100644 --- a/website/versioned_docs/version-0.5.1/deployment.md +++ b/website/versioned_docs/version-0.5.1/deployment.md @@ -29,9 +29,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](writing_data#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -130,7 +130,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.5.1/querying_data.md b/website/versioned_docs/version-0.5.1/querying_data.md index 1f1cba44c1a0b..978fbdc60f9a6 100644 --- a/website/versioned_docs/version-0.5.1/querying_data.md +++ b/website/versioned_docs/version-0.5.1/querying_data.md @@ -154,7 +154,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide#setup-spark-shell). -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +Please refer to [configurations](/docs/configurations#spark-datasource-configs) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.5.1/quick-start-guide.md b/website/versioned_docs/version-0.5.1/quick-start-guide.md index 90a8bec85b8b5..a39dc627aba9d 100644 --- a/website/versioned_docs/version-0.5.1/quick-start-guide.md +++ b/website/versioned_docs/version-0.5.1/quick-start-guide.md @@ -75,7 +75,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data diff --git a/website/versioned_docs/version-0.5.1/writing_data.md b/website/versioned_docs/version-0.5.1/writing_data.md index a6c40b65a1173..3b023898ede0e 100644 --- a/website/versioned_docs/version-0.5.1/writing_data.md +++ b/website/versioned_docs/version-0.5.1/writing_data.md @@ -242,10 +242,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#logfilemaxsizelogfilesize--1gb) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.5.2/deployment.md b/website/versioned_docs/version-0.5.2/deployment.md index ad4843f0639d6..af83d1c375244 100644 --- a/website/versioned_docs/version-0.5.2/deployment.md +++ b/website/versioned_docs/version-0.5.2/deployment.md @@ -29,9 +29,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](writing_data#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -130,7 +130,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.5.2/querying_data.md b/website/versioned_docs/version-0.5.2/querying_data.md index 7887c888e1f36..98b507b64bb12 100644 --- a/website/versioned_docs/version-0.5.2/querying_data.md +++ b/website/versioned_docs/version-0.5.2/querying_data.md @@ -155,8 +155,8 @@ hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show() ``` -For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide#setup-spark-shell). -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +For examples, refer to [Setup spark-shell in quickstart](quick-start-guide#setup). +Please refer to [configurations](/docs/configurations#spark-datasource-configs) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.5.2/quick-start-guide.md b/website/versioned_docs/version-0.5.2/quick-start-guide.md index dee782961647d..8a76b8189f755 100644 --- a/website/versioned_docs/version-0.5.2/quick-start-guide.md +++ b/website/versioned_docs/version-0.5.2/quick-start-guide.md @@ -75,7 +75,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data diff --git a/website/versioned_docs/version-0.5.2/writing_data.md b/website/versioned_docs/version-0.5.2/writing_data.md index d5a7771ab72dd..82c86111c357f 100644 --- a/website/versioned_docs/version-0.5.2/writing_data.md +++ b/website/versioned_docs/version-0.5.2/writing_data.md @@ -242,10 +242,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#logfilemaxsizelogfilesize--1gb) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.5.3/deployment.md b/website/versioned_docs/version-0.5.3/deployment.md index 018a21b5e2130..aeeea7b24d36c 100644 --- a/website/versioned_docs/version-0.5.3/deployment.md +++ b/website/versioned_docs/version-0.5.3/deployment.md @@ -29,9 +29,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](writing_data#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -130,7 +130,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.5.3/querying_data.md b/website/versioned_docs/version-0.5.3/querying_data.md index bec2c13f152dc..845478d4a6f18 100644 --- a/website/versioned_docs/version-0.5.3/querying_data.md +++ b/website/versioned_docs/version-0.5.3/querying_data.md @@ -154,8 +154,8 @@ hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show() ``` -For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide#setup-spark-shell). -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +For examples, refer to [Setup spark-shell in quickstart](quick-start-guide#setup). +Please refer to [configurations](/docs/configurations#spark-datasource-configs) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.5.3/quick-start-guide.md b/website/versioned_docs/version-0.5.3/quick-start-guide.md index a2a10bd5146b4..b3eca64655b1e 100644 --- a/website/versioned_docs/version-0.5.3/quick-start-guide.md +++ b/website/versioned_docs/version-0.5.3/quick-start-guide.md @@ -79,7 +79,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data @@ -285,7 +285,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data diff --git a/website/versioned_docs/version-0.5.3/writing_data.md b/website/versioned_docs/version-0.5.3/writing_data.md index 85aebf23c9d40..cbec1ae4effe1 100644 --- a/website/versioned_docs/version-0.5.3/writing_data.md +++ b/website/versioned_docs/version-0.5.3/writing_data.md @@ -242,10 +242,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#logfilemaxsizelogfilesize--1gb) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.6.0/deployment.md b/website/versioned_docs/version-0.6.0/deployment.md index 8112b63ac11e9..54a5c52be7de5 100644 --- a/website/versioned_docs/version-0.6.0/deployment.md +++ b/website/versioned_docs/version-0.6.0/deployment.md @@ -29,9 +29,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](writing_data#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -130,7 +130,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.6.0/querying_data.md b/website/versioned_docs/version-0.6.0/querying_data.md index 412be1c38a12b..766ada39fb8d1 100644 --- a/website/versioned_docs/version-0.6.0/querying_data.md +++ b/website/versioned_docs/version-0.6.0/querying_data.md @@ -166,8 +166,8 @@ hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show() ``` -For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide#setup-spark-shell). -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +For examples, refer to [Setup spark-shell in quickstart](quick-start-guide#setup). +Please refer to [configurations](/docs/configurations#spark-datasource-configs) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.6.0/quick-start-guide.md b/website/versioned_docs/version-0.6.0/quick-start-guide.md index 9bf766915cd0e..072da7c866137 100644 --- a/website/versioned_docs/version-0.6.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.6.0/quick-start-guide.md @@ -79,7 +79,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data @@ -290,7 +290,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data diff --git a/website/versioned_docs/version-0.6.0/writing_data.md b/website/versioned_docs/version-0.6.0/writing_data.md index 0a5b5933917be..5253d6bb0858b 100644 --- a/website/versioned_docs/version-0.6.0/writing_data.md +++ b/website/versioned_docs/version-0.6.0/writing_data.md @@ -380,10 +380,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#logfilemaxsizelogfilesize--1gb) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.7.0/deployment.md b/website/versioned_docs/version-0.7.0/deployment.md index ba1c3a16fec89..a24cdc248e832 100644 --- a/website/versioned_docs/version-0.7.0/deployment.md +++ b/website/versioned_docs/version-0.7.0/deployment.md @@ -29,9 +29,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](writing_data#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -130,7 +130,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.7.0/querying_data.md b/website/versioned_docs/version-0.7.0/querying_data.md index 8d227dc7bcf4d..5657ae107abb9 100644 --- a/website/versioned_docs/version-0.7.0/querying_data.md +++ b/website/versioned_docs/version-0.7.0/querying_data.md @@ -166,8 +166,8 @@ hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show() ``` -For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide#setup-spark-shell). -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +For examples, refer to [Setup spark-shell in quickstart](quick-start-guide#setup). +Please refer to [configurations](/docs/configurations#spark-datasource-configs) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.7.0/quick-start-guide.md b/website/versioned_docs/version-0.7.0/quick-start-guide.md index 55719b0746e7c..fcc69aaacc55b 100644 --- a/website/versioned_docs/version-0.7.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.7.0/quick-start-guide.md @@ -79,7 +79,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data @@ -362,7 +362,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) \{: .notice--info} ## Query data diff --git a/website/versioned_docs/version-0.7.0/writing_data.md b/website/versioned_docs/version-0.7.0/writing_data.md index fb100aeb8f519..7750c93f7f415 100644 --- a/website/versioned_docs/version-0.7.0/writing_data.md +++ b/website/versioned_docs/version-0.7.0/writing_data.md @@ -382,10 +382,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#logfilemaxsizelogfilesize--1gb) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.8.0/concurrency_control.md b/website/versioned_docs/version-0.8.0/concurrency_control.md index aafaf5807adb5..a3789940e919d 100644 --- a/website/versioned_docs/version-0.8.0/concurrency_control.md +++ b/website/versioned_docs/version-0.8.0/concurrency_control.md @@ -20,13 +20,13 @@ between multiple table service writers and readers. Additionally, using MVCC, Hu the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits (or writers) happening to the same table, if they do not have writes to overlapping files being changed, both writers are allowed to succeed. This feature is currently *experimental* and requires either Zookeeper or HiveMetastore to acquire locks. -It may be helpful to understand the different guarantees provided by [write operations](/docs/writing_data#write-operations) via Hudi datasource or the delta streamer. +It may be helpful to understand the different guarantees provided by [write operations](writing_data#write-operations) via Hudi datasource or the delta streamer. ## Single Writer Guarantees - *UPSERT Guarantee*: The target table will NEVER show duplicates. - - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. - - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. + - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](configurations#insert_drop_dups_opt_key) is enabled. + - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](configurations#insert_drop_dups_opt_key)) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out of order. ## Multi Writer Guarantees @@ -34,8 +34,8 @@ It may be helpful to understand the different guarantees provided by [write oper With multiple writers using OCC, some of the above guarantees change as follows - *UPSERT Guarantee*: The target table will NEVER show duplicates. -- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. -- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. +- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#insert_drop_dups_opt_key)) is enabled. +- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#insert_drop_dups_opt_key)) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints MIGHT be out of order due to multiple writer jobs finishing at different times. ## Enabling Multi Writing diff --git a/website/versioned_docs/version-0.8.0/deployment.md b/website/versioned_docs/version-0.8.0/deployment.md index e25b5dab4c761..1eba743a957f0 100644 --- a/website/versioned_docs/version-0.8.0/deployment.md +++ b/website/versioned_docs/version-0.8.0/deployment.md @@ -29,9 +29,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](writing_data#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -130,7 +130,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.8.0/querying_data.md b/website/versioned_docs/version-0.8.0/querying_data.md index 75df56f2b527d..450b96ba66bfc 100644 --- a/website/versioned_docs/version-0.8.0/querying_data.md +++ b/website/versioned_docs/version-0.8.0/querying_data.md @@ -168,8 +168,8 @@ hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show() ``` -For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide#setup-spark-shell). -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +For examples, refer to [Setup spark-shell in quickstart](quick-start-guide#setup). +Please refer to [configurations](/docs/configurations#spark-datasource-configs) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.8.0/quick-start-guide.md b/website/versioned_docs/version-0.8.0/quick-start-guide.md index 3a7c7a244178f..0b5d1e0d2308d 100644 --- a/website/versioned_docs/version-0.8.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.8.0/quick-start-guide.md @@ -182,7 +182,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) ::: ## Query data diff --git a/website/versioned_docs/version-0.8.0/writing_data.md b/website/versioned_docs/version-0.8.0/writing_data.md index fd2d5094f7a07..75ad24fc26ff3 100644 --- a/website/versioned_docs/version-0.8.0/writing_data.md +++ b/website/versioned_docs/version-0.8.0/writing_data.md @@ -416,10 +416,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#logfilemaxsizelogfilesize--1gb) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. diff --git a/website/versioned_docs/version-0.9.0/concurrency_control.md b/website/versioned_docs/version-0.9.0/concurrency_control.md index 3a816b6d089f9..8186e64013a6c 100644 --- a/website/versioned_docs/version-0.9.0/concurrency_control.md +++ b/website/versioned_docs/version-0.9.0/concurrency_control.md @@ -19,13 +19,13 @@ between multiple table service writers and readers. Additionally, using MVCC, Hu the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits (or writers) happening to the same table, if they do not have writes to overlapping files being changed, both writers are allowed to succeed. This feature is currently *experimental* and requires either Zookeeper or HiveMetastore to acquire locks. -It may be helpful to understand the different guarantees provided by [write operations](/docs/writing_data#write-operations) via Hudi datasource or the delta streamer. +It may be helpful to understand the different guarantees provided by [write operations](writing_data#write-operations) via Hudi datasource or the delta streamer. ## Single Writer Guarantees - *UPSERT Guarantee*: The target table will NEVER show duplicates. - - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS) is enabled. - - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS) is enabled. + - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. + - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out of order. ## Multi Writer Guarantees @@ -33,8 +33,8 @@ It may be helpful to understand the different guarantees provided by [write oper With multiple writers using OCC, some of the above guarantees change as follows - *UPSERT Guarantee*: The target table will NEVER show duplicates. -- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS) is enabled. -- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS) is enabled. +- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. +- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints MIGHT be out of order due to multiple writer jobs finishing at different times. ## Enabling Multi Writing diff --git a/website/versioned_docs/version-0.9.0/deployment.md b/website/versioned_docs/version-0.9.0/deployment.md index f5eba2c1ab117..5d954b4140a3a 100644 --- a/website/versioned_docs/version-0.9.0/deployment.md +++ b/website/versioned_docs/version-0.9.0/deployment.md @@ -27,9 +27,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application. + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](writing_data#deltastreamer) for running the spark application. Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -128,7 +128,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". +As described in [Writing Data](writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". Here is an example invocation using spark datasource diff --git a/website/versioned_docs/version-0.9.0/querying_data.md b/website/versioned_docs/version-0.9.0/querying_data.md index ede7d04f4d95d..f5e3769217819 100644 --- a/website/versioned_docs/version-0.9.0/querying_data.md +++ b/website/versioned_docs/version-0.9.0/querying_data.md @@ -173,8 +173,8 @@ hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show() ``` -For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide#setup-spark-shell). -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +For examples, refer to [Setup spark-shell in quickstart](quick-start-guide#setup). +Please refer to [configurations](/docs/configurations/#SPARK_DATASOURCE) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. diff --git a/website/versioned_docs/version-0.9.0/quick-start-guide.md b/website/versioned_docs/version-0.9.0/quick-start-guide.md index 3ab254501eb2f..a345030baffe9 100644 --- a/website/versioned_docs/version-0.9.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.9.0/quick-start-guide.md @@ -368,7 +368,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) ::: @@ -404,7 +404,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations) ::: diff --git a/website/versioned_docs/version-0.9.0/writing_data.md b/website/versioned_docs/version-0.9.0/writing_data.md index 7671593bacf9a..8f95514ef73b2 100644 --- a/website/versioned_docs/version-0.9.0/writing_data.md +++ b/website/versioned_docs/version-0.9.0/writing_data.md @@ -415,10 +415,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. - - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload + - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. - - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull - - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), + - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull + - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#hoodielogfilemaxsize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. - Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before.