From b72675ea627406d5117fd2019b01030562ff7e35 Mon Sep 17 00:00:00 2001
From: Bhavani Sudha Saktheeswaran <2179254+bhasudha@users.noreply.github.com>
Date: Tue, 26 Nov 2024 14:37:20 -0800
Subject: [PATCH] [site] Fix broken links across versions
---
README.md | 39 +++++++++++----
.../0.8.0/concurrency_control/.index.html.swp | Bin 0 -> 4096 bytes
.../2019-09-09-ingesting-database-changes.md | 2 +-
.../2020-01-20-change-capture-using-aws.md | 2 +-
.../blog/2021-01-27-hudi-clustering-intro.md | 2 +-
website/blog/2021-03-01-hudi-file-sizing.md | 6 +--
.../2021-08-16-kafka-custom-deserializer.md | 2 +-
website/docs/cli.md | 2 +-
website/docs/concurrency_control.md | 6 +--
website/docs/deployment.md | 2 +-
website/docs/faq.md | 14 +++---
website/docs/faq_general.md | 2 +-
website/docs/faq_table_services.md | 2 +-
website/docs/faq_writing_tables.md | 8 +--
website/docs/file_sizing.md | 2 +-
website/docs/flink-quick-start-guide.md | 18 +++----
website/docs/hudi_stack.md | 2 +-
website/docs/indexes.md | 12 ++---
website/docs/metadata.md | 4 +-
website/docs/metadata_indexing.md | 8 +--
website/docs/precommit_validator.md | 2 +-
website/docs/procedures.md | 20 ++++----
website/docs/querying_data.md | 2 +-
website/docs/quick-start-guide.md | 28 +++++------
website/docs/record_merger.md | 6 +--
website/docs/sql_ddl.md | 14 +++---
website/docs/sql_dml.md | 6 +--
website/docs/sql_queries.md | 2 +-
website/docs/storage_layouts.md | 4 +-
website/docs/use_cases.md | 8 +--
website/docs/write_operations.md | 16 +++---
website/docs/writing_data.md | 4 +-
website/releases/release-0.11.0.md | 2 +-
website/releases/release-0.7.0.md | 2 +-
website/releases/release-0.9.0.md | 2 +-
website/releases/release-1.0.0-beta1.md | 4 +-
website/releases/release-1.0.0-beta2.md | 2 +-
.../version-0.10.0/clustering.md | 2 +-
.../version-0.10.0/compaction.md | 2 +-
.../version-0.10.0/concurrency_control.md | 10 ++--
.../version-0.10.0/deployment.md | 6 +--
website/versioned_docs/version-0.10.0/faq.md | 2 +-
.../version-0.10.0/file_sizing.md | 2 +-
.../version-0.10.0/performance.md | 6 +--
.../version-0.10.0/querying_data.md | 8 +--
.../version-0.10.0/quick-start-guide.md | 4 +-
.../version-0.10.0/write_operations.md | 4 +-
.../version-0.10.0/writing_data.md | 4 +-
.../version-0.10.1/clustering.md | 2 +-
.../version-0.10.1/compaction.md | 2 +-
.../version-0.10.1/concurrency_control.md | 10 ++--
.../version-0.10.1/deployment.md | 6 +--
website/versioned_docs/version-0.10.1/faq.md | 2 +-
.../version-0.10.1/file_sizing.md | 2 +-
.../version-0.10.1/performance.md | 6 +--
.../version-0.10.1/querying_data.md | 8 +--
.../version-0.10.1/quick-start-guide.md | 4 +-
.../version-0.10.1/tuning-guide.md | 2 +-
.../version-0.10.1/write_operations.md | 4 +-
.../version-0.10.1/writing_data.md | 4 +-
.../version-0.11.0/compaction.md | 2 +-
.../version-0.11.0/deployment.md | 2 +-
website/versioned_docs/version-0.11.0/faq.md | 2 +-
.../version-0.11.0/file_sizing.md | 2 +-
.../version-0.11.0/querying_data.md | 2 +-
.../version-0.11.0/tuning-guide.md | 2 +-
.../version-0.11.0/write_operations.md | 2 +-
.../version-0.11.1/compaction.md | 2 +-
.../version-0.11.1/deployment.md | 2 +-
website/versioned_docs/version-0.11.1/faq.md | 2 +-
.../version-0.11.1/file_sizing.md | 2 +-
.../version-0.11.1/querying_data.md | 2 +-
.../version-0.11.1/tuning-guide.md | 2 +-
.../version-0.11.1/write_operations.md | 2 +-
.../version-0.12.0/compaction.md | 2 +-
.../version-0.12.0/deployment.md | 2 +-
website/versioned_docs/version-0.12.0/faq.md | 2 +-
.../version-0.12.0/file_sizing.md | 2 +-
.../version-0.12.0/flink-quick-start-guide.md | 2 +-
.../version-0.12.0/querying_data.md | 2 +-
.../version-0.12.0/tuning-guide.md | 2 +-
.../version-0.12.0/write_operations.md | 2 +-
.../version-0.12.1/compaction.md | 2 +-
.../version-0.12.1/deployment.md | 2 +-
website/versioned_docs/version-0.12.1/faq.md | 2 +-
.../version-0.12.1/file_sizing.md | 2 +-
.../version-0.12.1/querying_data.md | 2 +-
.../version-0.12.1/tuning-guide.md | 2 +-
.../version-0.12.1/write_operations.md | 2 +-
.../version-0.12.2/compaction.md | 2 +-
.../version-0.12.2/deployment.md | 2 +-
website/versioned_docs/version-0.12.2/faq.md | 2 +-
.../version-0.12.2/file_sizing.md | 2 +-
.../version-0.12.2/flink-quick-start-guide.md | 2 +-
.../version-0.12.2/querying_data.md | 2 +-
.../version-0.12.2/quick-start-guide.md | 2 +-
.../version-0.12.2/tuning-guide.md | 2 +-
.../version-0.12.2/write_operations.md | 2 +-
.../version-0.12.3/compaction.md | 2 +-
.../version-0.12.3/deployment.md | 2 +-
website/versioned_docs/version-0.12.3/faq.md | 2 +-
.../version-0.12.3/file_sizing.md | 2 +-
.../version-0.12.3/flink-quick-start-guide.md | 2 +-
.../version-0.12.3/querying_data.md | 2 +-
.../version-0.12.3/quick-start-guide.md | 2 +-
.../version-0.12.3/tuning-guide.md | 2 +-
.../version-0.12.3/write_operations.md | 2 +-
.../version-0.13.0/compaction.md | 2 +-
.../version-0.13.0/deployment.md | 2 +-
website/versioned_docs/version-0.13.0/faq.md | 2 +-
.../version-0.13.0/file_sizing.md | 2 +-
.../version-0.13.0/flink-quick-start-guide.md | 2 +-
.../version-0.13.0/querying_data.md | 2 +-
.../version-0.13.0/quick-start-guide.md | 2 +-
.../version-0.13.0/tuning-guide.md | 2 +-
.../version-0.13.0/write_operations.md | 2 +-
.../version-0.13.1/compaction.md | 2 +-
.../version-0.13.1/deployment.md | 2 +-
website/versioned_docs/version-0.13.1/faq.md | 2 +-
.../version-0.13.1/file_sizing.md | 2 +-
.../version-0.13.1/flink-quick-start-guide.md | 6 +--
.../version-0.13.1/quick-start-guide.md | 2 +-
.../version-0.13.1/record_payload.md | 2 +-
.../version-0.13.1/tuning-guide.md | 2 +-
.../version-0.13.1/write_operations.md | 2 +-
.../version-0.14.0/deployment.md | 2 +-
website/versioned_docs/version-0.14.0/faq.md | 4 +-
.../version-0.14.0/flink-quick-start-guide.md | 6 +--
.../version-0.14.0/quick-start-guide.md | 4 +-
.../version-0.14.0/record_payload.md | 2 +-
.../version-0.14.0/write_operations.md | 2 +-
website/versioned_docs/version-0.14.1/cli.md | 2 +-
.../version-0.14.1/concurrency_control.md | 2 +-
.../version-0.14.1/deployment.md | 2 +-
.../version-0.14.1/faq_writing_tables.md | 6 +--
.../version-0.14.1/file_layouts.md | 4 +-
.../version-0.14.1/file_sizing.md | 2 +-
.../version-0.14.1/flink-quick-start-guide.md | 18 +++----
.../versioned_docs/version-0.14.1/indexing.md | 4 +-
.../version-0.14.1/metadata_indexing.md | 4 +-
.../version-0.14.1/procedures.md | 46 +++++++++---------
.../version-0.14.1/querying_data.md | 2 +-
.../version-0.14.1/quick-start-guide.md | 30 ++++++------
.../version-0.14.1/record_payload.md | 2 +-
.../versioned_docs/version-0.14.1/sql_ddl.md | 6 +--
.../versioned_docs/version-0.14.1/sql_dml.md | 4 +-
.../version-0.14.1/use_cases.md | 12 ++---
.../version-0.14.1/write_operations.md | 16 +++---
.../version-0.14.1/writing_data.md | 16 +++---
website/versioned_docs/version-0.15.0/cli.md | 2 +-
.../version-0.15.0/concurrency_control.md | 2 +-
.../version-0.15.0/deployment.md | 2 +-
website/versioned_docs/version-0.15.0/faq.md | 2 +-
.../version-0.15.0/faq_general.md | 2 +-
.../version-0.15.0/faq_table_services.md | 2 +-
.../version-0.15.0/faq_writing_tables.md | 8 +--
.../version-0.15.0/file_layouts.md | 4 +-
.../version-0.15.0/file_sizing.md | 2 +-
.../version-0.15.0/flink-quick-start-guide.md | 18 +++----
.../version-0.15.0/hudi_stack.md | 2 +-
.../versioned_docs/version-0.15.0/indexing.md | 4 +-
.../version-0.15.0/metadata_indexing.md | 4 +-
.../version-0.15.0/precommit_validator.md | 2 +-
.../version-0.15.0/procedures.md | 46 +++++++++---------
.../version-0.15.0/querying_data.md | 2 +-
.../version-0.15.0/quick-start-guide.md | 26 +++++-----
.../version-0.15.0/record_payload.md | 2 +-
.../versioned_docs/version-0.15.0/sql_ddl.md | 4 +-
.../versioned_docs/version-0.15.0/sql_dml.md | 6 +--
.../version-0.15.0/use_cases.md | 12 ++---
.../version-0.15.0/write_operations.md | 16 +++---
.../version-0.15.0/writing_data.md | 4 +-
.../version-0.5.0/querying_data.md | 2 +-
.../version-0.5.0/quick-start-guide.md | 2 +-
.../version-0.5.0/writing_data.md | 6 +--
.../version-0.5.1/deployment.md | 6 +--
.../version-0.5.1/querying_data.md | 2 +-
.../version-0.5.1/quick-start-guide.md | 2 +-
.../version-0.5.1/writing_data.md | 6 +--
.../version-0.5.2/deployment.md | 6 +--
.../version-0.5.2/querying_data.md | 4 +-
.../version-0.5.2/quick-start-guide.md | 2 +-
.../version-0.5.2/writing_data.md | 6 +--
.../version-0.5.3/deployment.md | 6 +--
.../version-0.5.3/querying_data.md | 4 +-
.../version-0.5.3/quick-start-guide.md | 4 +-
.../version-0.5.3/writing_data.md | 6 +--
.../version-0.6.0/deployment.md | 6 +--
.../version-0.6.0/querying_data.md | 4 +-
.../version-0.6.0/quick-start-guide.md | 4 +-
.../version-0.6.0/writing_data.md | 6 +--
.../version-0.7.0/deployment.md | 6 +--
.../version-0.7.0/querying_data.md | 4 +-
.../version-0.7.0/quick-start-guide.md | 4 +-
.../version-0.7.0/writing_data.md | 6 +--
.../version-0.8.0/concurrency_control.md | 10 ++--
.../version-0.8.0/deployment.md | 6 +--
.../version-0.8.0/querying_data.md | 4 +-
.../version-0.8.0/quick-start-guide.md | 2 +-
.../version-0.8.0/writing_data.md | 6 +--
.../version-0.9.0/concurrency_control.md | 10 ++--
.../version-0.9.0/deployment.md | 6 +--
.../version-0.9.0/querying_data.md | 4 +-
.../version-0.9.0/quick-start-guide.md | 4 +-
.../version-0.9.0/writing_data.md | 6 +--
205 files changed, 524 insertions(+), 505 deletions(-)
create mode 100644 content/docs/0.8.0/concurrency_control/.index.html.swp
diff --git a/README.md b/README.md
index 9b011e8c7165..5686cf42d7ef 100644
--- a/README.md
+++ b/README.md
@@ -129,16 +129,35 @@ versioned_sidebars/version-0.7.0-sidebars.json
```
### Linking docs
-
-- Remember to include the `.md` extension.
-- Files will be linked to correct corresponding version.
-- Relative paths work as well.
-
-```md
-The [@hello](hello.md#paginate) document is great!
-
-See the [Tutorial](../getting-started/tutorial.md) for more info.
-```
+Relative paths work well. - Files will be linked to correct corresponding version.
+ - PREFER RELATIVE PATHS to be consistent with linking.
+ - **Good Example of linking.**
+ For ex say we are updating a 0.12.0 version doc which is older.
+ ```md
+ A [callback notification](writing_data#commit-notifications) is exposed
+ ```
+ This automatically resolves to /docs/0.12.0/writing_data#commit-notifications.
+ - **Bad example of linking.**
+ For ex say we are updating a 0.12.0 version doc which is older.
+ ```md
+ A [callback notification](/docs/writing_data#commit-notifications) is exposed
+ ```
+ This will resolve to the most recent release, specifically /docs/writing_data#commit-notifications . We do not want a 0.12.0 doc page to point to a page from a later release.
+ - DO NOT use next version when linking.
+ - Good Example of linking when you are working on unreleased version (from next version).
+ ```md
+ Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new
+ file slices and [cleaning](cleaning) action gets rid of unused/older file slices to reclaim space on the file system.
+ ```
+ This automatically resolves to /docs/next/compaction and /docs/next/cleaning pages.
+
+ - Bad Example of linking when you are working on unreleased version (from next version).
+ ```md
+ Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new
+ file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system.
+ ```
+ Even though it directly points to /docs/next which is intended target, this accumulates as tech debt when this copy of docs gets released, we will hav a older doc always pointing to /docs/next/
+
## Versions
diff --git a/content/docs/0.8.0/concurrency_control/.index.html.swp b/content/docs/0.8.0/concurrency_control/.index.html.swp
new file mode 100644
index 0000000000000000000000000000000000000000..122ce09e20f5e3c4268d916ca38b63872081406d
GIT binary patch
literal 4096
zcmYc?2=nw+u+%eP00IF9hN(_&=@)i3adz1=Fcg=jWF&&b34mabI$c8#UEjoHr~LeE
z-GHKey`22y#2i$ObzlSaGfGo3b(8b+N;30G^Gl0$i!)17bu)7dit@`+b5rw5ioq)M
zK`MYe{gnLVVtoTW3q1pnP;zNeQEFatWjsi_C_hI(GcP5zLNB8v7tJzMt4DdGAut*O
y47?1+Muy<5uB@b}AS@I*8YZJ5Fd71*Aut*OqaiRF0;3@?8UmvsFd70QC
-- [Soft file limit](/docs/configurations#compactionSmallFileSize): Max file size below which a given data file is considered to a small file
-- [Insert split size](/docs/configurations#insertSplitSize): Number of inserts grouped for a single partition. This value should match
+- [Max file size](/docs/configurations#hoodieparquetmaxfilesize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value
+- [Soft file limit](/docs/configurations#hoodieparquetsmallfilelimit): Max file size below which a given data file is considered to a small file
+- [Insert split size](/docs/configurations#hoodiecopyonwriteinsertsplitsize): Number of inserts grouped for a single partition. This value should match
the number of records in a single file (you can determine based on max file size and per record size)
For instance, if your first config value is 120MB and 2nd config value is set to 100MB, any file whose size is < 100MB
diff --git a/website/blog/2021-08-16-kafka-custom-deserializer.md b/website/blog/2021-08-16-kafka-custom-deserializer.md
index c8146b6343d0..7ed4bf3e03a2 100644
--- a/website/blog/2021-08-16-kafka-custom-deserializer.md
+++ b/website/blog/2021-08-16-kafka-custom-deserializer.md
@@ -18,7 +18,7 @@ In our case a Confluent schema registry is used to maintain the schema and as sc
## What do we want to achieve?
-We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/writing_data#deltastreamer)
+We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer)
Ideally every topic should be able to evolve the schema to match new business requirements. Producers start producing data with a new schema version and the DeltaStreamer picks up the new schema and ingests the data with the new schema. For this to work, we run our DeltaStreamer instances with the latest schema version available from the Schema Registry to ensure that we always use the freshest schema with all attributes.
A prerequisites is that all the mentioned Schema evolutions must be `BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of Avro Schema changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). This ensures that every record in the kafka topic can always be read using the latest schema.
diff --git a/website/docs/cli.md b/website/docs/cli.md
index 1c30b9b6fa6e..7cc4cdd92b0c 100644
--- a/website/docs/cli.md
+++ b/website/docs/cli.md
@@ -578,7 +578,7 @@ Compaction successfully repaired
### Savepoint and Restore
As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this
-savepoint at a later point in time if need be. You can read more about savepoints and restore [here](/docs/next/disaster_recovery)
+savepoint at a later point in time if need be. You can read more about savepoints and restore [here](disaster_recovery)
To trigger savepoint for a hudi table
```java
diff --git a/website/docs/concurrency_control.md b/website/docs/concurrency_control.md
index 8550888e734f..d9867be88a8e 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -8,7 +8,7 @@ last_modified_at: 2021-03-19T15:59:57-04:00
---
Concurrency control defines how different writers/readers/table services coordinate access to a Hudi table. Hudi ensures atomic writes, by way of publishing commits atomically to the timeline,
stamped with an instant time that denotes the time at which the action is deemed to have occurred. Unlike general purpose file version control, Hudi draws clear distinction between
-writer processes that issue [write operations](/docs/next/write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and
+writer processes that issue [write operations](write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and
readers (that execute queries and read data).
Hudi provides
@@ -23,7 +23,7 @@ We’ll also describe ways to ingest data into a Hudi Table from multiple writer
## Distributed Locking
A pre-requisite for distributed co-ordination in Hudi, like many other distributed database systems is a distributed lock provider, that different processes can use to plan, schedule and
-execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](/docs/next/timeline#truetime-generation), as discussed before.
+execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](timeline#truetime-generation), as discussed before.
External locking is typically used in conjunction with optimistic concurrency control
because it provides a way to prevent conflicts that might occur when two or more transactions (commits in our case) attempt to modify the same resource concurrently.
@@ -204,7 +204,7 @@ Multiple writers can operate on the table with non-blocking conflict resolution.
file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is
currently available for preview in version 1.0.0-beta only with the caveat that conflict resolution is not supported yet
between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink
-writers [here](/docs/next/sql_dml#non-blocking-concurrency-control-experimental).
+writers [here](sql_dml#non-blocking-concurrency-control-experimental).
## Early conflict Detection
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 9bafde59c465..7785f4ceaca1 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Batch Writes](/docs/next/writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Batch Writes](writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/docs/faq.md b/website/docs/faq.md
index 1378839b81cb..26c3eb50d214 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -6,10 +6,10 @@ keywords: [hudi, writing, reading]
The FAQs are split into following pages. Please refer to the specific pages for more info.
-- [General](/docs/next/faq_general)
-- [Design & Concepts](/docs/next/faq_design_and_concepts)
-- [Writing Tables](/docs/next/faq_writing_tables)
-- [Reading Tables](/docs/next/faq_reading_tables)
-- [Table Services](/docs/next/faq_table_services)
-- [Storage](/docs/next/faq_storage)
-- [Integrations](/docs/next/faq_integrations)
+- [General](faq_general)
+- [Design & Concepts](faq_design_and_concepts)
+- [Writing Tables](faq_writing_tables)
+- [Reading Tables](faq_reading_tables)
+- [Table Services](faq_table_services)
+- [Storage](faq_storage)
+- [Integrations](faq_integrations)
diff --git a/website/docs/faq_general.md b/website/docs/faq_general.md
index 61b6c12a4b5d..9f0a6c7d5153 100644
--- a/website/docs/faq_general.md
+++ b/website/docs/faq_general.md
@@ -61,7 +61,7 @@ Nonetheless, Hudi is designed very much like a database and provides similar fun
### How do I model the data stored in Hudi?
-When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](/docs/next/writing_data) for an example.
+When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](writing_data) for an example.
When querying/reading data, Hudi just presents itself as a json-like hierarchical table, everyone is used to querying using Hive/Spark/Presto over Parquet/Json/Avro.
diff --git a/website/docs/faq_table_services.md b/website/docs/faq_table_services.md
index 0ca730094e4f..7ff398687e39 100644
--- a/website/docs/faq_table_services.md
+++ b/website/docs/faq_table_services.md
@@ -50,6 +50,6 @@ Hudi runs cleaner to remove old file versions as part of writing data either in
Yes. Hudi provides the ability to post a callback notification about a write commit. You can use a http hook or choose to
-be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](/docs/next/platform_services_post_commit_callback)
+be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](platform_services_post_commit_callback)
for details
diff --git a/website/docs/faq_writing_tables.md b/website/docs/faq_writing_tables.md
index bed07a16e57a..2374006d9553 100644
--- a/website/docs/faq_writing_tables.md
+++ b/website/docs/faq_writing_tables.md
@@ -6,7 +6,7 @@ keywords: [hudi, writing, reading]
### What are some ways to write a Hudi table?
-Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](/docs/next/writing_data#spark-datasource-api) to write into Hudi.
+Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](writing_data#spark-datasource-api) to write into Hudi.
### How is a Hudi writer job deployed?
@@ -68,7 +68,7 @@ As you could see, ([combineAndGetUpdateValue(), getInsertValue()](https://github
### How do I delete records in the dataset using Hudi?
-GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see [here](/docs/next/writing_data#deletes).
+GDPR has made deletes a must-have tool in everyone's data management toolbox. Hudi supports both soft and hard deletes. For details on how to actually perform them, see [here](writing_data#deletes).
### Should I need to worry about deleting all copies of the records in case of duplicates?
@@ -147,7 +147,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
@@ -183,7 +183,7 @@ No, Hudi does not expose uncommitted files/blocks to the readers. Further, Hudi
### How are conflicts detected in Hudi between multiple writers?
-Hudi employs [optimistic concurrency control](/docs/concurrency_control#supported-concurrency-controls) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs.
+Hudi employs [optimistic concurrency control](concurrency_control) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs.
### Can single-writer inserts have duplicates?
diff --git a/website/docs/file_sizing.md b/website/docs/file_sizing.md
index c637a5a630cc..62ad0f7a4320 100644
--- a/website/docs/file_sizing.md
+++ b/website/docs/file_sizing.md
@@ -148,7 +148,7 @@ while the clustering service runs.
:::note
Hudi always creates immutable files on storage. To be able to do auto-sizing or clustering, Hudi will always create a
-newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](/docs/next/cleaning)
+newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](cleaning)
will later kick in and delete the older version small file and keep the latest one.
:::
diff --git a/website/docs/flink-quick-start-guide.md b/website/docs/flink-quick-start-guide.md
index 0ab2322d766e..1cfda067c71c 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -449,19 +449,19 @@ feature is that it now lets you author streaming pipelines on streaming or batch
## Where To Go From Here?
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
-- **Configuration** : For [Global Configuration](/docs/next/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/next/flink_tuning#table-options).
-- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/next/ingestion_flink#cdc-ingestion), [Bulk Insert](/docs/next/ingestion_flink#bulk-insert), [Index Bootstrap](/docs/next/ingestion_flink#index-bootstrap), [Changelog Mode](/docs/next/ingestion_flink#changelog-mode) and [Append Mode](/docs/next/ingestion_flink#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](/docs/next/sql_dml#non-blocking-concurrency-control-experimental).
-- **Reading Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/sql_queries#streaming-query) and [Incremental Query](/docs/sql_queries#incremental-query).
-- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/next/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/next/flink_tuning#write-rate-limit).
+- **Configuration** : For [Global Configuration](flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_tuning#table-options).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](ingestion_flink#cdc-ingestion), [Bulk Insert](ingestion_flink#bulk-insert), [Index Bootstrap](ingestion_flink#index-bootstrap), [Changelog Mode](ingestion_flink#changelog-mode) and [Append Mode](ingestion_flink#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](sql_dml#non-blocking-concurrency-control-experimental).
+- **Reading Data** : Flink supports different modes for reading, such as [Streaming Query](sql_queries#streaming-query) and [Incremental Query](/docs/sql_queries#incremental-query).
+- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](flink_tuning#memory-optimization) and [Write Rate Limit](flink_tuning#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
-- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](sql_queries#presto).
- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/sql_ddl/#create-catalog).
If you are relatively new to Apache Hudi, it is important to be familiar with a few core concepts:
- - [Hudi Timeline](/docs/next/timeline) – How Hudi manages transactions and other table services
- - [Hudi Storage Layout](/docs/next/storage_layouts) - How the files are laid out on storage
- - [Hudi Table Types](/docs/next/table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ`
- - [Hudi Query Types](/docs/next/table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries
+ - [Hudi Timeline](timeline) – How Hudi manages transactions and other table services
+ - [Hudi Storage Layout](storage_layouts) - How the files are laid out on storage
+ - [Hudi Table Types](table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ`
+ - [Hudi Query Types](table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries
See more in the "Concepts" section of the docs.
diff --git a/website/docs/hudi_stack.md b/website/docs/hudi_stack.md
index ab2408f43164..59517ede41da 100644
--- a/website/docs/hudi_stack.md
+++ b/website/docs/hudi_stack.md
@@ -157,7 +157,7 @@ Platform services offer functionality that is specific to data and workloads, an
Services, like [Hudi Streamer](./hoodie_streaming_ingestion#hudi-streamer) (or its Flink counterpart), are specialized in handling data and workloads, seamlessly integrating with Kafka streams and various
formats to build data lakes. They support functionalities like automatic checkpoint management, integration with major schema registries (including Confluent), and
deduplication of data. Hudi Streamer also offers features for backfills, one-off runs, and continuous mode operation with Spark/Flink streaming writers. Additionally,
-Hudi provides tools for [snapshotting](./snapshot_exporter) and incrementally [exporting](./snapshot_exporter#examples) Hudi tables, importing new tables, and [post-commit callback](/docs/next/platform_services_post_commit_callback) for analytics or
+Hudi provides tools for [snapshotting](./snapshot_exporter) and incrementally [exporting](./snapshot_exporter#examples) Hudi tables, importing new tables, and [post-commit callback](platform_services_post_commit_callback) for analytics or
workflow management, enhancing the deployment of production-grade incremental pipelines. Apart from these services, Hudi also provides broad support for different
catalogs such as [Hive Metastore](./syncing_metastore), [AWS Glue](./syncing_aws_glue_data_catalog/), [Google BigQuery](./gcp_bigquery), [DataHub](./syncing_datahub), etc. that allows syncing of Hudi tables to be queried by
interactive engines such as Trino and Presto.
diff --git a/website/docs/indexes.md b/website/docs/indexes.md
index 512242ba811d..c2284f2d473e 100644
--- a/website/docs/indexes.md
+++ b/website/docs/indexes.md
@@ -19,8 +19,8 @@ Only clustering or cross-partition updates that are implemented as deletes + ins
file group at any completed instant on the timeline.
## Need for indexing
-For [Copy-On-Write tables](/docs/next/table_types#copy-on-write-table), indexing enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite.
-For [Merge-On-Read tables](/docs/next/table_types#merge-on-read-table), indexing allows Hudi to bound the amount of change records any given base file needs to be merged against. Specifically, a given base file needs to merged
+For [Copy-On-Write tables](table_types#copy-on-write-table), indexing enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite.
+For [Merge-On-Read tables](table_types#merge-on-read-table), indexing allows Hudi to bound the amount of change records any given base file needs to be merged against. Specifically, a given base file needs to merged
only against updates for records that are part of that base file.
![Fact table](/assets/images/blog/hudi-indexes/with_without_index.png)
@@ -28,7 +28,7 @@ only against updates for records that are part of that base file.
In contrast,
- Designs without an indexing component (e.g: [Apache Hive/Apache Iceberg](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)) end up having to merge all the base files against all incoming updates/delete records
- (10-100x more [read amplification](/docs/next/table_types#comparison)).
+ (10-100x more [read amplification](table_types#comparison)).
- Designs that implement heavily write-optimized OLTP data structures like LSM trees do not require an indexing component. But they perform poorly scan heavy workloads
against cloud storage making them unsuitable for serving analytical queries.
@@ -42,8 +42,8 @@ implemented by enhancing the metadata table with the flexibility to extend to ne
along with an [asynchronous index](https://hudi.apache.org/docs/metadata_indexing/#setup-async-indexing) building
Hudi supports a multi-modal index by augmenting the metadata table with the capability to incorporate new types of indexes, complemented by an
-asynchronous mechanism for [index construction](/docs/next/metadata_indexing). This enhancement supports a range of indexes within
-the [metadata table](/docs/next/metadata#metadata-table), significantly improving the efficiency of both writing to and reading from the table.
+asynchronous mechanism for [index construction](metadata_indexing). This enhancement supports a range of indexes within
+the [metadata table](metadata#metadata-table), significantly improving the efficiency of both writing to and reading from the table.
![Indexes](/assets/images/hudi-stack-indexes.png)
Figure: Indexes in Hudi
@@ -68,7 +68,7 @@ the [metadata table](/docs/next/metadata#metadata-table), significantly improvin
An [expression index](https://github.com/apache/hudi/blob/3789840be3d041cbcfc6b24786740210e4e6d6ac/rfc/rfc-63/rfc-63.md) is an index on a function of a column. If a query has a predicate on a function of a column, the expression index can
be used to speed up the query. Expression index is stored in *expr_index_* prefixed partitions (one for each
expression index) under metadata table. Expression index can be created using SQL syntax. Please checkout SQL DDL
- docs [here](/docs/next/sql_ddl#create-functional-index-experimental) for more details.
+ docs [here](sql_ddl#create-expression-index) for more details.
### Secondary Index
diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index fb79f19799ac..0295489e348b 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -62,7 +62,7 @@ Following are the different types of metadata currently supported.
```
-To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-and-secondary-index-experimental).
+To try out these features, refer to the [SQL guide](sql_ddl#create-partition-stats-index).
## Metadata Tracking on Writers
@@ -153,7 +153,7 @@ process which cannot rely on the in-process lock provider.
### Deployment Model C: Multi-writer
-If your current deployment model is [multi-writer](/docs/concurrency_control#model-c-multi-writer) along with a lock
+If your current deployment model is [multi-writer](concurrency_control#full-on-multi-writer--async-table-services) along with a lock
provider and other required configs set for every writer as follows, there is no additional configuration required. You
can bring up the writers sequentially after stopping the writers for enabling metadata table. Applying the proper
configurations to only partial writers leads to loss of data from the inconsistent writer. So, ensure you enable
diff --git a/website/docs/metadata_indexing.md b/website/docs/metadata_indexing.md
index ee0609965fbe..d1978c1e486e 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -31,7 +31,7 @@ asynchronous indexing. To learn more about the design of asynchronous indexing f
## Index Creation Using SQL
Currently indexes like secondary index, expression index and record index can be created using SQL create index command.
-For more information on these indexes please refer [metadata section](/docs/next/metadata/#types-of-table-metadata)
+For more information on these indexes please refer [metadata section](metadata/#types-of-table-metadata)
:::note
Please note in order to create secondary index:
@@ -54,7 +54,7 @@ CREATE INDEX idx_column_ts ON hudi_indexed_table USING column_stats(ts) OPTIONS(
CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING bloom_filters(driver) OPTIONS(expr='identity');
```
-For more information on index creation using SQL refer [SQL DDL](/docs/next/sql_ddl#create-index)
+For more information on index creation using SQL refer [SQL DDL](sql_ddl#create-index)
## Index Creation Using Datasource
@@ -182,8 +182,8 @@ us schedule the indexing for COLUMN_STATS index. First we need to define a prope
As mentioned before, metadata indices are pluggable. One can add any index at any point in time depending on changing
business requirements. Some configurations to enable particular indices are listed below. Currently, available indices under
-metadata table can be explored [here](/docs/next/metadata/#types-of-table-metadata) along with [configs](/docs/next/metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side)
-to enable them. The full set of metadata configurations can be explored [here](/docs/next/configurations/#Metadata-Configs).
+metadata table can be explored [here](indexes#multi-modal-indexing) along with [configs](metadata#metadata-tracking-on-writers)
+to enable them. The full set of metadata configurations can be explored [here](configurations/#Metadata-Configs).
:::note
Enabling the metadata table and configuring a lock provider are the prerequisites for using async indexer. Checkout a sample
diff --git a/website/docs/precommit_validator.md b/website/docs/precommit_validator.md
index 5e13fca3dc0e..d5faf61057de 100644
--- a/website/docs/precommit_validator.md
+++ b/website/docs/precommit_validator.md
@@ -91,7 +91,7 @@ void validateRecordsBeforeAndAfter(Dataset before,
```
## Additional Monitoring with Notifications
-Hudi offers a [commit notification service](/docs/next/platform_services_post_commit_callback) that can be configured to trigger notifications about write commits.
+Hudi offers a [commit notification service](platform_services_post_commit_callback) that can be configured to trigger notifications about write commits.
The commit notification service can be combined with pre-commit validators to send a notification when a commit fails a validation. This is possible by passing details about the validation as a custom value to the HTTP endpoint.
diff --git a/website/docs/procedures.md b/website/docs/procedures.md
index 1dbeb899b14f..19d656680111 100644
--- a/website/docs/procedures.md
+++ b/website/docs/procedures.md
@@ -472,10 +472,10 @@ archive commits.
|------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | N | None | Hudi table name |
| path | String | N | None | Path of table |
-| [min_commits](/docs/next/configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. |
-| [max_commits](/docs/next/configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. |
-| [retain_commits](/docs/next/configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. |
-| [enable_metadata](/docs/next/configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table |
+| [min_commits](configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. |
+| [max_commits](configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. |
+| [retain_commits](configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. |
+| [enable_metadata](configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table |
**Output**
@@ -672,7 +672,7 @@ copy table to a temporary view.
| Parameter Name | Type | Required | Default Value | Description |
|-------------------------------------------------------------------|---------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | Y | None | Hudi table name |
-| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
+| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
| view_name | String | Y | None | Name of view |
| begin_instance_time | String | N | "" | Begin instance time |
| end_instance_time | String | N | "" | End instance time |
@@ -705,7 +705,7 @@ copy table to a new table.
| Parameter Name | Type | Required | Default Value | Description |
|-------------------------------------------------------------------|--------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | Y | None | Hudi table name |
-| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
+| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
| new_table | String | Y | None | Name of new table |
| begin_instance_time | String | N | "" | Begin instance time |
| end_instance_time | String | N | "" | End instance time |
@@ -1535,10 +1535,10 @@ Run cleaner on a hoodie table.
|---------------------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | Y | None | Name of table to be cleaned |
| schedule_in_line | Boolean | N | true | Set "true" if you want to schedule and run a clean. Set false if you have already scheduled a clean and want to run that. |
-| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. |
-| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. |
-| [hours_retained](/docs/next/configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. |
-| [file_versions_retained](/docs/next/configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. |
+| [clean_policy](configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. |
+| [retain_commits](configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. |
+| [hours_retained](configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. |
+| [file_versions_retained](configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. |
| [trigger_strategy](/docs/next/configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` |
| [trigger_max_commits](/docs/next/configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. |
| [options](/docs/next/configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" |
diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md
index 31069822df76..83a03a4a1121 100644
--- a/website/docs/querying_data.md
+++ b/website/docs/querying_data.md
@@ -7,7 +7,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
---
:::danger
-This page is no longer maintained. Please refer to Hudi [SQL DDL](/docs/next/sql_ddl), [SQL DML](/docs/next/sql_dml), [SQL Queries](/docs/next/sql_queries) and [Procedures](/docs/next/procedures) for the latest documentation.
+This page is no longer maintained. Please refer to Hudi [SQL DDL](sql_ddl), [SQL DML](sql_dml), [SQL Queries](sql_queries) and [Procedures](procedures) for the latest documentation.
:::
Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts#query-types).
diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md
index 7c4c2e8077d9..4ddb4005df31 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -257,7 +257,7 @@ CREATE TABLE hudi_table (
PARTITIONED BY (city);
```
-For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](/docs/next/sql_ddl) reference guide.
+For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](sql_ddl) reference guide.
@@ -301,7 +301,7 @@ inserts.write.format("hudi").
```
:::info Mapping to Hudi write operations
-Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables,
+Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables,
with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as
the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
:::
@@ -334,7 +334,7 @@ inserts.write.format("hudi"). \
```
:::info Mapping to Hudi write operations
-Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables,
+Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables,
with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as
the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
:::
@@ -343,7 +343,7 @@ the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
-Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](/docs/next/sql_dml#insert-into) for more advanced options.
+Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](sql_dml#insert-into) for more advanced options.
```sql
INSERT INTO hudi_table
@@ -455,7 +455,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un
-Hudi table can be update using a regular UPDATE statement. See [Update](/docs/next/sql_dml#update) for more advanced options.
+Hudi table can be update using a regular UPDATE statement. See [Update](sql_dml#update) for more advanced options.
```sql
UPDATE hudi_table SET fare = 25.0 WHERE rider = 'rider-D';
@@ -485,7 +485,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un
-[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](/docs/next/concepts).
+[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](concepts).
Look for changes in `_hoodie_commit_time`, `fare` fields for the given `_hoodie_record_key` value from a previous commit.
## Merging Data {#merge}
@@ -1264,7 +1264,7 @@ PARTITIONED BY (city);
>
:::note Implications of defining record keys
-Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](/docs/next/write_operations).
+Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](write_operations).
Also if a record key is configured, then it's also advisable to specify a precombine or ordering field, to correctly handle cases where the source data has
multiple records with the same key. See section below.
:::
@@ -1276,8 +1276,8 @@ Hudi also uses this mechanism to support out-of-order data arrival into a table,
For e.g. using a _created_at_ timestamp field as the precombine field will prevent older versions of a record from overwriting newer ones or being exposed to queries, even
if they are written at a later commit time to the table. This is one of the key features, that makes Hudi, best suited for dealing with streaming data.
-To enable different merge semantics, Hudi supports [merge modes](/docs/next/record_merger). Commit time and event time based merge modes are supported out of the box.
-Users can also define their own custom merge strategies, see [here](/docs/next/sql_ddl#create-table-with-record-merge-mode).
+To enable different merge semantics, Hudi supports [merge modes](record_merger). Commit time and event time based merge modes are supported out of the box.
+Users can also define their own custom merge strategies, see [here](sql_ddl#create-table-with-record-merge-mode).
`(see also [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions))
-for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](/docs/next/migration_guide).
+for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](migration_guide).
### Spark SQL Reference
-For advanced usage of spark SQL, please refer to [Spark SQL DDL](/docs/next/sql_ddl) and [Spark SQL DML](/docs/next/sql_dml) reference guides.
-For alter table commands, check out [this](/docs/next/sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](/docs/next/procedures) out.
+For advanced usage of spark SQL, please refer to [Spark SQL DDL](sql_ddl) and [Spark SQL DML](sql_dml) reference guides.
+For alter table commands, check out [this](sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](procedures) out.
### Streaming workloads
@@ -1355,9 +1355,9 @@ Hudi provides industry-leading performance and functionality for streaming data.
from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider,
transformation support, automatic table services and so on.
-**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/next/writing_tables_streaming_writes#spark-streaming) for more.
+**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](writing_tables_streaming_writes#spark-streaming) for more.
-Check out more information on [modeling data in Hudi](/docs/next/faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](/docs/next/writing_tables_streaming_writes).
+Check out more information on [modeling data in Hudi](faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](writing_tables_streaming_writes).
### Dockerized Demo
Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy
diff --git a/website/docs/record_merger.md b/website/docs/record_merger.md
index d98a5fc462a6..378c5575ad19 100644
--- a/website/docs/record_merger.md
+++ b/website/docs/record_merger.md
@@ -6,7 +6,7 @@ toc_min_heading_level: 2
toc_max_heading_level: 4
---
-Hudi handles mutations to records and streaming data, as we briefly touched upon in [timeline ordering](/docs/next/timeline#ordering-of-actions) section.
+Hudi handles mutations to records and streaming data, as we briefly touched upon in [timeline ordering](timeline#ordering-of-actions) section.
To provide users full-fledged support for stream processing, Hudi goes all the way making the storage engine and the underlying storage format
understand how to merge changes to the same record key, that may arrive even in different order at different times. With the rise of mobile applications
and IoT, these scenarios have become the normal than an exception. For e.g. a social networking application uploading user events several hours after they happened,
@@ -54,7 +54,7 @@ With event time ordering, the merging picks the record with the highest value on
In the example above, two microservices product change records about orders at different times, that can arrive out-of-order. As color coded,
this can lead to application-level inconsistent states in the table if simply merged in commit time order like a cancelled order being re-created or
a paid order moved back to just created state expecting payment again. Event time ordering helps by ignoring older state changes that arrive late and
-avoiding order status from "jumping back" in time. Combined with [non-blocking concurrency control](/docs/next/concurrency_control#non-blocking-concurrency-control-mode),
+avoiding order status from "jumping back" in time. Combined with [non-blocking concurrency control](concurrency_control#non-blocking-concurrency-control-mode),
this provides a very powerful way for processing such data streams efficiently and correctly.
### CUSTOM
@@ -249,5 +249,5 @@ Payload class can be specified using the below configs. For more advanced config
There are also quite a few other implementations. Developers may be interested in looking at the hierarchy of `HoodieRecordPayload` interface. For
example, [`MySqlDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java) and [`PostgresDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/PostgresDebeziumAvroPayload.java) provides support for seamlessly applying changes
captured via Debezium for MySQL and PostgresDB. [`AWSDmsAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java) provides support for applying changes captured via Amazon Database Migration Service onto S3.
-For full configurations, go [here](/docs/configurations#RECORD_PAYLOAD) and please check out [this FAQ](/docs/next/faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage) if you want to implement your own custom payloads.
+For full configurations, go [here](/docs/configurations#RECORD_PAYLOAD) and please check out [this FAQ](faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage) if you want to implement your own custom payloads.
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index e04f64d68b9d..c00b815ac4e8 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -105,7 +105,7 @@ TBLPROPERTIES (
### Create table with merge modes {#create-table-with-record-merge-mode}
-Hudi supports different [record merge modes](/docs/next/record_merger) to handle merge of incoming records with existing
+Hudi supports different [record merge modes](record_merger) to handle merge of incoming records with existing
records. To create a table with specific record merge mode, you can set `recordMergeMode` option.
```sql
@@ -127,7 +127,7 @@ LOCATION 'file:///tmp/hudi_table_merge_mode/';
With `EVENT_TIME_ORDERING`, the record with the larger event time (`precombineField`) overwrites the record with the
smaller event time on the same key, regardless of transaction's commit time. Users can set `CUSTOM` mode to provide their own
merge logic. With `CUSTOM` merge mode, you can provide a custom class that implements the merge logic. The interfaces
-to implement is explained in detail [here](/docs/next/record_merger#custom).
+to implement is explained in detail [here](record_merger#custom).
```sql
CREATE TABLE IF NOT EXISTS hudi_table_merge_mode_custom (
@@ -236,7 +236,7 @@ AS SELECT * FROM parquet_table;
### Create Index
Hudi supports creating and dropping different types of indexes on a table. For more information on different
-type of indexes please refer [multi-modal indexing](/docs/next/indexes#multi-modal-indexing). Secondary
+type of indexes please refer [multi-modal indexing](indexes#multi-modal-indexing). Secondary
index, expression index and record indexes can be created using SQL create index command.
```sql
@@ -529,7 +529,7 @@ CREATE INDEX idx_bloom_rider ON hudi_indexed_table USING bloom_filters(rider) OP
- Secondary index can only be used for tables using OverwriteWithLatestAvroPayload payload or COMMIT_TIME_ORDERING merge mode
- Column stats Expression Index can not be created using `identity` expression with SQL. Users can leverage column stat index using Datasource instead.
- Index update can fail with schema evolution.
-- Only one index can be created at a time using [async indexer](/docs/next/metadata_indexing).
+- Only one index can be created at a time using [async indexer](metadata_indexing).
### Setting Hudi configs
@@ -592,7 +592,7 @@ Users can set table properties while creating a table. The important table prope
#### Passing Lock Providers for Concurrent Writers
Hudi requires a lock provider to support concurrent writers or asynchronous table services when using OCC
-and [NBCC](/docs/next/concurrency_control#non-blocking-concurrency-control-mode-experimental) (Non-Blocking Concurrency Control)
+and [NBCC](concurrency_control#non-blocking-concurrency-control) (Non-Blocking Concurrency Control)
concurrency mode. For NBCC mode, locking is only used to write the commit metadata file in the timeline. Writes are
serialized by completion time. Users can pass these table properties into *TBLPROPERTIES* as well. Below is an example
for a Zookeeper based configuration.
@@ -843,7 +843,7 @@ WITH (
### Create Table in Non-Blocking Concurrency Control Mode
-The following is an example of creating a Flink table in [Non-Blocking Concurrency Control mode](/docs/next/concurrency_control#non-blocking-concurrency-control).
+The following is an example of creating a Flink table in [Non-Blocking Concurrency Control mode](concurrency_control#non-blocking-concurrency-control).
```sql
-- This is a datagen source that can generate records continuously
@@ -911,7 +911,7 @@ ALTER TABLE tableA RENAME TO tableB;
### Setting Hudi configs
#### Using table options
-You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](/docs/next/configurations#FLINK_SQL)
+You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](configurations#FLINK_SQL)
These configs will be applied to all the operations on that table.
```sql
diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
index 43d5d940fb37..6f5fe28a3eba 100644
--- a/website/docs/sql_dml.md
+++ b/website/docs/sql_dml.md
@@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem';
SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data
from your Hudi tables. Let's explore them one by one.
-Please refer to [SQL DDL](/docs/next/sql_ddl) for creating Hudi tables using SQL.
+Please refer to [SQL DDL](sql_ddl) for creating Hudi tables using SQL.
### Insert Into
@@ -25,7 +25,7 @@ SELECT FROM
@@ -445,7 +445,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations)
:::
diff --git a/website/versioned_docs/version-0.10.0/write_operations.md b/website/versioned_docs/version-0.10.0/write_operations.md
index eb3cb9a45220..952fe3b11969 100644
--- a/website/versioned_docs/version-0.10.0/write_operations.md
+++ b/website/versioned_docs/version-0.10.0/write_operations.md
@@ -37,7 +37,7 @@ Hudi supports implementing two types of deletes on data stored in Hudi tables, b
## Writing path
The following is an inside look on the Hudi write path and the sequence of events that occur during a write.
-1. [Deduping](/docs/configurations/#writeinsertdeduplicate)
+1. [Deduping](configurations#hoodiecombinebeforeinsert)
1. First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key.
2. [Index Lookup](/docs/indexing)
1. Next, an index lookup is performed to try and match the input records to identify which file groups they belong to.
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.10.0/writing_data.md b/website/versioned_docs/version-0.10.0/writing_data.md
index 719813360c4c..9806ef706484 100644
--- a/website/versioned_docs/version-0.10.0/writing_data.md
+++ b/website/versioned_docs/version-0.10.0/writing_data.md
@@ -93,7 +93,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations)
:::
@@ -129,7 +129,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations)
:::
diff --git a/website/versioned_docs/version-0.10.1/clustering.md b/website/versioned_docs/version-0.10.1/clustering.md
index f9bdd572751a..e630e92445d5 100644
--- a/website/versioned_docs/version-0.10.1/clustering.md
+++ b/website/versioned_docs/version-0.10.1/clustering.md
@@ -12,7 +12,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei
## Clustering Architecture
-At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.
+At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.
diff --git a/website/versioned_docs/version-0.10.1/compaction.md b/website/versioned_docs/version-0.10.1/compaction.md
index c56df32c186d..5267f1209844 100644
--- a/website/versioned_docs/version-0.10.1/compaction.md
+++ b/website/versioned_docs/version-0.10.1/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.10.1/concurrency_control.md b/website/versioned_docs/version-0.10.1/concurrency_control.md
index 6ea34baa9aab..b926babecb89 100644
--- a/website/versioned_docs/version-0.10.1/concurrency_control.md
+++ b/website/versioned_docs/version-0.10.1/concurrency_control.md
@@ -19,13 +19,13 @@ between multiple table service writers and readers. Additionally, using MVCC, Hu
the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits (or writers) happening to the same table, if they do not have writes to overlapping files being changed, both writers are allowed to succeed.
This feature is currently *experimental* and requires either Zookeeper or HiveMetastore to acquire locks.
-It may be helpful to understand the different guarantees provided by [write operations](/docs/writing_data#write-operations) via Hudi datasource or the delta streamer.
+It may be helpful to understand the different guarantees provided by [write operations](/docs/write_operations) via Hudi datasource or the delta streamer.
## Single Writer Guarantees
- *UPSERT Guarantee*: The target table will NEVER show duplicates.
- - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
- - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
+ - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled.
+ - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled.
- *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out of order.
## Multi Writer Guarantees
@@ -33,8 +33,8 @@ It may be helpful to understand the different guarantees provided by [write oper
With multiple writers using OCC, some of the above guarantees change as follows
- *UPSERT Guarantee*: The target table will NEVER show duplicates.
-- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
-- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
+- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled.
+- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled.
- *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints MIGHT be out of order due to multiple writer jobs finishing at different times.
## Enabling Multi Writing
diff --git a/website/versioned_docs/version-0.10.1/deployment.md b/website/versioned_docs/version-0.10.1/deployment.md
index c3f3de84e88c..7614b28c439c 100644
--- a/website/versioned_docs/version-0.10.1/deployment.md
+++ b/website/versioned_docs/version-0.10.1/deployment.md
@@ -25,9 +25,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d
### DeltaStreamer
-[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes.
+[DeltaStreamer](hoodie_deltastreamer#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes.
- - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](/docs/writing_data#deltastreamer) for running the spark application.
+ - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for every ingestion run and this can be changed by setting the property "hoodie.compact.inline.max.delta.commits". You can either manually run this spark application or use any cron trigger or workflow orchestrator (most common deployment strategy) such as Apache Airflow to spawn this application. See command line options in [this section](hoodie_deltastreamer#deltastreamer) for running the spark application.
Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster.
@@ -126,7 +126,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.10.1/faq.md b/website/versioned_docs/version-0.10.1/faq.md
index d9691b8bdeb4..f6feaa253435 100644
--- a/website/versioned_docs/version-0.10.1/faq.md
+++ b/website/versioned_docs/version-0.10.1/faq.md
@@ -284,7 +284,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.10.1/file_sizing.md b/website/versioned_docs/version-0.10.1/file_sizing.md
index e7935445d9e6..58831e4b2995 100644
--- a/website/versioned_docs/version-0.10.1/file_sizing.md
+++ b/website/versioned_docs/version-0.10.1/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.10.1/performance.md b/website/versioned_docs/version-0.10.1/performance.md
index 53152730bd84..274ed9dc3fd4 100644
--- a/website/versioned_docs/version-0.10.1/performance.md
+++ b/website/versioned_docs/version-0.10.1/performance.md
@@ -14,10 +14,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li
Here are some ways to efficiently manage the storage of your Hudi tables.
-- The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload
+- The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload
and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files.
-- Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull
-- User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio),
+- Cleaner can be [configured](/docs/configurations#hoodiecleanercommitsretained) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull
+- User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations/#hoodielogfilemaxsize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio),
such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately.
- Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups
once created cannot be deleted, but simply expanded as explained before.
diff --git a/website/versioned_docs/version-0.10.1/querying_data.md b/website/versioned_docs/version-0.10.1/querying_data.md
index a4fe212de99b..4a120b3423f3 100644
--- a/website/versioned_docs/version-0.10.1/querying_data.md
+++ b/website/versioned_docs/version-0.10.1/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
@@ -49,7 +49,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu
```
For examples, refer to [Incremental Queries](/docs/quick-start-guide#incremental-query) in the Spark quickstart.
-Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options.
+Please refer to [configurations](/docs/configurations/#SPARK_DATASOURCE) section, to view all datasource options.
Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing.
@@ -170,10 +170,10 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par
separated) and calls InputFormat.listStatus() only once with all those partitions.
## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#PrestoDB) page.
+To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#prestodb) page.
## Trino
-To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#Trino) page.
+To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/0.10.1/query_engine_setup#trino) page.
## Impala (3.4 or later)
diff --git a/website/versioned_docs/version-0.10.1/quick-start-guide.md b/website/versioned_docs/version-0.10.1/quick-start-guide.md
index fc4f17202ff3..d286cc3e61ce 100644
--- a/website/versioned_docs/version-0.10.1/quick-start-guide.md
+++ b/website/versioned_docs/version-0.10.1/quick-start-guide.md
@@ -426,7 +426,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations)
:::
@@ -462,7 +462,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations)
:::
diff --git a/website/versioned_docs/version-0.10.1/tuning-guide.md b/website/versioned_docs/version-0.10.1/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.10.1/tuning-guide.md
+++ b/website/versioned_docs/version-0.10.1/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.10.1/write_operations.md b/website/versioned_docs/version-0.10.1/write_operations.md
index eb3cb9a45220..952fe3b11969 100644
--- a/website/versioned_docs/version-0.10.1/write_operations.md
+++ b/website/versioned_docs/version-0.10.1/write_operations.md
@@ -37,7 +37,7 @@ Hudi supports implementing two types of deletes on data stored in Hudi tables, b
## Writing path
The following is an inside look on the Hudi write path and the sequence of events that occur during a write.
-1. [Deduping](/docs/configurations/#writeinsertdeduplicate)
+1. [Deduping](configurations#hoodiecombinebeforeinsert)
1. First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key.
2. [Index Lookup](/docs/indexing)
1. Next, an index lookup is performed to try and match the input records to identify which file groups they belong to.
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.10.1/writing_data.md b/website/versioned_docs/version-0.10.1/writing_data.md
index 719813360c4c..9806ef706484 100644
--- a/website/versioned_docs/version-0.10.1/writing_data.md
+++ b/website/versioned_docs/version-0.10.1/writing_data.md
@@ -93,7 +93,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations)
:::
@@ -129,7 +129,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations)
:::
diff --git a/website/versioned_docs/version-0.11.0/compaction.md b/website/versioned_docs/version-0.11.0/compaction.md
index a6249b7ae7c4..e99cc2082c5f 100644
--- a/website/versioned_docs/version-0.11.0/compaction.md
+++ b/website/versioned_docs/version-0.11.0/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.11.0/deployment.md b/website/versioned_docs/version-0.11.0/deployment.md
index 24ea35e3999f..7fbc595b8b2b 100644
--- a/website/versioned_docs/version-0.11.0/deployment.md
+++ b/website/versioned_docs/version-0.11.0/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.11.0/faq.md b/website/versioned_docs/version-0.11.0/faq.md
index 6c2c86fef5d6..32469d64e81f 100644
--- a/website/versioned_docs/version-0.11.0/faq.md
+++ b/website/versioned_docs/version-0.11.0/faq.md
@@ -284,7 +284,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.11.0/file_sizing.md b/website/versioned_docs/version-0.11.0/file_sizing.md
index e7935445d9e6..58831e4b2995 100644
--- a/website/versioned_docs/version-0.11.0/file_sizing.md
+++ b/website/versioned_docs/version-0.11.0/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.11.0/querying_data.md b/website/versioned_docs/version-0.11.0/querying_data.md
index 6ad05015e753..3a81bc22a17a 100644
--- a/website/versioned_docs/version-0.11.0/querying_data.md
+++ b/website/versioned_docs/version-0.11.0/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.0/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.0/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/versioned_docs/version-0.11.0/tuning-guide.md b/website/versioned_docs/version-0.11.0/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.11.0/tuning-guide.md
+++ b/website/versioned_docs/version-0.11.0/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.11.0/write_operations.md b/website/versioned_docs/version-0.11.0/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.11.0/write_operations.md
+++ b/website/versioned_docs/version-0.11.0/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.11.1/compaction.md b/website/versioned_docs/version-0.11.1/compaction.md
index 9d73e31bd5b0..7b84502c973d 100644
--- a/website/versioned_docs/version-0.11.1/compaction.md
+++ b/website/versioned_docs/version-0.11.1/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.11.1/deployment.md b/website/versioned_docs/version-0.11.1/deployment.md
index c8c4e5cefdc6..bce07498029b 100644
--- a/website/versioned_docs/version-0.11.1/deployment.md
+++ b/website/versioned_docs/version-0.11.1/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.11.1/faq.md b/website/versioned_docs/version-0.11.1/faq.md
index b081d0fe1b03..095480d29842 100644
--- a/website/versioned_docs/version-0.11.1/faq.md
+++ b/website/versioned_docs/version-0.11.1/faq.md
@@ -295,7 +295,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.11.1/file_sizing.md b/website/versioned_docs/version-0.11.1/file_sizing.md
index e7935445d9e6..58831e4b2995 100644
--- a/website/versioned_docs/version-0.11.1/file_sizing.md
+++ b/website/versioned_docs/version-0.11.1/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.11.1/querying_data.md b/website/versioned_docs/version-0.11.1/querying_data.md
index 4cae617a5b6e..b9c2294a83a4 100644
--- a/website/versioned_docs/version-0.11.1/querying_data.md
+++ b/website/versioned_docs/version-0.11.1/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.1/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.11.1/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/versioned_docs/version-0.11.1/tuning-guide.md b/website/versioned_docs/version-0.11.1/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.11.1/tuning-guide.md
+++ b/website/versioned_docs/version-0.11.1/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.11.1/write_operations.md b/website/versioned_docs/version-0.11.1/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.11.1/write_operations.md
+++ b/website/versioned_docs/version-0.11.1/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.12.0/compaction.md b/website/versioned_docs/version-0.12.0/compaction.md
index 9d73e31bd5b0..7b84502c973d 100644
--- a/website/versioned_docs/version-0.12.0/compaction.md
+++ b/website/versioned_docs/version-0.12.0/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.12.0/deployment.md b/website/versioned_docs/version-0.12.0/deployment.md
index 8964d2f91356..1476b051a628 100644
--- a/website/versioned_docs/version-0.12.0/deployment.md
+++ b/website/versioned_docs/version-0.12.0/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.12.0/faq.md b/website/versioned_docs/version-0.12.0/faq.md
index b43043d89cd0..cb3225571c2f 100644
--- a/website/versioned_docs/version-0.12.0/faq.md
+++ b/website/versioned_docs/version-0.12.0/faq.md
@@ -322,7 +322,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.12.0/file_sizing.md b/website/versioned_docs/version-0.12.0/file_sizing.md
index 0bb0d9b003b1..1c1c12fe2071 100644
--- a/website/versioned_docs/version-0.12.0/file_sizing.md
+++ b/website/versioned_docs/version-0.12.0/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
index 4b7e642099f1..75bae79386e8 100644
--- a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
@@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
- **Configuration** : For [Global Configuration](/docs/0.12.0/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.12.0/flink_configuration#table-options).
- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.12.0/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.12.0/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.12.0/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.12.0/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.12.0/hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.12.0/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.12.0/flink_configuration#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.12.0/query_engine_setup#prestodb).
diff --git a/website/versioned_docs/version-0.12.0/querying_data.md b/website/versioned_docs/version-0.12.0/querying_data.md
index 074d9a2e7c43..70adabf40a63 100644
--- a/website/versioned_docs/version-0.12.0/querying_data.md
+++ b/website/versioned_docs/version-0.12.0/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.0/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.0/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/versioned_docs/version-0.12.0/tuning-guide.md b/website/versioned_docs/version-0.12.0/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.12.0/tuning-guide.md
+++ b/website/versioned_docs/version-0.12.0/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.12.0/write_operations.md b/website/versioned_docs/version-0.12.0/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.12.0/write_operations.md
+++ b/website/versioned_docs/version-0.12.0/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.12.1/compaction.md b/website/versioned_docs/version-0.12.1/compaction.md
index 9d73e31bd5b0..7b84502c973d 100644
--- a/website/versioned_docs/version-0.12.1/compaction.md
+++ b/website/versioned_docs/version-0.12.1/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.12.1/deployment.md b/website/versioned_docs/version-0.12.1/deployment.md
index edd7bc69305e..4f01c1b39754 100644
--- a/website/versioned_docs/version-0.12.1/deployment.md
+++ b/website/versioned_docs/version-0.12.1/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.12.1/faq.md b/website/versioned_docs/version-0.12.1/faq.md
index 41b76ec6c15d..9245d723aa21 100644
--- a/website/versioned_docs/version-0.12.1/faq.md
+++ b/website/versioned_docs/version-0.12.1/faq.md
@@ -322,7 +322,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.12.1/file_sizing.md b/website/versioned_docs/version-0.12.1/file_sizing.md
index 0bb0d9b003b1..1c1c12fe2071 100644
--- a/website/versioned_docs/version-0.12.1/file_sizing.md
+++ b/website/versioned_docs/version-0.12.1/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.12.1/querying_data.md b/website/versioned_docs/version-0.12.1/querying_data.md
index 332368fcd33b..374502e96d2d 100644
--- a/website/versioned_docs/version-0.12.1/querying_data.md
+++ b/website/versioned_docs/version-0.12.1/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.1/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.1/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/versioned_docs/version-0.12.1/tuning-guide.md b/website/versioned_docs/version-0.12.1/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.12.1/tuning-guide.md
+++ b/website/versioned_docs/version-0.12.1/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.12.1/write_operations.md b/website/versioned_docs/version-0.12.1/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.12.1/write_operations.md
+++ b/website/versioned_docs/version-0.12.1/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.12.2/compaction.md b/website/versioned_docs/version-0.12.2/compaction.md
index a6249b7ae7c4..e99cc2082c5f 100644
--- a/website/versioned_docs/version-0.12.2/compaction.md
+++ b/website/versioned_docs/version-0.12.2/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.12.2/deployment.md b/website/versioned_docs/version-0.12.2/deployment.md
index 18d9259f745e..57f1ed35cb46 100644
--- a/website/versioned_docs/version-0.12.2/deployment.md
+++ b/website/versioned_docs/version-0.12.2/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.12.2/faq.md b/website/versioned_docs/version-0.12.2/faq.md
index 2752b49e3a79..0cf53d918d4c 100644
--- a/website/versioned_docs/version-0.12.2/faq.md
+++ b/website/versioned_docs/version-0.12.2/faq.md
@@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.12.2/file_sizing.md b/website/versioned_docs/version-0.12.2/file_sizing.md
index e7935445d9e6..58831e4b2995 100644
--- a/website/versioned_docs/version-0.12.2/file_sizing.md
+++ b/website/versioned_docs/version-0.12.2/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md b/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md
index 41fb1dc503a4..3d0944ccd2b6 100644
--- a/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.2/flink-quick-start-guide.md
@@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
- **Configuration** : For [Global Configuration](/docs/0.12.2/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.12.2/flink_configuration#table-options).
- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.12.2/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.12.2/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.12.2/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.12.2/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.12.2/hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.12.2/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.12.2/flink_configuration#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.12.2/query_engine_setup#prestodb).
diff --git a/website/versioned_docs/version-0.12.2/querying_data.md b/website/versioned_docs/version-0.12.2/querying_data.md
index fff64bc0bad2..23d1835010a6 100644
--- a/website/versioned_docs/version-0.12.2/querying_data.md
+++ b/website/versioned_docs/version-0.12.2/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.2/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.2/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/versioned_docs/version-0.12.2/quick-start-guide.md b/website/versioned_docs/version-0.12.2/quick-start-guide.md
index 0143f3b9f896..aa108a75a50d 100644
--- a/website/versioned_docs/version-0.12.2/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.2/quick-start-guide.md
@@ -1099,7 +1099,7 @@ For CoW tables, table services work in inline mode by default.
For MoR tables, some async services are enabled by default.
:::note
-Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions.
+Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions.
If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported.
:::
diff --git a/website/versioned_docs/version-0.12.2/tuning-guide.md b/website/versioned_docs/version-0.12.2/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.12.2/tuning-guide.md
+++ b/website/versioned_docs/version-0.12.2/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.12.2/write_operations.md b/website/versioned_docs/version-0.12.2/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.12.2/write_operations.md
+++ b/website/versioned_docs/version-0.12.2/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.12.3/compaction.md b/website/versioned_docs/version-0.12.3/compaction.md
index a6249b7ae7c4..e99cc2082c5f 100644
--- a/website/versioned_docs/version-0.12.3/compaction.md
+++ b/website/versioned_docs/version-0.12.3/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.12.3/deployment.md b/website/versioned_docs/version-0.12.3/deployment.md
index 998dafa23ed6..cd51d9c9cb5c 100644
--- a/website/versioned_docs/version-0.12.3/deployment.md
+++ b/website/versioned_docs/version-0.12.3/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.12.3/faq.md b/website/versioned_docs/version-0.12.3/faq.md
index 05b60c270c79..5d5aafa0ed15 100644
--- a/website/versioned_docs/version-0.12.3/faq.md
+++ b/website/versioned_docs/version-0.12.3/faq.md
@@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.12.3/file_sizing.md b/website/versioned_docs/version-0.12.3/file_sizing.md
index e7935445d9e6..58831e4b2995 100644
--- a/website/versioned_docs/version-0.12.3/file_sizing.md
+++ b/website/versioned_docs/version-0.12.3/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md b/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md
index afffd7f244e5..179518145226 100644
--- a/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.3/flink-quick-start-guide.md
@@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
- **Configuration** : For [Global Configuration](/docs/0.12.3/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.12.3/flink_configuration#table-options).
- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.12.3/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.12.3/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.12.3/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.12.3/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.12.3/hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.12.3/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.12.3/flink_configuration#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.12.3/query_engine_setup#prestodb).
diff --git a/website/versioned_docs/version-0.12.3/querying_data.md b/website/versioned_docs/version-0.12.3/querying_data.md
index ddd7b5ced131..470fc4f5df9d 100644
--- a/website/versioned_docs/version-0.12.3/querying_data.md
+++ b/website/versioned_docs/version-0.12.3/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.3/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.12.3/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/versioned_docs/version-0.12.3/quick-start-guide.md b/website/versioned_docs/version-0.12.3/quick-start-guide.md
index 3a990aa74431..67418ddd7ced 100644
--- a/website/versioned_docs/version-0.12.3/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.3/quick-start-guide.md
@@ -1099,7 +1099,7 @@ For CoW tables, table services work in inline mode by default.
For MoR tables, some async services are enabled by default.
:::note
-Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions.
+Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions.
If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported.
:::
diff --git a/website/versioned_docs/version-0.12.3/tuning-guide.md b/website/versioned_docs/version-0.12.3/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.12.3/tuning-guide.md
+++ b/website/versioned_docs/version-0.12.3/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.12.3/write_operations.md b/website/versioned_docs/version-0.12.3/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.12.3/write_operations.md
+++ b/website/versioned_docs/version-0.12.3/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.13.0/compaction.md b/website/versioned_docs/version-0.13.0/compaction.md
index a6249b7ae7c4..e99cc2082c5f 100644
--- a/website/versioned_docs/version-0.13.0/compaction.md
+++ b/website/versioned_docs/version-0.13.0/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.13.0/deployment.md b/website/versioned_docs/version-0.13.0/deployment.md
index 8ccc654b4f2a..2837cc92d43d 100644
--- a/website/versioned_docs/version-0.13.0/deployment.md
+++ b/website/versioned_docs/version-0.13.0/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.13.0/faq.md b/website/versioned_docs/version-0.13.0/faq.md
index b0011e893621..6daa604c7a84 100644
--- a/website/versioned_docs/version-0.13.0/faq.md
+++ b/website/versioned_docs/version-0.13.0/faq.md
@@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.13.0/file_sizing.md b/website/versioned_docs/version-0.13.0/file_sizing.md
index e7935445d9e6..58831e4b2995 100644
--- a/website/versioned_docs/version-0.13.0/file_sizing.md
+++ b/website/versioned_docs/version-0.13.0/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md
index f9f91a4c1e4d..8cae9919bc06 100644
--- a/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.13.0/flink-quick-start-guide.md
@@ -12,7 +12,7 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
- **Configuration** : For [Global Configuration](/docs/0.13.0/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.13.0/flink_configuration#table-options).
- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.13.0/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.13.0/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.13.0/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.13.0/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.13.0/hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.13.0/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.13.0/flink_configuration#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/0.13.0/query_engine_setup#prestodb).
diff --git a/website/versioned_docs/version-0.13.0/querying_data.md b/website/versioned_docs/version-0.13.0/querying_data.md
index b62b5e9f63e3..d95f1b4f71f5 100644
--- a/website/versioned_docs/version-0.13.0/querying_data.md
+++ b/website/versioned_docs/version-0.13.0/querying_data.md
@@ -17,7 +17,7 @@ In sections, below we will discuss specific setup to access different query type
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries.
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.13.0/query_engine_setup#Spark-DataSource) page.
+To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/0.13.0/query_engine_setup#spark) page.
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/versioned_docs/version-0.13.0/quick-start-guide.md b/website/versioned_docs/version-0.13.0/quick-start-guide.md
index d4b55283e3f2..839747cfbaa5 100644
--- a/website/versioned_docs/version-0.13.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.13.0/quick-start-guide.md
@@ -1103,7 +1103,7 @@ For CoW tables, table services work in inline mode by default.
For MoR tables, some async services are enabled by default.
:::note
-Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions.
+Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions.
If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported.
:::
diff --git a/website/versioned_docs/version-0.13.0/tuning-guide.md b/website/versioned_docs/version-0.13.0/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.13.0/tuning-guide.md
+++ b/website/versioned_docs/version-0.13.0/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.13.0/write_operations.md b/website/versioned_docs/version-0.13.0/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.13.0/write_operations.md
+++ b/website/versioned_docs/version-0.13.0/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.13.1/compaction.md b/website/versioned_docs/version-0.13.1/compaction.md
index a6249b7ae7c4..e99cc2082c5f 100644
--- a/website/versioned_docs/version-0.13.1/compaction.md
+++ b/website/versioned_docs/version-0.13.1/compaction.md
@@ -95,7 +95,7 @@ is enabled by default.
:::
### Hudi Compactor Utility
-Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/cli#compactions)
Example:
```properties
diff --git a/website/versioned_docs/version-0.13.1/deployment.md b/website/versioned_docs/version-0.13.1/deployment.md
index 7554cbfa8509..3a90bd9bcaa4 100644
--- a/website/versioned_docs/version-0.13.1/deployment.md
+++ b/website/versioned_docs/version-0.13.1/deployment.md
@@ -135,7 +135,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.13.1/faq.md b/website/versioned_docs/version-0.13.1/faq.md
index bd6ba91094c2..40cdd44df972 100644
--- a/website/versioned_docs/version-0.13.1/faq.md
+++ b/website/versioned_docs/version-0.13.1/faq.md
@@ -342,7 +342,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.13.1/file_sizing.md b/website/versioned_docs/version-0.13.1/file_sizing.md
index e7935445d9e6..58831e4b2995 100644
--- a/website/versioned_docs/version-0.13.1/file_sizing.md
+++ b/website/versioned_docs/version-0.13.1/file_sizing.md
@@ -21,7 +21,7 @@ and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below whi
be considered a small file. For the initial bootstrap of a Hudi table, tuning record size estimate is also important to
ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average
record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all
+configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all
files < 100MB and try to get them upto 120MB.
### For Merge-On-Read
diff --git a/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md b/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md
index 62c5671b1a1e..e30598a9e230 100644
--- a/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.13.1/flink-quick-start-guide.md
@@ -12,11 +12,11 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
- **Configuration** : For [Global Configuration](/docs/0.13.1/flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/0.13.1/flink_configuration#table-options).
- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/0.13.1/hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](/docs/0.13.1/hoodie_deltastreamer#bulk-insert), [Index Bootstrap](/docs/0.13.1/hoodie_deltastreamer#index-bootstrap), [Changelog Mode](/docs/0.13.1/hoodie_deltastreamer#changelog-mode) and [Append Mode](/docs/0.13.1/hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/0.13.1/flink_configuration#memory-optimization) and [Write Rate Limit](/docs/0.13.1/flink_configuration#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
-- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb).
-- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/querying_data/#hudi-catalog).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](querying_data#prestodb).
+- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](querying_data/#hudi-catalog).
## Quick Start
diff --git a/website/versioned_docs/version-0.13.1/quick-start-guide.md b/website/versioned_docs/version-0.13.1/quick-start-guide.md
index acba28538786..297b05d20bd7 100644
--- a/website/versioned_docs/version-0.13.1/quick-start-guide.md
+++ b/website/versioned_docs/version-0.13.1/quick-start-guide.md
@@ -1103,7 +1103,7 @@ For CoW tables, table services work in inline mode by default.
For MoR tables, some async services are enabled by default.
:::note
-Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](/docs/metadata#deployment-considerations) for detailed instructions.
+Since Hudi 0.11 Metadata Table is enabled by default. When using async table services with Metadata Table enabled you must use Optimistic Concurrency Control to avoid the risk of data loss (even in single writer scenario). See [Metadata Table deployment considerations](metadata#deployment-considerations) for detailed instructions.
If you're using Foreach or ForeachBatch streaming sink you must use inline table services, async table services are not supported.
:::
diff --git a/website/versioned_docs/version-0.13.1/record_payload.md b/website/versioned_docs/version-0.13.1/record_payload.md
index 48c3f0e6b79d..750e19858631 100644
--- a/website/versioned_docs/version-0.13.1/record_payload.md
+++ b/website/versioned_docs/version-0.13.1/record_payload.md
@@ -139,5 +139,5 @@ Amazon Database Migration Service onto S3.
Record payloads are tunable to suit many use cases. Please check out the configurations
listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic,
please check
-out [this FAQ](/docs/faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
+out [this FAQ](faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
separate document, we will talk about a new record merger API for optimized payload handling.
diff --git a/website/versioned_docs/version-0.13.1/tuning-guide.md b/website/versioned_docs/version-0.13.1/tuning-guide.md
index 4affeafda663..12b68098e060 100644
--- a/website/versioned_docs/version-0.13.1/tuning-guide.md
+++ b/website/versioned_docs/version-0.13.1/tuning-guide.md
@@ -17,7 +17,7 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.memory.storageFraction` will generally help boost performance.
-**Sizing files**: Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+**Sizing files**: Set `hoodie.parquet.small.file.limit` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time. Also, consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
diff --git a/website/versioned_docs/version-0.13.1/write_operations.md b/website/versioned_docs/version-0.13.1/write_operations.md
index baa6d7dbf848..9ff8431384ca 100644
--- a/website/versioned_docs/version-0.13.1/write_operations.md
+++ b/website/versioned_docs/version-0.13.1/write_operations.md
@@ -51,7 +51,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.14.0/deployment.md b/website/versioned_docs/version-0.14.0/deployment.md
index f5ad89c7f817..b400e413a7d3 100644
--- a/website/versioned_docs/version-0.14.0/deployment.md
+++ b/website/versioned_docs/version-0.14.0/deployment.md
@@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.14.0/faq.md b/website/versioned_docs/version-0.14.0/faq.md
index c64c63bda988..74bf66d3ae7a 100644
--- a/website/versioned_docs/version-0.14.0/faq.md
+++ b/website/versioned_docs/version-0.14.0/faq.md
@@ -162,7 +162,7 @@ Further - Hudi’s commit time can be a logical time and need not strictly be a
### What are some ways to write a Hudi table?
-Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](https://hudi.apache.org/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](https://hudi.apache.org/docs/hoodie_streaming_ingestion#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](https://hudi.apache.org/docs/writing_data/#spark-datasource-writer) to write into Hudi.
+Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](https://hudi.apache.org/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](https://hudi.apache.org/docs/0.14.0/hoodie_streaming_ingestion#hudi-streamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](https://hudi.apache.org/docs/writing_data/#spark-datasource-writer) to write into Hudi.
### How is a Hudi writer job deployed?
@@ -303,7 +303,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk\_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
diff --git a/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md
index 54afa766a19b..c64f84a4e602 100644
--- a/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.14.0/flink-quick-start-guide.md
@@ -11,11 +11,11 @@ This guide helps you quickly start using Flink on Hudi, and learn different mode
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
- **Configuration** : For [Global Configuration](/docs/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/flink_tuning#table-options).
-- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](/docs/hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](/docs/hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](/docs/hoodie_streaming_ingestion#changelog-mode) and [Append Mode](/docs/hoodie_streaming_ingestion#append-mode).
-- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](hoodie_streaming_ingestion#changelog-mode) and [Append Mode](hoodie_streaming_ingestion#append-mode).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](sql_queries#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/flink_tuning#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
-- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](sql_queries#presto).
- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/sql_ddl/#create-catalog).
## Quick Start
diff --git a/website/versioned_docs/version-0.14.0/quick-start-guide.md b/website/versioned_docs/version-0.14.0/quick-start-guide.md
index 1512bab6e050..32c89b2afe07 100644
--- a/website/versioned_docs/version-0.14.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.14.0/quick-start-guide.md
@@ -1123,9 +1123,9 @@ Hudi provides industry-leading performance and functionality for streaming data.
from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider,
transformation support, automatic table services and so on.
-**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/hoodie_streaming_ingestion#structured-streaming) for more.
+**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](hoodie_streaming_ingestion#structured-streaming) for more.
-Check out more information on [modeling data in Hudi](/docs/faq#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](/docs/writing_data).
+Check out more information on [modeling data in Hudi](faq#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](/docs/writing_data).
### Dockerized Demo
Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy
diff --git a/website/versioned_docs/version-0.14.0/record_payload.md b/website/versioned_docs/version-0.14.0/record_payload.md
index 1ed47b2ca967..fb63c8f52939 100644
--- a/website/versioned_docs/version-0.14.0/record_payload.md
+++ b/website/versioned_docs/version-0.14.0/record_payload.md
@@ -172,5 +172,5 @@ provides support for applying changes captured via Amazon Database Migration Ser
Record payloads are tunable to suit many use cases. Please check out the configurations
listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic,
-please check out [this FAQ](/docs/faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
+please check out [this FAQ](faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
separate document, we will talk about a new record merger API for optimized payload handling.
diff --git a/website/versioned_docs/version-0.14.0/write_operations.md b/website/versioned_docs/version-0.14.0/write_operations.md
index abd1bdb66db7..29132a38f5a6 100644
--- a/website/versioned_docs/version-0.14.0/write_operations.md
+++ b/website/versioned_docs/version-0.14.0/write_operations.md
@@ -100,7 +100,7 @@ The following is an inside look on the Hudi write path and the sequence of event
6. Update [Index](/docs/indexing)
1. Now that the write is performed, we will go back and update the index.
7. Commit
- 1. Finally we commit all of these changes atomically. (A [callback notification](/docs/writing_data#commit-notifications) is exposed)
+ 1. Finally we commit all of these changes atomically. (A [callback notification](writing_data#commit-notifications) is exposed)
8. [Clean](/docs/hoodie_cleaner) (if needed)
1. Following the commit, cleaning is invoked if needed.
9. [Compaction](/docs/compaction)
diff --git a/website/versioned_docs/version-0.14.1/cli.md b/website/versioned_docs/version-0.14.1/cli.md
index 1c30b9b6fa6e..7cc4cdd92b0c 100644
--- a/website/versioned_docs/version-0.14.1/cli.md
+++ b/website/versioned_docs/version-0.14.1/cli.md
@@ -578,7 +578,7 @@ Compaction successfully repaired
### Savepoint and Restore
As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this
-savepoint at a later point in time if need be. You can read more about savepoints and restore [here](/docs/next/disaster_recovery)
+savepoint at a later point in time if need be. You can read more about savepoints and restore [here](disaster_recovery)
To trigger savepoint for a hudi table
```java
diff --git a/website/versioned_docs/version-0.14.1/concurrency_control.md b/website/versioned_docs/version-0.14.1/concurrency_control.md
index dd4e217829e2..3efcc0492494 100644
--- a/website/versioned_docs/version-0.14.1/concurrency_control.md
+++ b/website/versioned_docs/version-0.14.1/concurrency_control.md
@@ -77,7 +77,7 @@ Multiple writers can operate on the table with non-blocking conflict resolution.
file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is
currently available for preview in version 1.0.0-beta only with the caveat that conflict resolution is not supported yet
between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink
-writers [here](/docs/next/writing_data#non-blocking-concurrency-control-experimental).
+writers [here](writing_data#non-blocking-concurrency-control-experimental).
## Enabling Multi Writing
diff --git a/website/versioned_docs/version-0.14.1/deployment.md b/website/versioned_docs/version-0.14.1/deployment.md
index f5ad89c7f817..b400e413a7d3 100644
--- a/website/versioned_docs/version-0.14.1/deployment.md
+++ b/website/versioned_docs/version-0.14.1/deployment.md
@@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
+As described in [Writing Data](writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
Here is an example invocation using spark datasource
diff --git a/website/versioned_docs/version-0.14.1/faq_writing_tables.md b/website/versioned_docs/version-0.14.1/faq_writing_tables.md
index 40c3a99fa99f..ae23bbf1e7d2 100644
--- a/website/versioned_docs/version-0.14.1/faq_writing_tables.md
+++ b/website/versioned_docs/version-0.14.1/faq_writing_tables.md
@@ -6,7 +6,7 @@ keywords: [hudi, writing, reading]
### What are some ways to write a Hudi table?
-Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](/docs/hoodie_streaming_ingestion#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](/docs/writing_data/#spark-datasource-writer) to write into Hudi.
+Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](/docs/write_operations/) against a Hudi table. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](hoodie_streaming_ingestion#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom source using the Spark datasource API and use a [Hudi datasource](writing_data/#spark-datasource-writer) to write into Hudi.
### How is a Hudi writer job deployed?
@@ -147,7 +147,7 @@ a) **Auto Size small files during ingestion**: This solution trades ingest/writi
Hudi has the ability to maintain a configured target file size, when performing **upsert/insert** operations. (Note: **bulk\_insert** operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet` )
-For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
+For **copy-on-write**, this is as simple as configuring the [maximum size for a base/parquet file](/docs/configurations#hoodieparquetmaxfilesize) and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below which a file should be considered a small file. For the initial bootstrap to Hudi table, tuning record size estimate is also important to ensure sufficient records are bin-packed in a parquet file. For subsequent writes, Hudi automatically uses average record size based on previous commit. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `hoodie.parquet.max.file.size=100MB` and hoodie.parquet.small.file.limit=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB.
For **merge-on-read**, there are few more configs to set. MergeOnRead works differently for different INDEX choices.
@@ -183,7 +183,7 @@ No, Hudi does not expose uncommitted files/blocks to the readers. Further, Hudi
### How are conflicts detected in Hudi between multiple writers?
-Hudi employs [optimistic concurrency control](/docs/concurrency_control#supported-concurrency-controls) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs.
+Hudi employs [optimistic concurrency control](concurrency_control) between writers, while implementing MVCC based concurrency control between writers and the table services. Concurrent writers to the same table need to be configured with the same lock provider configuration, to safely perform writes. By default (implemented in “[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”), Hudi allows multiple writers to concurrently write data and commit to the timeline if there is no conflicting writes to the same underlying file group IDs. This is achieved by holding a lock, checking for changes that modified the same file IDs. Hudi then supports a pluggable interface “[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)” that determines how conflicts are handled. By default, the later conflicting write is aborted. Hudi also support eager conflict detection to help speed up conflict detection and release cluster resources back early to reduce costs.
### Can single-writer inserts have duplicates?
diff --git a/website/versioned_docs/version-0.14.1/file_layouts.md b/website/versioned_docs/version-0.14.1/file_layouts.md
index 71ee6d563079..b8b5ca7c342a 100644
--- a/website/versioned_docs/version-0.14.1/file_layouts.md
+++ b/website/versioned_docs/version-0.14.1/file_layouts.md
@@ -10,8 +10,8 @@ The following describes the general file layout structure for Apache Hudi. Pleas
* Each file group contains several file slices
* Each slice contains a base file (*.parquet/*.orc) (defined by the config - [hoodie.table.base.file.format](https://hudi.apache.org/docs/next/configurations/#hoodietablebasefileformat) ) produced at a certain commit/compaction instant time, along with set of log files (*.log.*) that contain inserts/updates to the base file since the base file was produced.
-Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new
-file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system.
+Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new
+file slices and [cleaning](hoodie_cleaner) action gets rid of unused/older file slices to reclaim space on the file system.
![Partition On HDFS](/assets/images/hudi_partitions_HDFS.png)
diff --git a/website/versioned_docs/version-0.14.1/file_sizing.md b/website/versioned_docs/version-0.14.1/file_sizing.md
index c637a5a630cc..a451b09b6c58 100644
--- a/website/versioned_docs/version-0.14.1/file_sizing.md
+++ b/website/versioned_docs/version-0.14.1/file_sizing.md
@@ -148,7 +148,7 @@ while the clustering service runs.
:::note
Hudi always creates immutable files on storage. To be able to do auto-sizing or clustering, Hudi will always create a
-newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](/docs/next/cleaning)
+newer version of the smaller file, resulting in 2 versions of the same file. The [cleaner service](hoodie_cleaner)
will later kick in and delete the older version small file and keep the latest one.
:::
diff --git a/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md b/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md
index 02e5a19e5f6c..74ecc9d73a9f 100644
--- a/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.14.1/flink-quick-start-guide.md
@@ -453,19 +453,19 @@ feature is that it now lets you author streaming pipelines on streaming or batch
## Where To Go From Here?
- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
-- **Configuration** : For [Global Configuration](/docs/next/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/next/flink_tuning#table-options).
-- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](/docs/hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](/docs/hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](/docs/hoodie_streaming_ingestion#changelog-mode) and [Append Mode](/docs/hoodie_streaming_ingestion#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](/docs/next/writing_data#non-blocking-concurrency-control-experimental).
-- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
-- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/next/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/next/flink_tuning#write-rate-limit).
+- **Configuration** : For [Global Configuration](flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_tuning#table-options).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](hoodie_streaming_ingestion#changelog-mode) and [Append Mode](hoodie_streaming_ingestion#append-mode). Flink also supports multiple streaming writers with [non-blocking concurrency control](writing_data#non-blocking-concurrency-control-experimental).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](sql_queries#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query).
+- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](flink_tuning#memory-optimization) and [Write Rate Limit](flink_tuning#write-rate-limit).
- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction).
-- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](sql_queries#presto).
- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/sql_ddl/#create-catalog).
If you are relatively new to Apache Hudi, it is important to be familiar with a few core concepts:
- - [Hudi Timeline](/docs/next/timeline) – How Hudi manages transactions and other table services
- - [Hudi File Layout](/docs/next/storage_layouts) - How the files are laid out on storage
- - [Hudi Table Types](/docs/next/table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ`
- - [Hudi Query Types](/docs/next/table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries
+ - [Hudi Timeline](timeline) – How Hudi manages transactions and other table services
+ - [Hudi File Layout](file_layouts) - How the files are laid out on storage
+ - [Hudi Table Types](table_types) – `COPY_ON_WRITE` and `MERGE_ON_READ`
+ - [Hudi Query Types](table_types#query-types) – Snapshot Queries, Incremental Queries, Read-Optimized Queries
See more in the "Concepts" section of the docs.
diff --git a/website/versioned_docs/version-0.14.1/indexing.md b/website/versioned_docs/version-0.14.1/indexing.md
index 034246ad5805..53e883c38561 100644
--- a/website/versioned_docs/version-0.14.1/indexing.md
+++ b/website/versioned_docs/version-0.14.1/indexing.md
@@ -11,9 +11,9 @@ Hudi provides efficient upserts, by mapping a given hoodie key (record key + par
This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the
mapped file group contains all versions of a group of records.
-For [Copy-On-Write tables](/docs/next/table_types#copy-on-write-table), this enables fast upsert/delete operations, by
+For [Copy-On-Write tables](table_types#copy-on-write-table), this enables fast upsert/delete operations, by
avoiding the need to join against the entire dataset to determine which files to rewrite.
-For [Merge-On-Read tables](/docs/next/table_types#merge-on-read-table), this design allows Hudi to bound the amount of
+For [Merge-On-Read tables](table_types#merge-on-read-table), this design allows Hudi to bound the amount of
records any given base file needs to be merged against.
Specifically, a given base file needs to merged only against updates for records that are part of that base file. In contrast,
designs without an indexing component (e.g: [Apache Hive ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)),
diff --git a/website/versioned_docs/version-0.14.1/metadata_indexing.md b/website/versioned_docs/version-0.14.1/metadata_indexing.md
index 5b96ed07bd40..2a0bbfca06f0 100644
--- a/website/versioned_docs/version-0.14.1/metadata_indexing.md
+++ b/website/versioned_docs/version-0.14.1/metadata_indexing.md
@@ -78,8 +78,8 @@ us schedule the indexing for COLUMN_STATS index. First we need to define a prope
As mentioned before, metadata indices are pluggable. One can add any index at any point in time depending on changing
business requirements. Some configurations to enable particular indices are listed below. Currently, available indices under
-metadata table can be explored [here](/docs/next/metadata#metadata-table-indices) along with [configs](/docs/next/metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side)
-to enable them. The full set of metadata configurations can be explored [here](/docs/next/configurations/#Metadata-Configs).
+metadata table can be explored [here](metadata#metadata-table-indices) along with [configs](metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side)
+to enable them. The full set of metadata configurations can be explored [here](configurations/#Metadata-Configs).
:::note
Enabling the metadata table and configuring a lock provider are the prerequisites for using async indexer. Checkout a sample
diff --git a/website/versioned_docs/version-0.14.1/procedures.md b/website/versioned_docs/version-0.14.1/procedures.md
index c2cd0dea7c11..0a895560df09 100644
--- a/website/versioned_docs/version-0.14.1/procedures.md
+++ b/website/versioned_docs/version-0.14.1/procedures.md
@@ -472,10 +472,10 @@ archive commits.
|------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | N | None | Hudi table name |
| path | String | N | None | Path of table |
-| [min_commits](/docs/next/configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. |
-| [max_commits](/docs/next/configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. |
-| [retain_commits](/docs/next/configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. |
-| [enable_metadata](/docs/next/configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table |
+| [min_commits](configurations#hoodiekeepmincommits) | Int | N | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. |
+| [max_commits](configurations#hoodiekeepmaxcommits) | Int | N | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. |
+| [retain_commits](configurations#hoodiecommitsarchivalbatch) | Int | N | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. |
+| [enable_metadata](configurations#hoodiemetadataenable) | Boolean | N | false | Enable the internal metadata table |
**Output**
@@ -672,7 +672,7 @@ copy table to a temporary view.
| Parameter Name | Type | Required | Default Value | Description |
|-------------------------------------------------------------------|---------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | Y | None | Hudi table name |
-| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
+| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
| view_name | String | Y | None | Name of view |
| begin_instance_time | String | N | "" | Begin instance time |
| end_instance_time | String | N | "" | End instance time |
@@ -705,7 +705,7 @@ copy table to a new table.
| Parameter Name | Type | Required | Default Value | Description |
|-------------------------------------------------------------------|--------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | Y | None | Hudi table name |
-| [query_type](/docs/next/configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
+| [query_type](configurations#hoodiedatasourcequerytype) | String | N | "snapshot" | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) |
| new_table | String | Y | None | Name of new table |
| begin_instance_time | String | N | "" | Begin instance time |
| end_instance_time | String | N | "" | End instance time |
@@ -1533,13 +1533,13 @@ Run cleaner on a hoodie table.
|---------------------------------------------------------------------------------------|---------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | Y | None | Name of table to be cleaned |
| schedule_in_line | Boolean | N | true | Set "true" if you want to schedule and run a clean. Set false if you have already scheduled a clean and want to run that. |
-| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. |
-| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. |
-| [hours_retained](/docs/next/configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. |
-| [file_versions_retained](/docs/next/configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. |
-| [trigger_strategy](/docs/next/configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` |
-| [trigger_max_commits](/docs/next/configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. |
-| [options](/docs/next/configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" |
+| [clean_policy](configurations#hoodiecleanerpolicy) | String | N | None | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only. |
+| [retain_commits](configurations#hoodiecleanercommitsretained) | Int | N | None | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. |
+| [hours_retained](configurations#hoodiecleanerhoursretained) | Int | N | None | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. |
+| [file_versions_retained](configurations#hoodiecleanerfileversionsretained) | Int | N | None | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. |
+| [trigger_strategy](configurations#hoodiecleantriggerstrategy) | String | N | None | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.max.commits` |
+| [trigger_max_commits](configurations/#hoodiecleanmaxcommits) | Int | N | None | Number of commits after the last clean operation, before scheduling of a new clean is attempted. |
+| [options](configurations/#Clean-Configs) | String | N | None | comma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2" |
**Output**
@@ -1633,12 +1633,12 @@ Sync the table's latest schema to Hive metastore.
| metastore_uri | String | N | "" | Metastore_uri |
| username | String | N | "" | User name |
| password | String | N | "" | Password |
-| [use_jdbc](/docs/next/configurations#hoodiedatasourcehive_syncuse_jdbc) | String | N | "" | Use JDBC when hive synchronization is enabled |
-| [mode](/docs/next/configurations#hoodiedatasourcehive_syncmode) | String | N | "" | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. |
-| [partition_fields](/docs/next/configurations#hoodiedatasourcehive_syncpartition_fields) | String | N | "" | Field in the table to use for determining hive partition columns. | |
-| [partition_extractor_class](/docs/next/configurations#hoodiedatasourcehive_syncpartition_extractor_class) | String | N | "" | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. |
-| [strategy](/docs/next/configurations#hoodiedatasourcehive_synctablestrategy) | String | N | "" | Hive table synchronization strategy. Available option: RO, RT, ALL. |
-| [sync_incremental](/docs/next/configurations#hoodiemetasyncincremental) | String | N | "" | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to `false`, the meta sync executes a full partition sync operation when partitions are lost. |
+| [use_jdbc](configurations#hoodiedatasourcehive_syncuse_jdbc) | String | N | "" | Use JDBC when hive synchronization is enabled |
+| [mode](configurations#hoodiedatasourcehive_syncmode) | String | N | "" | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. |
+| [partition_fields](configurations#hoodiedatasourcehive_syncpartition_fields) | String | N | "" | Field in the table to use for determining hive partition columns. | |
+| [partition_extractor_class](configurations#hoodiedatasourcehive_syncpartition_extractor_class) | String | N | "" | Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. |
+| [strategy](configurations#hoodiedatasourcehive_synctablestrategy) | String | N | "" | Hive table synchronization strategy. Available option: RO, RT, ALL. |
+| [sync_incremental](configurations#hoodiemetasyncincremental) | String | N | "" | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to `false`, the meta sync executes a full partition sync operation when partitions are lost. |
@@ -1848,18 +1848,18 @@ Convert an existing table to Hudi.
|------------------------------------------------------------------------------|---------|----------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| table | String | Y | None | Name of table to be clustered |
| table_type | String | Y | None | Table type, MERGE_ON_READ or COPY_ON_WRITE |
-| [bootstrap_path](/docs/next/configurations#hoodiebootstrapbasepath) | String | Y | None | Base path of the dataset that needs to be bootstrapped as a Hudi table |
+| [bootstrap_path](configurations#hoodiebootstrapbasepath) | String | Y | None | Base path of the dataset that needs to be bootstrapped as a Hudi table |
| base_path | String | Y | None | Base path |
| rowKey_field | String | Y | None | Primary key field |
| base_file_format | String | N | "PARQUET" | Format of base file |
| partition_path_field | String | N | "" | Partitioned column field |
-| [bootstrap_index_class](/docs/next/configurations#hoodiebootstrapindexclass) | String | N | "org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex" | Implementation to use, for mapping a skeleton base file to a bootstrap base file. |
-| [selector_class](/docs/next/configurations#hoodiebootstrapmodeselector) | String | N | "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector" | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped |
+| [bootstrap_index_class](configurations#hoodiebootstrapindexclass) | String | N | "org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex" | Implementation to use, for mapping a skeleton base file to a bootstrap base file. |
+| [selector_class](configurations#hoodiebootstrapmodeselector) | String | N | "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector" | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped |
| key_generator_class | String | N | "org.apache.hudi.keygen.SimpleKeyGenerator" | Class of key generator |
| full_bootstrap_input_provider | String | N | "org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider" | Class of full bootstrap input provider |
| schema_provider_class | String | N | "" | Class of schema provider |
| payload_class | String | N | "org.apache.hudi.common.model.OverwriteWithLatestAvroPayload" | Class of payload |
-| [parallelism](/docs/next/configurations#hoodiebootstrapparallelism) | Int | N | 1500 | For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (`hoodie.bulkinsert.shuffle.parallelism`), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this. |
+| [parallelism](configurations#hoodiebootstrapparallelism) | Int | N | 1500 | For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (`hoodie.bulkinsert.shuffle.parallelism`), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this. |
| enable_hive_sync | Boolean | N | false | Whether to enable hive sync |
| props_file_path | String | N | "" | Path of properties file |
| bootstrap_overwrite | Boolean | N | false | Overwrite bootstrap path |
diff --git a/website/versioned_docs/version-0.14.1/querying_data.md b/website/versioned_docs/version-0.14.1/querying_data.md
index c43ee1fd7f45..ee330fede3a9 100644
--- a/website/versioned_docs/version-0.14.1/querying_data.md
+++ b/website/versioned_docs/version-0.14.1/querying_data.md
@@ -7,7 +7,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
---
:::danger
-This page is no longer maintained. Please refer to Hudi [SQL DDL](/docs/next/sql_ddl), [SQL DML](/docs/next/sql_dml), [SQL Queries](/docs/next/sql_queries) and [Procedures](/docs/next/procedures) for the latest documentation.
+This page is no longer maintained. Please refer to Hudi [SQL DDL](sql_ddl), [SQL DML](sql_dml), [SQL Queries](sql_queries) and [Procedures](procedures) for the latest documentation.
:::
Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts#query-types).
diff --git a/website/versioned_docs/version-0.14.1/quick-start-guide.md b/website/versioned_docs/version-0.14.1/quick-start-guide.md
index 6f07e7363b68..3c78608036fa 100644
--- a/website/versioned_docs/version-0.14.1/quick-start-guide.md
+++ b/website/versioned_docs/version-0.14.1/quick-start-guide.md
@@ -223,7 +223,7 @@ CREATE TABLE hudi_table (
PARTITIONED BY (city);
```
-For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](/docs/next/sql_ddl) reference guide.
+For more options for creating Hudi tables or if you're running into any issues, please refer to [SQL DDL](sql_ddl) reference guide.
@@ -267,7 +267,7 @@ inserts.write.format("hudi").
```
:::info Mapping to Hudi write operations
-Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables,
+Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables,
with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as
the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
:::
@@ -300,7 +300,7 @@ inserts.write.format("hudi"). \
```
:::info Mapping to Hudi write operations
-Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables,
+Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables,
with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as
the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
:::
@@ -309,7 +309,7 @@ the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
-Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](/docs/next/sql_dml#insert-into) for more advanced options.
+Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](sql_dml#insert-into) for more advanced options.
```sql
INSERT INTO hudi_table
@@ -421,7 +421,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un
-Hudi table can be update using a regular UPDATE statement. See [Update](/docs/next/sql_dml#update) for more advanced options.
+Hudi table can be update using a regular UPDATE statement. See [Update](sql_dml#update) for more advanced options.
```sql
UPDATE hudi_table SET fare = 25.0 WHERE rider = 'rider-D';
@@ -451,7 +451,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un
-[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](/docs/next/concepts).
+[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](concepts).
Look for changes in `_hoodie_commit_time`, `fare` fields for the given `_hoodie_record_key` value from a previous commit.
## Merging Data {#merge}
@@ -539,7 +539,7 @@ MERGE statement either using `SET *` or using `SET column1 = expression1 [, colu
## Delete data {#deletes}
Delete operation removes the records specified from the table. For example, this code snippet deletes records
-for the HoodieKeys passed in. Check out the [deletion section](/docs/next/writing_data#deletes) for more details.
+for the HoodieKeys passed in. Check out the [deletion section](writing_data#deletes) for more details.
:::note Implications of defining record keys
-Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](/docs/next/write_operations).
+Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](write_operations).
Also if a record key is configured, then it's also advisable to specify a precombine or ordering field, to correctly handle cases where the source data has
multiple records with the same key. See section below.
:::
@@ -1108,29 +1108,29 @@ PARTITIONED BY (city);
## Where to go from here?
You can also [build hudi yourself](https://github.com/apache/hudi#building-apache-hudi-from-source) and try this quickstart using `--jars `(see also [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions))
-for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](/docs/next/migration_guide).
+for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](migration_guide).
### Spark SQL Reference
-For advanced usage of spark SQL, please refer to [Spark SQL DDL](/docs/next/sql_ddl) and [Spark SQL DML](/docs/next/sql_dml) reference guides.
-For alter table commands, check out [this](/docs/next/sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](/docs/next/procedures) out.
+For advanced usage of spark SQL, please refer to [Spark SQL DDL](sql_ddl) and [Spark SQL DML](sql_dml) reference guides.
+For alter table commands, check out [this](sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](procedures) out.
### Streaming workloads
Hudi provides industry-leading performance and functionality for streaming data.
-**Hudi Streamer** - Hudi provides an incremental ingestion/ETL tool - [HoodieStreamer](/docs/next/hoodie_streaming_ingestion#hudi-streamer), to assist with ingesting data into Hudi
+**Hudi Streamer** - Hudi provides an incremental ingestion/ETL tool - [HoodieStreamer](hoodie_streaming_ingestion#hudi-streamer), to assist with ingesting data into Hudi
from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider,
transformation support, automatic table services and so on.
-**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/next/hoodie_streaming_ingestion#structured-streaming) for more.
+**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](hoodie_streaming_ingestion#structured-streaming) for more.
-Check out more information on [modeling data in Hudi](/docs/next/faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](/docs/next/writing_data).
+Check out more information on [modeling data in Hudi](faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to [writing Hudi Tables](writing_data).
### Dockerized Demo
Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy
to get your transactional data lakes up and running quickly, across a variety query engines like Hive, Flink, Spark, Presto, Trino and much more.
We have put together a [demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that showcases all of this on a docker based setup with all
dependent systems running locally. We recommend you replicate the same setup and run the demo yourself, by following
-steps [here](/docs/next/docker_demo) to get a taste for it.
+steps [here](docker_demo) to get a taste for it.
diff --git a/website/versioned_docs/version-0.14.1/record_payload.md b/website/versioned_docs/version-0.14.1/record_payload.md
index 105a87ae9a02..0f514dced09e 100644
--- a/website/versioned_docs/version-0.14.1/record_payload.md
+++ b/website/versioned_docs/version-0.14.1/record_payload.md
@@ -172,6 +172,6 @@ provides support for applying changes captured via Amazon Database Migration Ser
Record payloads are tunable to suit many use cases. Please check out the configurations
listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic,
-please check out [this FAQ](/docs/next/faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
+please check out [this FAQ](faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
separate document, we will talk about a new record merger API for optimized payload handling.
diff --git a/website/versioned_docs/version-0.14.1/sql_ddl.md b/website/versioned_docs/version-0.14.1/sql_ddl.md
index 0c953905fcb4..eb44b5da7c1d 100644
--- a/website/versioned_docs/version-0.14.1/sql_ddl.md
+++ b/website/versioned_docs/version-0.14.1/sql_ddl.md
@@ -104,7 +104,7 @@ TBLPROPERTIES (
```
### Create table from an external location
-Often, Hudi tables are created from streaming writers like the [streamer tool](/docs/next/hoodie_streaming_ingestion#hudi-streamer), which
+Often, Hudi tables are created from streaming writers like the [streamer tool](hoodie_streaming_ingestion#hudi-streamer), which
may later need some SQL statements to run on them. You can create an External table using the `location` statement.
```sql
@@ -389,7 +389,7 @@ Users can set table properties while creating a table. The important table prope
#### Passing Lock Providers for Concurrent Writers
Hudi requires a lock provider to support concurrent writers or asynchronous table services when using OCC
-and [NBCC](/docs/next/concurrency_control#non-blocking-concurrency-control-mode-experimental) (Non-Blocking Concurrency Control)
+and [NBCC](concurrency_control#non-blocking-concurrency-control) (Non-Blocking Concurrency Control)
concurrency mode. For NBCC mode, locking is only used to write the commit metadata file in the timeline. Writes are
serialized by completion time. Users can pass these table properties into *TBLPROPERTIES* as well. Below is an example
for a Zookeeper based configuration.
@@ -612,7 +612,7 @@ ALTER TABLE tableA RENAME TO tableB;
### Setting Hudi configs
#### Using table options
-You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](/docs/next/configurations#FLINK_SQL)
+You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](configurations#FLINK_SQL)
These configs will be applied to all the operations on that table.
```sql
diff --git a/website/versioned_docs/version-0.14.1/sql_dml.md b/website/versioned_docs/version-0.14.1/sql_dml.md
index 102198435642..fec050936e41 100644
--- a/website/versioned_docs/version-0.14.1/sql_dml.md
+++ b/website/versioned_docs/version-0.14.1/sql_dml.md
@@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem';
SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data
from your Hudi tables. Let's explore them one by one.
-Please refer to [SQL DDL](/docs/next/sql_ddl) for creating Hudi tables using SQL.
+Please refer to [SQL DDL](sql_ddl) for creating Hudi tables using SQL.
### Insert Into
@@ -25,7 +25,7 @@ SELECT FROM
@@ -299,7 +299,7 @@ inserts.write.format("hudi").
```
:::info Mapping to Hudi write operations
-Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables,
+Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables,
with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as
the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
:::
@@ -332,7 +332,7 @@ inserts.write.format("hudi"). \
```
:::info Mapping to Hudi write operations
-Hudi provides a wide range of [write operations](/docs/next/write_operations) - both batch and incremental - to write data into Hudi tables,
+Hudi provides a wide range of [write operations](write_operations) - both batch and incremental - to write data into Hudi tables,
with different semantics and performance. When record keys are not configured (see [keys](#keys) below), `bulk_insert` will be chosen as
the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
:::
@@ -341,7 +341,7 @@ the write operation, matching the out-of-behavior of Spark's Parquet Datasource.
-Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](/docs/next/sql_dml#insert-into) for more advanced options.
+Users can use 'INSERT INTO' to insert data into a Hudi table. See [Insert Into](sql_dml#insert-into) for more advanced options.
```sql
INSERT INTO hudi_table
@@ -453,7 +453,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un
-Hudi table can be update using a regular UPDATE statement. See [Update](/docs/next/sql_dml#update) for more advanced options.
+Hudi table can be update using a regular UPDATE statement. See [Update](sql_dml#update) for more advanced options.
```sql
UPDATE hudi_table SET fare = 25.0 WHERE rider = 'rider-D';
@@ -483,7 +483,7 @@ Notice that the save mode is now `Append`. In general, always use append mode un
-[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](/docs/next/concepts).
+[Querying](#querying) the data again will now show updated records. Each write operation generates a new [commit](concepts).
Look for changes in `_hoodie_commit_time`, `fare` fields for the given `_hoodie_record_key` value from a previous commit.
## Merging Data {#merge}
@@ -1067,7 +1067,7 @@ PARTITIONED BY (city);
>
:::note Implications of defining record keys
-Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](/docs/next/write_operations).
+Configuring keys for a Hudi table, has a new implications on the table. If record key is set by the user, `upsert` is chosen as the [write operation](write_operations).
Also if a record key is configured, then it's also advisable to specify a precombine or ordering field, to correctly handle cases where the source data has
multiple records with the same key. See section below.
:::
@@ -1140,12 +1140,12 @@ PARTITIONED BY (city);
## Where to go from here?
You can also [build hudi yourself](https://github.com/apache/hudi#building-apache-hudi-from-source) and try this quickstart using `--jars `(see also [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions))
-for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](/docs/next/migration_guide).
+for more info. If you are looking for ways to migrate your existing data to Hudi, refer to [migration guide](migration_guide).
### Spark SQL Reference
-For advanced usage of spark SQL, please refer to [Spark SQL DDL](/docs/next/sql_ddl) and [Spark SQL DML](/docs/next/sql_dml) reference guides.
-For alter table commands, check out [this](/docs/next/sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](/docs/next/procedures) out.
+For advanced usage of spark SQL, please refer to [Spark SQL DDL](sql_ddl) and [Spark SQL DML](sql_dml) reference guides.
+For alter table commands, check out [this](sql_ddl#spark-alter-table). Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check [this](procedures) out.
### Streaming workloads
@@ -1155,14 +1155,14 @@ Hudi provides industry-leading performance and functionality for streaming data.
from various different sources in a streaming manner, with powerful built-in capabilities like auto checkpointing, schema enforcement via schema provider,
transformation support, automatic table services and so on.
-**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](/docs/next/writing_tables_streaming_writes#spark-streaming) for more.
+**Structured Streaming** - Hudi supports Spark Structured Streaming reads and writes as well. Please see [here](writing_tables_streaming_writes#spark-streaming) for more.
-Check out more information on [modeling data in Hudi](/docs/next/faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](/docs/next/writing_tables_streaming_writes).
+Check out more information on [modeling data in Hudi](faq_general#how-do-i-model-the-data-stored-in-hudi) and different ways to perform [batch writes](/docs/writing_data) and [streaming writes](writing_tables_streaming_writes).
### Dockerized Demo
Even as we showcased the core capabilities, Hudi supports a lot more advanced functionality that can make it easy
to get your transactional data lakes up and running quickly, across a variety query engines like Hive, Flink, Spark, Presto, Trino and much more.
We have put together a [demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that showcases all of this on a docker based setup with all
dependent systems running locally. We recommend you replicate the same setup and run the demo yourself, by following
-steps [here](/docs/next/docker_demo) to get a taste for it.
+steps [here](docker_demo) to get a taste for it.
diff --git a/website/versioned_docs/version-0.15.0/record_payload.md b/website/versioned_docs/version-0.15.0/record_payload.md
index 105a87ae9a02..0f514dced09e 100644
--- a/website/versioned_docs/version-0.15.0/record_payload.md
+++ b/website/versioned_docs/version-0.15.0/record_payload.md
@@ -172,6 +172,6 @@ provides support for applying changes captured via Amazon Database Migration Ser
Record payloads are tunable to suit many use cases. Please check out the configurations
listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic,
-please check out [this FAQ](/docs/next/faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
+please check out [this FAQ](faq_writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a
separate document, we will talk about a new record merger API for optimized payload handling.
diff --git a/website/versioned_docs/version-0.15.0/sql_ddl.md b/website/versioned_docs/version-0.15.0/sql_ddl.md
index 4651e819cdb3..56dd6a677565 100644
--- a/website/versioned_docs/version-0.15.0/sql_ddl.md
+++ b/website/versioned_docs/version-0.15.0/sql_ddl.md
@@ -389,7 +389,7 @@ Users can set table properties while creating a table. The important table prope
#### Passing Lock Providers for Concurrent Writers
Hudi requires a lock provider to support concurrent writers or asynchronous table services when using OCC
-and [NBCC](/docs/next/concurrency_control#non-blocking-concurrency-control-mode-experimental) (Non-Blocking Concurrency Control)
+and [NBCC](concurrency_control#non-blocking-concurrency-control) (Non-Blocking Concurrency Control)
concurrency mode. For NBCC mode, locking is only used to write the commit metadata file in the timeline. Writes are
serialized by completion time. Users can pass these table properties into *TBLPROPERTIES* as well. Below is an example
for a Zookeeper based configuration.
@@ -612,7 +612,7 @@ ALTER TABLE tableA RENAME TO tableB;
### Setting Hudi configs
#### Using table options
-You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](/docs/next/configurations#FLINK_SQL)
+You can configure hoodie configs in table options when creating a table. You can refer Flink specific hoodie configs [here](configurations#FLINK_SQL)
These configs will be applied to all the operations on that table.
```sql
diff --git a/website/versioned_docs/version-0.15.0/sql_dml.md b/website/versioned_docs/version-0.15.0/sql_dml.md
index edb63730b135..b94b382df68d 100644
--- a/website/versioned_docs/version-0.15.0/sql_dml.md
+++ b/website/versioned_docs/version-0.15.0/sql_dml.md
@@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem';
SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data
from your Hudi tables. Let's explore them one by one.
-Please refer to [SQL DDL](/docs/next/sql_ddl) for creating Hudi tables using SQL.
+Please refer to [SQL DDL](sql_ddl) for creating Hudi tables using SQL.
### Insert Into
@@ -25,7 +25,7 @@ SELECT FROM
@@ -404,7 +404,7 @@ You can check the data generated under `/tmp/hudi_trips_cow///<
[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](writing_data#write-operations)
:::
diff --git a/website/versioned_docs/version-0.9.0/writing_data.md b/website/versioned_docs/version-0.9.0/writing_data.md
index 7671593bacf9..8f95514ef73b 100644
--- a/website/versioned_docs/version-0.9.0/writing_data.md
+++ b/website/versioned_docs/version-0.9.0/writing_data.md
@@ -415,10 +415,10 @@ column statistics etc. Even on some cloud data stores, there is often cost to li
Here are some ways to efficiently manage the storage of your Hudi tables.
- - The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload
+ - The [small file handling feature](/docs/configurations#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload
and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files.
- - Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull
- - User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio),
+ - Cleaner can be [configured](configurations#retaincommitsno_of_commits_to_retain--24) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull
+ - User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](configurations#hoodielogfilemaxsize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio),
such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately.
- Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups
once created cannot be deleted, but simply expanded as explained before.