apache · bhasudha · Dec 11, 2024 · Nov 26, 2024 · xushiyan · Dec 11, 2024
diff --git a/README.md b/README.md
@@ -129,16 +129,35 @@ versioned_sidebars/version-0.7.0-sidebars.json
 ```
 
 ### Linking docs
-
-- Remember to include the `.md` extension.
-- Files will be linked to correct corresponding version.
-- Relative paths work as well.
-
-```md
-The [@hello](hello.md#paginate) document is great!
-
-See the [Tutorial](../getting-started/tutorial.md) for more info.
-```
+Relative paths work well. - Files will be linked to correct corresponding version.
+  - PREFER RELATIVE PATHS to be consistent with linking.
+    - **Good Example of linking.**
+      For ex say we are updating a 0.12.0 version doc which is older.
+        ```md
+        A [callback notification](writing_data#commit-notifications) is exposed 
+        ```
+        This automatically resolves to /docs/0.12.0/writing_data#commit-notifications.
+    - **Bad example of linking.**
+      For ex say we are updating a 0.12.0 version doc which is older.
+        ```md
+        A [callback notification](/docs/writing_data#commit-notifications) is exposed 
+        ```
+        This will resolve to the most recent release, specifically /docs/writing_data#commit-notifications . We do not want a 0.12.0 doc page to point to a page from a later release.
+  - DO NOT use next version when linking.
+    - Good Example of linking when you are working on unreleased version (from next version).
+      ```md
+      Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new 
+      file slices and [cleaning](cleaning) action gets rid of unused/older file slices to reclaim space on the file system.
+      ```
+        This automatically resolves to /docs/next/compaction and /docs/next/cleaning pages.
+
+    - Bad Example of linking when you are working on unreleased version (from next version).
+       ```md
+      Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new
+      file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system.
+      ``` 
+        Even though it directly points to /docs/next which is intended target, this accumulates as tech debt when this copy of docs gets released, we will hav a older doc always pointing to /docs/next/
+
 
 ## Versions
 

diff --git a/content/docs/0.8.0/concurrency_control/.index.html.swp b/content/docs/0.8.0/concurrency_control/.index.html.swp
diff --git a/website/blog/2019-09-09-ingesting-database-changes.md b/website/blog/2019-09-09-ingesting-database-changes.md
@@ -44,5 +44,5 @@ inputDataset.write.format("org.apache.hudi”)
   .save("/path/on/dfs");
 ```
 
-Alternatively, you can also use the Hudi [DeltaStreamer](https://hudi.apache.org/writing_data#deltastreamer) tool with the DFSSource.
+Alternatively, you can also use the Hudi [DeltaStreamer](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer) tool with the DFSSource.
 
diff --git a/website/blog/2020-01-20-change-capture-using-aws.md b/website/blog/2020-01-20-change-capture-using-aws.md
@@ -20,7 +20,7 @@ In this blog, we will build an end-end solution for capturing changes from a MyS
 We can break up the problem into two pieces.
 
 1.  **Extracting change logs from MySQL**  : Surprisingly, this is still a pretty tricky problem to solve and often Hudi users get stuck here. Thankfully, at-least for AWS users, there is a  [Database Migration service](https://aws.amazon.com/dms/)  (DMS for short), that does this change capture and uploads them as parquet files on S3
-2.  **Applying these change logs to your data lake table**  : Once there are change logs in some form, the next step is to apply them incrementally to your table. This mundane task can be fully automated using the Hudi  [DeltaStreamer](http://hudi.apache.org/docs/writing_data#deltastreamer)  tool.
+2.  **Applying these change logs to your data lake table**  : Once there are change logs in some form, the next step is to apply them incrementally to your table. This mundane task can be fully automated using the Hudi  [DeltaStreamer](http://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer)  tool.
 
 
 

diff --git a/website/blog/2021-01-27-hudi-clustering-intro.md b/website/blog/2021-01-27-hudi-clustering-intro.md
@@ -17,7 +17,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei
 
 ## Clustering Architecture
 
-At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.
+At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.
 
 
 

diff --git a/website/blog/2021-03-01-hudi-file-sizing.md b/website/blog/2021-03-01-hudi-file-sizing.md
@@ -36,9 +36,9 @@ For illustration purposes, we are going to consider only COPY_ON_WRITE table.
 
 Configs of interest before we dive into the algorithm:
 
-- [Max file size](/docs/configurations#limitFileSize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value <br/>
-- [Soft file limit](/docs/configurations#compactionSmallFileSize): Max file size below which a given data file is considered to a small file <br/>
-- [Insert split size](/docs/configurations#insertSplitSize): Number of inserts grouped for a single partition. This value should match 
+- [Max file size](/docs/configurations#hoodieparquetmaxfilesize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value <br/>
+- [Soft file limit](/docs/configurations#hoodieparquetsmallfilelimit): Max file size below which a given data file is considered to a small file <br/>
+- [Insert split size](/docs/configurations#hoodiecopyonwriteinsertsplitsize): Number of inserts grouped for a single partition. This value should match 
 the number of records in a single file (you can determine based on max file size and per record size)
 
 For instance, if your first config value is 120MB and 2nd config value is set to 100MB, any file whose size is < 100MB 

diff --git a/website/blog/2021-08-16-kafka-custom-deserializer.md b/website/blog/2021-08-16-kafka-custom-deserializer.md
@@ -18,7 +18,7 @@ In our case a Confluent schema registry is used to maintain the schema and as sc
 <!--truncate-->
 
 ## What do we want to achieve?
-We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/writing_data#deltastreamer)
+We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer)
 Ideally every topic should be able to evolve the schema to match new business requirements. Producers start producing data with a new schema version and the DeltaStreamer picks up the new schema and ingests the data with the new schema. For this to work, we run our DeltaStreamer instances with the latest schema version available from the Schema Registry to ensure that we always use the freshest schema with all attributes.
 A prerequisites is that all the mentioned Schema evolutions must be `BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of Avro Schema changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). This ensures that every record in the kafka topic can always be read using the latest schema.
 

diff --git a/website/docs/cli.md b/website/docs/cli.md
@@ -578,7 +578,7 @@ Compaction successfully repaired
 
 ### Savepoint and Restore 
 As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this 
-savepoint at a later point in time if need be. You can read more about savepoints and restore [here](/docs/next/disaster_recovery)
+savepoint at a later point in time if need be. You can read more about savepoints and restore [here](disaster_recovery)
 
 To trigger savepoint for a hudi table
 ```java

diff --git a/website/docs/concurrency_control.md b/website/docs/concurrency_control.md
@@ -8,7 +8,7 @@ last_modified_at: 2021-03-19T15:59:57-04:00
 ---
 Concurrency control defines how different writers/readers/table services coordinate access to a Hudi table. Hudi ensures atomic writes, by way of publishing commits atomically to the timeline, 
 stamped with an instant time that denotes the time at which the action is deemed to have occurred. Unlike general purpose file version control, Hudi draws clear distinction between 
-writer processes that issue [write operations](/docs/next/write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and 
+writer processes that issue [write operations](write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and 
 readers (that execute queries and read data). 
 
 Hudi provides
@@ -23,7 +23,7 @@ We’ll also describe ways to ingest data into a Hudi Table from multiple writer
 
 ## Distributed Locking 
 A pre-requisite for distributed co-ordination in Hudi, like many other distributed database systems is a distributed lock provider, that different processes can use to plan, schedule and 
-execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](/docs/next/timeline#truetime-generation), as discussed before.
+execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](timeline#truetime-generation), as discussed before.
 
 External locking is typically used in conjunction with optimistic concurrency control 
 because it provides a way to prevent conflicts that might occur when two or more transactions (commits in our case) attempt to modify the same resource concurrently. 
@@ -204,7 +204,7 @@ Multiple writers can operate on the table with non-blocking conflict resolution.
 file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is
 currently available for preview in version 1.0.0-beta only with the caveat that conflict resolution is not supported yet
 between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink
-writers [here](/docs/next/sql_dml#non-blocking-concurrency-control-experimental).
+writers [here](sql_dml#non-blocking-concurrency-control-experimental).
 
 ## Early conflict Detection
 

diff --git a/website/docs/deployment.md b/website/docs/deployment.md
@@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode
 
 ### Spark Datasource Writer Jobs
 
-As described in [Batch Writes](/docs/next/writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". 
+As described in [Batch Writes](writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits". 
 
 Here is an example invocation using spark datasource
 

diff --git a/website/docs/faq.md b/website/docs/faq.md
@@ -6,10 +6,10 @@ keywords: [hudi, writing, reading]
 
 The FAQs are split into following pages. Please refer to the specific pages for more info.
 
-- [General](/docs/next/faq_general)
-- [Design & Concepts](/docs/next/faq_design_and_concepts)
-- [Writing Tables](/docs/next/faq_writing_tables)
-- [Reading Tables](/docs/next/faq_reading_tables)
-- [Table Services](/docs/next/faq_table_services)
-- [Storage](/docs/next/faq_storage)
-- [Integrations](/docs/next/faq_integrations)
+- [General](faq_general)
+- [Design & Concepts](faq_design_and_concepts)
+- [Writing Tables](faq_writing_tables)
+- [Reading Tables](faq_reading_tables)
+- [Table Services](faq_table_services)
+- [Storage](faq_storage)
+- [Integrations](faq_integrations)
diff --git a/website/docs/faq_general.md b/website/docs/faq_general.md
@@ -61,7 +61,7 @@ Nonetheless, Hudi is designed very much like a database and provides similar fun
 
 ### How do I model the data stored in Hudi?
 
-When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](/docs/next/writing_data) for an example.
+When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](writing_data) for an example.
 
 When querying/reading data, Hudi just presents itself as a json-like hierarchical table, everyone is used to querying using Hive/Spark/Presto over Parquet/Json/Avro.
 

diff --git a/website/docs/faq_table_services.md b/website/docs/faq_table_services.md
@@ -50,6 +50,6 @@ Hudi runs cleaner to remove old file versions as part of writing data either in
 
 Yes. Hudi provides the ability to post a callback notification about a write commit. You can use a http hook or choose to
 
-be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](/docs/next/platform_services_post_commit_callback)
+be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](platform_services_post_commit_callback)
 
 for details
Original file line number	Diff line number	Diff line change
Expand Up		@@ -17,7 +17,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei

		## Clustering Architecture

		At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.
		At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.



Expand Down