Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[site] Fix broken links across versions #12470

Merged
merged 1 commit into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,16 +129,35 @@ versioned_sidebars/version-0.7.0-sidebars.json
```

### Linking docs

- Remember to include the `.md` extension.
- Files will be linked to correct corresponding version.
- Relative paths work as well.

```md
The [@hello](hello.md#paginate) document is great!

See the [Tutorial](../getting-started/tutorial.md) for more info.
```
Relative paths work well. - Files will be linked to correct corresponding version.
- PREFER RELATIVE PATHS to be consistent with linking.
- **Good Example of linking.**
For ex say we are updating a 0.12.0 version doc which is older.
```md
A [callback notification](writing_data#commit-notifications) is exposed
```
This automatically resolves to /docs/0.12.0/writing_data#commit-notifications.
- **Bad example of linking.**
For ex say we are updating a 0.12.0 version doc which is older.
```md
A [callback notification](/docs/writing_data#commit-notifications) is exposed
```
This will resolve to the most recent release, specifically /docs/writing_data#commit-notifications . We do not want a 0.12.0 doc page to point to a page from a later release.
- DO NOT use next version when linking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using caution box syntax to highlight? https://github.com/orgs/community/discussions/16925

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just saw this. will fix the READ ME in a future PR

- Good Example of linking when you are working on unreleased version (from next version).
```md
Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](compaction) action merges logs and base files to produce new
file slices and [cleaning](cleaning) action gets rid of unused/older file slices to reclaim space on the file system.
```
This automatically resolves to /docs/next/compaction and /docs/next/cleaning pages.

- Bad Example of linking when you are working on unreleased version (from next version).
```md
Hudi adopts Multiversion Concurrency Control (MVCC), where [compaction](/docs/next/compaction) action merges logs and base files to produce new
file slices and [cleaning](/docs/next/cleaning) action gets rid of unused/older file slices to reclaim space on the file system.
```
Even though it directly points to /docs/next which is intended target, this accumulates as tech debt when this copy of docs gets released, we will hav a older doc always pointing to /docs/next/


## Versions

Expand Down
Binary file not shown.
2 changes: 1 addition & 1 deletion website/blog/2019-09-09-ingesting-database-changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,5 +44,5 @@ inputDataset.write.format("org.apache.hudi”)
.save("/path/on/dfs");
```

Alternatively, you can also use the Hudi [DeltaStreamer](https://hudi.apache.org/writing_data#deltastreamer) tool with the DFSSource.
Alternatively, you can also use the Hudi [DeltaStreamer](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer) tool with the DFSSource.

2 changes: 1 addition & 1 deletion website/blog/2020-01-20-change-capture-using-aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ In this blog, we will build an end-end solution for capturing changes from a MyS
We can break up the problem into two pieces.

1. **Extracting change logs from MySQL** : Surprisingly, this is still a pretty tricky problem to solve and often Hudi users get stuck here. Thankfully, at-least for AWS users, there is a [Database Migration service](https://aws.amazon.com/dms/) (DMS for short), that does this change capture and uploads them as parquet files on S3
2. **Applying these change logs to your data lake table** : Once there are change logs in some form, the next step is to apply them incrementally to your table. This mundane task can be fully automated using the Hudi [DeltaStreamer](http://hudi.apache.org/docs/writing_data#deltastreamer) tool.
2. **Applying these change logs to your data lake table** : Once there are change logs in some form, the next step is to apply them incrementally to your table. This mundane task can be fully automated using the Hudi [DeltaStreamer](http://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer) tool.



Expand Down
2 changes: 1 addition & 1 deletion website/blog/2021-01-27-hudi-clustering-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei

## Clustering Architecture

At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.
At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) to `0` to force new data to go into a new set of filegroups or set it to a higher value to ensure new data gets “padded” to existing files until it meets that limit that adds to ingestion latencies.



Expand Down
6 changes: 3 additions & 3 deletions website/blog/2021-03-01-hudi-file-sizing.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@ For illustration purposes, we are going to consider only COPY_ON_WRITE table.

Configs of interest before we dive into the algorithm:

- [Max file size](/docs/configurations#limitFileSize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value <br/>
- [Soft file limit](/docs/configurations#compactionSmallFileSize): Max file size below which a given data file is considered to a small file <br/>
- [Insert split size](/docs/configurations#insertSplitSize): Number of inserts grouped for a single partition. This value should match
- [Max file size](/docs/configurations#hoodieparquetmaxfilesize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value <br/>
- [Soft file limit](/docs/configurations#hoodieparquetsmallfilelimit): Max file size below which a given data file is considered to a small file <br/>
- [Insert split size](/docs/configurations#hoodiecopyonwriteinsertsplitsize): Number of inserts grouped for a single partition. This value should match
the number of records in a single file (you can determine based on max file size and per record size)

For instance, if your first config value is 120MB and 2nd config value is set to 100MB, any file whose size is < 100MB
Expand Down
2 changes: 1 addition & 1 deletion website/blog/2021-08-16-kafka-custom-deserializer.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ In our case a Confluent schema registry is used to maintain the schema and as sc
<!--truncate-->

## What do we want to achieve?
We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/writing_data#deltastreamer)
We have multiple instances of DeltaStreamer running, consuming many topics with different schemas ingesting to multiple Hudi tables. Deltastreamer is a utility in Hudi to assist in ingesting data from multiple sources like DFS, kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool [here](https://hudi.apache.org/docs/hoodie_streaming_ingestion#hudi-streamer)
Ideally every topic should be able to evolve the schema to match new business requirements. Producers start producing data with a new schema version and the DeltaStreamer picks up the new schema and ingests the data with the new schema. For this to work, we run our DeltaStreamer instances with the latest schema version available from the Schema Registry to ensure that we always use the freshest schema with all attributes.
A prerequisites is that all the mentioned Schema evolutions must be `BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of Avro Schema changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). This ensures that every record in the kafka topic can always be read using the latest schema.

Expand Down
2 changes: 1 addition & 1 deletion website/docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,7 @@ Compaction successfully repaired

### Savepoint and Restore
As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this
savepoint at a later point in time if need be. You can read more about savepoints and restore [here](/docs/next/disaster_recovery)
savepoint at a later point in time if need be. You can read more about savepoints and restore [here](disaster_recovery)

To trigger savepoint for a hudi table
```java
Expand Down
6 changes: 3 additions & 3 deletions website/docs/concurrency_control.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ last_modified_at: 2021-03-19T15:59:57-04:00
---
Concurrency control defines how different writers/readers/table services coordinate access to a Hudi table. Hudi ensures atomic writes, by way of publishing commits atomically to the timeline,
stamped with an instant time that denotes the time at which the action is deemed to have occurred. Unlike general purpose file version control, Hudi draws clear distinction between
writer processes that issue [write operations](/docs/next/write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and
writer processes that issue [write operations](write_operations) and table services that (re)write data/metadata to optimize/perform bookkeeping and
readers (that execute queries and read data).

Hudi provides
Expand All @@ -23,7 +23,7 @@ We’ll also describe ways to ingest data into a Hudi Table from multiple writer

## Distributed Locking
A pre-requisite for distributed co-ordination in Hudi, like many other distributed database systems is a distributed lock provider, that different processes can use to plan, schedule and
execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](/docs/next/timeline#truetime-generation), as discussed before.
execute actions on the Hudi timeline in a concurrent fashion. Locks are also used to [generate TrueTime](timeline#truetime-generation), as discussed before.

External locking is typically used in conjunction with optimistic concurrency control
because it provides a way to prevent conflicts that might occur when two or more transactions (commits in our case) attempt to modify the same resource concurrently.
Expand Down Expand Up @@ -204,7 +204,7 @@ Multiple writers can operate on the table with non-blocking conflict resolution.
file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency mode is
currently available for preview in version 1.0.0-beta only with the caveat that conflict resolution is not supported yet
between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink
writers [here](/docs/next/sql_dml#non-blocking-concurrency-control-experimental).
writers [here](sql_dml#non-blocking-concurrency-control-experimental).

## Early conflict Detection

Expand Down
2 changes: 1 addition & 1 deletion website/docs/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode

### Spark Datasource Writer Jobs

As described in [Batch Writes](/docs/next/writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".
As described in [Batch Writes](writing_data#spark-datasource-api), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.max.delta.commits".

Here is an example invocation using spark datasource

Expand Down
14 changes: 7 additions & 7 deletions website/docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ keywords: [hudi, writing, reading]

The FAQs are split into following pages. Please refer to the specific pages for more info.

- [General](/docs/next/faq_general)
- [Design & Concepts](/docs/next/faq_design_and_concepts)
- [Writing Tables](/docs/next/faq_writing_tables)
- [Reading Tables](/docs/next/faq_reading_tables)
- [Table Services](/docs/next/faq_table_services)
- [Storage](/docs/next/faq_storage)
- [Integrations](/docs/next/faq_integrations)
- [General](faq_general)
- [Design & Concepts](faq_design_and_concepts)
- [Writing Tables](faq_writing_tables)
- [Reading Tables](faq_reading_tables)
- [Table Services](faq_table_services)
- [Storage](faq_storage)
- [Integrations](faq_integrations)
2 changes: 1 addition & 1 deletion website/docs/faq_general.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Nonetheless, Hudi is designed very much like a database and provides similar fun

### How do I model the data stored in Hudi?

When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](/docs/next/writing_data) for an example.
When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique for a single partition/across table), a partition field (denotes partition to place key into) and preCombine/combine logic that specifies how to handle duplicates in a batch of records written. This model enables Hudi to enforce primary key constraints like you would get on a database table. See [here](writing_data) for an example.

When querying/reading data, Hudi just presents itself as a json-like hierarchical table, everyone is used to querying using Hive/Spark/Presto over Parquet/Json/Avro.

Expand Down
2 changes: 1 addition & 1 deletion website/docs/faq_table_services.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,6 @@ Hudi runs cleaner to remove old file versions as part of writing data either in

Yes. Hudi provides the ability to post a callback notification about a write commit. You can use a http hook or choose to

be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](/docs/next/platform_services_post_commit_callback)
be notified via a Kafka/pulsar topic or plug in your own implementation to get notified. Please refer [here](platform_services_post_commit_callback)

for details
Loading
Loading