-
Notifications
You must be signed in to change notification settings - Fork 47
Add Delta Lake ACID Transactions blog #530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Avril Aysha <[email protected]>
Signed-off-by: Avril Aysha <[email protected]>
Signed-off-by: Avril Aysha <[email protected]>
✅ Deploy Preview for deltaio-site ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
|
||
| This article explains how Delta Lake uses ACID transactions to make your data workloads more secure and reliable. | ||
|
|
||
| When you work with valuable data, you need to be able to trust that it's accurate and safe. ACID transactions provide this kind of reliability guarantee so that you don't have to worry about corrupted data, accidental deletes or partial overwrites. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not entirely sure ACID gives us all of that. At least only in a narrower sense. I.e. i can easily corrup data with an erroneous merge o.a. whie still being fully acid compliant.
THe time travel feature might allow us recover accidentally deleted data, bit someone might still acidentally depete things?
|
|
||
| A **transaction** is any operation or set of operations that modify your data, for example updating a table or writing new records. Transactions should be processed reliably. | ||
|
|
||
| **ACID** is an acronym that stands for: **A**tomicity, **C**onsistency, **I**solation, and **D**urability. They are the four guarantees that keep your data reliable even if something goes wrong, like an accidental delete or partial overwrite. Here's what each guarantee means: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, i might be wrong, but i believe the main focus in ACID is qualifying transaction properties. One migth still accidentally delete something in an acid transaction.
|
|
||
| **ACID** is an acronym that stands for: **A**tomicity, **C**onsistency, **I**solation, and **D**urability. They are the four guarantees that keep your data reliable even if something goes wrong, like an accidental delete or partial overwrite. Here's what each guarantee means: | ||
|
|
||
| - **Atomicity** means that a transaction is all or nothing. If one part fails, the entire operation is rolled back, so you never end up with incomplete data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in deltas case there technically is no roll back, but its just not committed in case of conflicts.
|
|
||
| ## A brief history of ACID transactions | ||
|
|
||
| ACID transactions were first developed in the 1970s to keep database operations reliable. They quickly became the standard for database management. But in the 2010s, many organizations shifted to data lakes for storing large-scale data. Unlike databases, traditional data lakes were built on cloud object stores and [didn't enforce ACID guarantees](https://delta.io/blog/delta-lake-vs-data-lake/). This created problems with incomplete writes, data corruption, and conflicting updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think they were defined to establish what a transaction in a database MUST have ... i.e. without that there is no (transactional) databases. Not sure if "keep .. reliable" captures that?
|
|
||
| ACID transactions were first developed in the 1970s to keep database operations reliable. They quickly became the standard for database management. But in the 2010s, many organizations shifted to data lakes for storing large-scale data. Unlike databases, traditional data lakes were built on cloud object stores and [didn't enforce ACID guarantees](https://delta.io/blog/delta-lake-vs-data-lake/). This created problems with incomplete writes, data corruption, and conflicting updates. | ||
|
|
||
| Delta Lake solves these problems. It brings ACID transactions to your data lake. Every data change is guaranteed to be either fully completed or undone. This keeps your data consistent and secure, no matter what happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does secure play into this?
|
|
||
| ## How does Delta Lake use ACID transactions? | ||
|
|
||
| Delta Lake brings ACID transactions to your data lake by using a transaction log. This log is stored in a lightweight JSON file and tracks every change to your data. This makes sure that every operation is **atomic** (all or nothing), **consistent** (follows defined rules), **isolated** (avoids conflicts), and **durable** (permanently stored). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Delta Lake brings ACID transactions to your data lake by using a transaction log. This log is stored in a lightweight JSON file and tracks every change to your data. This makes sure that every operation is **atomic** (all or nothing), **consistent** (follows defined rules), **isolated** (avoids conflicts), and **durable** (permanently stored). | |
| Delta Lake brings ACID transactions to your data lake by using a transaction log. Every commit is stored in a lightweight JSON file and tracks every change to your data. This makes sure that every operation is **atomic** (all or nothing), **consistent** (follows defined rules), **isolated** (avoids conflicts), and **durable** (permanently stored). |
|
|
||
| ## How does Delta Lake use ACID transactions? | ||
|
|
||
| Delta Lake brings ACID transactions to your data lake by using a transaction log. This log is stored in a lightweight JSON file and tracks every change to your data. This makes sure that every operation is **atomic** (all or nothing), **consistent** (follows defined rules), **isolated** (avoids conflicts), and **durable** (permanently stored). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are quite a few steps between tracking commits in json files and achieving these properties. is it OK to just handwave this here?
| This approach gives you the following peace-of-mind guarantees: | ||
|
|
||
| - **No Partial Writes** - If a job fails mid-operation, Delta Lake ensures that incomplete changes are rolled back, so your data remains clean. | ||
| - **Consistent Reads** - Since all operations are tracked, queries always see a valid and up-to-date view of the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
particularly they see one consistent state throughout the entire transaction, even if some other writer made a new commit.
|
|
||
| - **No Partial Writes** - If a job fails mid-operation, Delta Lake ensures that incomplete changes are rolled back, so your data remains clean. | ||
| - **Consistent Reads** - Since all operations are tracked, queries always see a valid and up-to-date view of the data. | ||
| - **Concurrent Writes Without Conflicts** - Multiple users can modify data safely without interfering with each other, preventing corruption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they might interfere, but this would fail the commit where the looser either has to resolve it (in case of fully serialisable commits) or retry its operation ir new data arrived it should have read during the operation.
|
|
||
| ## Do ACID transactions affect performance? | ||
|
|
||
| A common concern with ACID transactions is performance overhead, but Delta Lake minimizes this impact. Since transactions are recorded in lightweight log files, updates are efficient. Delta Lake also uses concurrency control to let multiple users work on the same data without locking files to reduce bottlenecks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm .. technically the actual commit requires put-if-absent of rename-if-absent which is not quite a lock, but also handles coordination. This is where the opmimistic approach comes in - i.e. write all data first hoping the later commit will succeed.
Signed-off-by: Avril Aysha [email protected]