-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DynamoDB item Time to Live #99
base: master
Are you sure you want to change the base?
Conversation
…nd fix and a potential bug
…conds for clarity and add comment
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This looks cool. :) Definitely useful to have.
Note: "TTL" stands for "Time To Live" not "Time To Leave"
My biggest concern is that the TTL expire and removal is eventually consistent across multiple partitions.
This is related to how TTL is implemented in dynamodb. The docs for TTL state "Shortly after the date and time of the specified timestamp". The TTL is implemented by a per-partition background scan. This means that items are not actually deleted in global TTL order but only per-partition. Which may result in odd recover behavior.
Similarly, the TTL docs have this note: "Items that have expired, but haven’t yet been deleted by TTL, still appear in reads, queries, and scans."
Suppose the log (in insert time order) for a persistent entity is (E - event, S - snapshot):
E0, E1, E2, S0, E3, E4, S1.
As the Es may be on multiple partitions the TTL expire may result in:
E0, E2, S0, E3, E4, S1.
Note that E1 was expired prior to E0. Which is entirely possible given the description from Amazon on how TTL is implemented.
Presuming the above problem is accurate, some mitigations:
- A separate TTL for snapshots. If the journal TTL can be configured to be a bit larger than the snapshot any out-of-order TTL removal of journal item can be prior to the snapshot item. Which should be safe for recovery.
- Checking the expireAt during read.
- Document this risk. This behavior may be totally fine for some applications.
A similar concern related to usage with existing systems:
What happens if this is enabled on an existing system. Suppose an existing system has persisted events and snapshots. Then the developers configure the system to use a TTL. What will happen during recovery? In the case of no snapshot, would the recovery fail?
This also applies the TTL for all items using DDB persistence. Is there a way to configure this per persistent entity?
I don't think any of the above is blocking. Documentation, warning and asserts could be added that ensure it's reasonably safe.
I'm going to consider this further before approval.
def readTTLConfig(c: Config): Option[DynamoDBTTLConfig] = { | ||
for { | ||
fieldName <- Try(c getString configFieldName).toOption | ||
if fieldName.trim.nonEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this guard uses fieldName.trim
but below the fieldName
is not trimmed. Consider a map after toOption: toOption.map(_.trim)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Should be good now.
|
||
fut.map { serialized => | ||
serializePersistentRepr(reprPayload, serializer).map { serialized => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice simplification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
README.md
Outdated
|
||
Note that `E1` was expired prior to `E0`. Which is entirely possible given the description from Amazon on how TTL is implemented. | ||
|
||
As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand exactly what having the snapshot TTL a bit larger than the journal one would change?
Let's say we have:
E1, E2, E3, S1, E4, E5
If the journal TTL is lower than the snapshot TTL, E4
and E5
could expire before S1
and thus information could be lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That is correct.
Considering this further: I don't think that requirement: "the snapshot TTL < journal TTL" is sufficient by itself. There also needs to be consideration of the snapshot interval when defining the TTL.
The property I'm thinking is valuable: The persistent entity can be recovered by replaying the events since any persisted snapshot.
So, if there is a sequence like E1, E2, S1, E3, E4, S2, E5
for a persistent entity. Then an equal entity state can be recovered from the replay S1, E3, E4, E5
or S2, E5
.
I think the minimum requirement is:
- snapshot interval much less than snapshot TTL
With the requirement:
- snapshot TTL less than journal TTL. At least 48hrs less.
being necessary only to ensure the entity can be recovered by replaying the events since any persisted snapshot.
Whew! Lots of consideration for a user of this but still valuable to have the TTL. Especially since even a nice TTL of like 20d is going to be much, much larger than the snapshot interval. The potential inconsistency seems really unlikely to occur in a real-world application?
EG: suppose the journal TTL was less than the snapshot interval. This satisfies the requirement "snapshot TTL less than journal TTL", but not the property above.
Prior to any TTL expiry:
E1, E2, E3, E4
Note no snapshot yet.... Expiry and delete occurs for E1 and E2
:
E3, E4
Oops! We can no longer recover the persistent entity state.
Compare to the snapshot interval << journal TTL:
E1, E2, E3, E4
Expiring hasn't occurred yet. Snapshot interval must occur first (per given above):
E1, E2, E3, E4, S1
Expiry and delete occurs for E1
and E2
:
E3, E4, S1
Still safe: The persistent entity can be entirely recovered from S1
. Expiry and delete occurs for E3
and E4
:
S1
Still safe: The persistent entity can be entirety recovered from S1
.
Thanks very much for your detailed feedback @coreyoconnor! I pushed a few commits to try to address the feedback. I added a section in the README to describe how to use this and what are the things to be cautious about. There is one thing I am not sure about in the readme. See my comment https://github.com/akka/akka-persistence-dynamodb/pull/99/files#r731892138 EDIT: Also, do you think |
Thank you for your patience. I'm working to review this this week. Looks like I still have some CI fixes to push first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note to the docs that an appropriate TTL also depends on snapshot interval. Then LGTM! Thanks!
I still have to fix CI before merging tho. Working on that.
README.md
Outdated
|
||
Note that `E1` was expired prior to `E0`. Which is entirely possible given the description from Amazon on how TTL is implemented. | ||
|
||
As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That is correct.
Considering this further: I don't think that requirement: "the snapshot TTL < journal TTL" is sufficient by itself. There also needs to be consideration of the snapshot interval when defining the TTL.
The property I'm thinking is valuable: The persistent entity can be recovered by replaying the events since any persisted snapshot.
So, if there is a sequence like E1, E2, S1, E3, E4, S2, E5
for a persistent entity. Then an equal entity state can be recovered from the replay S1, E3, E4, E5
or S2, E5
.
I think the minimum requirement is:
- snapshot interval much less than snapshot TTL
With the requirement:
- snapshot TTL less than journal TTL. At least 48hrs less.
being necessary only to ensure the entity can be recovered by replaying the events since any persisted snapshot.
Whew! Lots of consideration for a user of this but still valuable to have the TTL. Especially since even a nice TTL of like 20d is going to be much, much larger than the snapshot interval. The potential inconsistency seems really unlikely to occur in a real-world application?
EG: suppose the journal TTL was less than the snapshot interval. This satisfies the requirement "snapshot TTL less than journal TTL", but not the property above.
Prior to any TTL expiry:
E1, E2, E3, E4
Note no snapshot yet.... Expiry and delete occurs for E1 and E2
:
E3, E4
Oops! We can no longer recover the persistent entity state.
Compare to the snapshot interval << journal TTL:
E1, E2, E3, E4
Expiring hasn't occurred yet. Snapshot interval must occur first (per given above):
E1, E2, E3, E4, S1
Expiry and delete occurs for E1
and E2
:
E3, E4, S1
Still safe: The persistent entity can be entirely recovered from S1
. Expiry and delete occurs for E3
and E4
:
S1
Still safe: The persistent entity can be entirety recovered from S1
.
Before enabling the TTL feature, make sure to have an understanding of the potential impacts detailed below. | ||
|
||
#### Don't rely on the TTL for business logic related to the expiration | ||
As per specified in the [AWS DynamoDB configuration](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice description of the design considerations required!
README.md
Outdated
|
||
Note that `E1` was expired prior to `E0`. Which is entirely possible given the description from Amazon on how TTL is implemented. | ||
|
||
As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above consideration I'd rewrite: "As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item."
To something like:
As a consequence, configuring the TTLs requires careful consideration. For an entity to be recovered from the persisted snapshot and and journal events the snapshot interval must be less than snapshot TTL. Further, to ensure the useful property: The entity can be recovered by replaying the events since any persisted snapshot. The journal TTL should be larger than the snapshot TTL.
README.md
Outdated
|
||
In DynamoDB, this feature can be activated in a per-table basis by specifying the name of the field containing the expiring date. | ||
|
||
Expiring items is available for journal and snapshot tables. In order to activate it, `dynamodb-item-ttl-config.field-name` and `dynamodb-item-ttl-config.ttl` need to be specified: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"need to be specified. For example, given a system with a snapshot interval of 10 days. The TTLs could be configured:"
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
# Conflicts: # src/main/scala/akka/persistence/dynamodb/DynamoDBConfig.scala # src/main/scala/akka/persistence/dynamodb/journal/DynamoDBJournalRequests.scala # src/main/scala/akka/persistence/dynamodb/snapshot/DynamoDBSnapshotConfig.scala
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
Hi folks,
I'd like to propose a change to add a TTL to each item to leverage DynamoDB's feature. I am not aware of any discussion about this subject.
To add more details, we've added some optional configuration for journal and snapshot tables to specify the TTL:
With this configuration, it would add a field "expiresAt" for each journal item inserted with value
now + 30d
.To facilitate the review process, this can be separated into two parts:
I also tested it against dynamo in AWS:
Note: I am checking with my company for the CLA
Cheers