Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DynamoDB item Time to Live #99

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

Dnomyar
Copy link

@Dnomyar Dnomyar commented Oct 11, 2021

Hi folks,

I'd like to propose a change to add a TTL to each item to leverage DynamoDB's feature. I am not aware of any discussion about this subject.

To add more details, we've added some optional configuration for journal and snapshot tables to specify the TTL:

my-dynamodb-journal { // also available for snapshots tables
  dynamodb-item-ttl-config {
    field-name = "expiresAt" // in the AWS UI, we need to set the field name https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-how-to.html
    ttl = 30d
  }
}

With this configuration, it would add a field "expiresAt" for each journal item inserted with value now + 30d.

To facilitate the review process, this can be separated into two parts:

  • the first part contains the bulk of the work: handling configuration & adding the field
  • the second part contains the part that updates the alert related to the item size (<400k). Correct me if I am wrong, but I haven't found any similar alert for the snapshots.

I also tested it against dynamo in AWS:

Screenshot 2021-10-11 at 16 48 50

Note: I am checking with my company for the CLA

Cheers

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

@Dnomyar Dnomyar marked this pull request as draft October 11, 2021 16:21
@Dnomyar Dnomyar marked this pull request as ready for review October 11, 2021 16:25
@Dnomyar Dnomyar mentioned this pull request Oct 15, 2021
Copy link
Member

@coreyoconnor coreyoconnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
This looks cool. :) Definitely useful to have.
Note: "TTL" stands for "Time To Live" not "Time To Leave"

My biggest concern is that the TTL expire and removal is eventually consistent across multiple partitions.

This is related to how TTL is implemented in dynamodb. The docs for TTL state "Shortly after the date and time of the specified timestamp". The TTL is implemented by a per-partition background scan. This means that items are not actually deleted in global TTL order but only per-partition. Which may result in odd recover behavior.

Similarly, the TTL docs have this note: "Items that have expired, but haven’t yet been deleted by TTL, still appear in reads, queries, and scans."

Suppose the log (in insert time order) for a persistent entity is (E - event, S - snapshot):
E0, E1, E2, S0, E3, E4, S1.

As the Es may be on multiple partitions the TTL expire may result in:

E0, E2, S0, E3, E4, S1.

Note that E1 was expired prior to E0. Which is entirely possible given the description from Amazon on how TTL is implemented.

Presuming the above problem is accurate, some mitigations:

  1. A separate TTL for snapshots. If the journal TTL can be configured to be a bit larger than the snapshot any out-of-order TTL removal of journal item can be prior to the snapshot item. Which should be safe for recovery.
  2. Checking the expireAt during read.
  3. Document this risk. This behavior may be totally fine for some applications.

A similar concern related to usage with existing systems:

What happens if this is enabled on an existing system. Suppose an existing system has persisted events and snapshots. Then the developers configure the system to use a TTL. What will happen during recovery? In the case of no snapshot, would the recovery fail?

This also applies the TTL for all items using DDB persistence. Is there a way to configure this per persistent entity?

I don't think any of the above is blocking. Documentation, warning and asserts could be added that ensure it's reasonably safe.

I'm going to consider this further before approval.

def readTTLConfig(c: Config): Option[DynamoDBTTLConfig] = {
for {
fieldName <- Try(c getString configFieldName).toOption
if fieldName.trim.nonEmpty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this guard uses fieldName.trim but below the fieldName is not trimmed. Consider a map after toOption: toOption.map(_.trim).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Should be good now.


fut.map { serialized =>
serializePersistentRepr(reprPayload, serializer).map { serialized =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice simplification!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@Dnomyar Dnomyar changed the title Add DynamoDB item Time to Leave Add DynamoDB item Time to Live Oct 18, 2021
@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

README.md Outdated

Note that `E1` was expired prior to `E0`. Which is entirely possible given the description from Amazon on how TTL is implemented.

As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item.
Copy link
Author

@Dnomyar Dnomyar Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand exactly what having the snapshot TTL a bit larger than the journal one would change?
Let's say we have:

E1, E2, E3, S1, E4, E5

If the journal TTL is lower than the snapshot TTL, E4 and E5 could expire before S1 and thus information could be lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is correct.

Considering this further: I don't think that requirement: "the snapshot TTL < journal TTL" is sufficient by itself. There also needs to be consideration of the snapshot interval when defining the TTL.

The property I'm thinking is valuable: The persistent entity can be recovered by replaying the events since any persisted snapshot.

So, if there is a sequence like E1, E2, S1, E3, E4, S2, E5 for a persistent entity. Then an equal entity state can be recovered from the replay S1, E3, E4, E5 or S2, E5.

I think the minimum requirement is:

  • snapshot interval much less than snapshot TTL

With the requirement:

  • snapshot TTL less than journal TTL. At least 48hrs less.

being necessary only to ensure the entity can be recovered by replaying the events since any persisted snapshot.

Whew! Lots of consideration for a user of this but still valuable to have the TTL. Especially since even a nice TTL of like 20d is going to be much, much larger than the snapshot interval. The potential inconsistency seems really unlikely to occur in a real-world application?

EG: suppose the journal TTL was less than the snapshot interval. This satisfies the requirement "snapshot TTL less than journal TTL", but not the property above.

Prior to any TTL expiry:

E1, E2, E3, E4

Note no snapshot yet.... Expiry and delete occurs for E1 and E2:

E3, E4

Oops! We can no longer recover the persistent entity state.

Compare to the snapshot interval << journal TTL:

E1, E2, E3, E4

Expiring hasn't occurred yet. Snapshot interval must occur first (per given above):

E1, E2, E3, E4, S1

Expiry and delete occurs for E1 and E2:

E3, E4, S1

Still safe: The persistent entity can be entirely recovered from S1. Expiry and delete occurs for E3 and E4:

S1

Still safe: The persistent entity can be entirety recovered from S1.

@Dnomyar
Copy link
Author

Dnomyar commented Oct 19, 2021

Thanks very much for your detailed feedback @coreyoconnor!

I pushed a few commits to try to address the feedback. I added a section in the README to describe how to use this and what are the things to be cautious about.

There is one thing I am not sure about in the readme. See my comment https://github.com/akka/akka-persistence-dynamodb/pull/99/files#r731892138

EDIT: Also, do you think Checking the expireAt during read. should be part of this PR or are you happy to raise an issue to implement that later on?

@Dnomyar Dnomyar requested a review from coreyoconnor November 10, 2021 14:53
@coreyoconnor
Copy link
Member

Thank you for your patience. I'm working to review this this week. Looks like I still have some CI fixes to push first.

Copy link
Member

@coreyoconnor coreyoconnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note to the docs that an appropriate TTL also depends on snapshot interval. Then LGTM! Thanks!

I still have to fix CI before merging tho. Working on that.

README.md Outdated

Note that `E1` was expired prior to `E0`. Which is entirely possible given the description from Amazon on how TTL is implemented.

As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is correct.

Considering this further: I don't think that requirement: "the snapshot TTL < journal TTL" is sufficient by itself. There also needs to be consideration of the snapshot interval when defining the TTL.

The property I'm thinking is valuable: The persistent entity can be recovered by replaying the events since any persisted snapshot.

So, if there is a sequence like E1, E2, S1, E3, E4, S2, E5 for a persistent entity. Then an equal entity state can be recovered from the replay S1, E3, E4, E5 or S2, E5.

I think the minimum requirement is:

  • snapshot interval much less than snapshot TTL

With the requirement:

  • snapshot TTL less than journal TTL. At least 48hrs less.

being necessary only to ensure the entity can be recovered by replaying the events since any persisted snapshot.

Whew! Lots of consideration for a user of this but still valuable to have the TTL. Especially since even a nice TTL of like 20d is going to be much, much larger than the snapshot interval. The potential inconsistency seems really unlikely to occur in a real-world application?

EG: suppose the journal TTL was less than the snapshot interval. This satisfies the requirement "snapshot TTL less than journal TTL", but not the property above.

Prior to any TTL expiry:

E1, E2, E3, E4

Note no snapshot yet.... Expiry and delete occurs for E1 and E2:

E3, E4

Oops! We can no longer recover the persistent entity state.

Compare to the snapshot interval << journal TTL:

E1, E2, E3, E4

Expiring hasn't occurred yet. Snapshot interval must occur first (per given above):

E1, E2, E3, E4, S1

Expiry and delete occurs for E1 and E2:

E3, E4, S1

Still safe: The persistent entity can be entirely recovered from S1. Expiry and delete occurs for E3 and E4:

S1

Still safe: The persistent entity can be entirety recovered from S1.

Before enabling the TTL feature, make sure to have an understanding of the potential impacts detailed below.

#### Don't rely on the TTL for business logic related to the expiration
As per specified in the [AWS DynamoDB configuration](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice description of the design considerations required!

README.md Outdated

Note that `E1` was expired prior to `E0`. Which is entirely possible given the description from Amazon on how TTL is implemented.

As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above consideration I'd rewrite: "As a consequence, a separate TTL for snapshot should be preferred. The journal TTL should be configured to be a bit larger than the snapshot. Thus, any out-of-order TTL removal of journal item can be prior to the snapshot item."

To something like:

As a consequence, configuring the TTLs requires careful consideration. For an entity to be recovered from the persisted snapshot and and journal events the snapshot interval must be less than snapshot TTL. Further, to ensure the useful property: The entity can be recovered by replaying the events since any persisted snapshot. The journal TTL should be larger than the snapshot TTL.

README.md Outdated

In DynamoDB, this feature can be activated in a per-table basis by specifying the name of the field containing the expiring date.

Expiring items is available for journal and snapshot tables. In order to activate it, `dynamodb-item-ttl-config.field-name` and `dynamodb-item-ttl-config.ttl` need to be specified:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"need to be specified. For example, given a system with a snapshot interval of 10 days. The TTLs could be configured:"

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

# Conflicts:
#	src/main/scala/akka/persistence/dynamodb/DynamoDBConfig.scala
#	src/main/scala/akka/persistence/dynamodb/journal/DynamoDBJournalRequests.scala
#	src/main/scala/akka/persistence/dynamodb/snapshot/DynamoDBSnapshotConfig.scala
@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants