Simplify data updates in DynamoDB with DynamoDB Streams, and retain your sanity

Simplify data updates in DynamoDB with DynamoDB Streams, and retain your sanity

Luc van Donkersgoed

Luc van Donkersgoed

Contrary to relational databases, data duplication is okay in Dynamo DB. In many cases, it is actually a best practice for performance. But duplicate data is harder to maintain and reason about: you need to keep track of the locations the data is stored and update it all those locations. In this article I will describe how to simplify this process with DynamoDB Streams.

Preface 1 This article is a follow-up to The Spiraling Complexity of DynamoDB Data Duplication. In that post I tried to sketch the ways data duplication can complicate your application. To simplify the topic I tried to stay clear of partition key design. However, by doing so, I ended up with an unrealistic database design. The ‘problems’ I discussed in the article were partially the result of this unrealistic design. In this article we will use a proper partition key scheme and a design as you would use it in a real application.

Preface 2 The example below is deliberately simple. A blogging website might not require the complex techniques illustrated in this example. However, a complex example would be harder to read and understand, making the concept of data duplication and DynamoDB streams hard to explain.

Preface 3 One final note: I will still skip key sharding. Although this is an essential topic for performance at scale, it doesn’t have any real implications on table design and data updates. Leaving sharding out keeps this article more readable.

With that out of the way, let’s dive in!

Our use case

The data we’re modeling is a blogging website. The wireframes for the website are displayed below.
Wireframes

From the wireframes we can define the following access patterns:

  • Fetch all authors (ordered by name)
  • Fetch all blog posts and their authors (ordered by date)
  • Fetch all blog posts by a specific author (ordered by date)
  • Fetch a single blog with its author, related blogs, and their authors

To accommodate these access patterns we define two indices: the primary index which consists of a partition key (PK) and a sort key (SK), and a Global Secondary Index (GSI) which consists of a separate partition key (GSI1-PK) and sort key (GSI1-SK).

The full table looks like this:

PK SK GSI1-PK GSI1-SK Author Bio Excerpt Title
AUTHOR#aabb-1123 AUTHOR#aabb-1123 AUTHOR Luc Luc is the managing editor of the blog
AUTHOR#aabb-1123 BLOGPOST#2021-04-24T11:32:00Z BLOGPOST 2021-04-24T11:32:00Z {"Name": "Luc", "Role": "Managing Editor", "Bio": "Luc is the managing editor of the blog"} A short text My First Article
AUTHOR#aabb-1123 BLOGPOST#2021-04-28T12:03:00Z BLOGPOST 2021-04-28T12:03:00Z {"Name": "Luc", "Role": "Managing Editor", "Bio": "Luc is the managing editor of the blog"} Another short text My Second Article
AUTHOR#uuas-7437 AUTHOR#uuas-7437 AUTHOR Jane Jane is the science editor of the blog
AUTHOR#uuas-7437 BLOGPOST#2021-04-27T09:02:01Z BLOGPOST 2021-04-27T09:02:01Z {"Name": "Jane", "Role": "Science Editor", "Bio": "Jane is the science editor of the blog"} A short science text A Science Article
AUTHOR#uuas-7437 BLOGPOST#2021-04-28T12:03:00Z#1234-5321#RELATEDPOST#2021-04-24T11:32:00Z RELATEDPOST 2021-04-28T12:03:00Z#AUTHOR#aabb-1123 {"Name": "Luc", "Role": "Managing Editor", "Bio": "Luc is the managing editor of the blog"} A short text My First Article

Please note that I left out sharding for GSI1-PK for readability. For more information about sharding, see the DynamoDB Docs Using Write Sharding to Distribute Workloads Evenly or this article by Alex DeBrie: Leaderboard & Write Sharding.

To access the data for the access patterns above, we can execute the following queries:
Fetch all authors (ordered by name)
Query(Index=GSI1, PK.eq('AUTHOR'))
Fetch all blog posts and their authors (ordered by date)
Query(Index=GSI1, PK.eq('BLOGPOST'))
Fetch all blog posts by a specific author (ordered by date)
Query(Index=MainIndex, PK.eq('AUTHOR#<id>'), SK.beginswith('BLOGPOST#')) (condition: GSI1-PK.eq('BLOGPOST'))
Fetch a single blog with its author, related blogs, and their authors
Query(Index=MainIndex, PK.eq('AUTHOR#<id>'), SK.beginswith('BLOGPOST#<blog date>'))

In the data table we see two types of duplication: the author is duplicated into every blog post, and blog posts themselves are duplicated into related posts. This has the following significant performance benefits:

  • When retrieving all blog posts, we do not need to separately retrieve all their authors
  • When displaying a single post we can retrieve the post and all their related posts (and their authors) in a single query.

This article is about managing updates to duplicated data, so let’s take a look at the different update (and delete) operations we can perform.

Data mutations

In our example use case we see three data types: Authors, Blog Posts and Related Blogs. The complete list of mutations for these data types is:

Mutation Data Type Affects duplicated data DDB Operations
Create Author AUTHOR No Create item: Author
Update Author AUTHOR Yes Update item: Author
Update items: Blog Posts by Author
Update Items: Related Posts by Author
Delete Author AUTHOR Yes Delete item: Author
Delete items: Blog Posts by Author
Delete Items: Related Posts by Author
Create Blog Post BLOGPOST No Create item: Blog Post
Update Blog Post BLOGPOST Yes Update item: Blog Post
Update items: Related Posts matching Blog Post
Delete Blog Post BLOGPOST Yes Delete item: Blog Post
Delete items: Related Posts matching Blog Post
Add Related Post RELATEDPOST No Create item: Related Blog Post
Remove Related Post RELATEDPOST No Delete item: Related Blog Post

As you can see, half of the mutations impact multiple items in DynamoDB. There are two ways to approach these updates: either put all the database mutations in application logic, or use an event-based stream to post-process mutations.

DynamoDB Streams

In this article we will focus on the solution based on DynamoDB Streams, because it offers clear benefits over the in-application solution:

  • The business logic is cleaner. When you want to update an Author, you just update the Author and don’t worry about data propagation.
  • Future API changes are supported without code change. If another process also mutates an Author, its changes will also be automatically propagated. The developer working on the change does not need to be aware of all the implications on the data model.
  • The initial mutation is faster. When a user updates an Author, the application will do just that and return a response quickly. The user does not need to wait for all the data to propagate, which improves user experience.
  • Multi-layered updates (update an author, then all its blog posts, then all blog posts relating to those blog posts) are much easier to implement, as we will see below.

The DynamoDB Streams format

In our architecture, the function with the business logic (update author in the diagram above), only mutates the base data. In case of UpdateBlogPost, this might look like the following pseudo code:

Table.update(
    Key={
        PK="AUTHOR#aabb-1123",
        SK="BLOGPOST#2021-04-24T11:32:00Z"
    },
    UpdatedValues={
        Excerpt="An updated short text"
    }
)

This mutation will generate an event in the DynamoDB Stream. When this event is received by the stream processor Lambda, it will look like this:

{
    "Records": [
        {
            "eventID": "980d4f075f139e2b240275ef54296fe0",
            "eventName": "MODIFY",
            "eventVersion": "1.1",
            "eventSource": "aws:dynamodb",
            "awsRegion": "eu-west-1",
            "dynamodb": {
                "ApproximateCreationDateTime": 1619032671,
                "Keys": {
                    "PK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "SK": {
                        "S": "BLOGPOST#2021-04-24T11:32:00Z"
                    }
                },
                "NewImage": {
                    "PK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "SK": {
                        "S": "BLOGPOST#2021-04-24T11:32:00Z"
                    },
                    "GSI1-PK": "BLOGPOST",
                    "GSI1-SK": "2021-04-24T11:32:00Z",
                    "Author": "{\"Name\": \"Luc\", \"Role\": \"Managing Editor\", \"Bio\": \"Luc is the managing editor of the blog\"}",
                    "Excerpt": "An updated short text",
                    "Title": "My First Article"
                },
                "OldImage": {
                    "PK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "SK": {
                        "S": "BLOGPOST#2021-04-24T11:32:00Z"
                    },
                    "GSI1-PK": "BLOGPOST",
                    "GSI1-SK": "2021-04-24T11:32:00Z",
                    "Author": "{\"Name\": \"Luc\", \"Role\": \"Managing Editor\", \"Bio\": \"Luc is the managing editor of the blog\"}",
                    "Excerpt": "A short text",
                    "Title": "My First Article"
                },
                "SequenceNumber": "69314500000000025048375850",
                "SizeBytes": 615,
                "StreamViewType": "NEW_AND_OLD_IMAGES"
            },
            "eventSourceARN": "arn:aws:dynamodb:eu-west-1:123412341234:table/mytable/stream/2021-04-19T11:48:47.369"
        }
    ]
}

Important parts of this payload are "eventName": "MODIFY" (this can also be INSERT or DELETE), the “NewImage” and the “OldImage”. These fields tell us all we need to know to propagate the data. Our mutations table above shows that when a blog post is updated, the next step is to “Update items: Related Posts matching Blog Post”.

In pseudo code, we can achieve this as follows:

New Inbound Event
- if eventName == "MODIFY"
  - if GSI1-PK.eq("BLOGPOST")
    - For all items where GSI1-PK.eq("RELATEDPOST")
      and GSI1-SK.eq(<updated blogpost date>#AUTHOR#<author id>)
      - Update Related Blog Post

Let’s walk through this step by step. First, we check if the event is a modify event. If it is, we check its values. If the GSI1-PK equals BLOGPOST we can be sure this is an updated blog post. We know we need to update all the related blog posts, so using the GSI1 index we retrieve all items where GSI1-PK matches RELATEDPOST and GSI1-SK matches the date and author of the blog post we’re updating. For all of these items, we update the contents with the new values received in NewImage.

When this process has completed, all related posts contain the same info as the updated original post.

Next, let’s look at updating Authors. An Author might be updated through the following pseudo code:

Table.update(
    Key={
        PK="AUTHOR#aabb-1123",
        SK="AUTHOR#aabb-1123"
    },
    Item={
        GSI1-SK="Luc",
        Bio="This is an updated bio"
    }
)

This will generate the following event in DDB streams:

{
    "Records": [
        {
            "eventID": "980d4f075f139e2b240275ef54296fe0",
            "eventName": "MODIFY",
            "eventVersion": "1.1",
            "eventSource": "aws:dynamodb",
            "awsRegion": "eu-west-1",
            "dynamodb": {
                "ApproximateCreationDateTime": 1619032671,
                "Keys": {
                    "PK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "SK": {
                        "S": "AUTHOR#aabb-1123"
                    }
                },
                "NewImage": {
                    "PK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "SK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "GSI1-SK": {
                        "S": "Luc"
                    },
                    "Bio": {
                        "S": "This is an updated bio"
                    },
                },
                "OldImage": {
                    "PK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "SK": {
                        "S": "AUTHOR#aabb-1123"
                    },
                    "GSI1-SK": {
                        "S": "Luc"
                    },
                    "Bio": {
                        "S": "Luc is the managing editor of the blog"
                    },
                },
                "SequenceNumber": "69314500000000025048375850",
                "SizeBytes": 615,
                "StreamViewType": "NEW_AND_OLD_IMAGES"
            },
            "eventSourceARN": "arn:aws:dynamodb:eu-west-1:123412341234:table/mytable/stream/2021-04-19T11:48:47.369"
        }
    ]
}

The mutations table shows that when we update an Author we need to post-process it with:

  • Update items: Blog Posts by Author
  • Update Items: Related Posts by Author

In pseudo code, the first step looks like this:

New Inbound Event
- If eventName == MODIFY
  - If GSI1-PK.eq("AUTHOR")
    - For all items where PK.eq(<event author>)
      and SK.beginswith(BLOGPOST#) and GSI1-PK.eq(BLOGPOST)
      - Update Author

This will update all the blog posts by this author. The beauty of this system is that every updated blog post will itself generate an UpdateBlogPost event, just like the one discussed above. This will automatically perform the “Update Items: Related Posts by Author” change. Let’s look at that process in a flow chart:

Flow Chart

So when an Author is updated (1), the stream processor updates that author’s blog posts (2, 3, 4). The blog post updates then generate a next wave of stream events (5), which are again picked up by the stream processor (6, 7), which updates all the related blog post references (8). These updates then generate another wave of updates, but these are ignored (9, 10, 11, 12).

When this process has completed, the Author info has been updated in every blog post and related blog post in the table. And all the application had to do was kick start the process by updating the Author.

Conclusion

Duplication in DynamoDB is hard. Maintaining duplicated data is even harder. If you expect multiple developers to keep track of all the places data is stored and duplicated, you’re in for a rough ride. DynamoDB Streams can alleviate a lot of this pain. Of course, the stream processor still needs to be maintained, but this is an isolated domain. It’s easier to remember that when you’re updating the data model, you also need to update the stream processor. And finally, you only need to update the single element of the application; you don’t need to scour the code base for any function that might touch the duplicated data. In conclusion, I believe that DynamoDB Streams are the only sane way to manage duplicated data in a DynamoDB environment.

I share posts like these and smaller news articles on Twitter, follow me there for regular updates! If you have questions or remarks, or would just like to get in touch, you can also find me on LinkedIn.

Luc van Donkersgoed
Luc van Donkersgoed