Building a multi-value semaphore with DynamoDB

Building a multi-value semaphore with DynamoDB

Luc van Donkersgoed

Luc van Donkersgoed

Semaphores are a concept invented by fellow Dutchie Edsger Dijkstra. It describes the way a process can ‘claim’ a resource in a multi-process environment. This allows a system to avoid multiple processes simultaneously accessing a resource that can only handle one process or execution at a time. An example: a hard drive (the spinning disk type) can only read one disk location at a time, so a semaphore locks drive interactions until the current process is complete. When the first process is complete, the next process gets the lock. Semaphores are closely related to queues, because there might be backlog of processes waiting to acquire a lock on a resource.

A simple implementation of a semaphore on AWS involves SQS FIFO queues. The queue contains a list of tasks that need to be executed, but only one task can be performed at the same time. A queue worker takes a task from the top of the queue. The task then checks if a process is already running. If it is already running, the task is returned to the queue (in SQS terminology: made visible again). If no process is running, the task is executed and deleted from the queue. It’s that easy.

FIFO Queue

This process works fine until the tasks on the queue require different resources to be locked. For example, the items on the queue might look like this:

Two Resources

In this example task 1 is taken off the queue and locks resource A. Then task 2 is taken off the queue, but it is immediately returned because resource A is locked. This continues until task 1 completes. During this entire time task 3 has been idle on the queue, even though it doesn’t contend for the same resources as task 1 and 2.

This is a complex problem that can’t be solved with simple queues. So let’s build a multi-value semaphore solution with a more powerful tool. But first, let me tell you about a real-world use case to give the concept a bit more body.

A parallel deployment pipeline in AWS Step Functions

In my previous article “AWS Step Functions: the Deployment Orchestrator that CodePipeline should have been” I laid out the case for Step Functions as a deployment pipeline. One of the challenges I encountered was the fact that Step Functions are an explicitly parallel technology: they are designed to run as soon as you trigger them, regardless if another state machine execution is already running. This is great when you’re deploying to multiple environments (e.g. acceptance and production), but quite terrible when two people try to deploy to the same environment at the same time. This problem can be solved by a multi-value semaphore: an environment (a deployment target) is a ‘resource’, and multiple environments can be considered multiple, non-conflicting resources.

The simple approach would be to create a dedicated SQS queue for every environment. Each queue is processed by a separate queue worker, and the shared deployment pipeline is started when the requested environment / resource is free. However, this doesn’t scale well when the amount of environments / resources is dynamic. It would require a new queue and queue worker to be deployed for every new environment. So instead, we will use DynamoDB.

A multi-value semaphore in DynamoDB

Let’s start with a quick architectural overview.
Solution Overview
A new deployment is queued through an API call. This call will trigger a Lambda Function which puts the deployment details in a DynamoDB table. A time-based event periodically triggers another Lambda Function. This function checks for queued deployments in DynamoDB. If any are found, it will check if a deployment for the specified environment is already running. If so, it will back off. If not, it will trigger the event and delete it from the DynamoDB table.

A slightly more robust solution would involve EventBridge. Instead of removing the item from the queue as soon as the deployment starts, we only remove the item from DynamoDB when an “Execution succeeded” event is received by EventBridge. When an “Execution failed” event occurs we can choose to retry the deployment or put the event on a Dead Letter Queue.
Solution EventBridge

DynamoDB schema design

The best practice for DynamoDB is to use Single Table Design. This means that different types of objects and resources are stored in a single table, and a composite primary key (consisting of a Partition Key or PK and a Sort Key or SK) is used to retrieve these resources. With Single Table Designs you design your PK and SK around access patterns - you determine how you will query the data, and store resources with a PK and SK which facilitates those access patterns. So what does that look like for our multi-value semaphore?

We can define two access patterns: the first is to answer “which environments exist on the queue”, the second is to answer “what is the next queued deployment for each environment”. The term ‘environment’ is specific to the deployment pipelines use case. In a more generic sense, the word ‘environment’ is interchangeable with ‘resource’ and a ‘deployment’ is a ‘process’: the semaphore is locking resources and we’re using the queue mechanism to fetch the next process to access the resource.

To be able to answer the first query, we store every environment under the PK ENVIRONMENT and the SK <name_of_the_environment>:

PK SK QueuedBuilds
ENVIRONMENT Acceptance 2
ENVIRONMENT Production 1

We will also add a QueuedBuilds attribute. When a deployment for a new environment is pushed onto the queue this value will be initialized at 1. If a build for a previously used environment is added to the queue, the QueuedBuilds attribute will be incremented by one.

To be able to answer the second question we need to store queued deployments. The query is “what is the next queued item for each environment”, so we will store them like this:

PK SK Parameters
ENVIRONMENT#Acceptance 2021-03-20T20:01:45Z {“skip_api_v1”: true}
ENVIRONMENT#Acceptance 2021-03-20T20:02:12Z {}
ENVIRONMENT#Production 2021-03-20T21:03:21Z {}

For each environment we retrieved from the previous query we can request all builds by querying PK = 'ENVIRONMENT#<environment_name'. By setting the result limit at 1, we only retrieve the oldest build.

Implementation in Python

To put a new deployment on the queue, we first store the environment, or increment the QueuedBuilds attribute by 1.

from botocore.exceptions import ClientError

try:
    table.put_item(
        Item={
            'PK': 'ENVIRONMENT',
            'SK': environment,
            'QueuedBuilds': 1
        },
        ConditionExpression='attribute_not_exists(PK) AND attribute_not_exists(SK)'
    )
except ClientError as e:
    if e.response['Error']['Code'] == "ConditionalCheckFailedException":
        table.update_item(
            Key={
                'PK': 'ENVIRONMENT',
                'SK': environment
            },
            UpdateExpression="set QueuedBuilds = QueuedBuilds + :val",
            ExpressionAttributeValues={':val': 1}
        )
    else:
        raise

Then we store the build’s details:

from datetime import datetime

table.put_item(
    Item={
        'PK': f'ENVIRONMENT#{environment}',
        'SK': datetime.now().isoformat(),
        'Parameters': parameters
    }
)

To fetch the items at the top of the queue, we first retrieve all active environments using a FilterExpression:

from boto3.dynamodb.conditions import Key, Attr

response = table.query(
    KeyConditionExpression=Key('PK').eq('ENVIRONMENT'),
    FilterExpression=Attr('QueuedBuilds').gt(0)
)
return response['Items']

At this point we check if we have any State Machines running for the given environments. If we do, we drop the environment from the process. Then we retrieve the oldest build for every remaining environment:

from boto3.dynamodb.conditions import Key, Attr

response = table.query(
    KeyConditionExpression=Key('PK').eq(f'ENVIRONMENT#{environment}'),
    Limit=1
)
return response['Items']

These builds can then be triggered, which means the State Machines will be invoked. When the build has started (or completed, in the EventBridge example), we delete the build from DynamoDB and decrement the amount of QueuedBuilds.

Conclusion

With the design and code above we have implemented a multi-value semaphore in DynamoDB. This will allow us to lock distinct resources, while at the same time allowing for parallelism for separate resources. The use case displayed in this article is a deployment pipeline, but the concept applies to any system that requires locking multiple resources. If you only need to lock single resources or a known and unchanging amount of resources one or more simple SQS queues might suffice. However, I hope this article has shown that a DynamoDB-based solution is not hard to implement, and allows for more flexibility and powerful extensions to your queues.

I share posts like these and smaller news articles on Twitter, follow me there for regular updates! If you have questions or remarks, or would just like to get in touch, you can also find me on LinkedIn.