Serverless is an event-driven world

Unlike years ago, when we all host long-running daemon as servers, we are entering a serverless era, which everything is triggered by events.

It is obvious if you think about normal website traffics. User hit the URL of your website/API endpoint, API Gateway triggers Lambda function, Lambda then triggers DynamoDB to update/retrieve data, everything starts from the user.

However, if your system is big enough, you will always face the case which requires scheduled actions. How can we adopt serverless architecture for those cases?

My past project: EV charger control system

One of my past projects is to build a cloud-based system to remotely control EV (Electronic Vehicle) chargers. The main feature is to time the duration of each charging session and stops the charger when the duration expires (User have to select the duration and pay for it before the session starts).

My design is simple. Users interact with the system through API Gateway. The Lambda function handles the logic and sends commands to on-site Raspberry Pi through SQS queue, the Raspberry Pi then control the chargers through on-site LAN.

The main problem is how should I trigger the event when a session expires? If I was hosting a classic server, I can run a CRON job every minute. Or extremely, I can run an infinite loop to check if there is any expiring session every second.

while True:
    terminate_expired_sessions()
    time.sleep(1)

How about in serverless architecture? What should we do?

Option 1: CRON

Back old days, we use CRON to do time-based jobs, why don’t we use the same method? In AWS, we can use CloudWatch Event Rule to trigger Lambda functions regularly. In Azure, we can set up a Function App with timer trigger.

We can use CloudWatch Event Rule to trigger Lambda regularly

The trade-off is how often you want your scheduled function to run. If the interval is too long, you cannot schedule the event precisely. If it is too short, you may invoke the function too often but doing nothing.

Another problem is that you cannot set intervals shorter than 1 minute. In my case, it’s not a big deal as giving one more minute to the customer is not a big loss, it’s still doable.

Option 2: Just wait

If less than a minute precision is not achievable by CloudWatch Event Rule, how about controlling it inside the Lambda function?

You may think it is crazy to let the function wait for a full 2-hour charging session just for terminating it.

How about if we use Option 1 to schedule a 1-minute CRON job to check if there is any expiring session within the next minute, and let the function wait until the exact termination time?

This is a common approach in hosted servers. Making a process wait is easier than scheduling a task. Also, letting a process wait does not hurt the system at all as we have already provisioned the resources and waiting processes won’t consume much of them.

Anti-patterns

However, doing so in Lambda makes us fall into two anti-patterns: 1) Long-running tasks, and 2) Not keeping function busy.

Given that AWS Lambda allows up to 1000 concurrencies per region. If we use it to run long-running tasks, we will quickly drain out the limit. Imagine if there are 600 charging sessions expiring in the next minute, we invoke 600 functions to wait for the termination time. At that moment, we have only 400 concurrencies left to serve other requests.

The worse thing is that we are using the execution time to do nothing. Lambda function is charged every 100ms, it makes the architecture more cost-effective. However, by invoking function to just wait for something, we lose this benefit.

Option 3: Step Functions

Another option people suggest is using AWS Step Functions. This is a great tool if you encounter some actions require coordination.

For example, if you are building a travel booking system, you have built the functions to book hotel rooms and flights. If you invoke these 2 functions separately, how can you ensure they all success? I’m sure you don’t want your customer to take the flight to the destination and realise their hotel booking was unsuccessful.

We can easily implement a saga pattern (here is a great talk by Chris Richardson) by using Step Functions. You can define the logic of state transitions. E.g. If flight booking succeeds, proceed to book a hotel room. If it fails, go cancel the flight.

Using Step Functions to implement saga pattern

We can add waiting periods into Step Functions

Of course, we can add wait periods into our Step Functions. E.g. if the airline doesn’t allow you to cancel a newly booked flight, you can add a waiting period before cancelling it.

It comes with a cost

Step Functions is such a powerful tool that it’s not cheap too. Every 1000 state transitions cost $0.025.

In my case, I just want the charger to stop at a certain time, no complex coordination is required, Step Functions seems to be an overkill.

Option 4: CRON + SQS

Finally, I came up with this solution: Using CloudWatch Event to implement CRON and SQS to trigger event in second precision.

It’s sort of combining the first 3 options, using CloudWatch Event to schedule CRON job, then create a waiting period to terminate the charging session at the exact time. Instead of using Lambda function itself or Step Functions, I choose SQS to implement the waiting period.

First, I have created an SQS queue to store those scheduled actions. I didn’t use the existing command queue because I don’t want to directly push the command out. I want to execute a function to do a final check, see if there is any change in between the waiting period. E.g. the user may add more hours into the charging session. I don’t want the system to terminate the charging session in this case.

AWS SQS allows the message to be delivered up to 15 minutes later, so I schedule my find_expiring_sessions function every 15 minutes.

def find_expiring_sessions():
    fifteen_minute_later = datetime.now() \
        + timedelta(minutes = 15)
    expiring_sessions = Sessions.filter(
        expire_time_lte = fifteen_minute_later,
        expire_time_gt = datetime.now()
    )    for expiring_session in expiring_sessions:
        sqs_client.send_message(
            QueueUrl='xxxxxxxxxx',
            DelaySeconds=expiring_session.expire_time-datetime.now(),
            MessageBody=expiring_session.id
        )

In the function, I find out all the charging sessions that are expiring in the coming 15 minutes. For each session, I then push an SQS message to the queue with its own delay time.

Because SQS is pull-based, the function that is triggered by the queue needs permission to pull the message. So I attached a policy to my handle_expiring_session function

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:ReceiveMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": "arn:aws:sqs:us-east-1:111111111111:ExpireSessionWaitQueue"
        }
    ]
}

I then configured a trigger from the queue to the handle_expiring_session function

Now, whenever the charging time expires, my handle_expiring_session will be triggered and it can immediately do the terminating process.

The CRON + SQS architecture design

Comparing different options

I will use the following scenario to compare these 4 options:

240 session terminations every hour evenly
All sessions terminate at the 30th second (e.g. 14:02:30)
Actual execution requires 500ms and 128MB memory
Duration of all charging sessions are equal

For option 1, there will be 60 CRON invocation per hour. Each invocation executes the Lambda function once. Because the terminations are executed in batch, we can assume the time needed remains 500ms.

The cost of option 1 would be: $0.00006 (60 CloudWatch Event) + $0.000012 (60 Lambda execution) + $0.00006249 (30s total execution time) = $0.00013449 per hour.

For option 2, there will be 60 CRON invocation per hour. Each invocation executes 4 Lambda function (4 terminations per minute), and those function will execute for 30 seconds.

The cost of option 2 would be: $0.00006 (60 CloudWatch Event) + $0.000048 (240 Lambda execution) + $0.01524756 (240 x 30.5s Lambda execution time) = $0.01535556 per hour.

For option 3. Given that Step Functions allows execution time up to 1 year, I assume that we don’t have to use CRON, just schedule the termination action from the start to the end of the entire session. There are 240 sessions per hour, so we will have 240 state transitions per hour.

The cost of option 3 would be: $0.006 (240 state transitions)

For option 4, there will be 4 CRON invocation (15-minute interval), each invocation generates 240 SQS messages (240 terminations per hour), each message will eventually invoke the termination function.

The cost of option 4 would be: $0.000004 (4 CloudWatch Event) + $0.000096 (240 SQS messages) + $0.000048 (240 Lambda execution) + $0.00024996 (240 x 500ms execution time) = $0.00039796

Comparison of 4 different options

Among those options, using Lambda to wait is the worst one, it cost so much, and the 1000 concurrencies quota limits its scalability too. The only thing makes it good is the ease to use. You can simply implement in the way you want inside your code.

Simple CRON is cheap and scalable, the downside is that it can only give you down to 1-minute precision.

Step Functions is good to handle complex coordination situations, but it’s an overkill to solve simple time-based events.

CRON + SQS would be the most suitable option if you want to have time-based events scheduled in second precision.

This article was originally published by Richard Fan on medium.