How We Spent $2,000 on AWS Lambda in Two Weeks — The Luxury of Provisioned Concurrency

Momodu Afegbua
Towards AWS
Published in
3 min readJul 12, 2021

--

“Hey, I noticed a spike — over 1000% — in AWS Serverless budget. Did you notice any anomaly in the pipeline? bugs? anything?”

Working in a startup company that processes at least 2.5 Billion data points every single day can be fun with a sprinkle of nightmares. And if you are a DevOps Engineer in this company, you know damn well that you need to avoid outages at all costs. And by all cost, it means you have to have it at the back of your mind that you don’t have all the millions in the world to spend on cloud billing. What better way to achieve this but through one of the cheapest cloud offerings; Serverless Computing.

In one of our applications, we had a plethora of APIs, backends events-based jobs, ran by AWS Lambda, some of which were exposed to the internet through AWS API Gateway. The code, being primarily python, was deployed using Zappa and Serverless Framework. Using Zappa and Serverless Framework meant the DevOps Engineers had few things to do on the infrastructure side as it provisioned all the resources it needed on its own. What was left was managing IAM, creating actionable metrics, building monitoring dashboards, and setting up alerts for each of the Lambda functions deployed.

At first, everything was smooth. But with time, the logic of some of the APIs became complex. With the timeout of API Gateway set at a maximum of 30seconds by default, coupled with the cold starting of Lambda functions, we started seeing spikes in 504 errors. Some of the complex APIs at the cold start were averaging 32seconds. With a microservice architecture, a simple API call could invoke more than 20 Lambda functions, some of which could end up at a cold start. To keep them warm, we deployed another set of functions that were to ping these APIs every 9minute. While the number of 504 errors reduced, there were some mission-critical APIs that timed out during peak periods.

The initial proposition was to further break down these APIs into even smaller microservices, being that their dependencies were getting larger every time. But hey, did I mention mission-critical APIs were involved? There was no time to start decoupling them. We also considered Lambda Containers, which was introduced in Q4 2020, but we suspected containers will also have cold starts.

In came Provisioned Concurrency… According to the official documentation, provisioned concurrency is:

…a feature that keeps functions initialized and hyper-ready to respond in double-digit milliseconds. This is ideal for implementing interactive services, such as web and mobile backends, latency-sensitive microservices, or synchronous API.

Provisioned concurrency seemed like the solution to all our problems. So, we created an Epic and implemented provisioned concurrency in a Production environment. Before moving to production, we had tested it in Dev and Staging and it was all good to go. However, the pricing I was seeing was unbelievable. I actually thought it was a calculation error because, at 100 concurrencies, I was seeing a monthly projection of pricing that was way over the money spent on Lambda in the previous 2020 Q3 and Q4. At $0.015 per GB-hour and $0.035 per GB-hour as the duration for a lambda function, the mystery of how provisioned concurrency was so expensive was baffling. Especially as estimated billing was going into $2,000+ per month. But then, we had an Epic to close and we had to solve the 504 errors. So, we opted for it. Plus I was looking forward to analyzing the billing after a month to understand the workings of provisioned concurrency.

504 API Gateway timeout errors reduced drastically, and traffic at peak period increased exponentially — within two weeks. We had decided we’d be opting for further decoupling of the applications, using cloud-native approaches. This meant we’d be migrating them and deploy them as containers, using a container orchestrating tool; KUBERNETES.

But then, a certain morning… I got the message from a teammate who happened to be monitoring the cost dashboard that faithful day, bringing to my notice that we’d accrued a $2,000 bill in two weeks, for a partial solution.

Let’s not go into how I always convert Dollars to Naira whenever I am on the billing dashboard…

--

--

Cloud Architect | DevOps Evangelist | CKA, CKAD | I mostly write things in here so I can read them again when I get lost — eventually.