Like most SaaS startups, our stack is built on AWS, and it's one of our biggest expenses. And like many startups, the initial product was hastily built and resource optimization was a secondary concern.
Being a bootstrapped company, every dollar matters, and in 2019 our AWS costs were getting out of control.
We had recently added several large customers, and when the AWS bill came, it was depressing to see how much of that new revenue was going to AWS.
Diagnosing the problem
Our backend is made up of a host of services. The system was primarily backed by standard EC2 instances in an auto scaling group running on Docker. Then most services ran in isolation on their own instances.
With the way some of the logic was written, memory and CPU could spike at times. So we were forced to run large capacity servers to handle the variable load.
Basically, we had a lot of wasted capacity that we were paying for.
How we fixed it
1. Optimize the bad code
Our system processes a lot of data, and in several areas of the code we were pulling large amounts of data into memory.
After identifying the problematic logic, we switched the code to use Node.js streams. This took awhile and was not always straightforward to implement, but the result was a much more stable run profile with decreased resource requirements.
2. Move to ECS
Instead of having individual services on their own instances, we moved to AWS ECS. It's essentially a managed version of Kubernetes, which enables us to easily run multiple services on a single instance.
This minimizes wasted resources and decreases the number of instances we had to run.
3. Spot instances
We were aware of spot instances, but had never seriously looked into them. The concern was that spot instances were unreliable and could be shut down at any time.
After further investigation, we discovered that our concerns were unfounded. In fact, it's fair to say that in most cases, there's no reason NOT to use spot instances.
Being that many of our processes are APIs or task queues with relatively short jobs, there was nothing to worry about if the backing instance was going to shut down. AWS has parameters to configure the instance shutdown time, which make it easy to gracefully scale down the worker.
Spot instance pricing is based on demand, but at the extreme end, can run workloads up to 90% off.
4. Reserved instances
As we scaled up and stabilized the system, the resource requirements slowly became more predictable. This made it possible to forecast how much capacity we would need and open the possibility to begin leveraging reserved instances.
With reserved instances, you commit to 1+ years for using a resource and can get substantial discounts for paying upfront.
In our case, we were able to identify multiple areas where we could not foresee needing to change the underlying resources for at least a year. This included several EC2, RDS, and ElastiCache instance types, which all together worked out to a ~40% costs savings when paying upfront.
5. Graviton based instances
Graviton instances types are the new instances from AWS using 64-bit Arm Neoverse cores. AWS claims that they offer up to 40% better price performance over the current generation of x86-based instances in a variety of workloads.
After investigating further, we found that our workloads were a good candidate and our tests showed solid performance.
By using the new
docker buildx plugin, it's easy to build for multiple platforms at once and made the transition simple:
docker buildx build --platform linux/amd64,linux/arm64
Overall, we found that it gave us a ~20% cost savings.
There's still more optimization that can be done, but we feel that these changes gave us the biggest bang for our buck. Our stack now runs more reliably without spikes, is simplified, and we cut our AWS bill by ~70%.
All together, this is a big component of how we're able to offer our product at a competitive price and run a profitable company.