I always remember my professor Alessandro Brun ‘s advice on my graduation day:
“When you start working for a company, focus on cutting at least 1% of unnecessary costs. This approach will help you grow personally and support the company’s growth.”
When working in a start-up, it’s crucial not only to focus on generating revenue but also to find ways to optimize operating costs. This balance is especially important in the early stages of business development, where every dollar saved can contribute significantly to the company’s long-term sustainability.
During 2023 our infrastructure cost was all time high. I took charge as Site Reliability Engineer (SRE) by Jan 2024, by then we took an OKR to reduce our infrastructure cost by 40%. With in three months, we achieved this by paying our earlier monthly bill amount for the entire quarter. Here’s a breakdown of the steps we took to identify and reduce unnecessary infrastructure cost:
Study and Analysis of the current system:
To effectively reduce costs, it’s essential to gain a clear understanding of the current system — how it is being deployed, the reasons behind its usage, and the specific resources involved. By leveraging dashboard settings and cost management tools, we can closely monitor daily activities and track resource utilization. This allows us to identify areas of inefficiency and make data-driven decisions to optimize our resource usage and minimize unnecessary expenses.
Detailed Planning:
Based on our detailed analysis we come up with a three-stage detailed plan:
Short Term Plan:
We consolidate all resources and subscriptions into a unified structure. This approach gave us greater visibility and control over our entire infrastructure cost, enabling us to manage expenses more effectively. Additionally, by consolidating, we can leverage potential cost-saving opportunities and discounts offered by cloud providers, optimizing our overall spending.
Focus on the low-hanging fruits first — specifically, by promptly removing any resources that are being underutilized or have zero usage. These could be resources we originally created for testing purposes but forgot to release or delete. By identifying and eliminating these unused resources, we can immediately reduce unnecessary costs and streamline our infrastructure.
Optimized Resource Utilization:
We carefully analyzed our infrastructure usage and identified inefficiencies. This helped us adjust resources to match actual demand, rather than over-provisioning and incurring unnecessary costs.
Eliminated Unnecessary Resources:
We conducted a thorough review of all resources in use and removed those that were redundant or underutilized. This clean-up ensured we were only paying for what we needed.
Medium-term Plan:
Monitoring our performance metrics daily, paying particular attention to CPU and memory usage, and aim to optimize performance accordingly. Additionally, it’s important to regularly upgrade to the latest instance versions, as older versions may incur higher costs. By staying up to date with the latest technologies, we can not only improve efficiency but also ensure that we are taking advantage of cost-effective solutions provided by cloud providers.
Targeted High-Cost Resources:
We identified resources that were the biggest contributors to our operating costs and explored alternative, more cost-effective solutions. This step allowed us to prioritize the reduction of high-incurring expenses without compromising performance or service quality.
Long-term Plan:
We should also explore the use of reservations and auto-scaling for compute resources like app services and virtual machines. This strategy enables us to scale resources dynamically based on actual demand, ensuring that we only consume what is necessary while effectively controlling costs. Moreover, by adhering to cloud provider recommendations, we can capitalize on long-term commitments, such as reserving resources for one to three years, which often leads to substantial cost savings in the long run.
Leveraging Cloud Provider Recommendations:
We have proactively followed the recommendations and best practices provided by our cloud service providers. These insights have guided us in optimizing resource configurations, scaling effectively, and identifying cost-saving opportunities, allowing us to maximize both efficiency and cost-effectiveness.
Documentation
It is essential to document the entire process and educate team members on adhering to the established guidelines. By introducing a centralized point of control for resource creation, optimization, and the approval process, we can streamline operations and ensure consistency across the team. As part of this initiative, Site Reliability Engineers (SREs) should actively participate in multiple forums and communities to stay informed about industry best practices and learn from the experiences of others. Regularly updating these insights and incorporating them into our workflows will help maintain a culture of continuous improvement and ensure that we are always aligned with the latest advancements and strategies.
Why Is This Important?
As a start-up, our primary focus should not only be on generating revenue but also on efficiently managing your expenses. Early-stage companies often face financial constraints, and minimizing operating costs is essential for achieving profitability and long-term success. By addressing unnecessary costs before scaling, free up resources that can be reinvested into product development, marketing, or other areas that will directly contribute to revenue generation. In the long run, a leaner and more cost-efficient operation enables a healthier bottom line and a better chance at sustained growth.