Kubernetes · 8 min read min read

Kubernetes FinOps: Stop Burning Money on Overprovisioned Clusters

Most Kubernetes clusters run at 20-30% utilization. Here's how to implement FinOps practices that cut costs without sacrificing reliability.

THNKBIG Team

Engineering Insights

Kubernetes FinOps: Stop Burning Money on Overprovisioned Clusters

Your Kubernetes cluster is probably wasting 70% of what you're paying for. Developers request 2 CPU and 4GB memory 'just in case'. Nobody ever reclaims it. The cluster autoscaler keeps adding nodes. The CFO keeps asking questions.

The Overprovisioning Problem

Kubernetes scheduling is based on requests, not actual usage. If a pod requests 1 CPU but uses 100m, the scheduler still reserves the full CPU. Multiply by hundreds of pods, and you're paying for nodes that are mostly idle.

The root cause is uncertainty. Developers don't know what their apps need, so they guess high. Operations teams don't want outages, so they don't challenge the requests. The safe choice is expensive.

Right-Sizing with Data

The Vertical Pod Autoscaler (VPA) analyzes actual usage and recommends resource settings. It can auto-apply recommendations or just advise. Start in recommendation mode — review the suggestions before letting it resize pods automatically.

Tools like Kubecost and OpenCost show cost by namespace, label, or workload. When you can say 'the billing service costs $500/month and is using 15% of its allocated resources', the conversation changes. Data beats assumptions.

Spot Instances and Preemptible VMs

Spot instances cost 60-90% less than on-demand. The tradeoff is they can be terminated with short notice. For stateless, fault-tolerant workloads, this is a no-brainer. Web servers behind a load balancer? Run them on spot.

The key is graceful handling. Set pod disruption budgets. Use preStop hooks for cleanup. Spread replicas across multiple instance types and availability zones. When one spot pool gets terminated, others keep running.

Bin Packing and Node Selection

Cluster autoscaler adds nodes when pods can't be scheduled. It removes nodes when they're underutilized. But 'underutilized' has a threshold — often 50%. A node at 51% utilization stays even though half its capacity is wasted.

Configure aggressive scale-down policies. Use pod priority to ensure critical workloads survive consolidation. Karpenter (on AWS) is more aggressive than the standard autoscaler — it right-sizes nodes continuously rather than waiting for scale-down conditions.

Namespace Quotas and Governance

ResourceQuotas limit how much a namespace can consume. LimitRanges set defaults and constraints on individual pods. Together, they prevent runaway resource requests and enforce organizational policies.

Showback reports allocate costs to teams or projects. When the team that requested 32 CPUs sees their monthly bill, they become interested in optimization. FinOps is as much about culture as technology.

Kubernetes cost optimization isn't a one-time project — it's a practice. Start with visibility (what are you spending?), add right-sizing (what should you be spending?), then implement governance (how do you stay there?). The 40%+ savings are real and sustainable.

Key Takeaways

  • FinOps for Kubernetes applies financial accountability to cloud-native infrastructure — making engineering teams aware of and responsible for their cloud costs.
  • The three pillars of Kubernetes FinOps are visibility (who is spending what), optimization (eliminating waste), and governance (enforcing cost policies at admission time).
  • Teams that implement FinOps practices alongside cost tooling reduce Kubernetes cloud spend by 30-50% within the first quarter.

Where Kubernetes Costs Accumulate

Kubernetes makes it easy to create resources and hard to find them when they are no longer needed. Persistent volume claims survive after the workloads that requested them are deleted. Load balancers provisioned for a development service remain online after the service is removed. Node pools scaled up for a traffic event never scale back down because the autoscaler minimum was set too high. These invisible costs compound over time and are invisible in standard cloud billing reports.

CPU and memory over-provisioning is the largest driver of waste in most clusters. When developers set resource requests conservatively high — requesting 2 CPU cores for a service that uses 0.3 in production — the scheduler cannot pack workloads efficiently onto nodes. Nodes are provisioned to satisfy requests, not actual usage. Running Vertical Pod Autoscaler in recommendation mode for three weeks reveals the gap between requests and reality in quantitative terms.

FinOps Tools and Practices That Work

Kubecost provides namespace, deployment, and pod-level cost attribution correlated with cloud billing. Teams can see exactly what their services cost and compare costs over time. Showback reports (here is what your team spent this month) change engineering behavior without requiring chargeback infrastructure.

Resource quota enforcement prevents any single team from over-provisioning. Set namespace-level ResourceQuotas that align with team capacity budgets. Require resource requests on all containers through LimitRange defaults. Teams quickly learn to right-size requests when they have a quota ceiling.

Our Kubernetes consulting practice has helped enterprises across Texas and California implement FinOps programs that cut Kubernetes cloud costs by 35% on average. Schedule a cost audit.

Kubernetes FinOps: Stop Paying for Idle Capacity

Most Kubernetes clusters are dramatically over-provisioned. Developers request generous CPU and memory "just in case," the scheduler reserves it, and the autoscaler dutifully adds more nodes. The result: you pay for capacity that sits idle while your CFO keeps asking why the bill is so high.

The Overprovisioning Trap

Kubernetes schedules based on requested resources, not what workloads actually use.

If a pod requests 1 CPU but averages 100m, the scheduler still blocks off the full CPU. Multiply that pattern across hundreds of pods and you end up with:

  • Nodes that look busy to the scheduler but are mostly idle in reality
  • Cluster autoscaler adding more nodes to fit inflated requests
  • 30–70% of allocated CPU and memory sitting unused

Why this happens:

  • Developers don’t know real production needs, so they guess high
  • Ops teams avoid risk and don’t challenge inflated requests
  • The “safe” choice is to over-allocate — and it’s expensive

Right-Sizing with Actual Usage Data

You can’t fix this with guesswork. You need usage data.

The Vertical Pod Autoscaler (VPA) analyzes historical CPU and memory usage and recommends right-sized requests/limits.

How to use it safely:

  1. Run VPA in recommendation mode only at first
  2. Let it collect data for 2–3 weeks
  3. Review recommended requests/limits per workload
  4. Roll out changes gradually (per namespace or per team)

What teams typically discover:

  • Web services: CPU requests are often 3–5x higher than actual usage
  • Memory: limits padded for rare spikes that happen once a month
  • Batch jobs: hold resources while waiting on external APIs or queues

Pair VPA with cost-visibility tools:

  • Kubecost / OpenCost: show cost by namespace, label, or workload
  • Example: “The billing service costs $500/month and uses 15% of its allocated resources.”

Once you can quantify waste per service or team, the conversation shifts from opinion to data. Data beats assumptions.

Cheap Capacity: Spot Instances & Preemptible VMs

For stateless, fault-tolerant workloads, spot instances (or preemptible VMs) are an easy win:

  • 60–90% cheaper than on-demand
  • Can be terminated with ~2 minutes’ notice

Make spot reliable by design:

  • Use PodDisruptionBudgets (PDBs) to maintain a minimum healthy replica count
  • Add preStop hooks for graceful shutdown and connection draining
  • Spread replicas across multiple instance types and AZs
  • Use Karpenter (AWS) or CAST AI for intelligent instance selection and fallback

Good candidates for spot:

  • Web/API servers behind a load balancer
  • Stateless workers and async consumers

Keep these on on-demand:

  • Databases and stateful sets
  • Latency-sensitive or single-instance critical workloads

Bin Packing and Node Efficiency

The Cluster Autoscaler adds nodes when pods can’t be scheduled and removes them when nodes are underutilized. But default scale-down behavior is often conservative:

  • Nodes just above the utilization threshold (e.g., 51%) never scale down
  • You end up with many half-empty nodes

Optimization levers:

  • Lower the scale-down utilization threshold
  • Reduce scale-down delay so consolidation happens sooner
  • Use pod priority to ensure critical workloads survive consolidation
  • Consider Karpenter instead of Cluster Autoscaler:
  • Continuously right-sizes nodes
  • Packs pods more efficiently
  • Chooses optimal instance types on the fly

Governance: Quotas, Defaults, and Cost Awareness

You can’t rely on every developer to be a capacity planner. Kubernetes gives you guardrails:

  • ResourceQuota (per namespace)
  • Caps total CPU, memory, and object counts per team/namespace
  • Align quotas with team capacity budgets
  • LimitRange
  • Sets default requests/limits for containers
  • Enforces min/max bounds so no single pod can request absurd resources
  • Showback reports
  • Use Kubecost/OpenCost to allocate costs by namespace/team
  • Share monthly cost reports with teams (no chargeback required)

Once a team sees, “We requested 32 CPUs and use 20% of it, costing $X/month,” they become highly motivated to right-size.

Hidden Kubernetes Cost Landmines

Beyond CPU and memory over-provisioning, several silent cost drivers accumulate over time:

  • Orphaned PVCs
  • PersistentVolumeClaims that survive long after their workloads are deleted
  • Continue to incur storage charges
  • Idle load balancers
  • Services created for dev/test, never cleaned up
  • Each cloud LB adds a recurring monthly cost
  • Over-configured autoscaler minimums
  • Node pools pinned to high minSize after a traffic spike
  • Cluster never scales back to true baseline

These rarely show up clearly in standard cloud billing views, but they add up.

A Practical FinOps Stack for Kubernetes

Combine tools and process:

  1. Visibility
  • Kubecost / OpenCost for per-namespace, per-service cost
  • Dashboards that correlate cost with utilization and requests
  1. Optimization
  • VPA in recommendation mode for 2–3 weeks, then phased rollout
  • Aggressive node consolidation via Karpenter or tuned Cluster Autoscaler
  • Spot/preemptible capacity for stateless workloads
  1. Governance
  • ResourceQuotas per namespace aligned to team budgets
  • LimitRanges enforcing sane defaults and bounds
  • Showback reports to drive behavioral change

Teams that implement this stack routinely see 30–50% Kubernetes cost reduction in a quarter, without sacrificing reliability.

What We Do

Our Kubernetes consulting practice has helped enterprises across Texas and California:

  • Stand up FinOps visibility with Kubecost/OpenCost
  • Run VPA safely in recommendation mode and roll out right-sizing
  • Introduce spot capacity and Karpenter for intelligent node provisioning
  • Design ResourceQuotas, LimitRanges, and showback processes

On average, clients cut Kubernetes cloud costs by ~35% while improving reliability and predictability.

If you want a concrete, data-driven view of where your clusters are wasting money, schedule a cost audit. We’ll quantify over-provisioning, model savings scenarios, and give you an actionable roadmap to reclaim wasted spend.

TB

THNKBIG Team

Engineering Insights

Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.

Ready to make AI operational?

Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.

US-based team · All US citizens · Continental United States only