Kubernetes · 10 min read

Kubernetes Operations: Best Practices

A complete guide to Kubernetes day-two operations covering cluster upgrades, node management, autoscaling, backup, disaster recovery, and GitOps.

THNKBIG Team

Engineering Insights

October 3, 2023

Day Two Is Where Kubernetes Gets Real

Deploying a Kubernetes cluster takes a day. Operating it for years takes a strategy. Day-two operations — upgrades, scaling, monitoring, backup, and disaster recovery — determine whether Kubernetes accelerates your team or becomes a liability.

This post covers the operational practices that separate teams running Kubernetes successfully from teams drowning in toil. Each section includes specific tools and processes that we have seen work across hundreds of production clusters.

Cluster Upgrades: Plan or Pay Later

Kubernetes releases a new minor version every four months. Each version is supported for fourteen months. If you skip upgrades, you fall behind, and catching up becomes a multi-sprint project instead of a routine operation.

Upgrade every minor version sequentially. Kubernetes does not support skipping versions (1.26 to 1.28, for example). Read the changelog for every version. Pay attention to API deprecations — a removed API will break any manifest that references it. Use tools like pluto or kubent to scan your cluster for deprecated APIs before upgrading.

Test upgrades in a staging cluster that mirrors production. Run your integration tests against the upgraded staging cluster before touching production. For managed Kubernetes (EKS, GKE, AKS), test the control plane upgrade first, then upgrade node pools in a rolling fashion with PodDisruptionBudgets to maintain availability.

Node Management and Capacity Planning

Right-sizing nodes is an ongoing exercise. Nodes that are too large waste resources on unused capacity. Nodes that are too small cause scheduling failures and fragmentation. Monitor the ratio of requested resources to allocatable capacity across your node pools.

Use the Cluster Autoscaler or Karpenter to add and remove nodes based on pending pod demand. Karpenter (AWS-native) provisions nodes faster than Cluster Autoscaler and selects instance types dynamically based on workload requirements. For GKE, use the built-in autoscaler with node auto-provisioning.

Separate workloads by node pool. Stateful workloads (databases, message queues) run on nodes with local SSDs and do not get auto-scaled. Stateless workloads (API servers, workers) run on auto-scaling node pools with spot/preemptible instances for cost savings. Use taints and tolerations to enforce this separation.

Resource Quotas, Requests, and Limits

Every container should have CPU and memory requests defined. Requests determine scheduling — the scheduler places pods on nodes with enough allocatable resources. Without requests, the scheduler has no information and will overpack nodes until OOM kills start.

Set memory limits. Be cautious with CPU limits. Memory limits prevent a single container from consuming all node memory. When a container exceeds its memory limit, it gets OOM-killed. CPU limits, however, cause CPU throttling even when the node has idle CPU capacity. Many teams are removing CPU limits entirely and relying on CPU requests for scheduling. Test both approaches for your workloads.

Use ResourceQuotas at the namespace level to prevent any single team from consuming the entire cluster. LimitRanges set default requests and limits for containers that do not specify them. Both are essential for multi-tenant clusters.

Autoscaling: HPA, VPA, and KEDA

The Horizontal Pod Autoscaler (HPA) adds or removes pod replicas based on metrics — typically CPU utilization, memory, or custom metrics from Prometheus. Configure HPA with a stabilization window to prevent flapping: rapid scale-up followed by immediate scale-down.

The Vertical Pod Autoscaler (VPA) adjusts CPU and memory requests for individual containers based on observed usage. Run VPA in recommendation mode first to understand what it would change before enabling automatic updates. VPA and HPA should not target the same metric for the same deployment — they will conflict.

KEDA (Kubernetes Event-Driven Autoscaling) scales based on external event sources: queue depth in RabbitMQ, message lag in Kafka, or HTTP request rate. KEDA can scale deployments to zero, which HPA cannot. For event-driven architectures, KEDA is the right autoscaler.

Backup and Disaster Recovery

Your cluster state lives in etcd. Your application state lives in persistent volumes. Both need backup strategies. Velero is the standard tool for Kubernetes backup and restore. It backs up Kubernetes resources (manifests) and persistent volume data to object storage.

Schedule Velero backups for all namespaces. Test restores regularly — a backup that has never been restored is not a backup, it is a hope. Run quarterly DR drills where you restore a full namespace to a different cluster and verify application functionality.

For etcd, take regular snapshots using etcdctl snapshot save. Store snapshots in a separate location from the cluster. If you use managed Kubernetes, the cloud provider handles etcd backups, but you are still responsible for application-level data and configuration backup.

GitOps: ArgoCD and Flux

GitOps uses Git as the single source of truth for cluster state. Every change — deployments, config maps, RBAC policies — is a Git commit. A GitOps controller running inside the cluster reconciles the live state with the desired state in Git.

ArgoCD provides a web UI for visualizing application state, sync status, and deployment history. It supports Helm, Kustomize, and raw manifests. ArgoCD's Application and ApplicationSet resources make it manageable for clusters with hundreds of applications.

Flux is a CNCF graduated project that takes a more CLI-driven approach. It is lighter than ArgoCD and integrates tightly with Helm and Kustomize. Flux is a strong choice for teams that prefer infrastructure-as-code over web dashboards.

Both tools support multi-cluster management, automated image updates, and progressive delivery. The choice between them is largely a matter of team preference. Either is vastly better than manually running kubectl apply in production. Our Kubernetes consulting practice helps teams implement GitOps workflows from repository structure to promotion strategies.

Operational Maturity: Where Does Your Team Stand?

Level 1 — Manual. Deployments are kubectl apply. Monitoring is ad hoc. Upgrades are scary. Most teams start here, and some stay here for years.

Level 2 — Automated. CI/CD pipelines handle deployments. Prometheus and Grafana provide monitoring. Cluster upgrades follow a documented runbook. Backup exists but is rarely tested.

Level 3 — Resilient. GitOps drives all changes. Autoscaling handles demand spikes. DR is tested quarterly. Runbooks exist for the top twenty incident types. On-call rotations are sustainable.

Most enterprise teams are between Level 1 and Level 2. The goal is not to reach Level 3 overnight. It is to identify the highest-impact gap and close it this quarter.

Operate Kubernetes with Confidence

Day-two operations are where Kubernetes either pays off or becomes a drain. If your team is spending more time maintaining the platform than shipping features, the operational foundation needs attention.

Talk to an engineer about an operational assessment for your Kubernetes platform.

Explore Our Solutions

Kubernetes Consulting Cloud-Native Architecture DevOps Consulting AI & MLOps Cloud Migration Observability

Ready to make AI operational?

Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.

Schedule an Infrastructure Assessment Call Us Directly

US-based team · All US citizens · Continental United States only