Running GPU Workloads on Kubernetes: A Practical Guide
GPUs on Kubernetes require more than just installing drivers. Learn how to schedule, share, and optimize GPU resources for AI/ML workloads at scale.
THNKBIG Team
Engineering Insights
Kubernetes was built for stateless web apps. GPUs don't care. Running AI/ML workloads on Kubernetes requires understanding how the scheduler sees GPUs, how to share expensive hardware efficiently, and how to avoid the bill shock that comes from idle A100s.
The GPU Scheduling Problem
Kubernetes treats GPUs as extended resources. The scheduler sees "nvidia.com/gpu: 1" and assigns the pod to a node with a GPU. Simple. But real AI workloads are messier. Training jobs need multiple GPUs with NVLink. Inference pods need fractional GPU access. Batch jobs should scavenge unused capacity without blocking production traffic.
The default Kubernetes scheduler doesn't understand any of this. A pod requesting one GPU gets an entire A100 — even if it only needs 5GB of memory. That's $30/hour of waste.
GPU Sharing: MIG vs Time-Slicing
NVIDIA Multi-Instance GPU (MIG) physically partitions an A100 or H100 into up to seven isolated instances. Each instance has dedicated memory and compute. This is ideal for inference workloads that need predictable latency — you get hardware isolation without the cost of a full GPU.
Time-slicing is softer. Multiple pods share the same GPU by taking turns. Context switching adds overhead, but for batch inference or development workloads, it's often good enough. We typically deploy time-slicing for dev/test environments and MIG for production inference.
Node Affinity and Taints
GPU nodes are expensive. You don't want random pods landing on them. Use taints to repel non-GPU workloads and tolerations to allow GPU pods through. Node affinity rules can target specific GPU types — route your LLM inference to A100s while smaller models run on T4s.
Autoscaling GPU Nodes
The Kubernetes Cluster Autoscaler works, but it's slow. GPU nodes take minutes to provision. Karpenter (on AWS) or GKE Autopilot respond faster by pre-warming capacity. For predictable traffic patterns, we schedule scale-up before peak hours.
Spot instances cut GPU costs by 60-90%. The tradeoff is interruption. Training jobs need checkpointing — save state every N minutes so you can resume after preemption. For inference, we maintain a baseline of on-demand capacity with spot instances handling overflow.
Monitoring and Cost Visibility
You can't optimize what you can't see. Deploy DCGM Exporter to expose GPU metrics to Prometheus: utilization, memory, temperature. Build dashboards that show cost per model, cost per team. Set alerts when utilization drops below 50% — that's money burning.
GPU workloads on Kubernetes aren't plug-and-play. But with the right scheduling, sharing, and cost controls, you can run AI/ML at scale without the budget blowouts. Start with visibility, add sharing for efficiency, and use spot instances for the final cost optimization.
Explore Our Solutions
Related Reading
Image Registry Snowed In: What You Need to Know About the k8s.gcr.io Freeze
Prepare for the Kubernetes image registry migration from k8s.gcr.io to registry.k8s.io. Timeline, impact assessment, and migration steps.
KubeCon 2022 Recap: Insights from the Kubernetes Community
Advanced GPU Scheduling in Kubernetes: Beyond the Basics
The default Kubernetes scheduler wastes GPUs. Learn about priority classes, preemption, gang scheduling, and topology-aware placement for AI workloads.
THNKBIG Team
Engineering Insights
Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.
Ready to make AI operational?
Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.
US-based team · All US citizens · Continental United States only