Kubernetes · 8 min read min read

Running GPU Workloads on Kubernetes: A Practical Guide

GPUs on Kubernetes require more than just installing drivers. Learn how to schedule, share, and optimize GPU resources for AI/ML workloads at scale.

THNKBIG Team

Engineering Insights

Running GPU Workloads on Kubernetes: A Practical Guide

Kubernetes was built for stateless web apps. GPUs don't care. Running AI/ML workloads on Kubernetes requires understanding how the scheduler sees GPUs, how to share expensive hardware efficiently, and how to avoid the bill shock that comes from idle A100s.

The GPU Scheduling Problem

Kubernetes treats GPUs as extended resources. The scheduler sees "nvidia.com/gpu: 1" and assigns the pod to a node with a GPU. Simple. But real AI workloads are messier. Training jobs need multiple GPUs with NVLink. Inference pods need fractional GPU access. Batch jobs should scavenge unused capacity without blocking production traffic.

The default Kubernetes scheduler doesn't understand any of this. A pod requesting one GPU gets an entire A100 — even if it only needs 5GB of memory. That's $30/hour of waste.

GPU Sharing: MIG vs Time-Slicing

NVIDIA Multi-Instance GPU (MIG) physically partitions an A100 or H100 into up to seven isolated instances. Each instance has dedicated memory and compute. This is ideal for inference workloads that need predictable latency — you get hardware isolation without the cost of a full GPU.

Time-slicing is softer. Multiple pods share the same GPU by taking turns. Context switching adds overhead, but for batch inference or development workloads, it's often good enough. We typically deploy time-slicing for dev/test environments and MIG for production inference.

Node Affinity and Taints

GPU nodes are expensive. You don't want random pods landing on them. Use taints to repel non-GPU workloads and tolerations to allow GPU pods through. Node affinity rules can target specific GPU types — route your LLM inference to A100s while smaller models run on T4s.

Autoscaling GPU Nodes

The Kubernetes Cluster Autoscaler works, but it's slow. GPU nodes take minutes to provision. Karpenter (on AWS) or GKE Autopilot respond faster by pre-warming capacity. For predictable traffic patterns, we schedule scale-up before peak hours.

Spot instances cut GPU costs by 60-90%. The tradeoff is interruption. Training jobs need checkpointing — save state every N minutes so you can resume after preemption. For inference, we maintain a baseline of on-demand capacity with spot instances handling overflow.

Monitoring and Cost Visibility

You can't optimize what you can't see. Deploy DCGM Exporter to expose GPU metrics to Prometheus: utilization, memory, temperature. Build dashboards that show cost per model, cost per team. Set alerts when utilization drops below 50% — that's money burning.

GPU workloads on Kubernetes aren't plug-and-play. But with the right scheduling, sharing, and cost controls, you can run AI/ML at scale without the budget blowouts. Start with visibility, add sharing for efficiency, and use spot instances for the final cost optimization.

TB

THNKBIG Team

Engineering Insights

Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.

Ready to make AI operational?

Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.

US-based team · All US citizens · Continental United States only