Kubernetes · 8 min read min read

Running GPU Workloads on Kubernetes: A Practical Guide

GPUs on Kubernetes require more than just installing drivers. Learn how to schedule, share, and optimize GPU resources for AI/ML workloads at scale.

THNKBIG Team

Engineering Insights

Running GPU Workloads on Kubernetes: A Practical Guide

Kubernetes was built for stateless web apps. GPUs don't care. Running AI/ML workloads on Kubernetes requires understanding how the scheduler sees GPUs, how to share expensive hardware efficiently, and how to avoid the bill shock that comes from idle A100s.

The GPU Scheduling Problem

Kubernetes treats GPUs as extended resources. The scheduler sees "nvidia.com/gpu: 1" and assigns the pod to a node with a GPU. Simple. But real AI workloads are messier. Training jobs need multiple GPUs with NVLink. Inference pods need fractional GPU access. Batch jobs should scavenge unused capacity without blocking production traffic.

The default Kubernetes scheduler doesn't understand any of this. A pod requesting one GPU gets an entire A100 — even if it only needs 5GB of memory. That's $30/hour of waste.

GPU Sharing: MIG vs Time-Slicing

NVIDIA Multi-Instance GPU (MIG) physically partitions an A100 or H100 into up to seven isolated instances. Each instance has dedicated memory and compute. This is ideal for inference workloads that need predictable latency — you get hardware isolation without the cost of a full GPU.

Time-slicing is softer. Multiple pods share the same GPU by taking turns. Context switching adds overhead, but for batch inference or development workloads, it's often good enough. We typically deploy time-slicing for dev/test environments and MIG for production inference.

Node Affinity and Taints

GPU nodes are expensive. You don't want random pods landing on them. Use taints to repel non-GPU workloads and tolerations to allow GPU pods through. Node affinity rules can target specific GPU types — route your LLM inference to A100s while smaller models run on T4s.

Autoscaling GPU Nodes

The Kubernetes Cluster Autoscaler works, but it's slow. GPU nodes take minutes to provision. Karpenter (on AWS) or GKE Autopilot respond faster by pre-warming capacity. For predictable traffic patterns, we schedule scale-up before peak hours.

Spot instances cut GPU costs by 60-90%. The tradeoff is interruption. Training jobs need checkpointing — save state every N minutes so you can resume after preemption. For inference, we maintain a baseline of on-demand capacity with spot instances handling overflow.

Monitoring and Cost Visibility

You can't optimize what you can't see. Deploy DCGM Exporter to expose GPU metrics to Prometheus: utilization, memory, temperature. Build dashboards that show cost per model, cost per team. Set alerts when utilization drops below 50% — that's money burning.

GPU workloads on Kubernetes aren't plug-and-play. But with the right scheduling, sharing, and cost controls, you can run AI/ML at scale without the budget blowouts. Start with visibility, add sharing for efficiency, and use spot instances for the final cost optimization.

Key Takeaways

  • Running GPU workloads on Kubernetes requires careful node pool configuration, driver management, and workload-specific scheduling policies.
  • AI training and inference have different infrastructure requirements — training needs large, exclusive GPU allocations while inference benefits from fractional sharing and autoscaling.
  • NVIDIA, AMD, and Intel GPU support all exist in the Kubernetes ecosystem, each with different operator toolchains and scheduling behaviors.

Configuring GPU Node Pools

GPU nodes require the NVIDIA GPU Operator (or AMD ROCm operator for AMD hardware) installed before workloads can be scheduled. The operator manages GPU driver installation, the device plugin DaemonSet, DCGM exporter for GPU metrics, and the container toolkit that enables GPU access from within containers. Attempting to manage drivers manually across a fleet of GPU nodes is error-prone and does not scale.

Isolate GPU nodes from CPU-only workloads using node taints and tolerations. Label GPU nodes with hardware specifications (gpu-type: a100, gpu-memory: 80gb) so workloads that have specific hardware requirements can use node selectors or node affinity rules. This prevents memory-intensive training jobs from landing on nodes equipped only for inference.

Training vs. Inference Workload Patterns

Training jobs are batch workloads with predictable completion times. They benefit from preemptible or spot GPU instances where the cost savings (60-70% vs on-demand) outweigh the risk of interruption. Using checkpointing frameworks (PyTorch Lightning's built-in checkpointing, Determined AI) lets training jobs resume from the last checkpoint rather than restarting from scratch if a spot instance is reclaimed.

Inference workloads are latency-sensitive and require horizontal autoscaling based on request queue depth or GPU utilization. KEDA's Prometheus scaler can trigger replica addition when GPU utilization exceeds 70% or when pending inference requests exceed a threshold. Unlike training, inference commonly benefits from GPU sharing — serving a medium-sized LLM on a fractional A100 GPU is both cost-effective and achieves acceptable latency for most enterprise use cases.

THNKBIG's AI/MLOps practice helps enterprises design GPU infrastructure that handles both training and inference efficiently, reducing per-inference cost while maintaining the compute capacity data science teams need. Talk to our team.

TB

THNKBIG Team

Engineering Insights

Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.

Ready to make AI operational?

Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.

US-based team · All US citizens · Continental United States only