Running GPU Workloads on Kubernetes: A Practical Guide
GPUs on Kubernetes require more than just installing drivers. Learn how to schedule, share, and optimize GPU resources for AI/ML workloads at scale.
THNKBIG Team
Engineering Insights
Kubernetes was built for stateless web apps. GPUs don't care. Running AI/ML workloads on Kubernetes requires understanding how the scheduler sees GPUs, how to share expensive hardware efficiently, and how to avoid the bill shock that comes from idle A100s.
The GPU Scheduling Problem
Kubernetes treats GPUs as extended resources. The scheduler sees "nvidia.com/gpu: 1" and assigns the pod to a node with a GPU. Simple. But real AI workloads are messier. Training jobs need multiple GPUs with NVLink. Inference pods need fractional GPU access. Batch jobs should scavenge unused capacity without blocking production traffic.
The default Kubernetes scheduler doesn't understand any of this. A pod requesting one GPU gets an entire A100 — even if it only needs 5GB of memory. That's $30/hour of waste.
GPU Sharing: MIG vs Time-Slicing
NVIDIA Multi-Instance GPU (MIG) physically partitions an A100 or H100 into up to seven isolated instances. Each instance has dedicated memory and compute. This is ideal for inference workloads that need predictable latency — you get hardware isolation without the cost of a full GPU.
Time-slicing is softer. Multiple pods share the same GPU by taking turns. Context switching adds overhead, but for batch inference or development workloads, it's often good enough. We typically deploy time-slicing for dev/test environments and MIG for production inference.
Node Affinity and Taints
GPU nodes are expensive. You don't want random pods landing on them. Use taints to repel non-GPU workloads and tolerations to allow GPU pods through. Node affinity rules can target specific GPU types — route your LLM inference to A100s while smaller models run on T4s.
Autoscaling GPU Nodes
The Kubernetes Cluster Autoscaler works, but it's slow. GPU nodes take minutes to provision. Karpenter (on AWS) or GKE Autopilot respond faster by pre-warming capacity. For predictable traffic patterns, we schedule scale-up before peak hours.
Spot instances cut GPU costs by 60-90%. The tradeoff is interruption. Training jobs need checkpointing — save state every N minutes so you can resume after preemption. For inference, we maintain a baseline of on-demand capacity with spot instances handling overflow.
Monitoring and Cost Visibility
You can't optimize what you can't see. Deploy DCGM Exporter to expose GPU metrics to Prometheus: utilization, memory, temperature. Build dashboards that show cost per model, cost per team. Set alerts when utilization drops below 50% — that's money burning.
GPU workloads on Kubernetes aren't plug-and-play. But with the right scheduling, sharing, and cost controls, you can run AI/ML at scale without the budget blowouts. Start with visibility, add sharing for efficiency, and use spot instances for the final cost optimization.
Key Takeaways
- Running GPU workloads on Kubernetes requires careful node pool configuration, driver management, and workload-specific scheduling policies.
- AI training and inference have different infrastructure requirements — training needs large, exclusive GPU allocations while inference benefits from fractional sharing and autoscaling.
- NVIDIA, AMD, and Intel GPU support all exist in the Kubernetes ecosystem, each with different operator toolchains and scheduling behaviors.
Configuring GPU Node Pools
GPU nodes require the NVIDIA GPU Operator (or AMD ROCm operator for AMD hardware) installed before workloads can be scheduled. The operator manages GPU driver installation, the device plugin DaemonSet, DCGM exporter for GPU metrics, and the container toolkit that enables GPU access from within containers. Attempting to manage drivers manually across a fleet of GPU nodes is error-prone and does not scale.
Isolate GPU nodes from CPU-only workloads using node taints and tolerations. Label GPU nodes with hardware specifications (gpu-type: a100, gpu-memory: 80gb) so workloads that have specific hardware requirements can use node selectors or node affinity rules. This prevents memory-intensive training jobs from landing on nodes equipped only for inference.
Training vs. Inference Workload Patterns
Training jobs are batch workloads with predictable completion times. They benefit from preemptible or spot GPU instances where the cost savings (60-70% vs on-demand) outweigh the risk of interruption. Using checkpointing frameworks (PyTorch Lightning's built-in checkpointing, Determined AI) lets training jobs resume from the last checkpoint rather than restarting from scratch if a spot instance is reclaimed.
Inference workloads are latency-sensitive and require horizontal autoscaling based on request queue depth or GPU utilization. KEDA's Prometheus scaler can trigger replica addition when GPU utilization exceeds 70% or when pending inference requests exceed a threshold. Unlike training, inference commonly benefits from GPU sharing — serving a medium-sized LLM on a fractional A100 GPU is both cost-effective and achieves acceptable latency for most enterprise use cases.
THNKBIG's AI/MLOps practice helps enterprises design GPU infrastructure that handles both training and inference efficiently, reducing per-inference cost while maintaining the compute capacity data science teams need. Talk to our team.
Explore Our Solutions
Related Reading
Image Registry Snowed In: What You Need to Know About the k8s.gcr.io Freeze
Prepare for the Kubernetes image registry migration from k8s.gcr.io to registry.k8s.io. Timeline, impact assessment, and migration steps.
KubeCon 2022 Recap: Insights from the Kubernetes Community
Kubernetes Cost Optimization: A Practical Guide for Enterprise Teams
Enterprise Kubernetes deployments overspend 30-40% on cloud infrastructure. This guide covers battle-tested strategies for cutting Kubernetes costs without sacrificing reliability.
THNKBIG Team
Engineering Insights
Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.
Ready to make AI operational?
Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.
US-based team · All US citizens · Continental United States only