Advanced GPU Scheduling in Kubernetes: Beyond the Basics
The default Kubernetes scheduler wastes GPUs. Learn about priority classes, preemption, gang scheduling, and topology-aware placement for AI workloads.
THNKBIG Team
Engineering Insights
Kubernetes knows how to schedule pods. It doesn't know that your distributed training job needs 8 GPUs with NVLink, or that inference should preempt batch jobs, or that spreading GPUs across nodes kills performance. Let's fix that.
Priority Classes and Preemption
Not all workloads are equal. Production inference serves customers. Training jobs can wait. Batch processing can scavenge leftover capacity. Priority classes let you define this hierarchy. When cluster resources are tight, Kubernetes evicts lower-priority pods to make room for higher-priority ones.
We typically define three tiers: critical (production inference, SLA-bound), normal (scheduled training, experiments), and scavenger (batch jobs that tolerate interruption). The scavenger tier is where you put workloads that should run on otherwise-idle GPUs without blocking production.
Gang Scheduling for Distributed Training
Distributed training needs all its GPUs at once. If you request 8 pods with 1 GPU each, and only 6 GPUs are available, the default scheduler creates 6 pods and leaves 2 pending. Those 6 pods sit idle, wasting resources, waiting for siblings that might never arrive.
Gang scheduling is all-or-nothing: either schedule all pods in the group, or schedule none. The Kubernetes community has several solutions — Volcano, Coscheduling plugin, and the newer JobSet controller. We prefer Volcano for complex training workflows because it also handles job queuing and fair-share scheduling.
Topology-Aware Scheduling
GPU placement matters. Two GPUs on the same node with NVLink communicate at 600 GB/s. Two GPUs across nodes talk through the network at maybe 25 GB/s. For training jobs that synchronize gradients every iteration, this difference is enormous.
Topology-aware scheduling respects these constraints. The scheduler understands which GPUs are connected via NVLink, which share PCIe switches, and which are on the same NUMA node. It places pods to maximize communication bandwidth. This isn't built into vanilla Kubernetes — you need the Topology Aware Scheduling plugin or a custom scheduler.
Bin Packing vs Spreading
The default scheduler spreads pods across nodes for availability. For GPUs, this is usually wrong. Bin packing — filling up nodes before using new ones — keeps communication local and enables autoscaling down when nodes empty out.
Configure the scheduler to prefer nodes that already have your workload's pods. Use pod affinity rules to co-locate related workloads. For inference, consider anti-affinity to spread replicas for fault tolerance — but keep training consolidated.
Queue Management
When GPU demand exceeds supply, jobs need to queue. Without proper queue management, you get chaos — whoever submits first wins, regardless of priority or fair share. Volcano and Kueue provide job queuing with quotas, priorities, and fair-share policies.
Stock Kubernetes scheduling is a starting point, not a solution. For serious GPU workloads, invest in gang scheduling, topology awareness, and proper queue management. The performance and cost improvements are substantial.
Explore Our Solutions
Related Reading
Image Registry Snowed In: What You Need to Know About the k8s.gcr.io Freeze
Prepare for the Kubernetes image registry migration from k8s.gcr.io to registry.k8s.io. Timeline, impact assessment, and migration steps.
KubeCon 2022 Recap: Insights from the Kubernetes Community
Running GPU Workloads on Kubernetes: A Practical Guide
GPUs on Kubernetes require more than just installing drivers. Learn how to schedule, share, and optimize GPU resources for AI/ML workloads at scale.
THNKBIG Team
Engineering Insights
Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.
Ready to make AI operational?
Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.
US-based team · All US citizens · Continental United States only