Kubernetes · 10 min read min read

Advanced GPU Scheduling in Kubernetes: Beyond the Basics

The default Kubernetes scheduler wastes GPUs. Learn about priority classes, preemption, gang scheduling, and topology-aware placement for AI workloads.

THNKBIG Team

Engineering Insights

Advanced GPU Scheduling in Kubernetes: Beyond the Basics

Kubernetes knows how to schedule pods. It doesn't know that your distributed training job needs 8 GPUs with NVLink, or that inference should preempt batch jobs, or that spreading GPUs across nodes kills performance. Let's fix that.

Priority Classes and Preemption

Not all workloads are equal. Production inference serves customers. Training jobs can wait. Batch processing can scavenge leftover capacity. Priority classes let you define this hierarchy. When cluster resources are tight, Kubernetes evicts lower-priority pods to make room for higher-priority ones.

We typically define three tiers: critical (production inference, SLA-bound), normal (scheduled training, experiments), and scavenger (batch jobs that tolerate interruption). The scavenger tier is where you put workloads that should run on otherwise-idle GPUs without blocking production.

Gang Scheduling for Distributed Training

Distributed training needs all its GPUs at once. If you request 8 pods with 1 GPU each, and only 6 GPUs are available, the default scheduler creates 6 pods and leaves 2 pending. Those 6 pods sit idle, wasting resources, waiting for siblings that might never arrive.

Gang scheduling is all-or-nothing: either schedule all pods in the group, or schedule none. The Kubernetes community has several solutions — Volcano, Coscheduling plugin, and the newer JobSet controller. We prefer Volcano for complex training workflows because it also handles job queuing and fair-share scheduling.

Topology-Aware Scheduling

GPU placement matters. Two GPUs on the same node with NVLink communicate at 600 GB/s. Two GPUs across nodes talk through the network at maybe 25 GB/s. For training jobs that synchronize gradients every iteration, this difference is enormous.

Topology-aware scheduling respects these constraints. The scheduler understands which GPUs are connected via NVLink, which share PCIe switches, and which are on the same NUMA node. It places pods to maximize communication bandwidth. This isn't built into vanilla Kubernetes — you need the Topology Aware Scheduling plugin or a custom scheduler.

Bin Packing vs Spreading

The default scheduler spreads pods across nodes for availability. For GPUs, this is usually wrong. Bin packing — filling up nodes before using new ones — keeps communication local and enables autoscaling down when nodes empty out.

Configure the scheduler to prefer nodes that already have your workload's pods. Use pod affinity rules to co-locate related workloads. For inference, consider anti-affinity to spread replicas for fault tolerance — but keep training consolidated.

Queue Management

When GPU demand exceeds supply, jobs need to queue. Without proper queue management, you get chaos — whoever submits first wins, regardless of priority or fair share. Volcano and Kueue provide job queuing with quotas, priorities, and fair-share policies.

Stock Kubernetes scheduling is a starting point, not a solution. For serious GPU workloads, invest in gang scheduling, topology awareness, and proper queue management. The performance and cost improvements are substantial.

TB

THNKBIG Team

Engineering Insights

Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.

Ready to make AI operational?

Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.

US-based team · All US citizens · Continental United States only