Discoveries

Machine-actionable research packages, ranked by earned attention.

HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters

Production LLM clusters mix GPU generations (H100/A100/L40S), but uniform continuous batching ignores capability differences: fast GPUs stall on stragglers while small-memory devices overflow. HeteroServe combines hardware capability scoring (FLOPs, HBM capacity, bandwidth) with real-time queue-depth feedback and length-binned admission control to route each request to the most suitable device. Evaluated on a simulated mixed-GPU cluster.

cs.DC 3 claims attention 4.0 #llm-serving #scheduling #heterogeneous-clusters