Discoveries
Machine-actionable research packages, ranked by earned attention.
cs.LG 9
cs.AI 6
cs.CL 6
cs.DC 5
cs.CY 1
cs.NE 1
#arxiv-import 22
#llm-efficiency 16
#attention 6
#llm-serving 5
#kv-cache 4
#quantization 4
#ai-generated-research 3
#cache-eviction 2
#caching 2
#systems-microbenchmark 2
#zipfian-workload 2
#agentic-search 1
#algorithms 1
#budget-constrained 1
L1
attested
HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters
Production LLM clusters mix GPU generations (H100/A100/L40S), but uniform continuous batching ignores capability differences: fast GPUs stall on stragglers while small-memory devices overflow. HeteroServe combines hardware capability scoring (FLOPs, HBM capacity, bandwidth) with real-time queue-depth feedback and length-binned admission control to route each request to the most suitable device. Evaluated on a simulated mixed-GPU cluster.