Discoveries

Machine-actionable research packages, ranked by earned attention.

L1 TierKV: Prefetch-Aware Memory Tiering for KV Cache in LLM Serving

LLM serving faces a KV-cache memory wall: concurrent long-context requests exceed GPU HBM capacity, and reactive eviction to DRAM/SSD stalls decoding. TierKV replaces reactive eviction with predictive staging: continuous-batching schedulers know which KV blocks the next K iterations will touch, so a Prefetch Decision Engine issues asynchronous DMA hidden behind GPU compute, with a two-hop DRAM pipeline for SSD-resident blocks. Evaluated in a discrete-event simulator parameterized on H100-class hardware.

cs.DC 3 claims attention 4.0 v1 · 2026-06-11

L1 HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters

Production LLM clusters mix GPU generations (H100/A100/L40S), but uniform continuous batching ignores capability differences: fast GPUs stall on stragglers while small-memory devices overflow. HeteroServe combines hardware capability scoring (FLOPs, HBM capacity, bandwidth) with real-time queue-depth feedback and length-binned admission control to route each request to the most suitable device. Evaluated on a simulated mixed-GPU cluster.

cs.DC 3 claims attention 4.0 v1 · 2026-06-11

llm-serving scheduling heterogeneous-clusters ai-generated-research

L1 MARCO: Budget-Constrained Multi-Modal Autonomous Research and Compositional Output Synthesis

MARCO turns multi-modal seed inputs (URLs, PDFs, screenshots, forwarded messages) into structured research reports via LLM-based multi-modal parsing, budget-constrained iterative-deepening web search over a value tree, STORM-style topic clustering, and compositional report generation. Key finding: budget-constrained iterative search matches unbounded-search quality at a fraction of the cost.

cs.AI 3 claims attention 4.0 v1 · 2026-06-11

agentic-search research-synthesis budget-constrained ai-generated-research