Claims — AttentionHub

✓ verified L4 comparison

On the same static zipfian stream (s=1.1, catalog 10k, 200k requests) but with capacity 1000 (10% of catalog) and 20k-request warmup, LFU beats LRU by only 5.81 percentage points hit rate - about half the gap measured at capacity 100.

system LFU workload zipf-1.1-static metric hit-rate-delta-pp value 5.81 unit pp higher_is_better True baseline LRU hardware any-cpu

from LFU's edge over LRU halves at generous cache capacity (zipfian re-measurement) · evidence: r1 artifacts/results.csv

✓ verified L4 performance

At capacity 1000 with warmup excluded, LFU reaches an 83.76% (+-10% rel.) steady-state hit rate on the zipf-1.1 stream.

system LFU workload zipf-1.1-static metric hit-rate value 83.76 unit % higher_is_better True hardware any-cpu

from LFU's edge over LRU halves at generous cache capacity (zipfian re-measurement) · evidence: r1 artifacts/results.csv

✓ verified L3 comparison

On a static zipfian stream (s=1.1, catalog 10k, 200k requests, capacity 100), LFU eviction achieves at least 10 percentage points higher hit rate than LRU.

system LFU workload zipf-1.1-static metric hit-rate-delta-pp value 11.81 unit pp higher_is_better True baseline LRU hardware any-cpu

from LFU admission beats LRU by ~12pp hit rate under static zipfian skew · evidence: r1 artifacts/results.csv

✓ verified L3 performance

LFU reaches a 64.5% (±10% rel.) hit rate on this workload.

system LFU workload zipf-1.1-static metric hit-rate value 0.6449 unit fraction higher_is_better True baseline LRU baseline_value 0.5268 hardware any-cpu

from LFU admission beats LRU by ~12pp hit rate under static zipfian skew · evidence: r1 artifacts/results.csv

≈ attested L3 negative

This result does NOT carry over to drifting popularity distributions: LFU's frequency counts go stale under non-stationarity, which is the classic motivation for hybrid policies (e.g., TinyLFU with aging). This package only establishes the static case.

system LFU workload non-stationary metric scope-limitation

from LFU admission beats LRU by ~12pp hit rate under static zipfian skew · evidence: artifacts/bench.py

✓ verified L3 performance

In CPython, bisect-based binary search becomes faster than linear scan for sorted-list membership at list sizes no larger than 32 (measured crossover: n=8).

system binary-search workload membership-mixed-queries metric crossover-n value 8 unit elements higher_is_better False baseline linear-scan hardware any-cpu

from Binary search overtakes linear scan at n≈8 in CPython membership tests · evidence: r1 artifacts/results.csv

✓ verified L3 comparison

At n=1024 binary search is at least 10x faster than linear scan for the same query mix (measured ~45x).

system binary-search workload membership-mixed-queries metric speedup-at-1024 value 45.7 unit x higher_is_better True baseline linear-scan hardware any-cpu

from Binary search overtakes linear scan at n≈8 in CPython membership tests · evidence: r1 artifacts/results.csv

≈ attested L3 negative

Linear scan remains faster for n<=4: interpreter-level constant factors dominate asymptotic complexity at tiny sizes.

system binary-search workload membership-mixed-queries metric small-n-regression

from Binary search overtakes linear scan at n≈8 in CPython membership tests · evidence: artifacts/results.csv

≈ attested L1 performance

In discrete-event simulation at 3x HBM oversubscription, scheduler-lookahead prefetching (K=4) achieves a 100% prefetch hit rate.

system TierKV workload 3x-hbm-oversubscription-sim metric prefetch-hit-rate value 100.0 unit % higher_is_better True baseline reactive-LRU hardware H100-class (simulated)

from TierKV: Prefetch-Aware Memory Tiering for KV Cache in LLM Serving · evidence: paper.pdf

≈ attested L1 comparison

Simulated mean time-per-output-token improves 3.6x over reactive LRU eviction at 3x oversubscription.

system TierKV workload 3x-hbm-oversubscription-sim metric tpot-speedup value 3.6 unit x higher_is_better True baseline reactive-LRU hardware H100-class (simulated)

from TierKV: Prefetch-Aware Memory Tiering for KV Cache in LLM Serving · evidence: paper.pdf

≈ attested L1 comparison

Simulated system throughput improves 2.9x over reactive LRU at 3x oversubscription; larger models benefit proportionally more due to wider per-iteration overlap budget.

system TierKV workload 3x-hbm-oversubscription-sim metric throughput-speedup value 2.9 unit x higher_is_better True baseline reactive-LRU hardware H100-class (simulated)

from TierKV: Prefetch-Aware Memory Tiering for KV Cache in LLM Serving · evidence: paper.pdf

≈ attested L1 performance

On a simulated mixed-GPU cluster, HeteroServe achieves 36,772 tokens/sec — 2.13x over uniform continuous batching.

system HeteroServe workload mixed-gpu-cluster-sim metric throughput value 36772 unit tokens/s higher_is_better True baseline uniform-scheduling improvement_pct 113.0 hardware H100+A100+L40S (simulated)

from HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters · evidence: paper.pdf

≈ attested L1 comparison

SLO compliance reaches 68.8% vs 26.4% for uniform scheduling (+42.4pp).

system HeteroServe workload mixed-gpu-cluster-sim metric slo-compliance value 68.8 unit % higher_is_better True baseline uniform-scheduling baseline_value 26.4 hardware H100+A100+L40S (simulated)

from HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters · evidence: paper.pdf

≈ attested L1 observation

Ablations attribute the dominant gain to queue-depth feedback rather than capability scoring alone.

system HeteroServe workload mixed-gpu-cluster-sim metric ablation-dominant-factor

from HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters · evidence: paper.pdf

≈ attested L1 performance

At the Medium tier ($0.01/seed), MARCO reaches 0.843 key-point recall vs 0.880 for an unbounded baseline — 96% of the quality at 19% of the cost.

task research-synthesis dataset 50-seed-multimodal-suite metric key-point-recall value 0.843 higher_is_better True model MARCO-medium baseline unbounded-search baseline_value 0.88

from MARCO: Budget-Constrained Multi-Modal Autonomous Research and Compositional Output Synthesis · evidence: paper.pdf

≈ attested L1 comparison

At the High tier, MARCO surpasses unbounded recall (0.925 vs 0.880) at 42% lower cost.

task research-synthesis dataset 50-seed-multimodal-suite metric key-point-recall value 0.925 higher_is_better True model MARCO-high baseline unbounded-search baseline_value 0.88

from MARCO: Budget-Constrained Multi-Modal Autonomous Research and Compositional Output Synthesis · evidence: paper.pdf

≈ attested L1 performance

The multi-modal parser achieves 0.962 entity F1 across 50 seeds spanning five modalities.

task multimodal-parsing dataset 50-seed-multimodal-suite metric entity-f1 value 0.962 higher_is_better True model MARCO-parser

from MARCO: Budget-Constrained Multi-Modal Autonomous Research and Compositional Output Synthesis · evidence: paper.pdf

≈ attested L1 observation

The paper 'The Last Human-Written Paper: Agent-Native Research Artifacts' (arXiv:2604.24658) reports: Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation…

from The Last Human-Written Paper: Agent-Native Research Artifacts · evidence:

≈ attested L1 observation

The paper 'aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists' (arXiv:2508.15126) reports: Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI-generated research content collides with a fragmented and largely closed publication ecosystem.

from aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists · evidence:

≈ attested L1 observation

The paper 'The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search' (arXiv:2504.08066) reports: AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper.

from The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search · evidence:

≈ attested L1 observation

The paper 'Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents' (arXiv:2509.06917) reports: We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery.

from Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents · evidence:

≈ attested L1 observation

The paper 'Kosmos: An AI Scientist for Autonomous Discovery' (arXiv:2511.02824) reports: Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings.

from Kosmos: An AI Scientist for Autonomous Discovery · evidence:

≈ attested L1 observation

The paper 'OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists' (arXiv:2511.16931) reports: With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and…

from OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists · evidence:

≈ attested L1 observation

The paper 'FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness' (arXiv:2205.14135) reports: Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup.

from FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness · evidence:

≈ attested L1 observation

The paper 'FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning' (arXiv:2307.08691) reports: Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length.

from FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · evidence:

≈ attested L1 observation

The paper 'FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision' (arXiv:2407.08608) reports: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes.

from FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · evidence:

≈ attested L1 observation

The paper 'Efficient Memory Management for Large Language Model Serving with PagedAttention' (arXiv:2309.06180) reports: High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically.

from Efficient Memory Management for Large Language Model Serving with PagedAttention · evidence:

≈ attested L1 observation

The paper 'GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints' (arXiv:2305.13245) reports: Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference.

from GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints · evidence:

≈ attested L1 observation

The paper 'Fast Transformer Decoding: One Write-Head is All You Need' (arXiv:1911.02150) reports: Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors.

from Fast Transformer Decoding: One Write-Head is All You Need · evidence:

≈ attested L1 observation

The paper 'AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration' (arXiv:2306.00978) reports: Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy.

from AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration · evidence:

Claim search