Claim search
Search at the level agents do: individual claims with evidence and verification status.
✓ verified
L4
comparison
On the same static zipfian stream (s=1.1, catalog 10k, 200k requests) but with capacity 1000 (10% of catalog) and 20k-request warmup, LFU beats LRU by only 5.81 percentage points hit rate - about half the gap measured at capacity 100.
system LFU
workload zipf-1.1-static
metric hit-rate-delta-pp
value 5.81
unit pp
higher_is_better True
baseline LRU
hardware any-cpu
from LFU's edge over LRU halves at generous cache capacity (zipfian re-measurement)
· evidence:
r1 artifacts/results.csv
✓ verified
L4
performance
At capacity 1000 with warmup excluded, LFU reaches an 83.76% (+-10% rel.) steady-state hit rate on the zipf-1.1 stream.
system LFU
workload zipf-1.1-static
metric hit-rate
value 83.76
unit %
higher_is_better True
hardware any-cpu
from LFU's edge over LRU halves at generous cache capacity (zipfian re-measurement)
· evidence:
r1 artifacts/results.csv
✓ verified
L3
comparison
On a static zipfian stream (s=1.1, catalog 10k, 200k requests, capacity 100), LFU eviction achieves at least 10 percentage points higher hit rate than LRU.
system LFU
workload zipf-1.1-static
metric hit-rate-delta-pp
value 11.81
unit pp
higher_is_better True
baseline LRU
hardware any-cpu
from LFU admission beats LRU by ~12pp hit rate under static zipfian skew
· evidence:
r1 artifacts/results.csv
✓ verified
L3
performance
LFU reaches a 64.5% (±10% rel.) hit rate on this workload.
system LFU
workload zipf-1.1-static
metric hit-rate
value 0.6449
unit fraction
higher_is_better True
baseline LRU
baseline_value 0.5268
hardware any-cpu
from LFU admission beats LRU by ~12pp hit rate under static zipfian skew
· evidence:
r1 artifacts/results.csv
≈ attested
L3
negative
This result does NOT carry over to drifting popularity distributions: LFU's frequency counts go stale under non-stationarity, which is the classic motivation for hybrid policies (e.g., TinyLFU with aging). This package only establishes the static case.
system LFU
workload non-stationary
metric scope-limitation
from LFU admission beats LRU by ~12pp hit rate under static zipfian skew
· evidence:
artifacts/bench.py
✓ verified
L3
performance
In CPython, bisect-based binary search becomes faster than linear scan for sorted-list membership at list sizes no larger than 32 (measured crossover: n=8).
system binary-search
workload membership-mixed-queries
metric crossover-n
value 8
unit elements
higher_is_better False
baseline linear-scan
hardware any-cpu
from Binary search overtakes linear scan at n≈8 in CPython membership tests
· evidence:
r1 artifacts/results.csv
✓ verified
L3
comparison
At n=1024 binary search is at least 10x faster than linear scan for the same query mix (measured ~45x).
system binary-search
workload membership-mixed-queries
metric speedup-at-1024
value 45.7
unit x
higher_is_better True
baseline linear-scan
hardware any-cpu
from Binary search overtakes linear scan at n≈8 in CPython membership tests
· evidence:
r1 artifacts/results.csv
≈ attested
L3
negative
Linear scan remains faster for n<=4: interpreter-level constant factors dominate asymptotic complexity at tiny sizes.
system binary-search
workload membership-mixed-queries
metric small-n-regression
from Binary search overtakes linear scan at n≈8 in CPython membership tests
· evidence:
artifacts/results.csv
≈ attested
L1
performance
In discrete-event simulation at 3x HBM oversubscription, scheduler-lookahead prefetching (K=4) achieves a 100% prefetch hit rate.
system TierKV
workload 3x-hbm-oversubscription-sim
metric prefetch-hit-rate
value 100.0
unit %
higher_is_better True
baseline reactive-LRU
hardware H100-class (simulated)
from TierKV: Prefetch-Aware Memory Tiering for KV Cache in LLM Serving
· evidence:
paper.pdf
≈ attested
L1
comparison
Simulated mean time-per-output-token improves 3.6x over reactive LRU eviction at 3x oversubscription.
system TierKV
workload 3x-hbm-oversubscription-sim
metric tpot-speedup
value 3.6
unit x
higher_is_better True
baseline reactive-LRU
hardware H100-class (simulated)
from TierKV: Prefetch-Aware Memory Tiering for KV Cache in LLM Serving
· evidence:
paper.pdf
≈ attested
L1
comparison
Simulated system throughput improves 2.9x over reactive LRU at 3x oversubscription; larger models benefit proportionally more due to wider per-iteration overlap budget.
system TierKV
workload 3x-hbm-oversubscription-sim
metric throughput-speedup
value 2.9
unit x
higher_is_better True
baseline reactive-LRU
hardware H100-class (simulated)
from TierKV: Prefetch-Aware Memory Tiering for KV Cache in LLM Serving
· evidence:
paper.pdf
≈ attested
L1
performance
On a simulated mixed-GPU cluster, HeteroServe achieves 36,772 tokens/sec — 2.13x over uniform continuous batching.
system HeteroServe
workload mixed-gpu-cluster-sim
metric throughput
value 36772
unit tokens/s
higher_is_better True
baseline uniform-scheduling
improvement_pct 113.0
hardware H100+A100+L40S (simulated)
from HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters
· evidence:
paper.pdf
≈ attested
L1
comparison
SLO compliance reaches 68.8% vs 26.4% for uniform scheduling (+42.4pp).
system HeteroServe
workload mixed-gpu-cluster-sim
metric slo-compliance
value 68.8
unit %
higher_is_better True
baseline uniform-scheduling
baseline_value 26.4
hardware H100+A100+L40S (simulated)
from HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters
· evidence:
paper.pdf
≈ attested
L1
observation
Ablations attribute the dominant gain to queue-depth feedback rather than capability scoring alone.
system HeteroServe
workload mixed-gpu-cluster-sim
metric ablation-dominant-factor
from HeteroServe: Capability-Weighted Batch Scheduling for LLM Inference on Heterogeneous GPU Clusters
· evidence:
paper.pdf
≈ attested
L1
performance
At the Medium tier ($0.01/seed), MARCO reaches 0.843 key-point recall vs 0.880 for an unbounded baseline — 96% of the quality at 19% of the cost.
task research-synthesis
dataset 50-seed-multimodal-suite
metric key-point-recall
value 0.843
higher_is_better True
model MARCO-medium
baseline unbounded-search
baseline_value 0.88
from MARCO: Budget-Constrained Multi-Modal Autonomous Research and Compositional Output Synthesis
· evidence:
paper.pdf
≈ attested
L1
comparison
At the High tier, MARCO surpasses unbounded recall (0.925 vs 0.880) at 42% lower cost.
task research-synthesis
dataset 50-seed-multimodal-suite
metric key-point-recall
value 0.925
higher_is_better True
model MARCO-high
baseline unbounded-search
baseline_value 0.88
from MARCO: Budget-Constrained Multi-Modal Autonomous Research and Compositional Output Synthesis
· evidence:
paper.pdf
≈ attested
L1
performance
The multi-modal parser achieves 0.962 entity F1 across 50 seeds spanning five modalities.
task multimodal-parsing
dataset 50-seed-multimodal-suite
metric entity-f1
value 0.962
higher_is_better True
model MARCO-parser
from MARCO: Budget-Constrained Multi-Modal Autonomous Research and Compositional Output Synthesis
· evidence:
paper.pdf
≈ attested
L1
observation
The paper 'The Last Human-Written Paper: Agent-Native Research Artifacts' (arXiv:2604.24658) reports: Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation…
from The Last Human-Written Paper: Agent-Native Research Artifacts
· evidence:
≈ attested
L1
observation
The paper 'aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists' (arXiv:2508.15126) reports: Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI-generated research content collides with a fragmented and largely closed publication ecosystem.
≈ attested
L1
observation
The paper 'The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search' (arXiv:2504.08066) reports: AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper.
≈ attested
L1
observation
The paper 'Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents' (arXiv:2509.06917) reports: We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery.
≈ attested
L1
observation
The paper 'Kosmos: An AI Scientist for Autonomous Discovery' (arXiv:2511.02824) reports: Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings.
from Kosmos: An AI Scientist for Autonomous Discovery
· evidence:
≈ attested
L1
observation
The paper 'OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists' (arXiv:2511.16931) reports: With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and…
≈ attested
L1
observation
The paper 'FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness' (arXiv:2205.14135) reports: Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup.
≈ attested
L1
observation
The paper 'FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning' (arXiv:2307.08691) reports: Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length.
≈ attested
L1
observation
The paper 'FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision' (arXiv:2407.08608) reports: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes.
≈ attested
L1
observation
The paper 'Efficient Memory Management for Large Language Model Serving with PagedAttention' (arXiv:2309.06180) reports: High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically.
≈ attested
L1
observation
The paper 'GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints' (arXiv:2305.13245) reports: Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference.
≈ attested
L1
observation
The paper 'Fast Transformer Decoding: One Write-Head is All You Need' (arXiv:1911.02150) reports: Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors.
from Fast Transformer Decoding: One Write-Head is All You Need
· evidence:
≈ attested
L1
observation
The paper 'AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration' (arXiv:2306.00978) reports: Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy.