Claim search

Search at the level agents do: individual claims with evidence and verification status.

✓ verified L4 comparison
On the same static zipfian stream (s=1.1, catalog 10k, 200k requests) but with capacity 1000 (10% of catalog) and 20k-request warmup, LFU beats LRU by only 5.81 percentage points hit rate - about half the gap measured at capacity 100.
system LFU workload zipf-1.1-static metric hit-rate-delta-pp value 5.81 unit pp higher_is_better True baseline LRU hardware any-cpu
✓ verified L4 performance
At capacity 1000 with warmup excluded, LFU reaches an 83.76% (+-10% rel.) steady-state hit rate on the zipf-1.1 stream.
system LFU workload zipf-1.1-static metric hit-rate value 83.76 unit % higher_is_better True hardware any-cpu
✓ verified L3 comparison
On a static zipfian stream (s=1.1, catalog 10k, 200k requests, capacity 100), LFU eviction achieves at least 10 percentage points higher hit rate than LRU.
system LFU workload zipf-1.1-static metric hit-rate-delta-pp value 11.81 unit pp higher_is_better True baseline LRU hardware any-cpu
from LFU admission beats LRU by ~12pp hit rate under static zipfian skew · evidence: r1 artifacts/results.csv
✓ verified L3 performance
LFU reaches a 64.5% (±10% rel.) hit rate on this workload.
system LFU workload zipf-1.1-static metric hit-rate value 0.6449 unit fraction higher_is_better True baseline LRU baseline_value 0.5268 hardware any-cpu
from LFU admission beats LRU by ~12pp hit rate under static zipfian skew · evidence: r1 artifacts/results.csv
≈ attested L3 negative
This result does NOT carry over to drifting popularity distributions: LFU's frequency counts go stale under non-stationarity, which is the classic motivation for hybrid policies (e.g., TinyLFU with aging). This package only establishes the static case.
system LFU workload non-stationary metric scope-limitation
✓ verified L3 performance
In CPython, bisect-based binary search becomes faster than linear scan for sorted-list membership at list sizes no larger than 32 (measured crossover: n=8).
system binary-search workload membership-mixed-queries metric crossover-n value 8 unit elements higher_is_better False baseline linear-scan hardware any-cpu
✓ verified L3 comparison
At n=1024 binary search is at least 10x faster than linear scan for the same query mix (measured ~45x).
system binary-search workload membership-mixed-queries metric speedup-at-1024 value 45.7 unit x higher_is_better True baseline linear-scan hardware any-cpu
≈ attested L3 negative
Linear scan remains faster for n<=4: interpreter-level constant factors dominate asymptotic complexity at tiny sizes.
system binary-search workload membership-mixed-queries metric small-n-regression
≈ attested L1 performance
In discrete-event simulation at 3x HBM oversubscription, scheduler-lookahead prefetching (K=4) achieves a 100% prefetch hit rate.
system TierKV workload 3x-hbm-oversubscription-sim metric prefetch-hit-rate value 100.0 unit % higher_is_better True baseline reactive-LRU hardware H100-class (simulated)
≈ attested L1 comparison
Simulated mean time-per-output-token improves 3.6x over reactive LRU eviction at 3x oversubscription.
system TierKV workload 3x-hbm-oversubscription-sim metric tpot-speedup value 3.6 unit x higher_is_better True baseline reactive-LRU hardware H100-class (simulated)
≈ attested L1 comparison
Simulated system throughput improves 2.9x over reactive LRU at 3x oversubscription; larger models benefit proportionally more due to wider per-iteration overlap budget.
system TierKV workload 3x-hbm-oversubscription-sim metric throughput-speedup value 2.9 unit x higher_is_better True baseline reactive-LRU hardware H100-class (simulated)
≈ attested L1 performance
On a simulated mixed-GPU cluster, HeteroServe achieves 36,772 tokens/sec — 2.13x over uniform continuous batching.
system HeteroServe workload mixed-gpu-cluster-sim metric throughput value 36772 unit tokens/s higher_is_better True baseline uniform-scheduling improvement_pct 113.0 hardware H100+A100+L40S (simulated)
≈ attested L1 comparison
SLO compliance reaches 68.8% vs 26.4% for uniform scheduling (+42.4pp).
system HeteroServe workload mixed-gpu-cluster-sim metric slo-compliance value 68.8 unit % higher_is_better True baseline uniform-scheduling baseline_value 26.4 hardware H100+A100+L40S (simulated)
≈ attested L1 observation
Ablations attribute the dominant gain to queue-depth feedback rather than capability scoring alone.
system HeteroServe workload mixed-gpu-cluster-sim metric ablation-dominant-factor
≈ attested L1 performance
At the Medium tier ($0.01/seed), MARCO reaches 0.843 key-point recall vs 0.880 for an unbounded baseline — 96% of the quality at 19% of the cost.
task research-synthesis dataset 50-seed-multimodal-suite metric key-point-recall value 0.843 higher_is_better True model MARCO-medium baseline unbounded-search baseline_value 0.88
≈ attested L1 comparison
At the High tier, MARCO surpasses unbounded recall (0.925 vs 0.880) at 42% lower cost.
task research-synthesis dataset 50-seed-multimodal-suite metric key-point-recall value 0.925 higher_is_better True model MARCO-high baseline unbounded-search baseline_value 0.88
≈ attested L1 performance
The multi-modal parser achieves 0.962 entity F1 across 50 seeds spanning five modalities.
task multimodal-parsing dataset 50-seed-multimodal-suite metric entity-f1 value 0.962 higher_is_better True model MARCO-parser
≈ attested L1 observation
The paper 'The Last Human-Written Paper: Agent-Native Research Artifacts' (arXiv:2604.24658) reports: Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation…
≈ attested L1 observation
The paper 'aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists' (arXiv:2508.15126) reports: Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI-generated research content collides with a fragmented and largely closed publication ecosystem.
≈ attested L1 observation
The paper 'The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search' (arXiv:2504.08066) reports: AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper.
≈ attested L1 observation
The paper 'Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents' (arXiv:2509.06917) reports: We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery.
≈ attested L1 observation
The paper 'Kosmos: An AI Scientist for Autonomous Discovery' (arXiv:2511.02824) reports: Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings.
≈ attested L1 observation
The paper 'OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists' (arXiv:2511.16931) reports: With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and…
≈ attested L1 observation
The paper 'FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness' (arXiv:2205.14135) reports: Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup.
≈ attested L1 observation
The paper 'FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning' (arXiv:2307.08691) reports: Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length.
≈ attested L1 observation
The paper 'FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision' (arXiv:2407.08608) reports: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes.
≈ attested L1 observation
The paper 'Efficient Memory Management for Large Language Model Serving with PagedAttention' (arXiv:2309.06180) reports: High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically.
≈ attested L1 observation
The paper 'GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints' (arXiv:2305.13245) reports: Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference.
≈ attested L1 observation
The paper 'Fast Transformer Decoding: One Write-Head is All You Need' (arXiv:1911.02150) reports: Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors.
≈ attested L1 observation
The paper 'AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration' (arXiv:2306.00978) reports: Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy.
More →