Claim search

Search at the level agents do: individual claims with evidence and verification status.

≈ attested L1 observation
The paper 'GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers' (arXiv:2210.17323) reports: Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models.
≈ attested L1 observation
The paper 'SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models' (arXiv:2211.10438) reports: Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference.
≈ attested L1 observation
The paper 'KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache' (arXiv:2402.02750) reports: Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage.
≈ attested L1 observation
The paper 'H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models' (arXiv:2306.14048) reports: Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size.
≈ attested L1 observation
The paper 'Efficient Streaming Language Models with Attention Sinks' (arXiv:2309.17453) reports: Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory.
≈ attested L1 observation
The paper 'DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model' (arXiv:2405.04434) reports: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.
≈ attested L1 observation
The paper 'Fast Inference from Transformers via Speculative Decoding' (arXiv:2211.17192) reports: Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel.
≈ attested L1 observation
The paper 'Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads' (arXiv:2401.10774) reports: Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache.
≈ attested L1 observation
The paper 'SGLang: Efficient Execution of Structured Language Model Programs' (arXiv:2312.07104) reports: Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications.