SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Oliaro, Gabriele; Jia, Zhihao; Campos, Daniel; Qiao, Aurick

Now Available in Production Frameworks

Abstract

Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit.

To address this gap, we introduce SuffixDecoding, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited.

Evaluations on agentic benchmarks, including SWE-Bench and AgenticSQL (a multi-agent SQL generation workflow), demonstrate that SuffixDecoding achieves speedups of up to 5.3×, outperforming state-of-the-art methods—2.8× faster than model-based approaches like EAGLE-2/3 and 1.9× faster than model-free approaches such as Token Recycling.

Method

Suffix Tree Construction and Pattern Matching

The goal of SuffixDecoding is to enable fast, adaptive speculative decoding over long sequences, particularly suited for agentic applications where repeated inference calls often contain highly predictable and overlapping token sequences. To support fast speculation, SuffixDecoding builds a suffix tree over the tokens in the current and prior requests, and uses the suffix tree to generate speculative tokens.

SuffixDecoding maintains two different suffix trees: a global tree for previously generated outputs, and a separate per-request tree for the current ongoing inference request. This circumvents the complexities and overheads due to synchronizing suffix tree updates from multiple concurrent requests. The global tree can be constructed offline in O(n) time, while the per-request tree can be efficiently constructed and updated online.

Adaptive Speculation Strategy

Left: The mean number of accepted tokens increases with the length of the pattern match, which motivates MAX_SPEC = αp. Right: Using an adaptive MAX_SPEC achieves a better trade-off between acceptance rate and speculative speedup compared to a constant MAX_SPEC.

The key innovation is adaptive speculation: SuffixDecoding dynamically adjusts the number of speculated tokens based on the quality of matches found in the suffix tree. When the method detects strong pattern repetition, it speculates aggressively (up to hundreds of tokens), maximizing parallelism in verification. When patterns are less predictable, it reduces speculation to avoid wasted computation.

Speculation Tree Expansion and Scoring

Example speculation tree containing 66 tokens for the AgenticSQL Extract task. SuffixDecoding builds the speculation tree by greedily adding leaf nodes that it believes to be the most likely to be accepted during verification, based on frequency statistics captured within the suffix trees.

To select a smaller more likely sub-tree for practical speculative verification, SuffixDecoding starts with a pattern match node and grows a sub-tree greedily by expanding one leaf node at a time. The method uses empirical probability estimates to score each potential expansion and selects the nodes most likely to be accepted.

Results

Performance on Agentic Benchmarks

Speculative speedups (top) and mean accepted tokens per step (bottom) compared to vanilla decoding for SuffixDecoding and baseline methods on three benchmarks: Spec-Bench, AgenticSQL, and SWE-Bench.

On agentic workloads, SuffixDecoding outperforms all baselines. In AgenticSQL, SuffixDecoding obtains a mean speedup of 5.3× over vanilla decoding, a 2.8× improvement over EAGLE-2/3, and 1.9× higher than Token Recycling. In SWE-Bench, SuffixDecoding obtains a mean speedup of 2.5× over vanilla decoding and 1.7× improvement over Prompt-Lookup Decoding. SuffixDecoding's superior performance can be attributed to its consistently higher mean accepted tokens per decoding step.

The hybrid approach of SuffixDecoding + EAGLE-3 achieves the best of both worlds: we speculate with the faster SuffixDecoding method whenever possible and fall back to EAGLE-3 when the speculation score is too low. The hybrid approach performs well across both agentic and non-agentic workloads.

End-to-End SWE-Bench on vLLM

End-to-end task-completion time of the OpenHands agent on SWE-Bench Verified. The benchmarks are run with a concurrency of 8 tasks running simultaneously. vLLM is deployed on 4 H100 GPUs configured with 4-way tensor parallelism and prefix caching enabled.

SuffixDecoding can be efficiently integrated into vLLM, a popular inference system used in production deployments, and effectively accelerate end-to-end agentic task completion time. Decoding time (i.e. output generation) takes a majority of the time across all SWE-Bench tasks, dominating both prefilling and agentic actions. In this end-to-end scenario, SuffixDecoding outperforms PLD by 1.3–3×, leading to a 1.8–4.5× speculative speedup over vanilla decoding.

Ablation Studies

Top: Speedup factor for the tasks in AgenticSQL using only the global suffix tree, only the per-request suffix tree, and both. Bottom Left & Right: Speedup and acceptance rate vs global suffix tree size for Magicoder and Wildchat.

Ablation studies demonstrate the importance of each component in SuffixDecoding. Using both the global and per-request suffix trees performs better than using just one. The global tree outperforms the per-request tree on most tasks. Additionally, the speedup from SuffixDecoding continues to increase with more previous output examples, while the acceptance rate holds steady, indicating that SuffixDecoding can learn useful patterns even in workloads with lower token repetition.

Frequently Asked Questions

How much memory does SuffixDecoding require?

SuffixDecoding is highly memory efficient. The suffix trees scale linearly with the number of cached tokens:

20K requests (~572M tokens): ~6.15 GB of CPU memory
Per-token memory: ~10.75 bytes/token
Typical capacity: Can cache 31 days of continuous generation on a standard A100 system (144GB CPU memory) before requiring cache eviction

Lookup and update times remain fast (~4 microseconds per token for updates, ~12 microseconds for lookups) even with large trees.

What happens when the input distribution shifts?

SuffixDecoding adapts online to distribution shifts. In our experiments, when switching from WildChat to SpiderSQL (a significant distribution shift), SuffixDecoding initially achieves 1.5× speedup. However, it quickly adapts by inserting new outputs into its global suffix tree. After processing just 500 requests from the new distribution, SuffixDecoding's performance becomes almost indistinguishable from training on that distribution offline. This demonstrates strong online adaptation capabilities.

What is AgenticSQL and why does SuffixDecoding work so well on it?

AgenticSQL is a proprietary multi-agent workflow designed for complex SQL generation tasks. It consists of multiple LLM-powered stages working together to convert natural language questions into accurate SQL queries.

The workflow includes several specialized stages:

Classify & Extract: Analyzes the user question to extract useful features and metadata
Enrich: Uses retrieval-augmented generation (RAG) to supplement the question with relevant context from documentation or schema information
SQL Generation (SQL 1...N): Multiple parallel text-to-SQL agents propose different SQL solutions to the question, each with feedback from an error corrector
Combine: Synthesizes all proposed SQL candidates into a final optimized SQL query and generates a natural language response

SuffixDecoding achieves exceptional performance on AgenticSQL (up to 10.41× speedup on the Enrich task and 9.85× speedup on Extract) because:

Structured outputs: Each stage generates highly structured responses (JSON for Extract/Classify, SQL queries, etc.) that follow consistent patterns
Repetitive patterns: Similar user questions lead to similar intermediate outputs, creating strong token-level repetition across requests
Schema consistency: Database schemas, table names, and common SQL patterns repeat frequently within and across queries
Multi-stage workflow: Outputs from early stages (like Classify and Extract) are frequently reused in later stages, creating additional opportunities for speculation

This demonstrates SuffixDecoding's strength on production agentic workflows where structured generation and repetitive patterns are common.

How can I predict if SuffixDecoding will work well for my application?

You can measure the "structuredness" of your task using empirical entropy with just 100 example outputs:

Create a suffix tree from 100 example model outputs
Calculate the entropy of each node's output distribution
Compute a weighted average across all nodes

Lower average entropy indicates more predictable outputs and better SuffixDecoding performance. For reference:

AgenticSQL Extract: 0.086 entropy → 9.85× speedup
AgenticSQL Enrich: 0.171 entropy → 10.41× speedup
SpiderSQL: 2.50 entropy → 2.19× speedup
WildChat (open-ended chat): 3.43 entropy → modest speedup

Should I use SuffixDecoding alone or in hybrid mode?

The choice depends on your workload:

Highly repetitive agentic tasks: Use SuffixDecoding alone (threshold τ = 0) for best performance. On AgenticSQL, this achieves 5.35× speedup.
Mixed or open-ended workloads: Use hybrid mode with threshold τ ≈ mean accepted tokens of your fallback model-based speculator. For example, with EAGLE-3 (MAT ≈ 4.65), setting τ = 5-7 achieves 2.5× speedup on Spec-Bench.

Can SuffixDecoding work with batch serving?

Yes! SuffixDecoding is compatible with batch-level speculation control methods like TurboSpec and AdaServe. By dynamically adjusting the number of speculative tokens per request based on the suffix tree scores, SuffixDecoding can be integrated to maximize batch-wise goodput or meet per-request SLO targets. The statistics-based scoring in SuffixDecoding can help determine which requests in a batch should receive more speculation budget.

How long does it take to build the suffix tree?

Suffix tree construction is fast and only needs to be done once when starting the inference server:

1,000 examples: 0.30 seconds, 137 MB
10,000 examples: 4.82 seconds, 1.4 GB
100,000 examples: 61.95 seconds, 14.7 GB

This is a one-time cost, and the tree can be updated incrementally during serving with minimal overhead.

Does SuffixDecoding preserve the output distribution?

Yes! SuffixDecoding uses speculative decoding, which is mathematically guaranteed to preserve the exact output distribution of the target LLM. In our end-to-end vLLM experiments on SWE-Bench Verified, SuffixDecoding achieved the same 37.2% task completion rate as vanilla decoding while providing 1.8-4.5× speedup.

BibTeX

@inproceedings{oliaro2025suffixdecoding,
  author    = {Gabriele Oliaro and Zhihao Jia and Daniel Campos and Aurick Qiao},
  title     = {SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications},
  booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year      = {2025},
  url       = {https://arxiv.org/abs/2411.04975}
}