From Draft to Target: Optimizing Language Model Decoding with Speculative Sampling

Introduction

In the realm of machine learning, large language models have transformed our capabilities. However, decoding these behemoths efficiently remains a challenge. Enter Speculative Sampling, a technique that promises to revolutionize this decoding process.

The Problem with Traditional Decoding

Transformers have revolutionized the field of natural language processing, but with great power comes great computational demands. Traditional transformer decoding, especially with auto-regressive sampling, brings along latency issues that are a barrier to real-time applications.

What is Speculative Sampling?

Speculative Sampling isn't just a fancy term—it's a novel solution to the aforementioned problem. The technique uses what's called a "draft model" to generate preliminary token predictions. These predictions are then refined and validated by a more powerful "target model," ensuring both speed and quality in the decoding process.

The Role of the Draft Model

The draft model in Speculative Sampling acts as the rapid-response unit. It quickly produces token predictions which are then assessed by the target model. But how do we choose this draft model? The paper suggests several strategies, ranging from using a scaled-down version of the target model to incorporating draft generation capabilities directly into the target model itself.

Choice of Draft Models

The paper explores multiple strategies for creating or choosing the draft model:

Speculative Sampling in Action: Algorithm 2

The magic of Speculative Sampling is best understood through its algorithm. Here's a breakdown:

  1. Given a lookahead \( K \) and a minimum target sequence length \( T \), the process starts.
  2. Using the draft model, a short draft of length \( K \) is generated.
  3. The draft is then scored using the target model.
  4. A modified rejection sampling scheme is employed. A random number \( r \) is sampled from a uniform distribution between 0 and 1. If \( r \) is less than the ratio of the likelihoods of the target model to the draft model, the draft token is accepted. Otherwise, a new token is sampled from the difference in distributions of the two models.
  5. This process continues until all tokens are accepted or a new token is sampled.

Delving Deeper: Why the Difference in Rejection Sampling?

One might wonder: why use the difference between the two distributions in the rejection sampling scheme? The reasoning is rooted in ensuring that the sampling process leans more towards the target model's preferences while still accounting for the draft's suggestions. By considering the difference, we get a clearer picture of where the two models disagree. This ensures that the final sampled token is more in line with the target model's expectations.

Furthermore, the question arises: why not simply resample from \( q \) (the target model)? The answer is that directly sampling from \( q \) without considering the draft model \( p \) could disregard valuable information that the draft model provides, making the entire process of speculative sampling redundant.

Acceptance and Rejection in Speculative Sampling

Not all drafts make the cut. Speculative Sampling employs a rejection sampling mechanism to ensure the quality of the final output. If a token prediction from the draft model doesn't align with the target model's expectations, it's either rejected or refined, ensuring the output remains top-notch.

Benefits of Speculative Sampling

So, why go through all this trouble? Speculative Sampling offers a tantalizing trade-off: it substantially accelerates the decoding process without sacrificing the quality of the output. It's like having your cake and eating it too!

Conclusion

In this groundbreaking work, a new algorithm and workflow have been unveiled that revolutionizes the decoding process of language models. What makes Speculative Sampling stand out is its ability to enhance decoding speeds without necessitating any changes to the target language model's parameters or architecture. It's a method that's both lossless within its numeric boundaries and scales effectively with a suitable draft model. The success of this technique, especially its optimization with the Chinchilla 70B model using an easily trained draft model, underscores its potential. The empirical results further validate its efficiency, showcasing substantial speedups across various benchmark tasks and decoding methods. It's a testament to the future of efficient and high-quality language model decoding.

Reference

Accelerating Large Language Model Decoding with Speculative Sampling

Related

Created 2023-09-04T10:06:16-07:00, updated 2023-11-02T21:42:25-07:00 · History · Edit