Programming for beginners: How Sampling techniques help to control the randomness in LLMs?

Sampling in the context of Large Language Models (LLMs) refers to the process of selecting the next token (word, character, or subword) during the generation of text. It’s a technique used to introduce randomness and control over the output of the model. Without sampling, the model would always predict the same next token given a sequence of inputs, which would result in repetitive or deterministic output. Sampling introduces variability, making the output more diverse and natural.

1. What is Sampling in LLMs?

In LLMs, the model generates text by predicting the next token based on the previous context. Sampling is the process of selecting a token from the probability distribution of possible next tokens, which the model provides. The probability distribution is based on the model's understanding of language patterns, context, and prior training data.

For instance, if you're generating text about "cats," the model may predict that the next token could be “are” with a high probability, but “rats” might also be a possible, though less likely, token. Sampling helps to decide which token to choose, allowing for diverse and creative outputs.

How Sampling is Used to Choose the Next Token?
Given a sequence of input tokens, the model computes a probability distribution for the next token. The model outputs a set of possible tokens with associated probabilities. Sampling techniques are used to pick the next token from this distribution. Let’s understand with below example.

1. Context: The model receives an input sequence, such as "I am suffering from fever."

2. Probability Distribution: Based on the context, the model outputs a distribution of likely next tokens, e.g.,

1. "and" (0.4),

2. "but" (0.35),

3. "so" (0.25).

3. Sampling: Instead of always selecting the highest-probability token ("and"), sampling may choose "but" or "so" to introduce variability in the output, allowing the model to generate more diverse and natural continuations based on different probabilities.

3. Different Types of Sampling Techniques

There are several sampling techniques used in LLMs to generate the next token. Here are the most common ones.

3.1 Greedy Sampling

Greedy sampling picks the token with the highest probability for the next word based on the current context.

For Example, let’s take the input: "I am suffering from fever". Probabilities of next tokens are give below.

· "and" (0.4)

· "but" (0.35)

· "so" (0.25)

The model selects the most probable token, which is "and"., the final output is "I am suffering from fever and"

3.2 Random Sampling

Random sampling selects the next token randomly, based on the given probabilities. The model will sample from the next token options, but tokens with higher probabilities are more likely to be chosen.

For Example, let’s take the input: "I am suffering from fever". Probabilities of next tokens are give below.

· "and" (0.4)

· "but" (0.35)

· "so" (0.25)

The model randomly picks a token, for example,

· Output 1: “I am suffering from fever and” (40% chance)

· Output 2: “I am suffering from fever but” (35% chance)

· Output 3: “I am suffering from fever so” (25% chance)

3.3 Top-K Sampling

Top-k sampling limits the pool of next possible tokens to the top k most probable ones and then samples from them.

If k = 2, the model will sample from the top 2 most probable tokens.

For Example, let’s take the input: "I am suffering from fever". Probabilities of next tokens are give below.

· "and" (0.4)

· "but" (0.35)

· "so" (0.25)

The model will randomly select from "and", "but", as they are the top 2 most probable tokens.

· Output 1: "I am suffering from fever and"

· Output 2: "I am suffering from fever but"

3.4 Top-p (Nucleus) Sampling

In top-p sampling, the model picks the smallest set of tokens whose cumulative probability is greater than or equal to p, and samples from this set.

If p = 0.9, the model will select the smallest subset of tokens that cumulatively make up 90% of the probability.

For Example, let’s take the input: "I am suffering from fever". Probabilities of next tokens are give below.

· "and" (0.4)

· "but" (0.35)

· "so" (0.25)

Cumulative Probability:

· "and" (0.4) + "but" (0.35) = 0.75 (not enough for p = 0.9)

· "and" (0.4) + "but" (0.35) + "so" (0.25) = 1.0 (this exceeds p = 0.9)

Since the cumulative probability exceeds p = 0.9, all three tokens are included in the selection pool.

Top-p Sampling Output: The model will randomly choose one of these three tokens.

· Output 1: "I am suffering from fever and"

· Output 2: "I am suffering from fever but"

· Output 3: "I am suffering from fever so"

Temperature Sampling

Lower temperatures make the model more confident in its predictions, while higher temperatures add more randomness.

For Example, let’s take the input: "I am suffering from fever". Probabilities of next tokens are give below.

· "and" (0.4)

· "but" (0.35)

· "so" (0.25)

Temperature = 0.5 (Low Temperature): The model becomes more confident in its predictions, so "and" will be favoured.

· Output: "I am suffering from fever and" (more likely).

Temperature = 1.5 (High Temperature): The probabilities become more spread out, and all tokens are more likely to be chosen.

Output 1: "I am suffering from fever and"

Output 2: "I am suffering from fever but"

Output 3: "I am suffering from fever so"

Combination of (Top-k/Top-p + Temperature)

Combination (Top-k/Top-p + Temperature) is supported by most modern large language models (LLMs), and it's commonly used to control both the diversity and quality of the generated text. These models, such as GPT-3, GPT-4, and other advanced transformers, allow for the combination of Top-k, Top-p (Nucleus Sampling), and Temperature parameters to generate text that balances creativity, relevance, and consistency.

Why use the combination?

· Top-k + Temperature: This combination provides a fixed set of options (top-k) while adjusting the randomness and creativity through temperature. It's useful when you want to restrict the choice of tokens but still want to allow some variability.

· Top-p + Temperature: Top-p, combined with temperature, dynamically adjusts the size of the token pool based on probability distribution, allowing for more nuanced control over creativity and coherence. It lets the model generate varied outputs without falling into too much randomness.

· Top-k + Top-p + Temperature: This approach combines the benefits of all three methods, allowing for a more precise, fine-tuned balance between coherence, creativity, and diversity. By using Top-k to limit the pool of possible tokens, Top-p to ensure flexible, adaptive token selection, and Temperature to control the randomness in the output, this combination provides a highly effective way to steer the model's behavior.

In summary,

· Use Greedy Sampling, when you need the most deterministic, reliable, and fluent output.

· Use Random Sampling, when you want more diverse, creative, or exploratory outputs, and the specific choice of the next token doesn't need to be deterministic.

· Use Top-k sampling, when you want a balance between controlled diversity and fluency in the output, while still limiting randomness

· Use Top-p (Nucleus sampling), when you want an adaptive and flexible approach to token selection that ensures consistency while introducing variability, without being limited to a fixed number of top tokens.

· Use Temperature Sampling, when you need to control the level of randomness in the output, adjusting how adventurous or conservative the model’s predictions are.

References

https://community.openai.com/t/questions-regarding-api-sampling-parameters-temperature-top-p/335706

Previous Next Home

Programming for beginners

Monday, 28 April 2025

How Sampling techniques help to control the randomness in LLMs?

No comments:

Post a Comment