How LLMs Pick the Next Word — Token Sampling From First Principles
Every LLM call ends the same way — the model produces a probability distribution over its entire vocabulary, and a sampler picks one token from that distribution. That single mechanic is why the same prompt can return slightly different answers, why temperature exists, and why code generation works best at one setting and brainstorming at another. This post walks through the sampling stack — greedy decoding, temperature, top-k, top-p, min-p, beam search, and the repetition penalties — and gives you a recipe for each task type.
Key takeaways
- An LLM doesn't "know" the next word — it outputs a probability over all tokens, then a sampler picks one.
- Temperature scales the logits before softmax. Lower = sharper distribution = more deterministic.
- Top-k keeps the k highest-probability tokens. Top-p keeps the smallest set whose probabilities sum to p. Min-p keeps tokens within a fraction of the top token.
- OpenAI and Anthropic both recommend tuning temperature OR top_p, not both.
- Reasoning models — OpenAI o-series, GPT-5, and Claude Opus 4.7 — disable most sampling knobs. Their internal deliberation does the work the knobs used to do.
What's happening under the hood
When you send a prompt, the model runs a forward pass and produces a logit vector — one real number per token in the vocabulary. Vocabularies are typically 100,000 to 200,000 tokens for modern frontier models. Those logits pass through a softmax to become a probability distribution that sums to 1.
So at every step, the model is staring at something like:
the → 0.31
a → 0.14
this → 0.08
my → 0.05
... (200,000 more tokens, most near zero)
The sampler's job is to pick one of those tokens, append it to the output, and run the next forward pass. Sampling is what turns a distribution into language. Change the sampling strategy and you change the personality of the output — same model, same weights, same prompt.
Greedy decoding vs sampling
The simplest strategy is greedy decoding — at every step, pick the highest-probability token. It's deterministic and fast. It's also boring, and worse, it tends to get stuck in repetitive loops because once a high-probability path opens, greedy will keep walking it forever.
Modern chat LLMs almost never run pure greedy. They sample — they roll a weighted die against the probability distribution. That randomness is the reason "regenerate" produces a different answer. It's also the reason factual queries can be slightly wrong on one run and correct on the next.
The sampling parameters below are all variations on the same theme: shape the probability distribution before you draw from it.
Temperature — the master dial
Temperature scales the logits before the softmax. The formula is:
softmax(logits / temperature)
Divide by a small number (say 0.1) and the logits get amplified — the top token's probability shoots up, everything else collapses toward zero. Divide by a large number (say 2.0) and the logits flatten — the distribution becomes closer to uniform, less-likely tokens get a real chance.
A quick mental model:
| Temperature | Distribution shape | Behaviour |
|---|---|---|
| 0.0 | Spike on the top token | Effectively greedy, mostly deterministic |
| 0.3 | Sharply peaked | Focused, factual, low variation |
| 0.7 | Moderate spread | Default chat behaviour |
| 1.0 | Original distribution | Unchanged from the model's raw output |
| 1.5+ | Flattened | Wild, sometimes incoherent |
Note that even temperature=0 is not a guarantee of identical outputs. Anthropic states this explicitly in its API docs — floating-point non-determinism, batched inference, and mixture-of-experts routing all introduce small amounts of randomness even at zero. OpenAI is the same.
Temperature ranges vary by provider. OpenAI and Google Gemini accept 0–2. Anthropic caps at 0–1. Most local-inference engines will accept anything you give them, which is its own kind of problem.
Top-k — keep the best k
top_k is the simplest truncation strategy. After computing the distribution, sort tokens by probability, keep the top k, set the rest to zero, renormalise, then sample.
top_k=1is greedy decoding.top_k=40is a common default in open-weights models — it leaves enough headroom for variety while cutting the long tail of weird tokens.top_k=0means "no truncation."
Top-k was the original truncation knob in early language models. Most hosted chat APIs have moved on. OpenAI's chat completion API does not expose top_k at all. Anthropic exposes it but flags it as "advanced use only." Google Gemini exposes it. Open-weights stacks like Hugging Face Transformers, vLLM, and llama.cpp all expose it.
The weakness of top-k is that k is constant regardless of how confident the model is. If the model is 95% certain about the next token, you still consider k options. If it's flat across thousands of plausible tokens, you only consider k. Top-p fixes that.
Top-p (nucleus sampling) — keep the most-likely set
top_p sorts tokens by probability and keeps the smallest set whose cumulative probability reaches p. Everything below the cutoff is discarded.
If top_p=0.9:
- When the model is confident, the top 2 or 3 tokens might already sum to 0.9. You sample from those.
- When the model is uncertain, you might need 200 tokens to reach 0.9. You sample from all 200.
That adaptiveness is why top-p, also called nucleus sampling, became the dominant truncation method in modern chat APIs. OpenAI exposes it. Anthropic exposes it. Gemini exposes it.
A common, sensible default is top_p=0.9 or 0.95. Pushing it lower (say 0.5) tightens output toward the high-probability core — good for factual answers. Pushing it to 1.0 is the same as no truncation.
The classic guidance from both OpenAI and Anthropic: tune temperature OR top_p, not both. The interaction between them is hard to reason about, and tuning both simultaneously is a recipe for non-reproducible outputs. Pick one, leave the other at its default.
Min-p — the newer alternative
min_p is younger than the others. Introduced in a 2024 paper, Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs, and selected as an Oral at ICLR 2025.
Min-p sets a threshold relative to the top token. If min_p=0.05 and the top token has probability 0.4, every token below 0.05 × 0.4 = 0.02 gets cut. The intuition: at low confidence, absolute probabilities are all small, so a fixed top-p threshold either keeps too much junk or strips everything to one option. A relative threshold scales with model confidence.
Min-p is integrated into Hugging Face Transformers, vLLM, and SGLang. A follow-up critical paper in 2025 questioned whether min-p meaningfully improves on top-p in practice. Honest summary: it's a real option in open-source stacks, the wins are debated, and none of the hosted commercial APIs (OpenAI, Anthropic, Gemini) expose it.
Beam search — the academic outlier
Beam search keeps k partial sequences alive at every step and finally picks the one with the highest joint probability. It was the standard for machine translation in the pre-LLM era. In modern chat LLMs it's mostly absent — it produces safe, repetitive, hedged outputs, and none of the major hosted chat APIs expose it. You'll occasionally see it in specialised inference servers for translation or summarisation where exact phrasing matters more than range.
Repetition, presence, and frequency penalties
Three related parameters discourage the model from repeating itself. They're most visible on OpenAI and open-weights stacks; Anthropic doesn't expose them directly.
- Repetition penalty — multiplies the logits of tokens that already appeared. Common in llama.cpp and vLLM. Values typically
1.0–1.3. - Presence penalty (OpenAI) — a flat per-token penalty once a token has appeared at all. Encourages talking about new things.
- Frequency penalty (OpenAI) — penalty proportional to how many times the token has already appeared. Suppresses chronic repetition.
Useful when a model gets stuck in a loop or keeps reusing a phrase. Crank them too high and the output becomes incoherent because the model is forced to dodge perfectly correct word choices.
Recipes by task
The right setting depends entirely on the job. Here's a working playbook.
| Task | temperature | top_p | Notes |
|---|---|---|---|
| Code generation | 0.0–0.2 | default | Determinism matters more than creativity. Tests should pass on rerun. |
| Analysis, extraction, classification | 0.0 | default | Same as code. You want the model's best guess, not a different one each time. |
| RAG / question answering over docs | 0.0–0.3 | default | Stay close to the retrieved evidence. |
| Default chat | 0.7 | 0.9–1.0 | The standard fluent-but-not-wild range. |
| Creative writing, marketing copy | 0.7–1.0 | 0.9 | Want range without going off the rails. |
| Brainstorming, idea expansion | 0.8–1.2 | 0.95–1.0 | Diversity is the goal. Some output will be junk, that's fine. |
| Adversarial / red-team prompting | 1.2+ | 1.0 | Surface failure modes the safe distribution hides. |
A note on "temperature 0 for code." Strictly speaking, near zero — most providers floor at a small positive value because exact zero division is undefined. The behaviour is effectively greedy.
Same prompt, different output — a worked example
Here's the same call to a chat model at three different temperatures, using the OpenAI SDK. The prompt is deliberately open-ended.
from openai import OpenAI
client = OpenAI()
prompt = "Write one sentence about a coffee shop on a rainy Tuesday."
for temp in [0.0, 0.7, 1.3]:
print(f"\n--- temperature={temp} ---")
for run in range(3):
resp = client.chat.completions.create(
model="gpt-4o",
temperature=temp,
messages=[{"role": "user", "content": prompt}],
)
print(f" run {run+1}: {resp.choices[0].message.content}")
What you'd typically see across runs:
temperature=0.0— all three runs produce nearly the same sentence. Tiny variations only, often identical.temperature=0.7— each run produces a different but coherent sentence. Same scene, different angles — one focuses on the barista, another on a customer, another on the smell.temperature=1.3— outputs drift. One sentence is unexpectedly poetic, another introduces a character out of nowhere, a third may stumble grammatically. Diversity is high, coherence drops.
That spread is sampling in action. The model didn't "know" the answer at 0 and "forget" it at 1.3 — the underlying distribution is the same. Different settings draw from it differently.
Anthropic caps temperature at 1.0, so the wildest setting in that range is the model's raw distribution, not an amplified one. OpenAI and Gemini accept up to 2.0, which is where the genuinely chaotic outputs live.
Why "deterministic" doesn't mean "identical"
Setting temperature to zero is the closest you get to deterministic output, but it isn't a guarantee. Floating-point non-determinism on the GPU, batching effects, and mixture-of-experts routing all introduce small differences across runs even at temp 0. If you need true reproducibility, you need a seed parameter — and that's only partially supported. OpenAI exposes seed on chat completions with a "best-effort" disclaimer. Anthropic does not expose a stable seed. Local inference (llama.cpp, vLLM) gives you the strongest reproducibility because you control the entire stack.
Reasoning models don't expose the knobs
This is the big 2026 twist. OpenAI's o-series and GPT-5 series, plus Claude Opus 4.7, have removed most sampling controls.
For OpenAI reasoning models (o1, o3, o4-mini, GPT-5):
temperatureis fixed at1and any non-default value returns a 400 errortop_p,n,frequency_penalty,presence_penalty,logit_bias, andlogprobsare all locked out- New parameters replace them:
reasoning_effort(low/medium/high) andverbosity
For Claude Opus 4.7:
temperature,top_p, andtop_kare deprecated entirely — setting any of them returns a 400 error- Extended thinking budgets are gone, replaced by adaptive thinking with an
effortparameter (low/medium/high) inoutput_config - You're expected to control behaviour through prompting, not sampling
Reasoning models run internal multi-step deliberation before producing the visible answer — generating, verifying, and selecting across candidate paths. Exposing temperature would let users collapse those paths to a single greedy one, which is exactly what reasoning was designed to avoid. The model is doing the work the temperature knob used to do.
If you're migrating from gpt-4o or claude-sonnet-4-5 to o3 or claude-opus-4-7, strip sampling parameters out of your request. Don't search for "the right temperature for o3." There isn't one.
For a broader look at how AI is reshaping testing workflows, the best AI testing tools 2026 post covers how these models are landing in QA stacks.
Putting it together — a decision flow
When you're not sure what to set, this is the order to think through:
- Are you on a reasoning model? If yes, you have no sampling knobs to set. Use
reasoning_effort/effortand write a sharper prompt. Stop. - Is the task deterministic-style? Code, extraction, classification, RAG over evidence — set
temperature=0. Leave everything else default. - Is the task creative or open-ended? Pick either
temperature=0.7–1.0ortop_p=0.9. Not both. - Is the model repeating itself? Add a frequency penalty of
0.5. If it persists, fix the prompt. - Do you need cross-run reproducibility? Use the lowest temperature available, set a seed if the provider supports it, and accept that exact bit-equality is rare on hosted APIs.
For a related deep dive into how the AI test-generation layer interacts with sampling settings, see AI test generation with LLMs.
FAQ
What's the difference between top-k and top-p?
Top-k keeps a fixed number of tokens (the highest-probability k). Top-p keeps a variable-size set whose probabilities sum to p. Top-p adapts to model confidence; top-k doesn't. Modern chat APIs prefer top-p, often called nucleus sampling.
Should I set temperature or top_p?
Pick one. Both OpenAI and Anthropic recommend changing temperature OR top_p, not both. Most developers tune temperature and leave top_p at its default of 1.0.
Why does the same prompt give different answers each time?
Because sampling is random by default. Even at the same temperature, the model is rolling a weighted die against its probability distribution each step. Setting temperature=0 reduces — but doesn't eliminate — the randomness, due to floating-point and batching effects in the inference stack.
Does temperature 0 guarantee identical output?
No. Anthropic states this explicitly in its docs, and OpenAI behaves the same. GPU non-determinism, batched inference, and mixture-of-experts routing all introduce small differences run to run. For tighter reproducibility, use a seed parameter where supported, or run inference locally.
Why can't I set temperature on o1 or Claude Opus 4.7?
Both providers locked sampling parameters on their reasoning models. Reasoning models run multi-step internal deliberation, and exposing temperature would interfere with that process. Use reasoning_effort (OpenAI) or effort inside output_config (Anthropic Opus 4.7) instead.
Is min-p worth using?
If you're running open-source inference (vLLM, llama.cpp, Hugging Face Transformers), it's available and worth experimenting with — especially for creative tasks. A 2025 critical re-analysis questioned whether min-p reliably beats top-p, so the wins aren't settled. Hosted APIs don't expose it at all.
What's the best temperature for code generation?
0 or as close as the provider allows. Code needs determinism — the same input should produce the same output, and unit tests should pass on rerun. Crank higher only if you're explicitly brainstorming alternative implementations.
Where Crosscheck fits
Sampling parameters explain how the model picks words, but they don't catch the bugs your AI features ship with. When an LLM-powered flow misfires in production — a hallucinated JSON field, a stale RAG answer, a temperature-too-high response that bricks the UI — someone has to capture what actually happened in the browser. Crosscheck is the Chrome extension QA teams reach for to file those reports: console errors, network calls, user actions, and the full diagnostic payload, packaged for Jira, Linear, or your AI coding assistant via MCP.



