Getting Reliable JSON From an LLM: The 2026 Playbook
LLM structured output is the set of techniques that force a language model to return data matching a developer-supplied schema — typically JSON — instead of free-form prose. Every major provider supports it natively in 2026: OpenAI through response_format with strict: true, Anthropic through output_format (beta) and tool use, Gemini through responseSchema. The same primitive powers all three: constrained decoding masks any next-token choice that would violate the schema. This guide compares the three approaches, shows working code for each provider, and explains why you still need Zod or Pydantic on the receiving end.
Key takeaways
- Free-text LLM output is unsuitable for any code path that parses or branches on the response — schema enforcement is now table stakes.
- Three approaches dominate: function/tool calling, native JSON-schema response modes, and constrained decoding at inference time.
- OpenAI's
response_format: { type: "json_schema", strict: true }and Anthropic'soutput_format(beta headerstructured-outputs-2025-11-13) compile the schema into a grammar and constrain token sampling. Gemini does the equivalent throughresponseMimeType: "application/json"plusresponseSchema. - Provider guarantees cover shape, never truth — validate downstream with Zod or Pydantic and treat refusals as first-class error paths.
- Reach for XGrammar, Outlines, or llama.cpp grammars when you're self-hosting open models or need a schema feature the hosted APIs don't support.
Why free-text responses break
Ask a model to "extract the customer's name and email from this paragraph" without any structure constraint and you get three flavours of answer at random: a prose reply, a clean JSON object, or a refusal preamble followed by JSON in a code fence. For a human all three are fine. For a downstream JSON.parse() they're three different failure modes.
The old fix was prompt engineering: "respond with only valid JSON, no markdown." That raised the success rate but never to 100%. Modern structured-output features fix it at the API level by changing how the model samples tokens.
How structured output actually works
Every transformer-based LLM ends each step by sampling one token from a probability distribution over its vocabulary. Constrained decoding intercepts that distribution — at each step it masks any token that would make the running output invalid against the schema, then samples from what remains. The model can't choose "Sure! " as the next token if the schema says the output must start with {.
The constraint can live in three places: inside a tool-calling pipeline (the schema becomes a function signature), inside a native JSON response mode (passed through response_format / output_format / responseSchema), or outside the provider at inference time (open-source runtimes like vLLM and llama.cpp expose grammar backends). The three approaches are siblings conceptually but differ in cost, latency, and how they fit into the rest of your code.
Approach 1 — Function / tool calling
Tool calling was the original workaround for "make the model return structured data." You describe a function the model can call, the model returns a JSON object matching the function's parameter schema, and your code reads that object. You don't have to actually execute the function — extracting the parsed input is a valid use on its own.
All three providers support it. OpenAI's variant has an explicit strict: true flag that promotes the parameter schema from "hint" to "constraint." Anthropic's tool use accepts an input_schema, and pairing it with tool_choice: { type: "tool", name: "..." } forces the call. Gemini exposes the same primitive as function_declarations. A forced tool call is structured output wearing a tool-call hat, and because it works on every tool-capable model, a lot of production code still uses it. Tool calling beats native JSON mode when you actually have multiple tools the model picks between, when the call is optional, or when you're on an older model that doesn't support strict JSON-schema responses.
Approach 2 — Native JSON schema modes
In 2024–2025 each major provider shipped a dedicated parameter for "respond in this exact JSON shape, no tool wrapper." This is the path most new code should take when you only need structured data, not actual tool routing.
OpenAI: response_format with strict: true
OpenAI's Structured Outputs feature on Chat Completions takes a JSON schema with strict: true and guarantees the output matches the schema or returns a refusal field. Every property must be listed in required and additionalProperties: false is mandatory — those constraints are what let the backend compile the schema into a grammar deterministically. First call with a new schema costs an extra second or two while the grammar is compiled and cached; subsequent calls are full speed.
from openai import OpenAI
client = OpenAI()
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string"},
},
"required": ["name", "email"],
"additionalProperties": False,
}
response = client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Extract the contact details from the user's message."},
{"role": "user", "content": "Hey, I'm Jane Doe — reach me at [email protected] after Friday."},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "Contact",
"strict": True,
"schema": schema,
},
},
)
message = response.choices[0].message
if message.refusal:
raise RuntimeError(f"Model refused: {message.refusal}")
import json
contact = json.loads(message.content)
# {"name": "Jane Doe", "email": "[email protected]"}
On OpenAI's newer Responses API the shape moved from response_format to text.format — same payload, different key.
Anthropic: output_format (beta) or forced tool use
Anthropic's native Structured Outputs is in public beta and requires the header anthropic-beta: structured-outputs-2025-11-13. It supports Claude Sonnet 4.5 and Claude Opus 4.1 at launch, with Haiku 4.5 following. The schema is compiled into a grammar and cached for 24 hours — a first-call penalty of roughly 100–300 ms, then negligible overhead.
import json
import anthropic
client = anthropic.Anthropic()
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string"},
},
"required": ["name", "email"],
"additionalProperties": False,
}
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[
{"role": "user", "content": "Hey, I'm Jane Doe — reach me at [email protected] after Friday."},
],
extra_headers={"anthropic-beta": "structured-outputs-2025-11-13"},
extra_body={
"output_format": {
"type": "json_schema",
"schema": schema,
}
},
)
contact = json.loads(response.content[0].text)
# {"name": "Jane Doe", "email": "[email protected]"}
On older Claude models, the long-standing fallback is forced tool use: declare a single tool whose input_schema is the data you want, then set tool_choice={"type": "tool", "name": "..."}. The model returns a tool_use block whose input is the schema-conforming object. The Anthropic cookbook has recommended this pattern since 2024 and it still works on every Claude version.
Gemini: responseSchema plus responseMimeType
Gemini's structured output takes a different shape: set responseMimeType: "application/json" and pass a schema as either responseSchema (OpenAPI-style, simpler) or responseJsonSchema (full JSON Schema, supported on Gemini 3). Fields are optional unless declared in required. Per Google's own guidance, the schema lives in the request config and not in the prompt — duplicating it in the prompt actively hurts output quality.
import { GoogleGenerativeAI } from '@google/generative-ai';
const ai = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const schema = {
type: 'object',
properties: {
name: { type: 'string' },
email: { type: 'string' },
},
required: ['name', 'email'],
};
const result = await ai.models.generateContent({
model: 'gemini-3-flash',
contents: "Hey, I'm Jane Doe — reach me at [email protected] after Friday.",
config: {
responseMimeType: 'application/json',
responseSchema: schema,
},
});
const contact = JSON.parse(result.response.text());
// {"name": "Jane Doe", "email": "[email protected]"}
Gemini also supports enum responses through responseMimeType: "text/x.enum", useful when the model should pick from a fixed list rather than emit free text.
Approach 3 — Constrained decoding for self-hosted models
When you're running open-weight models on your own infrastructure — Llama 4, Qwen, DeepSeek, anything served by vLLM, llama.cpp, or TGI — the schema-enforcement primitive is exposed directly. You install a grammar backend, hand it a schema, and the runtime masks invalid tokens at sampling time. No provider in the loop, no beta to depend on.
The major backends in 2026:
- XGrammar — the vLLM default. JIT-compiled grammars via pushdown automata, compilation in native C. Faster than Outlines on most schemas.
- Outlines — finite-state-machine-based, the library that popularised constrained decoding. Strong for complex schemas reused across thousands of requests, where the FSM compile cost amortises.
- lm-format-enforcer — token-masking approach with faster first requests but slower steady state on complex schemas.
- llama.cpp GBNF — used directly inside llama.cpp and any wrapper around it (Ollama, LM Studio).
A vLLM call with XGrammar:
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string"},
},
"required": ["name", "email"],
}
params = SamplingParams(
temperature=0.1,
max_tokens=256,
guided_decoding=GuidedDecodingParams(json=schema, backend="xgrammar"),
)
output = llm.generate(
"Extract contact info as JSON: Hey, I'm Jane Doe — reach me at [email protected].",
sampling_params=params,
)
print(output[0].outputs[0].text)
Outlines can also be used standalone — against Hugging Face models, llama.cpp, or behind an OpenAI-compatible proxy. The right backend depends on your serving stack; there's no single winner across all workloads.
How the approaches compare
| Approach | Guarantee | First-call latency | Schema features | Best for |
|---|---|---|---|---|
OpenAI response_format strict | Schema-enforced | +1–10 s (cached after) | JSON Schema subset | New OpenAI code where you only need data |
Anthropic output_format (beta) | Schema-enforced | +100–300 ms (cached 24 h) | JSON Schema subset | Claude Sonnet 4.5 / Opus 4.1 data extraction |
| Anthropic forced tool use | Strong | Negligible | Standard JSON Schema | Older Claude models, multi-tool flows |
Gemini responseSchema | Schema-enforced | Negligible | OpenAPI subset (responseJsonSchema for full schema) | Gemini 3 / Flash data extraction, enum outputs |
| Provider tool calling | Strong | Negligible | Standard JSON Schema | When the model should actually pick between tools |
| XGrammar (vLLM) | Schema-enforced | Low (JIT) | JSON Schema, regex, CFG | Self-hosted, high-throughput inference |
| Outlines | Schema-enforced | Moderate (FSM compile) | JSON Schema, regex, CFG | Self-hosted with reusable complex schemas |
| llama.cpp GBNF | Schema-enforced | Negligible | GBNF (custom CFG format) | Local llama.cpp, Ollama, LM Studio |
The hosted-API guarantees are roughly equivalent — pick by which model you're using, not by which API is "best." Self-hosting buys you grammar features the hosted APIs don't (full regex, recursive CFGs) at the cost of running your own GPUs.
For a wider view of the AI tooling QA teams pair with structured output in test-generation workflows, see best AI testing tools 2026 and AI test generation with LLMs.
Pitfalls that show up in production
Structured output fixes the "is it valid JSON" problem. It doesn't fix everything else.
Deep nesting hurts quality. Schemas with five-plus levels of nested objects produce noticeably worse field values, even when the shape is correct. Flatten where you can. If you really need depth, split into two calls.
Optional fields confuse the model. OpenAI's strict mode requires every property in required. If you genuinely need an optional field, model it as a ["string", "null"] union and let the model emit null. "Just omit the field" is not an option under strict mode.
Long enum lists degrade output. Twenty enum values is fine. Two hundred and you'll see hallucinated values that don't match any entry, paired with the model's confident insistence that they do.
Schema constraints aren't all enforced. Anthropic and Gemini silently drop unsupported keywords like minimum, maxLength, pattern, and format. The model receives a simplified schema and your downstream validator catches the gap. Don't assume pattern: "^\\d{5}$" will be enforced at sampling time.
Refusals are not exceptions. OpenAI can return a refusal field instead of the structured output if the request hits a safety policy. Anthropic and Gemini have equivalents. These look like 200 OK responses with no parseable body. Code that does JSON.parse(message.content) without checking for refusals first will crash randomly under load.
Multilingual edge cases. A schema field labelled "city" against a French-language source may emit "ville" as the value, especially with lower-temperature settings on smaller models. Add explicit field descriptions in English to anchor the output.
For a deeper read on how the model picks tokens under the hood — and why temperature still matters even when the schema is locked — see LLM token sampling.
Validate downstream regardless
The provider guarantees schema shape, never truth. The model can still emit "email": "not-actually-an-email" and your code will happily parse it. Run every structured response through a validator on the receiving end.
In TypeScript, Zod is the de facto choice. In Python, Pydantic does the same job.
import { z } from 'zod';
const ContactSchema = z.object({
name: z.string().min(1),
email: z.string().email(),
});
const contact = ContactSchema.parse(JSON.parse(raw));
from pydantic import BaseModel, EmailStr
class Contact(BaseModel):
name: str
email: EmailStr
contact = Contact.model_validate_json(raw)
Treat validation as the boundary between "an LLM said this" and "trusted data." It's also the right place to attach retry logic — if email fails the regex, retry once with the model's previous response appended as context. A common pattern: define the schema in Zod or Pydantic first, generate the JSON schema from it (zod-to-json-schema or model.model_json_schema()), and pass that to the LLM API. One source of truth, validated on both sides.
Picking an approach
A short decision guide:
- GPT-4o or newer, need data only: OpenAI
response_formatwithstrict: true(ortext.formaton the Responses API). - Claude Sonnet 4.5 or Opus 4.1: Anthropic
output_formatwith the beta header. - Older Claude models: forced tool use with
input_schema. - Gemini 2.x / 3.x:
responseMimeType+responseSchema. - Multiple tools the model should choose between: tool calling, not native JSON mode.
- Self-hosting: XGrammar with vLLM, or llama.cpp GBNF for local setups.
Whichever path you pick, pair it with Zod or Pydantic on the way out.
FAQ
What's the difference between JSON mode and structured outputs?
JSON mode (OpenAI's older response_format: { type: "json_object" }) guarantees the output parses as JSON but does not guarantee it matches a specific schema. Structured outputs (type: "json_schema" with strict: true) guarantees both. New code should use the schema version; JSON mode is now legacy.
Does strict: true make the LLM more accurate?
It makes the shape more accurate — the response is guaranteed to match the schema. It doesn't make the values inside more truthful. The model can still hallucinate names, dates, or fields. That's why downstream validation with Zod or Pydantic remains essential.
Can I use structured outputs with streaming?
Yes, on all three providers. Most SDKs expose a helper that lets you read partial parsed output as it's generated — useful for long extractions in chat UIs.
Does structured output add to my token bill?
Yes. The schema counts as input tokens. A contact-extraction schema runs 100–200 tokens; a 20-field schema with enums can be 500–1,000. Small compared to retry costs from malformed output, but worth knowing when you're tuning prompt caches.
What happens if the model refuses?
OpenAI returns a refusal field instead of the content payload. Anthropic and Gemini have similar mechanisms. Code must check for these before parsing — missing it is one of the most common production bugs in LLM apps.
Should I describe the schema in the prompt too?
No. Providers explicitly warn against it. Duplicating the schema confuses the model and degrades output quality. Field descriptions inside the schema are the right place to add semantic hints.
Where Crosscheck fits
Structured outputs are how AI tooling stops being a demo and starts being something a build can depend on. The Crosscheck Chrome extension takes the same posture for bug reports — every report carries console logs, network traces, and user-action replay in a defined structure, so the AI assistants that read them (via Crosscheck's MCP integration with Claude, Cursor, and Windsurf) get a predictable shape every time, not an ad-hoc paragraph of "here's what I think went wrong."
When your structured-output pipeline misbehaves in production and you need to file a clean bug to the team that owns it, this is where Crosscheck lives.



