What Are Embeddings in LLMs? Vectors and Semantic Search Explained

Written By  Crosscheck Team

Content Team

May 22, 2026 11 minutes

What Are Embeddings in LLMs? Vectors and Semantic Search Explained

Embeddings Explained: How LLMs Turn Meaning Into Math

An embedding is a list of numbers — a vector — that represents the meaning of a piece of text in a form a computer can compare. Similar meanings produce vectors close together; different meanings sit far apart. That trick — meaning as geometry — powers semantic search, RAG, recommendations, clustering, and most of the "AI that finds the right document" features shipping in 2026.

Key takeaways

  • An embedding is a fixed-length vector of floats that encodes meaning — "similar meaning" becomes "close in vector space".
  • Modern models output 256 to 3,072 dimensions, many supporting Matryoshka truncation so you can shrink vectors without re-encoding.
  • Cosine similarity is the default similarity metric for text; dot product and Euclidean are appropriate in narrower cases.
  • RAG is one paragraph — chunk, embed, store, query, retrieve top-k, feed to LLM — but every step has a chunking, indexing, or fusion choice that decides whether it works.
  • 2026 brought late chunking, ColBERT-style late interaction, and hybrid sparse + dense into the production mainstream. Pure single-vector dense is no longer the safe default.

What an embedding actually is

When you send a sentence to an embedding model, it passes through a neural network and emits a fixed-length list of floats — OpenAI's text-embedding-3-large returns a 3,072-dimension vector by default. That vector is the embedding.

The numbers are not human-readable; dimension 412 does not mean "is about sports". The layout is learned during training and gives the vector its useful property: related meanings end up close together.

Embed "how do I reset my password" and "i forgot my login" and the two vectors sit near each other even though they share almost no words. Lexical search would say there is zero overlap; embedding-based search says they mean the same thing.


Why we need them — and what they replaced

Before dense embeddings became cheap at scale, search was lexical. BM25, the term-weighting algorithm published in 1994 and still embedded in every serious search engine, scores a document by how well its vocabulary matches the query's. It is precise, fast, and has a hard limit — if the user does not type a word from the document, BM25 cannot find it. A query about "cardiac arrhythmia" does not retrieve "irregular heartbeats".

Embeddings solve the synonym problem because the model has seen, in training, that the two phrases appear in similar contexts. The cost is precision — dense retrieval is generous and surfaces documents that are conceptually related but technically wrong. That tradeoff is why production systems in 2026 rarely use embeddings alone. The 2026 default is hybrid search: run BM25 and dense retrieval in parallel, then fuse the results.


Dimensions and what they cost

Embedding dimension is the length of the vector. More dimensions give the model more room for fine distinctions; fewer cost less to store and search. Where the major providers landed in 2026:

Provider / modelDefault dimensionsOther supported dimensionsContext windowNotable
OpenAI text-embedding-3-small1,536512, 256 (Matryoshka)8K tokensThe price-performance default
OpenAI text-embedding-3-large3,0721,024, 512, 256 (Matryoshka)8K tokens~$0.13 / 1M tokens, MTEB 64.6
Voyage voyage-3-large1,0242,048, 512, 25632K tokensAnthropic's recommended provider; MTEB 65.1
Cohere embed-v41,5361,024, 512, 256 (Matryoshka)128K tokensMultimodal text + image; MTEB 65.2
Google gemini-embedding-0013,0721,536, 768, 256 (Matryoshka)8K tokensMTEB Multilingual leader (68.32)
Mistral mistral-embed1,0248K tokensStrong on European languages

The standout feature is Matryoshka representation learning. Models trained with Matryoshka front-load the most important semantic information into the earliest dimensions, so you can truncate a 3,072-vector down to 512 at query time and keep most of the quality without re-embedding.

The storage math matters more than people expect. A 3,072-dim float32 vector is 12 KB. At 100 million chunks that is 1.2 TB. A 1,024-dim vector at the same scale is 400 GB. int8 cuts that 4x; binary embeddings cut it 32x with a modest accuracy hit. Storage is now where most of the RAG bill lives, not the embedding API.


Similarity: cosine, dot product, Euclidean

Once you have two vectors, you need a way to score how close they are. Three metrics dominate.

Cosine similarity measures the angle between two vectors, ignoring magnitude. It is the default for text embeddings because what matters is direction in semantic space. Ranges from -1 to 1, computed as the dot product divided by the product of magnitudes.

Dot product is cosine similarity without the magnitude normalisation. If your model returns unit-normalised vectors — OpenAI, Voyage, and Cohere do — dot product and cosine give the same ranking, and dot product is faster.

Euclidean distance measures the straight-line distance between two vectors. Appropriate when magnitudes carry meaning, rare for LLM text embeddings.

Rule of thumb: cosine by default, dot product for speed if pre-normalised, Euclidean only when you have a specific reason.


The 2026 vector database landscape

Once you have embeddings, you need somewhere to put them that supports nearest-neighbour search fast. The market has consolidated around a handful of options.

DatabaseHostingStrengthsLimitationsHybrid search
pgvectorSelf-host on PostgresOne database for app + vectors, no extra serviceHNSW/IVFFlat indexes capped at 2,000 dimensionsVia tsvector + manual fusion
PineconeManaged onlyZero ops, predictable latency (~45ms p95 at 10M)Closed source, can get expensive at scaleYes
WeaviateSelf-host or managedNative hybrid BM25 + dense, modular vectorisersHigher memory footprintYes (native)
QdrantSelf-host or managedBest raw price-performance, Rust core, ~22ms p95Smaller ecosystem than PineconeYes
ChromaEmbedded or self-hostSimplest to start, great for prototypingSingle-node ceiling around 5-10M vectorsLimited
TurbopufferManaged serverlessUnlimited namespaces, ~10x cheaper for multi-tenant$64/mo minimum, no free tier, cold-start latencyYes (BM25 + vector)

A few things worth saying. pgvector is the right answer more often than people admit — if you already run Postgres, CREATE EXTENSION vector; and you avoid a second database. The 2,000-dim cap on HNSW/IVFFlat is real; for 3,072-dim models, either truncate via Matryoshka or reach for pgvectorscale / pg_diskann. Turbopuffer is the surprise of 2026 — Cursor, Notion, and Linear run on it because per-tenant cost is roughly an order of magnitude lower than alternatives and namespaces are unlimited. Chroma is great for prototyping, not where you finish.

HNSW vs IVFFlat — both pgvector and most vector DBs offer the choice. HNSW builds a multi-layered graph and gives the best recall-latency tradeoff at the cost of higher memory and slower build times. IVFFlat partitions into clusters and searches only the nearest at query time — faster to build, lower recall. For low-latency production read paths, HNSW almost always wins.


The RAG pattern, end to end

Almost every production AI feature that grounds an LLM in private data is six steps:

chunk → embed → store → query → top-k → feed to LLM
  1. Chunk source documents into pieces small enough for the embedding model and focused enough to be about one thing.
  2. Embed each chunk; one vector per chunk.
  3. Store vectors with the original text and metadata.
  4. Query — embed the user's question with the same model, run nearest-neighbour search.
  5. Top-k — take the 5-20 most similar chunks.
  6. Feed them into the LLM's prompt as context.

The whole pattern in working code, using OpenAI's embedding API and pgvector:

import os
from openai import OpenAI
import psycopg
from pgvector.psycopg import register_vector

client = OpenAI()
conn = psycopg.connect(os.environ["DATABASE_URL"])
register_vector(conn)

# --- One-time: chunk, embed, store ---
def index_document(doc_id: str, text: str):
    chunks = chunk_recursive(text, target_size=512, overlap=50)
    for i, chunk in enumerate(chunks):
        emb = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk,
        ).data[0].embedding
        conn.execute(
            "INSERT INTO docs (doc_id, chunk_idx, text, embedding) "
            "VALUES (%s, %s, %s, %s)",
            (doc_id, i, chunk, emb),
        )
    conn.commit()

# --- At query time: embed question, retrieve top-k, answer ---
def answer(question: str, k: int = 8) -> str:
    q_emb = client.embeddings.create(
        model="text-embedding-3-small",
        input=question,
    ).data[0].embedding

    rows = conn.execute(
        "SELECT text FROM docs "
        "ORDER BY embedding <=> %s::vector "  # cosine distance
        "LIMIT %s",
        (q_emb, k),
    ).fetchall()
    context = "\n\n---\n\n".join(r[0] for r in rows)

    completion = client.chat.completions.create(
        model="gpt-5",
        messages=[
            {"role": "system",
             "content": "Answer only from the provided context. "
                        "If the answer is not present, say so."},
            {"role": "user",
             "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return completion.choices[0].message.content

That is the full pattern. Everything sophisticated — reranking, hybrid retrieval, late chunking, query rewriting, agentic retrieval — is layered on top of those two functions. The <=> operator is pgvector's cosine distance; <-> is Euclidean, <#> is negative inner product. Pick the one matching your embedding model — for OpenAI, Voyage, and Cohere, that is cosine.


Chunking strategies — the most under-rated lever in RAG

Pick the wrong chunk size and the best embedding model cannot save you.

Fixed-size chunking splits text into pieces of a constant token count with 50-100 tokens of overlap. Simplest approach, default in LangChain — but it blows up on structured text, slicing through tables and code blocks.

Recursive character splitting tries natural boundaries first — paragraphs, then sentences, then words — falling back to character splits only when none fit. The production default for prose.

Semantic chunking uses an embedding model to find meaning shifts, splitting where the cosine distance between adjacent sentences jumps. More expensive but groups related ideas together. Worth it for long-form content.

Late chunking is the 2026 development that genuinely changes the pattern. Introduced by Jina AI in 2024 and now standard in jina-embeddings-v3 and v4 and integrated in Elasticsearch, Milvus, and Qdrant, it inverts the order — embed the whole document first with a long-context model, then pool token-level embeddings into chunks afterwards. Every chunk's vector knows about the rest of the document, so when the text says "the city" three paragraphs after introducing Berlin, the chunk gets an embedding that points toward Berlin too.

Default for 2026: recursive splitting at 400-600 tokens with 50-token overlap, semantic chunking as an upgrade where boundaries matter, late chunking enabled if supported. For code and tabular data, chunk on natural structural boundaries.


Pitfalls that bite teams in production

A RAG demo is easy; a production RAG system is hard, and most of the difficulty is retrieval.

Chunks too small lose context — retrieved fragments match semantically but lack detail the LLM needs. Chunks too large blur the signal — too many ideas in one vector, recall drops.

Single embedding model for everything. Code, legal text, financial filings, and chat have their own distributions. Voyage's voyage-code-3, voyage-finance-2, and voyage-law-2 exist for this reason.

Pure dense retrieval where lexical matters. If your domain involves exact identifiers — SKUs, error codes, ticker symbols — pure dense loses to BM25 because the embedding model treats ERR_CERT_AUTHORITY_INVALID and ERR_CONNECTION_REFUSED as semantically similar. They are not, to the user looking up an error.

Re-embedding when the model updates. Embeddings are model-specific. Swap from text-embedding-3-small to text-embedding-3-large, re-embed everything.

Skipping the reranker. Retrieval is a high-recall problem; reranking is high-precision. Pull 100 candidates with hybrid retrieval, run a cross-encoder reranker (Cohere rerank-v3, BGE-reranker, Voyage rerank-2) to get the top 10 that go to the LLM.


What's actually new in 2026

Three threads moved the state of retrieval forward in 2026.

ColBERT-style late interaction. Traditional dense retrieval encodes each document into a single vector. ColBERT — and its successors ColBERTv2, PLAID, ModernColBERT, and ColBERT-Att (March 2026) — encode each token of the document and each token of the query separately, then compute a fine-grained late-interaction score. The cost is storage; the benefit is better recall on out-of-domain queries, low-resource languages, and long documents. The first dedicated workshop (LIR @ ECIR 2026) signals practitioners are deploying it. For general RAG, single-vector dense remains default. For hard retrieval — legal, scientific, multi-language — ColBERT-style is now worth the cost.

Hybrid sparse + dense as the production default. The 2026 consensus: do not pick between BM25 and embeddings. Run both, fuse with Reciprocal Rank Fusion (k=60), then optionally rerank. RRF is rank-based and sidesteps the score-scale mismatch. On a 2025 Elasticsearch benchmark (Wands furniture dataset), hybrid added roughly 1.3% NDCG with plain RRF and 7.5% NDCG with a tuned tiered approach over BM25 baseline. On financial documents, BM25 beats dense even against text-embedding-3-large on every metric except Recall@20, and hybrid wins on all of them.

Late chunking in production stacks. Native in jina-embeddings-v3 and v4 with integrations for Elasticsearch, Milvus, and Qdrant. No longer experimental.

A defensible 2026 default: recursive chunking at 500 tokens with overlap, text-embedding-3-small or voyage-3 for embeddings, pgvector with HNSW on Postgres or Qdrant otherwise, hybrid BM25 + dense with RRF, Cohere or Voyage cross-encoder rerank on top 50-100 candidates. Add late chunking and ColBERT-style retrieval only when the existing pipeline is the proven bottleneck.

For broader AI tooling context, see our best AI testing tools for 2026 and AI test generation with LLMs.


FAQ

Do I need embeddings if my LLM has a huge context window?

Usually yes. Stuffing 200K tokens into every prompt is slow, expensive, and dilutes the model's attention. Embeddings let you retrieve only the 5-20 most relevant chunks — faster, cheaper, better answers on most benchmarks.

What's the difference between an embedding and a vector?

An embedding is a vector. The terms are interchangeable. "Vector" is the math term; "embedding" specifies the vector was produced by a model to represent input in a meaningful space.

How do I pick the embedding dimension?

Start with the model's default, ship, then tune down via Matryoshka truncation if storage hurts. Most production teams in 2026 use 1,024 or 1,536 dimensions.

Should I use pgvector or a dedicated vector database?

If you already run Postgres and your scale is under ~10M vectors with 1,024-dim or fewer embeddings, pgvector with HNSW is the cleanest answer. Above that scale, or with 3,072-dim embeddings, look at Qdrant (self-hosted) or Pinecone / Turbopuffer / Weaviate (managed).

When should I add a reranker?

When recall is fine but answer quality is worse than the retrieved chunks would suggest. The right chunk is in the top 50 but not the top 5 — a cross-encoder reranker (Cohere rerank-v3, Voyage rerank-2, BGE-reranker-v2) fixes it. Adds 50-200ms latency but is often the single biggest quality lever.


Where Crosscheck fits

Crosscheck is a free Chrome extension for visual bug reporting — not a retrieval system. If you are building an LLM-powered product on embeddings and RAG, the bugs you ship need reproduction context — the question the user typed, the chunks retrieved, the model's response, the metadata at each step. Crosscheck captures page state, console logs, network traffic, and a screen recording in a single click, then pushes the report into Jira, Linear, ClickUp, GitHub, or Slack. When the retrieval layer misbehaves, that context is what makes the bug debuggable.

Try Crosscheck free

Related Articles

Contact us
to find out how this model can streamline your business!
Crosscheck Logo
Crosscheck Logo
Crosscheck Logo

Speed up bug reporting by 50% and
make it twice as effortless.

Overall rating: 5/5