How semantic search works (and why keyword search fails)
guides

How semantic search works (and why keyword search fails)

How semantic search works in 2026 — embeddings, vectors, and hybrid retrieval explained plainly. Why keyword search keeps missing your saved bookmarks, and what fixes it.

SavedThat team12 min read

Type "how much does it cost to acquire customers" into a saved-bookmarks search and watch nothing match. The video you're looking for is right there in your library — you saved it. The speaker said "CAC" sixteen times. Your search said "customer acquisition cost." The words don't overlap. The match fails.

This failure mode has a name: the vocabulary mismatch problem. It's why keyword search keeps losing, and it's why semantic search is now the default in every serious tool that searches anything text-shaped. Below: how semantic search actually works in 2026 — plain English, no math degree required — and why most consumer apps that ship "AI search" still get it wrong.

The vocabulary mismatch problem in 60 seconds

Old search systems matched literal strings. You typed customer acquisition cost, the system found documents containing those exact words. Documents that said "CAC" or "the cost of bringing in new buyers" or "what we pay per signup" stayed invisible.

In information retrieval research this has been measured for forty years. Typical vocabulary mismatch is somewhere around 70-80%: people use different words for the same concept most of the time. If your search only matches literal strings, you miss four out of five relevant documents. That's a polite way of saying "keyword search has been broken since 1985."

Two things fixed it. The first is embeddings: a way to represent meaning as numbers, so two phrases that mean the same thing end up "close" even if they share no words. The second is vector databases: storage and indexes that can find the closest vectors fast, even when the corpus is millions of documents.

Together, they're what makes "semantic search" a thing.

What an embedding is, plainly

Take a sentence. Run it through a neural network. The network returns a list of numbers — typically 384, 768, or 1,536 of them. That list is the embedding of the sentence.

The trick is that the network is trained so that sentences with similar meanings produce similar lists. The literal example:

Embeddings A and B will be close in the 1,536-dimensional space they live in. C will be far away. Closeness is measured by cosine similarity — how aligned the two vectors are in that high-dimensional space. Numbers near 1.0 mean almost identical meaning; numbers near 0 mean unrelated.

You can think of it as plotting every phrase as a point in a 1,536-dimensional galaxy where direction encodes meaning. The actual coordinates are meaningless on their own — what matters is which other phrases land near which.

The model that does this

Several models are competitive in 2026. OpenAI's text-embedding-3-small ships 1,536 dimensions natively and supports Matryoshka representation (you can truncate to 768 or 256 dimensions for storage savings without retraining). Cohere's embed-v3 is competitive on multilingual content. Open-source alternatives like bge-large and nomic-embed-text are within a few accuracy points and run on a Mac.

For multilingual workloads — like saved-videos in 30 languages with queries in any of them — text-embedding-3-small is currently the best balance of accuracy, cost, and dimensions. At SavedThat we use the 768-dimension Matryoshka truncation: half the storage, ~99% of the search quality. (Read more on Matryoshka representation.)

Plenty of consumer apps in 2026 advertise "AI search." Most of them mean one of three things:

  1. Real semantic search — embeddings + vector retrieval, like described above.
  2. LLM-on-top of keyword search — the system runs a keyword search, gets results, then asks an LLM to rephrase or summarise them. The retrieval is still keyword-broken; the LLM just paints the failures.
  3. Pure LLM Q&A — you ask a question, the LLM answers from its general training, your saved content is not consulted at all.

Only the first one actually solves vocabulary mismatch on your data. The other two are productised but not useful for finding what you saved.

A quick diagnostic: if you can search for the concept and not find the document, but you can find it by searching for one of its specific words, the tool is doing #2 or #3 wrapped as "AI search." If you can find the document via paraphrase, it's #1.

Vector databases: storing and finding embeddings fast

Embeddings on their own don't help if you have to compare every query against millions of vectors one at a time. The full search of a 50,000-document index takes hundreds of milliseconds the naive way — way too slow for interactive use.

The fix is approximate nearest neighbour (ANN) indexing. Specialised data structures — HNSW (Hierarchical Navigable Small World), IVF, ScaNN — store vectors so the closest matches can be found in under 10 milliseconds even on libraries with tens of millions of vectors.

The current state of the art (May 2026) for production semantic search:

For SavedThat-scale (currently ~120 saved videos × ~50 chunks each = 6,000 vectors), the index fits in RAM and lookups are sub-millisecond.

The catch: semantic search alone is wrong about exact phrases

Now the punchline that most tutorials skip: semantic search loses to keyword search on exact phrases.

If you search for "text-embedding-3-small" (an exact product name), keyword search returns the document mentioning that exact string at rank 1. Semantic search may rank a document that talks about "OpenAI embeddings in general" higher because it's semantically closer to the average meaning of the query, even though the literal string is missing.

This is real. Public benchmarks consistently show pure semantic underperforms keyword for queries with high specificity: product names, function signatures, exact quotes, dates, brand names. The thing semantic was supposed to fix (vocabulary mismatch) doesn't exist for those queries because the user typed the literal string they wanted.

The fix is to run both and merge the results.

Hybrid retrieval with reciprocal rank fusion

Hybrid retrieval is the technique that's quietly become the default in serious production search systems since 2024:

  1. Run a semantic search (vector similarity). Get the top-K results ranked by cosine.
  2. Run a full-text search (keyword, tsvector, BM25, whatever). Get the top-K results ranked by relevance score.
  3. Merge the two lists using reciprocal rank fusion (RRF): each document's RRF score is 1 / (rank + 60), summed across the two methods. Documents that appear well in both get a meaningful boost; documents appearing in only one still contribute.
  4. Sort by combined RRF score.

The constant 60 is a magic number that's been benchmarked to death and works across most corpora. You can tune it; you usually don't need to.

The result: paraphrased queries still find concept-matches (via the semantic side), and exact-phrase queries still rank the literal match at #1 (via the keyword side). Each method's blindspot is covered by the other.

This is what SavedThat ships as the default search. The search_chunks RPC in our Postgres runs both queries in parallel, computes RRF scores, and returns ranked hits in ~80ms for our current corpus size. (Microsoft on hybrid search)

Where this still gets fancy: re-ranking

For high-stakes use cases — legal discovery, medical literature, enterprise search over millions of documents — hybrid RRF is usually the second-to-last layer. The last layer is re-ranking with a cross-encoder model.

A cross-encoder is a smaller, slower model that takes the full query and the full candidate document and computes a tight relevance score by attending across both jointly. You can't run it on a million documents (too slow); you run it on the top 50-200 results from hybrid retrieval to re-order them precisely.

Cohere ships rerank-v3 for this. Open-source alternatives: bge-reranker-large, jina-reranker-v2. For consumer-scale workloads like SavedThat, the marginal gain doesn't justify the 100ms latency hit. For enterprise legal-discovery search, cross-encoder re-ranking is now standard.

We'll add it when our corpus or our use case demands it. Not before.

The cost picture

The whole pipeline is roughly:

StageCost
Embedding 1,000 chunks (~50 videos)~$0.02 in OpenAI credits
Storing 1M chunks in pgvector~$25/mo on a small Postgres instance
HNSW index build for 1M chunks~30 seconds, one-time per major data change
One search query (embed + HNSW lookup + FTS + RRF)~$0.0001 + 80ms

At SavedThat's pricing, one Pro subscription ($6.99/mo) covers thousands of searches and several hundred new saves. The economics are fine for consumer-scale; what doesn't work is trying to fund this with a free-forever-unlimited plan because the OpenAI bill grows linearly with content while revenue caps at zero.

Why most "second brain" tools don't ship this

Three reasons:

  1. Engineering complexity. Hybrid retrieval with RRF requires running two retrieval systems and merging them. Most teams ship one. The one is usually keyword because it's older and the stack is mature.
  2. Storage cost. Embeddings at 1,536 dimensions × 4 bytes/dim = ~6KB per chunk. A library of 100,000 chunks is 600MB of vector storage on top of the original text. Without halfvec and MRL the math is twice that.
  3. Inertia. Most existing note tools shipped before sentence embeddings were a commodity. Retrofitting semantic search into Evernote or Pocket would have required a vector database project that the org never prioritised.

The tools shipping semantic search from day one — Mem, Reflect, Glasp's premium tier, SavedThat — are mostly post-2022 builds because that's when the embedding cost dropped below the threshold where consumer-scale could be priced sustainably.

What this means practically

When you evaluate a search-inside-X tool in 2026, ask:

For saving videos and finding them later: SavedThat ships all four (architecture overview here). Glasp ships keyword only (free tier) and semantic (paid). Mem ships semantic only without hybrid. Notion's "AI search" is item 2 — LLM on keyword.

That's the lay of the land in May 2026.

Keep reading

Frequently asked questions (2026)

What's the difference between an embedding and a vector?

Functionally none — in practice the terms are used interchangeably. Strictly, an embedding is the *representation* a model produces, and a vector is just a list of numbers. Every embedding is a vector; not every vector is an embedding (a column of pixel intensities is a vector but not an embedding). Most papers and engineering docs use 'vector' as the storage term and 'embedding' as the conceptual term.

Does semantic search work in any language?

Depends on the embedding model. OpenAI's text-embedding-3-small was trained on 100+ languages and produces vectors that are roughly aligned across them — a Russian query and an English document about the same concept land in similar regions of vector space. Smaller open-source models often degrade significantly on non-English. If multilingual is a requirement, verify the model card explicitly mentions cross-lingual alignment.

How is hybrid search different from running two searches and showing both?

Running two searches and concatenating the results just gives you twice as many results. Hybrid search merges them by score (RRF or weighted fusion) so the final ranking reflects both methods' agreement. A document at rank 1 in keyword and rank 1 in semantic ends up at rank 1 overall; a document at rank 1 in keyword but rank 50 in semantic falls in the middle. Without merging, you can't tell which results to trust.

Why does pure semantic search lose to keyword for exact phrases?

Embeddings encode semantic meaning, not literal strings. A query like 'text-embedding-3-small' has its meaning in 'OpenAI embeddings family member' more than in the literal token sequence. The vector for the query is similar to vectors for documents discussing OpenAI embeddings broadly, which may rank higher than the specific document mentioning the exact product name. Keyword search has zero ambiguity on exact strings — they either appear or don't. Hybrid covers both modes.

Is RAG the same thing as semantic search?

Related but not the same. Semantic search is the retrieval step — find documents close to a query in vector space. RAG (Retrieval-Augmented Generation) is a system that uses semantic search to fetch documents, then passes them to a generative LLM that writes an answer using the documents as context. RAG includes semantic search as one component; you can have semantic search without RAG (when you just want to find the documents, not generate text on top).

Do I need a GPU to run semantic search?

For inference — searching against a built index — no. CPU is fine for retrieval at consumer scale. For building the index — running the embedding model on your content — a GPU is fast but not required. OpenAI's API does the embedding in their datacentre, returning the vector for you to store; the only thing on your end is the vector database and the query embedding (one short API call per query). Cost is pennies per thousand searches.