← Insights
automate

AI Search That Actually Finds Things

Keyword search misses the obvious and bolted-on AI search returns confident nonsense. Hybrid retrieval, ranking, and where semantic search helps versus hurts

A user searches your product for "cancel subscription." Your keyword search returns nothing, because the help doc is titled "Ending Your Plan." The user concludes the feature doesn't exist and churns. So you bolt on "AI search" — swap keywords for embeddings — and now searching for the exact product name "Cadrium Pro v2" returns fuzzy semantic cousins while the page literally titled "Cadrium Pro v2" sits on result four. You traded one failure mode for another and called it an upgrade.

Both of these are the same mistake in different clothes: picking one retrieval method and pretending it's a search system. Keyword search and semantic search fail in opposite directions. The fix isn't choosing the better one. It's building the system that uses both.

The teardown: how each approach fails alone

Keyword search fails on meaning. It matches strings, not concepts. "Cancel subscription" doesn't match "ending your plan" because they share no words. It misses synonyms, paraphrases, and the gap between how users talk and how your content is written. Users don't search with your vocabulary; they search with theirs, and lexical search punishes them for it.

Semantic search fails on specifics. Embeddings capture meaning, which is exactly why they blur precision. Search for an exact SKU, a product name, an error code, a person's name, and semantic search returns things that are about the same topic rather than the exact match. It's confidently approximate. For "TLS handshake error 525" the user wants the doc about error 525, not five docs about TLS that are semantically nearby. Semantic search also hallucinates relevance — it always returns something, ranked by similarity, even when nothing is actually relevant, so the user gets confident nonsense at position one.

"AI search" bolted on naively is usually semantic search alone, which means you swapped the keyword failure mode for the semantic one. Exact-match queries now degrade. The demo looked magic because the demo queries were conceptual. Production queries are half conceptual and half "I know exactly what I want, give me that."

The fix: hybrid retrieval

A search system that works runs both retrieval methods and combines them. Lexical search (the modern descendant of keyword, like BM25) catches exact matches, codes, names, and rare terms. Semantic search catches meaning, synonyms, and paraphrase. Each covers the other's blind spot.

The combination is the engineering. You run both, then fuse the results into one ranked list — a method like reciprocal rank fusion, which blends the two orderings without needing their scores to be on the same scale. The exact-match doc that lexical search ranks first stays near the top. The semantically-relevant doc that lexical search missed entirely gets surfaced. The user searching "cancel subscription" finds "Ending Your Plan" and the user searching "Cadrium Pro v2" gets the exact page first.

Hybrid retrieval is not the fancy option. It's the baseline for any search system that has both conceptual queries and precise queries — which is every real search system.

Ranking: retrieval finds candidates, ranking decides what wins

Retrieval gets you a candidate set. It does not get you the right order, and order is what users actually experience — almost nobody scrolls past the first few results.

Above raw retrieval, a ranking layer reorders candidates using signals retrieval ignores. A cross-encoder re-ranker reads the query and each candidate together and scores true relevance far more precisely than the first-pass similarity — you run it on the top 50 candidates, not the whole corpus, because it's expensive but accurate where it counts. Then business signals layer on: recency (the current doc beats the deprecated one), popularity (what other users clicked for similar queries), and authority (official docs over forum posts).

The pattern is cheap-and-broad retrieval to gather candidates, then expensive-and-precise ranking to order the few that matter. Skip the ranking layer and even good retrieval delivers the right answer at position seven, where nobody looks.

Query understanding: the part before retrieval

The query itself often needs work before you search with it. Users type fragments, misspellings, and ambiguity.

Query understanding handles spelling correction (so a typo'd product name still matches), expansion (adding synonyms and related terms so "laptop" also finds "notebook"), and intent detection — recognizing when a query is a precise lookup ("invoice #4502") versus an exploratory question ("how do refunds work"), and weighting lexical versus semantic retrieval accordingly. A lookup leans lexical; a question leans semantic. Detecting which is which, and tuning the blend per query, is a meaningful slice of the quality.

For natural-language questions, this is also where you decide whether search returns documents or whether it feeds retrieval into a generated answer. Different products, different choices — but the retrieval underneath is the same hybrid system either way.

Evaluation: the part that makes it engineering

Without measurement, "search quality" is whatever the last person to complain felt. You cannot tune a search system on vibes, and every change is a gamble until you can score it.

Build an evaluation set: real queries paired with their known-correct results, drawn from your actual query logs and spanning the real mix — conceptual, exact-match, typo'd, ambiguous, long-tail. Then track the metrics that capture what users feel:

  • Recall@k — is the right result in the top k at all. If it's not retrieved, ranking can't save it.
  • MRR / NDCG — is the right result near the top, where users look. This is where ranking quality shows up.
  • Zero-result rate — how often search returns nothing, the keyword failure made visible.
  • Click-through and reformulation rate — in production, do users click the top results, or do they immediately retype? Reformulation is the loudest signal that search failed.

Every change to retrieval, ranking, or query understanding runs against this set before it ships. That's the difference between tuning a search system and guessing at one.

Where semantic search helps and where it hurts

Semantic search is not a universal upgrade. It earns its place on some query types and degrades others.

It helps when queries are conceptual, vocabulary mismatch is common, the corpus is natural-language prose, and users describe what they want in their own words. Support content, documentation, knowledge bases, anything where "ending your plan" needs to match "cancel subscription."

It hurts when queries are precise identifiers — SKUs, codes, names, IDs — where the user wants an exact match and semantic blurring actively buries it. It also adds latency and embedding cost, which isn't worth paying for a corpus where lexical search already nails it. A product catalog searched mostly by exact part number doesn't need semantics first; it needs fast, precise lexical match with semantics as a fallback for the occasional descriptive query.

The right answer is almost never "semantic only." It's hybrid, with the blend tuned to your actual query distribution — which you only know by looking at your logs, not by assuming.

What fixed looks like

A search system that runs lexical and semantic retrieval together, fuses them, re-ranks the top candidates with a cross-encoder and business signals, and understands the query before it searches — correcting spelling, detecting intent, and weighting the methods to match. Exact-match queries return exact matches at the top. Conceptual queries find the right content even when the words don't line up. Nonsense queries return an honest empty state instead of confident garbage.

You have an evaluation set and a dashboard. Every change is scored before it ships. Zero-result rate falls, reformulation rate falls, click-through on top results climbs. Users find what they're looking for, stop assuming the feature doesn't exist, and stop churning over a doc that was there the whole time under a different title.

This is for you if

Search is core to your product or your internal tooling, your users have both precise and conceptual queries, and "search is bad" is a recurring complaint that's costing you usage or churn. The build — hybrid retrieval, a ranking layer, query understanding, and an evaluation harness wired to your real query logs — runs $50k+, scaling toward $100k+ with large corpora, strict latency budgets, and multi-language or multi-tenant requirements.

This is not for you if your corpus is small and your users search by exact term — a well-configured lexical search with synonyms will serve you for a fraction of the cost, and adding semantics buys you latency and bill, not relevance. It's also not for you if you want to swap in "AI search" as a single embedding lookup and ship it. That's the bolt-on that returns confident nonsense, and we'd be replacing it within the quarter.