AI Search

Cosine Similarity Decides Whether Your Page Gets Cited

How LLM search engines use embeddings and cosine similarity to choose which pages they cite, and three writing changes that raise your retrieval odds.

When Perplexity or ChatGPT Search handles a question, it converts the query into an embedding, a long list of numbers that places the question at a point in high-dimensional space. Your content lives in the same space, chopped into passages and embedded the same way. The engine measures how closely each passage's vector aligns with the query vector, usually with cosine similarity, and shortlists the passages that sit nearest. Whatever survives the shortlist becomes the raw material for the answer, and for the citation.

Keyword frequency never enters that calculation. A page can repeat "best CRM for SMEs" thirty times and still sit far from the query vector if the surrounding prose is vague. The target moved. Traditional SEO optimized for a string matcher. GEO optimizes for geometry.

How an answer engine finds your page

The retrieval pipeline behind most LLM search products runs in three stages. First, the index. Crawled pages get split into passages, typically a few hundred tokens each, and every passage is embedded into the same vector space the queries use. Second, retrieval. The engine pulls the passages whose vectors score highest against the query, cosine similarity being the standard measure because it compares direction rather than length. Third, generation. A reranker trims the shortlist, and the model writes its answer from what remains, citing the passages it actually used.

Two consequences follow for anyone writing content.

The unit of competition is the passage, not the page. An engine doesn't retrieve your 2,000-word guide. It retrieves the 300-token fragment of it that aligned with the query. A page full of strong sections beats a page with one strong introduction and ten dependent ones.

And alignment is semantic, so sloppy prose has a measurable cost. Every token in a passage moves its embedding. Filler moves it away from the concept you want to be found for.

Three ways to engineer the distance

TacticRetrieval failure it prevents
Answer in the first 150 tokensThe model skips material buried mid-passage
Cut filler and passive voiceMixed-signal embeddings that sit between topics and match nothing well
Modifiers next to their entitiesAttributes attached to the wrong product or company

The first 150 tokens carry the section

Bottom line up front. The format is older than the internet (the US military drilled BLUF into memo writing decades ago), and embeddings gave it a second justification. Liu et al. showed in "Lost in the Middle" (TACL, 2024) that language models use information at the start of a context far more reliably than information buried in the middle, where accuracy sags even for models with long context windows. A section that opens with its conclusion gets read by the machine. A section that builds suspense gets skimmed past.

So state the dense, complete version of the answer in the opening sentences of each section, then elaborate. If the heading asks a question, the first sentence after it answers the question.

Cut the words that blur the vector

An embedding is a weighted blend of everything in the passage. Conversational filler, hedges, and redundant adverbs all pull the blend toward generic conversational space and away from your topic. The fix is mechanical. Strip the throat-clearing, prefer active voice, and make every sentence point in the same conceptual direction.

Compare the two versions.

It could perhaps be argued that, generally speaking, response times might be improved somewhat by caching.

Caching cut median response time from 480ms to 210ms.

The second sentence is shorter, names an agent, and carries two retrievable facts. Its vector points at caching performance. The first points at hedging.

Modifiers drift when clauses run long

Embedding models associate words that appear near each other. Separate a modifier from its entity with a long relative clause and the model can bind the attribute to the wrong thing. A sentence like "Acme's platform, which unlike the legacy tools most enterprises still run was rebuilt in 2021, encrypts data at rest" invites the model to associate "legacy" with Acme. Split it. "Acme rebuilt its platform in 2021. It encrypts data at rest." Proximity is the binding signal, so keep claims short and adjacent to their subjects.

Where the vector talk stops helping

Some honesty from the practitioner side. Most of what works here is disciplined writing that good editors demanded long before anyone embedded a sentence. The math explains why it works and lets you stop arguing about taste. It doesn't unlock secret tricks.

Be skeptical of anyone selling "vector optimization" as a proprietary audit. You can't inspect the embedding model behind Perplexity or ChatGPT Search, and chunking schemes change without notice. What you can do is write self-contained, front-loaded, low-noise passages and then measure citations directly. The one controlled study in this space, Aggarwal et al.'s GEO paper (KDD, 2024), found that adding citations, quotations, and statistics lifted visibility in generative answers by up to 40% across their benchmark. Keyword stuffing, tested in the same study, underperformed all three sourcing tactics.

The practical next step is an audit of your highest-value pages at the passage level. Take each section, read its first 150 tokens, and ask whether they would stand alone as an answer. If your pages also depend on JavaScript to render, check what AI crawlers actually extract first. The free AI Visibility Check on this site shows the gap between what humans see and what the machines embed. If you want a second pair of eyes on which pages are losing the retrieval race, that is what my consultancy work is for.


Chat on WhatsApp