RAG Architecture Lessons from Practice

“What’s in my biography” failed to find the biography file. “Show me my biography” worked fine. That apostrophe broke everything.

Building ragbot , my open source personal RAG system, with Claude Code taught me lessons that tutorials skip. The standard architecture — embed documents, chunk them, retrieve similar chunks — works in demos. Production use reveals the gaps.

Contractions are more important than expected

The query “what’s in my biography” failed to retrieve the biography file. Changing it to “show me my biography” worked.

The difference? The apostrophe in “what’s”.

MiniLM embeddings — and most embedding models — treat “what’s” as a single token with different semantics than “what is”. The contraction created a fundamentally different embedding. Keyword matching also failed because “what’s” doesn’t contain the substring “what”.

The fix was embarrassingly simple. I directed Claude to add normalization:

query = query.replace("what's", "what is")

Similar normalizations for common contractions (“can’t”, “won’t”, “it’s”) eliminated a class of failures that had seemed mysterious before diagnosis.

Lesson: text normalization matters for hybrid search. Linguistic variations that seem minor can completely break retrieval.

Not everything needs an LLM

To classify queries as “document lookup” vs “information synthesis,” I initially assumed I’d need an LLM call. Understanding intent seemed like a language understanding problem.

It’s not. I asked Claude what alternatives to an LLM call might work for intent classification. Claude suggested pattern matching:

DOCUMENT_LOOKUP_PATTERNS = [
    r"^show\s+(?:me\s+)?(?:my\s+|the\s+)?(.+)$",
    r"^what(?:'s| is)\s+in\s+(?:my\s+|the\s+)?(.+)$",
    r"^(?:get|fetch|retrieve)\s+(?:my\s+|the\s+)?(.+)$",
]

These patterns catch most document lookup queries. They’re fast — no API call, no latency. They’re deterministic — same query always classifies the same way. They’re cheaper — no token costs.

When the patterns don’t match, fall back to the general retrieval path. But for the common cases, simple heuristics outperform LLM classification on speed, cost, and consistency.

The instinct to reach for an LLM whenever language is involved is often wrong. Ask first: can a pattern or rule handle this? LLMs are for the residual complexity that simple methods can’t address.

Full document retrieval should be first-class

Standard RAG returns chunks. You ask a question, the system finds relevant fragments, and the LLM synthesizes an answer from those fragments.

This works for questions like “How do I configure authentication?” where the answer might span multiple documents or exist in a subsection of a larger document.

It fails for requests like “Show me my biography.”

When a user asks for a specific document, they want the whole document, not the three most relevant chunks. Chunked retrieval produces partial, disjointed results.

I asked Claude for options on how to handle this. Claude proposed dual-path architecture:

  1. Information synthesis path — Standard chunked retrieval. Used for questions requiring synthesis across sources.

  2. Document lookup path — Full document retrieval. Used when the user names a specific document.

Query classification (using those simple patterns) routes requests to the appropriate path. Document lookup bypasses chunking entirely and returns the complete file.

This seems obvious in retrospect. But the standard RAG architecture doesn’t include it, and I had to learn through user frustration why it mattered.

Context budget is usually too conservative

Claude initially implemented 2K tokens of retrieved context. That’s what some tutorials suggested.

Modern models have 200K+ token context windows. 2K is 1% of available capacity.

I directed Claude to increase to 16K — still less than 10% of available context. The results improved immediately:

  • More complete document coverage
  • Less information loss from truncation
  • Answers that felt more comprehensive

The cost increase was negligible for personal use. Even for production systems, the difference between 2K and 16K is marginal compared to overall API costs.

Don’t over-optimize context budget in quality-first applications. The models can handle much more than default configurations provide. If quality matters more than cost, give the model more context.

Re-ranking parameters need tuning

Semantic similarity isn’t enough. A chunk might be semantically similar to the query but not actually what the user wants.

When someone asks for “my biography,” a chunk from a biography file should rank higher than a chunk from a different file that happens to mention biographical details.

I asked Claude to implement re-ranking boosts:

  • Filename match: +0.5 per matching term
  • Title match: +0.3 per matching term

Claude’s initial values (0.2 and 0.15) weren’t strong enough. Semantically similar but wrong documents still outranked exact filename matches.

Tuning is required. Start with higher boosts for explicit matches — filename, title, section headers — since users mentioning these expect exact matches. Adjust based on observed failures.

Defaults must update everywhere

After Claude implemented 16K context in the core retrieval module, the web UI still showed 2K and returned truncated results.

Claude had updated the default value in only two places. It existed in five:

  1. Core retrieval function — updated
  2. Chat function — updated
  3. Streaming chat kwargs — missed
  4. Pydantic request model — missed
  5. React state initialization — missed

Three out of five is failure. The system behaves according to whatever code path doesn’t get the update.

The diagnostic:

grep -r "2000" --include="*.py" --include="*.tsx" | grep -i "rag\|token\|context"

Then trace the data flow from UI to API to backend to library. Test from the UI, not just unit tests — unit tests often use defaults that bypass intermediate layers.

Full-stack applications require full-stack awareness when changing defaults. Grep is your friend.

What I’d do differently

Start with full document mode from the beginning. The information-synthesis-only architecture was a false assumption from tutorials.

Add query preprocessing earlier. The contraction issue was obvious once diagnosed but took too long to discover.

Test with real user queries sooner. “What’s” vs “what is” only emerged from actual usage, not synthetic test cases.

Patterns that generalized

Several lessons apply beyond RAG:

Simple heuristics before LLMs. Query classification didn’t need machine learning. Many problems don’t.

Test edge cases in natural language. Contractions, typos, synonyms — the variations users actually produce.

Trace data flow on configuration changes. Any default that exists in multiple places must be updated in all of them.

Give the system more capacity than you think it needs. Modern models handle large contexts. Don’t artificially constrain them based on outdated guidance.

RAG architectures are still evolving. The patterns that work come from practice, not theory.


This article is part of the synthesis coding series . For related content on AI pipelines, see Data Format Contracts for AI Pipelines .


Rajiv Pant is President of Flatiron Software and Snapshot AI , where he leads organizational growth and AI innovation. He is former Chief Product & Technology Officer at The Wall Street Journal, The New York Times, and Hearst Magazines. Earlier in his career, he headed technology for Condé Nast’s brands including Reddit. Rajiv coined the terms “ synthesis engineering ” and “ synthesis coding ” to describe the systematic integration of human expertise with AI capabilities in professional software development. Connect with him on LinkedIn or read more at rajiv.com .