A language model knows only what was present in its training data and what is supplied in the current prompt. For most practical applications—answering questions about internal documentation, summarizing a company’s policies, or reasoning over a knowledge base—neither source is sufficient on its own. Retrieval-augmented generation, commonly abbreviated as RAG, addresses this gap by fetching relevant information at query time and placing it into the prompt. This article explains how RAG works, when it is the appropriate choice, and the components required to build it.
The Core Idea
The central insight of RAG is that a model does not need to memorize your data; it needs to be handed the right portion of that data at the moment of a query. Rather than fine-tuning a model on a corpus—an expensive process that must be repeated whenever the corpus changes—RAG keeps the data external and retrieves only the passages relevant to the current question. The model then generates an answer grounded in those passages. The result is a system that can cite current information, be updated by simply editing the underlying documents, and confine the model’s reasoning to verified source material.
Why Not Just Put Everything in the Prompt
Modern models accept large context windows, which raises a reasonable question: why retrieve at all, rather than including the entire corpus in every prompt? Two constraints make this impractical. First, cost and latency scale with the number of input tokens, so submitting an entire knowledge base on every request is wasteful. Second, relevance degrades as irrelevant material accumulates; a model handed fifty documents to answer a question about one of them performs worse than a model handed the single pertinent passage. Retrieval is, in effect, a relevance filter that improves both economy and accuracy.
The Components of a Pipeline
A working RAG system comprises several stages. First, documents are split into chunks of manageable size, because retrieval operates on passages rather than whole files. Each chunk is then converted into a numerical representation, called an embedding, that captures its meaning; both OpenAI and Anthropic document embedding-based retrieval as the standard approach. These embeddings are stored in a vector index that supports similarity search. At query time, the user’s question is embedded with the same model, the index returns the most similar chunks, and those chunks are inserted into the prompt alongside the question. The model generates its answer from this assembled context.
Practical Considerations
The quality of a RAG system depends heavily on choices that are easy to overlook. Chunk size involves a trade-off: chunks that are too small fragment ideas across passages, while chunks that are too large dilute relevance and consume context budget. The instruction given to the model matters as well; it should direct the model to answer only from the supplied passages and to state plainly when the passages do not contain the answer, which curtails fabrication. Including the source of each passage in the prompt allows the system to present citations, an important feature where users must verify claims.
When RAG Is Appropriate
RAG is the right tool when answers must be grounded in a specific, changing body of knowledge that the model was not trained on—internal wikis, product manuals, legal documents, or support archives. It is less suitable when the task requires the model to learn a new skill or output style rather than to recall facts; that is the province of fine-tuning. In my work, the majority of practical knowledge-base applications are served by RAG precisely because the underlying documents change frequently and must remain authoritative.
Conclusion
Retrieval-augmented generation reconciles a fixed, pre-trained model with the dynamic, proprietary data that real applications depend upon. By retrieving relevant passages at query time and grounding generation in them, RAG delivers current, verifiable answers without the cost of retraining. Understanding its components—chunking, embedding, indexing, and retrieval—and the trade-offs within each is what separates a fragile prototype from a system that answers reliably from your own data.