What is RAG (Retrieval-Augmented Generation)?

Architecture

RAG (Retrieval-Augmented Generation)

A technique that enhances AI model responses by retrieving relevant information from external knowledge sources and including it in the prompt, reducing hallucinations and enabling access to up-to-date or private data.

Retrieval-augmented generation (RAG) combines the generative capabilities of language models with information retrieval from external knowledge bases. Instead of relying solely on what the model learned during training, RAG systems first search for relevant documents or data, then include that information in the prompt as context for the model to reference when generating its response. This approach grounds the model's output in actual source material.

A typical RAG pipeline works in several steps. Documents are first chunked into manageable pieces and converted into embeddings stored in a vector database. When a user asks a question, the system embeds the query, searches the vector database for the most semantically similar document chunks, retrieves the top results, and prepends them to the prompt as context. The language model then generates a response based on both the question and the retrieved information, often citing its sources.

RAG solves several critical problems. Language models have a knowledge cutoff — they do not know about events after their training data was collected. RAG provides access to current information. Models sometimes hallucinate plausible-sounding but incorrect facts; RAG reduces this by grounding responses in real documents. Models cannot access private or proprietary data; RAG enables question answering over internal company documents, databases, and knowledge bases without fine-tuning.

Building an effective RAG system involves many design decisions: chunking strategy (size, overlap, semantic vs. fixed-size), embedding model selection, retrieval algorithm (vector search, hybrid search with BM25, re-ranking), the number of chunks to retrieve, and how to format the context in the prompt. Advanced techniques include multi-step retrieval (using initial results to refine the search), hypothetical document embeddings (HyDE), and agentic RAG where the model decides when and what to retrieve. Despite its complexity, RAG is the most common pattern for building production AI applications that require access to specific knowledge bases.

Quantization

Reasoning Models

Explore more AI concepts in the glossary

Browse Full Glossary