Tutorial 08: Basic RAG (Retrieval-Augmented Generation)
This tutorial introduces the foundational RAG pattern - combining document retrieval with LLM generation for grounded, accurate responses.
Overview
RAG (Retrieval-Augmented Generation) enhances LLM responses by:
- Retrieving relevant documents for the user's question
- Augmenting the prompt with retrieved context
- Generating an answer grounded in the retrieved information
This addresses key LLM limitations:
- Knowledge cutoffs (outdated information)
- Hallucinations (making up facts)
- Lack of domain-specific knowledge
Architecture
RAG Pipeline Components
1. Document Loading
Load documents from various formats:
from langgraph_ollama_local.rag import DocumentLoader
loader = DocumentLoader()
# Load single file
docs = loader.load_pdf("paper.pdf")
# Load directory
docs = loader.load_directory("sources/")Supported formats:
- PDF (
.pdf) - Text (
.txt) - Markdown (
.md,.markdown)
2. Document Chunking
Split documents into searchable pieces:
from langgraph_ollama_local.rag import DocumentIndexer
from langgraph_ollama_local.rag.indexer import IndexerConfig
config = IndexerConfig(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between chunks
)
indexer = DocumentIndexer(config=config)
chunks = indexer.chunk_documents(documents)Chunking considerations:
| Parameter | Recommendation | Trade-off |
|---|---|---|
chunk_size | 500-1500 chars | Larger = more context, but less precise |
chunk_overlap | 10-20% of size | More overlap = better continuity |
3. Embeddings
Convert text to vectors using sentence-transformers:
from langgraph_ollama_local.rag import LocalEmbeddings
embeddings = LocalEmbeddings(model_name="all-mpnet-base-v2")
# Embed documents
vectors = embeddings.embed_documents(["text1", "text2"])
# Embed query
query_vector = embeddings.embed_query("What is RAG?")Available models:
| Model | Dimensions | Quality | Size |
|---|---|---|---|
all-mpnet-base-v2 | 768 | High | 420MB |
all-MiniLM-L6-v2 | 384 | Good | 90MB |
4. Vector Storage (ChromaDB)
Store and query embeddings:
# Index documents
indexer.index_documents(chunks)
# Query later
from langgraph_ollama_local.rag import LocalRetriever
retriever = LocalRetriever()
results = retriever.retrieve("query", k=4)ChromaDB features:
- Persistent storage (survives restarts)
- Cosine similarity search
- Metadata filtering
5. RAG Generation
Combine retrieval with LLM generation:
RAG_PROMPT = """Answer based on the context.
Context:
{context}
Question: {question}
Answer:"""Complete Implementation
State Definition
from typing import List
from typing_extensions import TypedDict
from langchain_core.documents import Document
class RAGState(TypedDict):
question: str # User's question
documents: List[Document] # Retrieved documents
generation: str # Generated answerNode Functions
def retrieve(state: RAGState) -> dict:
"""Retrieve relevant documents."""
docs = retriever.retrieve_documents(state["question"], k=4)
return {"documents": docs}
def generate(state: RAGState) -> dict:
"""Generate answer using context."""
context = "\n\n".join([d.page_content for d in state["documents"]])
messages = rag_prompt.format_messages(
context=context,
question=state["question"]
)
response = llm.invoke(messages)
return {"generation": response.content}Graph Construction
from langgraph.graph import StateGraph, START, END
graph = StateGraph(RAGState)
graph.add_node("retrieve", retrieve)
graph.add_node("generate", generate)
graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "generate")
graph.add_edge("generate", END)
rag_app = graph.compile()Usage
result = rag_app.invoke({"question": "What is Self-RAG?"})
print(result["generation"])Adding Source Citations
Track and display sources for transparency:
def format_sources(documents: List[Document]) -> str:
"""Format sources for citation."""
sources = []
for doc in documents:
filename = doc.metadata.get('filename', 'Unknown')
page = doc.metadata.get('page', '')
if page:
sources.append(f"- {filename} (page {page})")
else:
sources.append(f"- {filename}")
return "\n".join(sources)Indexing Pipeline
One-time setup to index your documents:
from langgraph_ollama_local.rag import DocumentIndexer, DocumentLoader
# 1. Load documents
loader = DocumentLoader()
docs = loader.load_directory("sources/")
# 2. Create indexer
indexer = DocumentIndexer()
# 3. Index (chunks, embeds, and stores)
indexer.index_directory("sources/")
# 4. Check stats
print(indexer.get_stats())Configuration
Environment variables for customization:
# .env
RAG_CHUNK_SIZE=1000
RAG_CHUNK_OVERLAP=200
RAG_COLLECTION_NAME=documents
RAG_PERSIST_DIRECTORY=.chromadb
EMBEDDING_MODEL_NAME=all-mpnet-base-v2Limitations
Basic RAG has limitations addressed in later tutorials:
| Limitation | Solution | Tutorial |
|---|---|---|
| No relevance check | Document grading | Self-RAG (09) |
| Hallucinations | Answer grading | Self-RAG (09) |
| Retrieval failures | Web search fallback | CRAG (10) |
| Single strategy | Query routing | Adaptive RAG (11) |
| Single retrieval | Multi-step retrieval | Agentic RAG (12) |
Best Practices
- Chunk size tuning: Start with 1000 chars, adjust based on results
- Overlap: Use 10-20% overlap to maintain context
- K value: Start with k=4, increase for complex questions
- Temperature: Use 0 for factual RAG, higher for creative tasks
- Prompt engineering: Be explicit about using only the context
Graph Visualization
Quiz
Test your understanding of Basic RAG:
Knowledge Check
What are the three main steps in the RAG pipeline?
Knowledge Check
What is the recommended chunk overlap percentage for document chunking?
Knowledge Check
Which limitation does Basic RAG NOT address that is solved by later tutorials?
Knowledge Check
What is the purpose of using embeddings in RAG?
Knowledge Check T/F
True or False: Basic RAG always retrieves the most relevant documents for any question.