Skip to content

Tutorial 10: CRAG (Corrective RAG)

CRAG extends RAG with corrective capabilities - when local retrieval fails, it falls back to web search to find answers.

Overview

CRAG (Corrective RAG) adds a corrective mechanism:

  1. Retrieve from local documents
  2. Grade document relevance
  3. If insufficient, search the web
  4. Combine knowledge sources
  5. Generate answer

Architecture

When to Use CRAG

  • Document corpus may not cover all topics
  • Users ask about recent events
  • Need to supplement local with external knowledge
  • Building research assistants

State Definition

python
class CRAGState(TypedDict):
    question: str                      # User's question
    documents: List[Document]          # Local documents
    web_results: List[Document]        # Web search results
    combined_documents: List[Document] # Merged for generation
    knowledge_source: str              # "local", "web", "combined"
    generation: str                    # Final answer

Web Search Integration

python
from tavily import TavilyClient
import os

def web_search(query: str, max_results: int = 3) -> List[Document]:
    client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
    response = client.search(query, max_results=max_results)

    return [
        Document(
            page_content=r["content"],
            metadata={"source": r["url"], "title": r["title"], "type": "web"}
        )
        for r in response["results"]
    ]

Using DuckDuckGo (Free)

python
from duckduckgo_search import DDGS

def web_search(query: str, max_results: int = 3) -> List[Document]:
    with DDGS() as ddgs:
        results = list(ddgs.text(query, max_results=max_results))
        return [
            Document(
                page_content=r["body"],
                metadata={"source": r["href"], "title": r["title"], "type": "web"}
            )
            for r in results
        ]

Node Functions

Grade and Decide

python
def grade_documents(state: CRAGState) -> dict:
    """Grade documents and decide knowledge source."""
    relevant, _ = doc_grader.grade_documents(
        state["documents"],
        state["question"]
    )

    if len(relevant) >= 2:
        return {"combined_documents": relevant, "knowledge_source": "local"}
    elif len(relevant) == 1:
        return {"combined_documents": relevant, "knowledge_source": "combined"}
    else:
        return {"combined_documents": [], "knowledge_source": "web"}

Web Search Node

python
def search_web(state: CRAGState) -> dict:
    """Search the web for additional information."""
    web_docs = web_search(state["question"], max_results=3)

    # Combine with any existing relevant docs
    combined = state["combined_documents"] + web_docs

    return {
        "web_results": web_docs,
        "combined_documents": combined,
    }

Routing Logic

python
def route_after_grading(state: CRAGState) -> str:
    """Route based on knowledge source decision."""
    if state["knowledge_source"] == "local":
        return "generate"
    else:
        return "web_search"

Graph Construction

python
from langgraph.graph import StateGraph, START, END

graph = StateGraph(CRAGState)

# Nodes
graph.add_node("retrieve_local", retrieve_local)
graph.add_node("grade_documents", grade_documents)
graph.add_node("web_search", search_web)
graph.add_node("generate", generate)

# Edges
graph.add_edge(START, "retrieve_local")
graph.add_edge("retrieve_local", "grade_documents")
graph.add_conditional_edges(
    "grade_documents",
    route_after_grading,
    {"generate": "generate", "web_search": "web_search"}
)
graph.add_edge("web_search", "generate")
graph.add_edge("generate", END)

crag = graph.compile()

Source Attribution

Track where answers come from:

python
def generate(state: CRAGState) -> dict:
    """Generate with source attribution."""
    context_parts = []
    for i, doc in enumerate(state["combined_documents"], 1):
        source_type = doc.metadata.get("type", "local")
        source_name = doc.metadata.get("filename", doc.metadata.get("title", "Unknown"))
        context_parts.append(f"[Source {i} ({source_type}): {source_name}]\n{doc.page_content}")

    context = "\n\n".join(context_parts)
    # Generate with context...

Configuration

bash
# Environment variables
TAVILY_API_KEY=your-key-here
CRAG_MIN_RELEVANT_DOCS=2
CRAG_WEB_RESULTS_COUNT=3

Best Practices

  1. Rate limiting: Respect web search API limits
  2. Caching: Cache web results for repeated queries
  3. Source diversity: Balance local and web sources
  4. Freshness: Prefer web for time-sensitive queries
  5. Attribution: Always cite web sources

Comparison

AspectSelf-RAGCRAG
Primary focusQualityCoverage
Failure handlingRetryFallback
External dependenciesNoneWeb search API
Best forAccuracyComprehensiveness

Quiz

Test your understanding of CRAG (Corrective RAG):

Knowledge Check

What does CRAG do when local document retrieval is insufficient?

AReturns an error message
BFalls back to web search
CRetries local retrieval with different parameters
DUses a default pre-written answer

Knowledge Check

Which web search API is recommended in the tutorial for CRAG?

AGoogle Custom Search API
BBing Web Search API
CTavily
DSerpAPI

Knowledge Check

What is the primary focus difference between Self-RAG and CRAG?

ASelf-RAG focuses on speed, CRAG on accuracy
BSelf-RAG focuses on quality, CRAG on coverage
CSelf-RAG uses local docs, CRAG only uses web
DSelf-RAG is for questions, CRAG is for summarization

Knowledge Check

What are the three possible values for the 'knowledge_source' field in CRAGState?

Aprimary, secondary, fallback
Blocal, web, combined
Cfast, medium, slow
Dcached, fresh, mixed

Knowledge Check T/F

True or False: CRAG requires an external web search API to function.

TTrue
FFalse