Tutorial 13: Perplexity-Style Research Assistant

Build a full-featured research assistant with in-text citations, source metadata, and follow-up suggestions.

Overview

This tutorial combines all RAG patterns into a polished research experience:

In-text citations [1], [2]
Source cards with metadata
Multi-source synthesis
Follow-up question suggestions

Architecture

Source Data Model

python

@dataclass
class Source:
    index: int              # Citation number [1], [2], etc.
    title: str              # Source title
    url: str                # URL or file path
    content: str            # Relevant excerpt
    source_type: str        # "local" or "web"
    page: Optional[int]     # Page number if applicable
    relevance_score: float  # Similarity score

State Definition

python

class ResearchState(TypedDict):
    question: str                    # User's question
    sources: List[Source]            # All gathered sources
    answer: str                      # Answer with citations
    follow_up_questions: List[str]   # Suggested questions

Citation Prompt

python

RESEARCH_PROMPT = """You are a research assistant.

IMPORTANT: Cite sources using [1], [2], etc. inline.
Every factual claim should have a citation.

Sources:
{sources}

Question: {question}

Answer with inline citations:"""

Web Search Setup

Tavily API (Recommended)

Sign up at https://tavily.com
Get your free API key
Add to .env:
```
TAVILY_API_KEY=tvly-your-key-here
```
1

Usage in Code

python

from tavily import TavilyClient
import os

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
results = client.search(query, max_results=3)

Node Functions

Gather Sources

python

def gather_sources(state: ResearchState) -> dict:
    """Gather sources from multiple locations."""
    sources = []

    # Local documents
    local_docs = retriever.retrieve_documents(state["question"], k=3)
    for i, doc in enumerate(local_docs, 1):
        sources.append(Source(
            index=i,
            title=doc.metadata.get("filename", "Unknown"),
            url=doc.metadata.get("source", ""),
            content=doc.page_content,
            source_type="local",
            page=doc.metadata.get("page"),
            relevance_score=doc.metadata.get("score", 0.0)
        ))

    # Web search
    web_results = web_search(state["question"], max_results=3)
    for j, result in enumerate(web_results, len(sources) + 1):
        sources.append(Source(
            index=j,
            title=result["title"],
            url=result["url"],
            content=result["content"],
            source_type="web",
            page=None,
            relevance_score=result.get("score", 0.0)
        ))

    return {"sources": sources}

Generate Answer with Citations

python

def generate_answer(state: ResearchState) -> dict:
    """Generate answer with inline citations."""
    # Format sources for prompt
    sources_text = "\n\n".join([
        f"[{s.index}] {s.title}\n{s.content}"
        for s in state["sources"]
    ])

    # Generate answer
    prompt = RESEARCH_PROMPT.format(
        sources=sources_text,
        question=state["question"]
    )
    response = llm.invoke(prompt)

    return {"answer": response.content}

Generate Follow-up Questions

python

FOLLOWUP_PROMPT = """Based on this question and answer, suggest 3 related follow-up questions.

Original Question: {question}
Answer: {answer}

Follow-up questions (one per line):"""

def generate_followups(state: ResearchState) -> dict:
    """Generate follow-up question suggestions."""
    prompt = FOLLOWUP_PROMPT.format(
        question=state["question"],
        answer=state["answer"]
    )
    response = llm.invoke(prompt)

    # Parse questions (one per line)
    questions = [
        q.strip().lstrip("123456789.-) ")
        for q in response.content.split("\n")
        if q.strip()
    ][:3]

    return {"follow_up_questions": questions}

Graph Construction

python

from langgraph.graph import StateGraph, START, END

graph = StateGraph(ResearchState)

# Nodes
graph.add_node("gather_sources", gather_sources)
graph.add_node("generate_answer", generate_answer)
graph.add_node("generate_followups", generate_followups)

# Edges
graph.add_edge(START, "gather_sources")
graph.add_edge("gather_sources", "generate_answer")
graph.add_edge("generate_answer", "generate_followups")
graph.add_edge("generate_followups", END)

research_assistant = graph.compile()

Output Formatting

python

def format_response(result: dict) -> str:
    output = []

    # Answer with citations
    output.append(result["answer"])

    # Sources section
    output.append("\n" + "─" * 50)
    output.append("Sources:")
    for src in result["sources"]:
        icon = "🌐" if src.source_type == "web" else "📄"
        score = f"[{src.relevance_score*100:.0f}%]"

        if src.page:
            output.append(f"[{src.index}] {icon} {src.title} (page {src.page}) {score}")
        else:
            output.append(f"[{src.index}] {icon} {src.title} {score}")

        if src.url:
            output.append(f"    {src.url}")

    # Follow-ups
    output.append("\n" + "─" * 50)
    output.append("Related Questions:")
    for q in result["follow_up_questions"]:
        output.append(f"• {q}")

    return "\n".join(output)

Usage

python

result = research_assistant.invoke({
    "question": "What is Self-RAG and how does it differ from CRAG?"
})

print(format_response(result))

Example Output

Self-RAG is a framework that enhances LLMs with self-reflection [1].
It grades retrieved documents for relevance and checks answers for
hallucinations [1][2]. CRAG differs by adding web search as a
fallback mechanism [3].

──────────────────────────────────────────────────
Sources:
[1] 📄 self_rag_paper.pdf (page 3) [92%]
    /path/to/self_rag_paper.pdf
[2] 📄 rag_survey.pdf (page 12) [87%]
    /path/to/rag_survey.pdf
[3] 🌐 "Corrective RAG Explained" [85%]
    https://example.com/crag-explained

──────────────────────────────────────────────────
Related Questions:
• What are the performance benchmarks for Self-RAG?
• How does CRAG handle web search failures?
• Can Self-RAG and CRAG be combined?

Advanced Features

Source Ranking

python

def rank_sources(sources: List[Source]) -> List[Source]:
    """Rank sources by relevance and recency."""
    return sorted(
        sources,
        key=lambda s: (s.relevance_score, s.source_type == "web"),
        reverse=True
    )

Source Deduplication

python

def deduplicate_sources(sources: List[Source]) -> List[Source]:
    """Remove duplicate or very similar sources."""
    unique_sources = []
    seen_content = set()

    for source in sources:
        # Simple dedup based on first 100 chars
        content_hash = hash(source.content[:100])
        if content_hash not in seen_content:
            unique_sources.append(source)
            seen_content.add(content_hash)

    return unique_sources

Configuration

bash

# .env
TAVILY_API_KEY=tvly-your-key-here
RAG_COLLECTION_NAME=documents
EMBEDDING_MODEL_NAME=all-mpnet-base-v2
RESEARCH_MAX_LOCAL_SOURCES=3
RESEARCH_MAX_WEB_SOURCES=3

Best Practices

Source diversity: Balance local and web sources
Citation verification: Ensure all claims are cited
Source quality: Filter low-relevance sources
User experience: Format output for readability
Rate limiting: Respect API limits for web search

Congratulations!

You've completed the RAG Patterns tutorial series. You can now build:

Basic RAG systems
Self-reflective RAG with quality grading
Corrective RAG with web fallback
Adaptive RAG with query routing
Agentic RAG with agent control
Full research assistants with citations

All running locally with Ollama!

Quiz

Test your understanding of the Perplexity-Style Research Assistant:

Knowledge Check

What are the three main components of the research assistant's output?

AQuery, Search, Results

BAnswer with citations, Sources section, Follow-up suggestions

CIntroduction, Body, Conclusion

DQuestion, Raw answer, References

Knowledge Check

How are citations formatted in the research assistant's answers?

A(Author, Year) format

BSuperscript numbers

CInline brackets like [1], [2]

DFootnotes at the bottom

Knowledge Check

What fields does the Source data model include?

AOnly title and URL

BIndex, title, URL, content, source_type, page, relevance_score

CJust the document content

DAuthor and publication date only

Knowledge Check

What is the purpose of the follow-up questions feature?

ATo test user knowledge

BTo guide further exploration and research

CTo correct errors in the answer

DTo summarize the sources

Knowledge Check T/F

True or False: The research assistant only uses local documents, not web search.

TTrue

FFalse

Tutorial 13: Perplexity-Style Research Assistant ​

Overview ​

Architecture ​

Source Data Model ​

State Definition ​

Citation Prompt ​

Web Search Setup ​

Tavily API (Recommended) ​

Usage in Code ​

Node Functions ​

Gather Sources ​

Generate Answer with Citations ​

Generate Follow-up Questions ​

Graph Construction ​

Output Formatting ​

Usage ​

Example Output ​

Advanced Features ​

Source Ranking ​

Source Deduplication ​

Configuration ​

Best Practices ​

Congratulations! ​

Quiz ​

Knowledge Check

Knowledge Check

Knowledge Check

Knowledge Check

Knowledge Check T/F

Tutorial 13: Perplexity-Style Research Assistant

Overview

Architecture

Source Data Model

State Definition

Citation Prompt

Web Search Setup

Tavily API (Recommended)

Usage in Code

Node Functions

Gather Sources

Generate Answer with Citations

Generate Follow-up Questions

Graph Construction

Output Formatting

Usage

Example Output

Advanced Features

Source Ranking

Source Deduplication

Configuration

Best Practices

Congratulations!

Quiz