RAG and LLMs in 2026: Retrieval-Augmented Generation for Data Science Interviews

Retrieval-Augmented Generation (RAG) explained for data science interviews in 2026. Covers vector databases, chunking strategies, embedding models, agentic RAG, Graph RAG, and production-ready pipeline architecture.

RAG retrieval-augmented generation pipeline architecture with vector database and LLM

Retrieval-Augmented Generation (RAG) has become the standard architecture for grounding LLM outputs in factual, up-to-date data. For data science and AI engineering interviews in 2026, RAG questions now appear alongside traditional ML topics — testing both system design thinking and hands-on implementation skills.

What Is RAG in One Sentence?

Retrieval-Augmented Generation combines a retrieval system (vector search over a knowledge base) with an LLM generator, so the model answers from real documents instead of relying on memorized training data.

How the RAG Pipeline Works End to End

A RAG system operates in two phases: an offline ingestion phase that builds the knowledge base, and an online query phase that retrieves relevant context and generates an answer.

During ingestion, raw documents pass through cleaning, chunking, and embedding before storage in a vector database. During inference, the user query follows the same embedding path, and nearest-neighbor search retrieves the most relevant chunks. These chunks become part of the LLM prompt, grounding the generated answer in source material.

python
# rag_pipeline.py
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Offline: ingest documents into the vector store
def build_index(documents: list[str]) -> Chroma:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,       # tokens per chunk
        chunk_overlap=64,     # overlap preserves context at boundaries
        separators=["\n\n", "\n", ". ", " "]
    )
    chunks = splitter.create_documents(documents)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
    return Chroma.from_documents(chunks, embeddings)

# Online: retrieve + generate
def query(vectorstore: Chroma, question: str) -> str:
    retriever = vectorstore.as_retriever(
        search_type="mmr",    # Maximal Marginal Relevance for diversity
        search_kwargs={"k": 5, "fetch_k": 20}
    )
    prompt = ChatPromptTemplate.from_template(
        "Answer based on this context only:\n{context}\n\nQuestion: {question}"
    )
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | ChatOpenAI(model="gpt-4o", temperature=0)
    )
    return chain.invoke(question).content

This pipeline covers the core RAG loop: chunk, embed, store, retrieve, generate. The search_type="mmr" parameter ensures retrieved chunks are both relevant and diverse, reducing redundancy in the context window.

Chunking Strategies That Actually Matter

Chunking determines retrieval quality more than any other component. Poor chunking means the retriever returns fragments that either lack context or dilute the signal with irrelevant content.

Three chunking approaches dominate production systems in 2026:

Fixed-size chunking splits text at a set token count (typically 256-512 tokens) with overlap. Simple to implement, but splits sentences and ideas mid-thought.

Semantic chunking detects topic boundaries by measuring embedding similarity between consecutive sentences. When similarity drops below a threshold, a new chunk begins. Each chunk carries a coherent idea rather than an arbitrary slice of text.

Late chunking applies the transformer model to the entire document first, producing contextual token embeddings, then splits into chunks. This preserves long-range dependencies that traditional chunking destroys.

python
# semantic_chunking.py
import numpy as np
from sentence_transformers import SentenceTransformer

def semantic_chunk(text: str, threshold: float = 0.3) -> list[str]:
    """Split text where semantic similarity drops below threshold."""
    model = SentenceTransformer("all-MiniLM-L6-v2")
    sentences = text.split(". ")
    embeddings = model.encode(sentences)

    chunks, current_chunk = [], [sentences[0]]

    for i in range(1, len(sentences)):
        # Cosine similarity between consecutive sentences
        sim = np.dot(embeddings[i-1], embeddings[i]) / (
            np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
        )
        if sim < threshold:       # topic shift detected
            chunks.append(". ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(". ".join(current_chunk))  # final chunk
    return chunks

The key interview insight: chunk size is a precision-recall tradeoff. Small chunks (100 tokens) improve retrieval precision but fragment context. Large chunks (1000 tokens) preserve context but dilute embedding specificity. Most production systems land at 256-512 tokens with 10-20% overlap.

Vector Databases and Embedding Models in Production

The vector database stores embeddings and supports fast approximate nearest-neighbor (ANN) search. Choosing the right combination of embedding model and vector database directly impacts retrieval latency and accuracy.

Embedding models in 2026 have converged around a few high-performing options. OpenAI's text-embedding-3-large (3072 dimensions) and open-source alternatives like bge-m3 from BAAI or Cohere's embed-v4 offer strong multilingual retrieval. The MTEB leaderboard remains the standard benchmark for comparing embedding quality.

Vector databases like Pinecone, Weaviate, Milvus, Qdrant, and pgvector each make different tradeoffs:

| Database | Indexing | Managed | Strength | |----------|----------|---------|----------| | Pinecone | Proprietary | Yes | Simplicity, serverless scaling | | Weaviate | HNSW | Yes/Self | Hybrid search (vector + BM25) | | Milvus | IVF, HNSW | Yes/Self | Billion-scale datasets | | Qdrant | HNSW | Yes/Self | Filtering + payload storage | | pgvector | IVF, HNSW | Self | PostgreSQL integration |

For interview discussions, the critical point is understanding the HNSW (Hierarchical Navigable Small World) algorithm: it builds a multi-layer graph where each node connects to its nearest neighbors, enabling O(log n) search at the cost of higher memory usage.

Ready to ace your Data Science & ML interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Pure vector search fails on exact keyword matches and rare terms. Pure lexical search (BM25) misses semantic similarity. Hybrid retrieval combines both, and in 2026 this is the default for production RAG systems.

The standard pattern uses Reciprocal Rank Fusion (RRF) to merge ranked results from both retrievers:

python
# hybrid_retrieval.py
from rank_bm25 import BM25Okapi
import numpy as np

def reciprocal_rank_fusion(
    dense_results: list[str],
    sparse_results: list[str],
    k: int = 60
) -> list[str]:
    """Merge dense (vector) and sparse (BM25) results using RRF."""
    scores: dict[str, float] = {}

    for rank, doc_id in enumerate(dense_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, doc_id in enumerate(sparse_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    # Sort by combined RRF score, highest first
    return sorted(scores.keys(), key=lambda d: scores[d], reverse=True)

Hybrid retrieval solves the "vocabulary mismatch" problem where a user asks about "canceling a subscription" but the relevant document uses "account termination policy." BM25 catches the exact term overlap while vector search captures the semantic relationship.

Reranking: The Second-Stage Filter

Retrieval returns candidates. Reranking sorts them by true relevance. Cross-encoder rerankers like Cohere Rerank or bge-reranker-v2.5-gemma2-lightweight score each query-document pair jointly, producing far more accurate relevance scores than bi-encoder similarity.

The two-stage retrieval pipeline — broad first-stage recall (top 50-100 candidates via vector + BM25), then precise reranking (top 5-10 for the prompt) — is standard in production. This keeps latency manageable: the first stage uses fast ANN search, while the expensive cross-encoder only processes a small candidate set.

Reranking Interview Insight

Cross-encoders are more accurate than bi-encoders because they process the query and document together through all transformer layers. Bi-encoders embed them independently, losing fine-grained interaction signals. The tradeoff is speed: cross-encoders cannot be pre-indexed.

Agentic RAG: Beyond Single-Shot Retrieval

Naive RAG retrieves once and generates. Agentic RAG treats the LLM as a reasoning agent that decides when to retrieve, what to retrieve, and whether the retrieved context is sufficient.

In 2026, agentic RAG is the dominant pattern for complex queries that require multi-step reasoning. The agent can:

  • Self-assess: Evaluate whether retrieved documents answer the question
  • Re-query: Reformulate the search query if initial results are insufficient
  • Route: Choose between different knowledge sources (vector DB, SQL database, API)
  • Verify: Cross-check facts across multiple retrieved passages
python
# agentic_rag.py
from langgraph.graph import StateGraph, END
from typing import TypedDict

class RAGState(TypedDict):
    question: str
    documents: list[str]
    generation: str
    retries: int

def retrieve(state: RAGState) -> RAGState:
    """Retrieve documents from vector store."""
    docs = vectorstore.similarity_search(state["question"], k=5)
    return {"documents": [d.page_content for d in docs]}

def grade_documents(state: RAGState) -> str:
    """Decide if documents are relevant enough to answer."""
    prompt = f"Are these documents relevant to: {state['question']}?\n"
    prompt += "\n".join(state["documents"])
    relevance = llm.invoke(prompt)  # returns 'relevant' or 'not_relevant'
    return "generate" if "relevant" in relevance.content else "rewrite"

def rewrite_query(state: RAGState) -> RAGState:
    """Reformulate the query for better retrieval."""
    new_query = llm.invoke(
        f"Rewrite this query for better search results: {state['question']}"
    )
    return {"question": new_query.content, "retries": state["retries"] + 1}

# Build the agent graph
workflow = StateGraph(RAGState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade", grade_documents)      # conditional routing
workflow.add_node("rewrite", rewrite_query)
workflow.add_node("generate", generate_answer)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", grade_documents,
    {"generate": "generate", "rewrite": "rewrite"})
workflow.add_edge("rewrite", "retrieve")          # retry loop
workflow.add_edge("generate", END)

This pattern — retrieve, grade, optionally rewrite and retry — is known as Corrective RAG (CRAG). The LangGraph framework models the workflow as a directed cyclic graph with conditional branching, making it straightforward to add verification steps, human-in-the-loop checkpoints, or multi-source routing.

Graph RAG: Structured Knowledge Retrieval

Graph RAG extracts entities and relationships from documents into a knowledge graph, then queries both the graph and the vector store. This architecture reduces hallucination on factual queries by grounding answers in explicit entity relationships rather than unstructured text similarity.

The ingestion pipeline extracts triples (subject, predicate, object) from each document chunk. At query time, the system identifies relevant entities in the question, traverses the knowledge graph for connected facts, and combines graph-retrieved context with vector-retrieved passages.

Graph RAG excels at multi-hop reasoning questions like "Which team leads the project that uses the framework mentioned in document X?" — queries that require connecting facts across multiple documents. Pure vector search struggles here because no single chunk contains the complete answer.

Graph RAG Tradeoff

Graph RAG significantly improves factual accuracy (up to 40% reduction in hallucination on entity-heavy queries) but requires a mature entity extraction pipeline. Noisy extraction produces a noisy graph, which can degrade results below naive RAG.

Evaluating RAG Systems: Metrics That Matter

RAG evaluation splits into retrieval metrics and generation metrics. Both must be measured independently to diagnose failures.

Retrieval metrics:

  • Recall@k: Did the relevant documents appear in the top k results?
  • MRR (Mean Reciprocal Rank): How high is the first relevant result ranked?
  • NDCG: Does the ranking match the ideal relevance ordering?

Generation metrics:

  • Faithfulness: Does the answer only use information from retrieved context? (Measures hallucination)
  • Answer relevance: Does the answer address the original question?
  • Context precision: Are the retrieved chunks actually used in the answer?

Frameworks like Ragas and DeepEval automate these evaluations using LLM-as-judge patterns. The top data science interview questions increasingly include RAG evaluation design — expect to explain how to measure whether a RAG system is working correctly.

Production Failure Modes and Debugging

RAG systems fail in predictable ways. Knowing these patterns is essential for both interviews and real-world deployment.

Context window pollution happens when too many retrieved chunks dilute the relevant signal. The LLM receives 10 chunks but only 2 contain useful information. The fix: use a reranker to filter, and reduce top-k from the retriever.

Chunking artifacts occur when fixed-size splitting breaks sentences, tables, or code blocks mid-element. The retrieved chunk is syntactically incomplete and semantically useless. Semantic chunking or document-aware splitting (respecting headers, paragraphs, code fences) solves this.

Embedding drift emerges when the embedding model is updated but the vector store still contains embeddings from the old model. Queries encoded with the new model search a vector space built by the old one, degrading retrieval quality. Solution: re-embed the entire corpus after any model change.

Stale indexes deliver outdated information because the ingestion pipeline lagged behind document updates. In machine learning systems, this is analogous to training-serving skew — the retrieval system sees a different data distribution than what exists in production.

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Conclusion

  • RAG combines retrieval (vector search over a knowledge base) with LLM generation to produce grounded, factual answers without retraining the model
  • Chunking strategy has the highest impact on retrieval quality — semantic chunking and late chunking outperform fixed-size splitting for most use cases
  • Hybrid retrieval (dense vectors + sparse BM25) with Reciprocal Rank Fusion is the production default, solving the vocabulary mismatch problem that pure vector search cannot handle
  • Cross-encoder rerankers add a precision layer after broad retrieval, processing only a small candidate set to keep latency acceptable
  • Agentic RAG (retrieve, grade, rewrite, retry) and Graph RAG (entity-relationship extraction) represent the two major architectural advances in 2026, handling complex multi-hop queries that naive RAG fails on
  • Evaluation must separate retrieval metrics (Recall@k, MRR) from generation metrics (faithfulness, answer relevance) to diagnose where the pipeline breaks
  • The most common production failures — context pollution, chunking artifacts, embedding drift, stale indexes — all have straightforward fixes once identified

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#RAG
#retrieval augmented generation
#LLM
#data science
#vector database
#interview preparation
#AI engineering

Share

Related articles