Skip to content

Retrieval-Augmented Generation (RAG) Explained: How to Connect LLMs to Your Own Data (Python Tutorial)

Why LLMs Fail at Private Data — And Why RAG Solves It

Large language models like GPT-5 or Claude are trained on data up to a certain date. They don’t know what’s inside your company’s internal documentation, your product database, or last quarter’s sales report. They also can’t browse a private Notion workspace or read your Slack messages.

This creates a very real gap: you have a powerful AI assistant, but it’s essentially blind to the most important data you need it to use.

Retrieval-Augmented Generation — RAG — bridges that gap. Instead of fine-tuning the model (expensive, slow) or cramming everything into the prompt (impractical), RAG dynamically fetches only the relevant chunks of your data at query time, then hands those chunks to the LLM as context.

The result: your LLM answers questions using your actual data, updated in real time, without retraining anything.

What Is RAG? (The Non-Textbook Definition)

Forget the academic definition for a moment. Think of RAG like this:

Imagine you’re a consultant who forgot everything between projects. Before each client call, your assistant pulls the 5 most relevant files from the archive and hands them to you. You walk into the meeting fully briefed — without memorizing the entire archive. That’s RAG.

Technically, RAG has three core steps:

  • Indexing: Your documents are split into chunks, converted to vector embeddings, and stored in a vector database.
  • Retrieval: When a user asks a question, that question is also embedded, and the most semantically similar document chunks are retrieved.
  • Generation: Those retrieved chunks are injected into the LLM prompt as context, and the model generates a grounded, accurate response.

RAG Architecture: The Full Picture

Here’s a visual breakdown of how the pieces connect:

The key insight: the LLM never searches the database. It only sees what the retrieval step hands it.

Step-by-Step Python Tutorial: Build a RAG System

Step 1 — Install Dependencies

pip install langchain openai chromadb tiktoken pypdf

Step 2 — Load and Chunk Your Documents

RAG starts with your documents. Here’s how to load a PDF and split it into manageable chunks:

from langchain.document_loaders import PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load your document

loader = PyPDFLoader('company_handbook.pdf')

documents = loader.load()

# Split into chunks (overlap helps preserve context at boundaries)

splitter = RecursiveCharacterTextSplitter(

   chunk_size=500,

   chunk_overlap=50

)

chunks = splitter.split_documents(documents)

print(f'Created {len(chunks)} chunks')

Why 500 characters? It’s a practical sweet spot — small enough to be specific, large enough to carry meaning. You’ll tune this based on your data.

Step 3 — Create Embeddings and Store in a Vector DB

from langchain.embeddings import OpenAIEmbeddings

from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()  # or use a local model

# Embed all chunks and store them

vectorstore = Chroma.from_documents(

   documents=chunks,

   embedding=embeddings,

   persist_directory='./chroma_db'

)

vectorstore.persist()

print('Vector store created and saved.')

Chroma is a great local option for getting started. For production, consider Pinecone, Weaviate, or pgvector (if you’re already on Postgres).

Step 4 — Build the Retrieval + Generation Chain

from langchain.chat_models import ChatOpenAI

from langchain.chains import RetrievalQA

llm = ChatOpenAI(model_name='gpt-4', temperature=0)

qa_chain = RetrievalQA.from_chain_type(

   llm=llm,

   chain_type='stuff',  # 'stuff' = inject all context into one prompt

   retriever=vectorstore.as_retriever(search_kwargs={'k': 4}),

   return_source_documents=True

)

# Ask a question

result = qa_chain({'query': 'What is our parental leave policy?'})

print(result['result'])

print('Sources:', [doc.metadata for doc in result['source_documents']])

Notice return_source_documents=True — always show users where answers came from. It builds trust and makes debugging easier.

Step 5 — Load From Existing Store (Don’t Re-Index Every Time)

# On subsequent runs, load from disk instead of re-embedding

vectorstore = Chroma(

   persist_directory='./chroma_db',

   embedding_function=OpenAIEmbeddings()

)

Real-World Example: Querying a Product Manual

Imagine you have a 200-page product manual. A customer support agent types: “How do I reset the device to factory settings?”

Without RAG: The LLM either guesses (and often hallucinates) or says it doesn’t know.

With RAG: The system retrieves the 4 most relevant chunks from the manual, the LLM reads those specific sections, and answers accurately — often citing the exact page or section.

The difference isn’t subtle. In production, this eliminates a major category of LLM failures.

Common RAG Mistakes to Avoid

  • Chunk size too large: 1,000+ character chunks often retrieve too much irrelevant noise.
  • No overlap between chunks: Without overlap, a sentence split at a boundary loses its context.
  • Retrieving too few chunks: k=1 or k=2 often misses relevant information; start with k=4 and tune.
  • Never validating retrieved context: Log what gets retrieved during dev. Garbage in, garbage out.
  • Skipping source attribution: Users can’t trust answers they can’t verify.

When to Use RAG vs. Fine-Tuning

This is one of the most common questions in applied LLM work:

  • Use RAG when: your data updates frequently, you need source attribution, or you want to get started quickly.
  • Use fine-tuning when: you need a specific writing style or tone, or the model consistently fails at a task type even with good context.

For most business use cases — internal search, customer support, document Q&A — RAG is the right starting point.

FAQ

Do I need an OpenAI API key to use RAG?

No. You can use open-source embedding models (like sentence-transformers) and local LLMs (via Ollama or LM Studio) to build a fully local RAG pipeline with zero API costs.

How much does it cost to run RAG with OpenAI?

Embedding costs are very low — typically fractions of a cent per document page. The main cost is the LLM calls at query time. For GPT-4, expect $0.01–$0.05 per query depending on context size. For most business applications, this is negligible.

What’s the difference between RAG and just pasting documents into the prompt?

Pasting everything hits context limits fast and is expensive. RAG is selective — it only retrieves the relevant pieces. A 500-page document would break any context window; RAG handles it gracefully.

How do I handle documents that change frequently?

Re-index on a schedule or trigger re-indexing on document update events. Many teams run nightly re-indexing jobs. Chroma and most vector DBs support deleting and re-adding specific document chunks.

Can RAG work without LangChain?

Absolutely. LangChain is a convenience layer. You can build RAG with raw OpenAI SDK calls, a vector DB client, and a few dozen lines of Python. LangChain just speeds up the scaffolding.

What if my retrieved context is still wrong?

This usually comes down to chunking strategy, embedding quality, or retrieval k-value. Start by logging exactly what gets retrieved, then adjust chunk size, overlap, or retrieval parameters.

Niklas Lang

I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.

My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.

Cookie Consent with Real Cookie Banner