7 Steps to Build a Simple RAG System from Scratch

0
8



Image by Author

 

Introduction

 
These days, almost everyone uses ChatGPT, Gemini, or another large language model (LLM). They make life easier but can still get things wrong. For example, I remember asking a generative model who won the most recent U.S. presidential election and getting the previous president’s name back. It sounded confident, but the model simply relied on training data before the election took place. This is where retrieval-augmented generation (RAG) helps LLMs give more accurate and up-to-date responses. Instead of depending only on the model’s internal knowledge, it pulls information from external sources — such as PDFs, documents, or APIs — and uses that to build a more contextual and reliable answer. In this guide, I’ll walk you through seven practical steps to build a simple RAG system from scratch.

 

Understanding the Retrieval-Augmented Generation Workflow

 
Before we proceed to code, here’s the idea in plain terms. A RAG system has two core pieces: the retriever and the generator. The retriever searches your knowledge base and pulls out the most relevant chunks of text. The generator is the language model that takes those snippets and turns them into a natural, useful answer. The process is straightforward, as follows:

  1. A user asks a question.
  2. The retriever searches your indexed documents or database and returns the best matching passages.
  3. Those passages are handed to the LLM as context.
  4. The LLM then generates a response grounded in that retrieved context.

Now we will break that flow down into seven simple steps and build it end-to-end.

 

Step 1: Preprocessing the Data

 
Even though large language models already know a lot from textbooks and web data, they don’t have access to your private or newly generated information like research notes, company documents, or project files. RAG helps you feed the model your own data, reducing hallucinations and making responses more accurate and up-to-date. For the sake of this article, we’ll keep things simple and use a few short text files about machine learning concepts.

data/
 ├── supervised_learning.txt
 └── unsupervised_learning.txt

 

supervised_learning.txt:
In this type of machine learning (supervised), the model is trained on labeled data. 
In simple terms, every training example has an input and an associated output label. 
The objective is to build a model that generalizes well on unseen data. 
Common algorithms include:
- Linear Regression
- Decision Trees
- Random Forests
- Support Vector Machines

Classification and regression tasks are performed in supervised machine learning.
For example: spam detection (classification) and house price prediction (regression).
They can be evaluated using accuracy, F1-score, precision, recall, or mean squared error.

 

unsupervised_learning.txt:
In this type of machine learning (unsupervised), the model is trained on unlabeled data. 
Popular algorithms include:
- K-Means
- Principal Component Analysis (PCA)
- Autoencoders

There are no predefined output labels; the algorithm automatically detects 
underlying patterns or structures within the data.
Typical use cases include anomaly detection, customer clustering, 
and dimensionality reduction.
Performance can be measured qualitatively or with metrics such as silhouette score 
and reconstruction error.

 
The next task is to load this data. For that, we will create a Python file, load_data.py:

import os

def load_documents(folder_path):
    docs = []
    for file in os.listdir(folder_path):
        if file.endswith(".txt"):
            with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
                docs.append(f.read())
    return docs

 
Before we use the data, we will clean it. If the text is messy, the model may retrieve irrelevant or incorrect passages, increasing hallucinations. Now, let’s create another Python file, clean_data.py:

import re

def clean_text(text: str) -> str:
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    return text.strip()

 
Finally, combine everything into a new file called prepare_data.py to load and clean your documents together:

from load_data import load_documents
from clean_data import clean_text

def prepare_docs(folder_path="data/"):
    """
    Loads and cleans all text documents from the given folder.
    """
    # Load Documents
    raw_docs = load_documents(folder_path)

    # Clean Documents
    cleaned_docs = [clean_text(doc) for doc in raw_docs]

    print(f"Prepared {len(cleaned_docs)} documents.")
    return cleaned_docs

 

Step 2: Converting Text into Chunks

 
LLMs possess a small context window — e.g. they are capable of processing only a limited amount of text simultaneously. We solve this by dividing long documents into short, overlapping pieces (the number of words in a chunk is normally 300 to 500 words). We’ll use LangChain’s RecursiveCharacterTextSplitter, which splits text at natural points like sentences or paragraphs. Each piece makes sense, and the model can quickly find the relevant piece while answering.

split_text.py

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents, chunk_size=500, chunk_overlap=100):
 
   # define the splitter
   splitter = RecursiveCharacterTextSplitter(
       chunk_size=chunk_size,
       chunk_overlap=chunk_overlap
   )

   # use the splitter to split docs into chunks
   chunks = splitter.create_documents(documents)
   print(f"Total chunks created: {len(chunks)}")

   return chunks

 
Chunking helps the model understand the text without losing its meaning. If we don’t add a little overlap between pieces, the model can get confused at the edges, and the answer might not make sense.

 

Step 3: Creating and Storing Vector Embeddings

 
A computer does not understand textual information; it only understands numbers. So, we need to convert our text chunks into numbers. These numbers are called vector embeddings, and they help the computer understand the meaning behind the text. We can use tools like OpenAI, SentenceTransformers, or Hugging Face for this. Let’s create a new file called create_embeddings.py and use SentenceTransformers to generate embeddings.

from sentence_transformers import SentenceTransformer
import numpy as np

def get_embeddings(text_chunks):
  
   # Load embedding model
   model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
  
   print(f"Creating embeddings for {len(text_chunks)} chunks:")
   embeddings = model.encode(text_chunks, show_progress_bar=True)
  
   print(f"Embeddings shape: {embeddings.shape}")
   return np.array(embeddings)

 
Each vector embedding captures its semantic meaning. Similar text chunks will have embeddings that are close to each other in vector space. Now we will store embeddings in a vector database like FAISS (Facebook AI Similarity Search), Chroma, or Pinecone. This helps in fast similarity search. For example, let’s use FAISS (a lightweight, local option). You can install it using:

 
Next, let’s create a file called store_faiss.py. First, we make necessary imports:

import faiss
import numpy as np
import pickle

 
Now we’ll create a FAISS index from our embeddings using the function build_faiss_index().

def build_faiss_index(embeddings, save_path="faiss_index"):
   """
   Builds FAISS index and saves it.
   """
   dim = embeddings.shape[1]
   print(f"Building FAISS index with dimension: {dim}")

   # Use a simple flat L2 index
   index = faiss.IndexFlatL2(dim)
   index.add(embeddings.astype('float32'))

   # Save FAISS index
   faiss.write_index(index, f"{save_path}.index")
   print(f"Saved FAISS index to {save_path}.index")

   return index

 
Each embedding represents a text chunk, and FAISS assists in retrieving the nearest ones in the future when a user poses a question. Finally, we need to save all text chunks (their metadata) into a pickle file so they can be easily reloaded later for retrieval.

def save_metadata(text_chunks, path="faiss_metadata.pkl"):
   """
   Saves the mapping of vector positions to text chunks.
   """
   with open(path, "wb") as f:
       pickle.dump(text_chunks, f)
   print(f"Saved text metadata to {path}")

 

Step 4: Retrieving Relevant Information

 
In this step, the user’s question is first converted into numerical form, just like what we did with all the text chunks before. The computer then compares the numerical values of the chunks with the question’s vector to find the closest ones. This process is called similarity search.
Let’s create a new file called retrieve_faiss.py and make the imports as needed:

import faiss
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer

 
Now, create a function to load the previously saved FAISS index from disk so it can be searched.

def load_faiss_index(index_path="faiss_index.index"):
    """
    Loads the saved FAISS index from disk.
    """
    print("Loading FAISS index.")
    return faiss.read_index(index_path)

 

We’ll also need another function that loads the metadata, which contains the text chunks we stored earlier.

def load_metadata(metadata_path="faiss_metadata.pkl"):
    """
    Loads text chunk metadata (the actual text pieces).
    """
    print("Loading text metadata.")
    with open(metadata_path, "rb") as f:
        return pickle.load(f)

 

The original text chunks are stored in a metadata file (faiss_metadata.pkl) and are used to map FAISS results back to readable text. At this point, we will be creating another function that takes a user’s query, embeds it, and finds the top matching chunks from the FAISS index. The semantic search takes place here.

def retrieve_similar_chunks(query, index, text_chunks, top_k=3):
    """
    Retrieves top_k most relevant chunks for a given query.
  
    Parameters:
        query (str): The user's input question.
        index (faiss.Index): FAISS index object.
        text_chunks (list): Original text chunks.
        top_k (int): Number of top results to return.
  
    Returns:
        list: Top matching text chunks.
    """
  
    # Embed the query
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    # Ensure query vector is float32 as required by FAISS
    query_vector = model.encode([query]).astype('float32')
  
    # Search FAISS for nearest vectors
    distances, indices = index.search(query_vector, top_k)
  
    print(f"Retrieved top {top_k} similar chunks.")
    return [text_chunks[i] for i in indices[0]]

 
This gives you the top three most relevant text chunks to use as context.

 

Step 5: Combining the Retrieved Context

 
Once we have the most relevant chunks, the next step is to combine them into a single context block. This context is then appended to the user’s query before passing it to the LLM. This step ensures that the model has all the necessary information to generate accurate and grounded responses. You can combine the chunks like this:

context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=3)
context = "\n\n".join(context_chunks)

 
This merged context will later be used when building the final prompt for the LLM.

 

Step 6: Using a Large Language Model to Generate the Answer

 
Now, we combine the retrieved context with the user query and feed it into an LLM to generate the final answer. Here, we’ll use a freely available open-source model from Hugging Face, but you can use any model you prefer.

Let’s create a new file called generate_answer.py and add the imports:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from retrieve_faiss import load_faiss_index, load_metadata, retrieve_similar_chunks

 
Now define a function generate_answer() that performs the complete process:

def generate_answer(query, top_k=3):
    """
    Retrieves relevant chunks and generates a final answer.
    """
    # Load FAISS index and metadata
    index = load_faiss_index()
    text_chunks = load_metadata()

    # Retrieve top relevant chunks
    context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=top_k)
    context = "\n\n".join(context_chunks)

    # Load open-source LLM
    print("Loading LLM...")
    model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    # Load tokenizer and model, using a device map for efficient loading
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

    # Build the prompt
    prompt = f"""
    Context:
    {context}
    Question:
    {query}
    Answer:
    """

    # Generate output
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    # Use the correct input for model generation
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
    
    # Decode and clean up the answer, removing the original prompt
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Simple way to remove the prompt part from the output
    answer = full_text.split("Answer:")[1].strip() if "Answer:" in full_text else full_text.strip()
    
    print("\nFinal Answer:")
    print(answer)

 

Step 7: Running the Full Retrieval-Augmented Generation Pipeline

 
This final step brings everything together. We’ll create a main.py file that automates the entire workflow from data loading to generating the final answer.

# Data preparation
from prepare_data import prepare_docs
from split_text import split_docs

# Embedding and storage
from create_embeddings import get_embeddings
from store_faiss import build_faiss_index, save_metadata

# Retrieval and answer generation
from generate_answer import generate_answer

 

Now define the main function:

def run_pipeline():
    """
    Runs the full end-to-end RAG workflow.
    """
    print("\nLoad and Clean Data:")
    documents = prepare_docs("data/")
    print(f"Loaded {len(documents)} clean documents.\n")

    print("Split Text into Chunks:")
    # documents is a list of strings, but split_docs expects a list of documents
    # For this simple example where documents are small, we pass them as strings
    chunks_as_text = split_docs(documents, chunk_size=500, chunk_overlap=100)
    # In this case, chunks_as_text is a list of LangChain Document objects

    # Extract text content from LangChain Document objects
    texts = [c.page_content for c in chunks_as_text]
    print(f"Created {len(texts)} text chunks.\n")

    print("Generate Embeddings:")
    embeddings = get_embeddings(texts)
  
    print("Store Embeddings in FAISS:")
    index = build_faiss_index(embeddings)
    save_metadata(texts)
    print("Stored embeddings and metadata successfully.\n")

    print("Retrieve & Generate Answer:")
    query = "Does unsupervised ML cover regression tasks?"
    generate_answer(query)

 

Finally, run the pipeline:

if __name__ == "__main__":
    run_pipeline()

 

Output:
 

Screenshot of the Output
Screenshot of the Output | Image by Author

 

Wrapping Up

 
RAG closes the gap between what an LLM “already knows” and the constantly changing information out in the world. I have implemented a very basic pipeline so you could understand how RAG works. At the enterprise level, many advanced concepts, such as adding guardrails, hybrid search, streaming, and context optimization techniques come into use. If you’re interested in exploring more advanced concepts, here are a few of my personal favorites:

 
 

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.