Building a RAG API with FastAPI, LangChain, and Hugging Face Embeddings

Feb 11, 2025

I recently built a lightweight Retrieval-Augmented Generation (RAG) API using FastAPI, LangChain, and Hugging Face embeddings, allowing users to query a PDF document with natural language questions. Here’s the breakdown:

How It Works

1️⃣ Load a PDF – Extracts text from a document using PyPDF2.
2️⃣ Chunk the Text – Uses RecursiveCharacterTextSplitter to break it into manageable pieces.
3️⃣ Create an In-Memory Vector Store – Uses Chroma with sentence-transformers/all-MiniLM-L6-v2 embeddings.
4️⃣ Run a Local LLM – Uses Ollama to power responses with deepseek-r1:1.5b.
5️⃣ FastAPI Endpoint – A /ask endpoint retrieves relevant chunks and generates concise answers.

What is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is an AI technique that combines retrieval and generation to improve the quality and accuracy of responses from a language model.

1️⃣ Retrieve: The system searches for relevant documents or text chunks related to a user's query (e.g., from a PDF, database, or knowledge base).
2️⃣ Augment: The retrieved information is added to the LLM’s prompt to give it more context.
3️⃣ Generate: The LLM uses this enriched prompt to generate a more accurate and context-aware

Image from **Build LLM Apps with LangChain.js** course

What is FastAPI, LangChain, and Ollama?

FastAPI ⚡ – A high-performance Python web framework, perfect for building APIs.
LangChain 🧠 – A framework that helps connect LLMs with external data sources, like PDFs.
Ollama 🏗️ – A tool for running local LLMs on your machine without needing external API calls.

The Code

from fastapi import FastAPI, Query
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama
import PyPDF2

app = FastAPI()

def load_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        return "\n".join([page.extract_text() for page in reader.pages if page.extract_text()])

pdf_text = load_pdf("mypdf.pdf")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_text(pdf_text)

embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = Chroma.from_texts(chunks, embedding_function)

llm = Ollama(model="deepseek-r1:1.5b")
retriever = vector_store.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)

@app.get("/ask")
def ask_question(question: str = Query(...)):
    relevant_chunks = retriever.get_relevant_documents(question)
    context = "\n\n".join([doc.page_content for doc in relevant_chunks])
    prompt = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    response = llm.invoke(prompt)
    return {"question": question, "answer": response}

Installation and Setup

1️⃣ Install FastAPI and LangChain:

pip install fastapi langchain langchain_community pypdf chromadb sentence-transformers

2️⃣ Install Ollama (for local LLMs) – Download it here

# Pull the TinyLlama model  
ollama pull tinyllama  

# List available models  
ollama list

Run

uvicorn rag_api:app --host 0.0.0.0 --port 8000 --reload

Test

Repository

https://github.com/arnml/poc_llm_api

Arnold's Substack

Discussion about this post