Building a RAG Chatbot — The Parts the Tutorials Skip

Most tutorials on Retrieval-Augmented Generation (RAG) make it look incredibly simple.

Install LangChain, call OpenAIEmbeddings(), load a few text chunks into a vector store, and boom—you have an intelligent AI assistant.

When you build a RAG system for a real production app, however, you quickly hit walls that these tutorials completely ignore. You face:

LLM hallucinations where the AI confidently makes up facts about your work
Massive cold starts where the local vector encoding model takes 10 seconds to load on the first user query
Complex semantic retrieval mismatches where direct queries fail to return the right context
Unhandled API rate limits that crash your streaming servers

Here is exactly how I built the Ask My AI chatbot system for my portfolio, the parts of RAG the tutorials skip, and the specific engineering strategies I used to solve them.

🔬 Local Vector Encoding: Why I Skipped OpenAI

Many builders immediately reach for text-embedding-3-small by OpenAI. However, for a high-performance personal portfolio chatbot, calling an external embedding API introduces a slow network hop (usually adding 150-300ms of latency) and creates a direct billing dependency.

To solve this, I opted to run the sentence-transformers/all-MiniLM-L6-v2 model locally on the server. This maps sentences to 384-dimensional dense vectors in under 15ms.

But there's a major catch: Model Cold Start.

On server startup, the first user who hit the chat widget triggered the loading of the PyTorch weights. The container would hang for 5 to 7 seconds just initializing the neural network:

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

To solve this, I implemented a Pre-Warming Service that triggers during the FastAPI startup event loop. By executing a silent dummy embedding query on container boot, the model is fully loaded into memory before the first user request ever lands:

# backend/main.py
from fastapi import FastAPI
from services.rag import initialize_knowledge_base, model

app = FastAPI()

@app.on_event("startup")
async def startup_event():
    # Pre-warm sentence-transformers model in memory
    print("📚 Pre-warming local embedding model...")
    model.encode("warmup query")
    print("✓ Model successfully pre-warmed!")
    
    # Initialize database knowledge base
    await initialize_knowledge_base()

🧮 Solving Context Retrieval: The Reranking Formula

Standard vector similarity lookups (like basic Cosine Distance) frequently retrieve the wrong context if the user's question is indirect.

For instance, if a user asks, "Has Anush worked with React?", a pure vector search might return general facts about my front-end preferences but miss a highly important, specific project because the project description used the word "Next.js" instead of the word "React".

To solve this, I designed a composite scoring and reranking algorithm that weights semantic vector matching, database-configured importance levels, and direct keyword/tag overlaps:

Final Score = (Cosine Similarity * 0.75) + (Fact Importance * 0.15) + (Tag Overlap * 0.10)

Here is the exact Python implementation of the reranker:

# backend/services/rag.py
import numpy as np

def cosine_similarity(v1: list[float], v2: list[float]) -> float:
    a1, a2 = np.array(v1), np.array(v2)
    norm1, norm2 = np.linalg.norm(a1), np.linalg.norm(a2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return float(np.dot(a1, a2) / (norm1 * norm2))

def rerank_results(query_vector: list[float], query_text: str, facts: list[dict], limit: int = 4) -> list[dict]:
    scored_facts = []
    query_words = set(query_text.lower().split())
    
    for fact in facts:
        # 1. Semantic Cosine Similarity (0.75 weight)
        sim = cosine_similarity(query_vector, fact["embedding"])
        
        # 2. Fact Importance (0.15 weight) - scale 1-10 to 0.1-1.0
        importance = (fact.get("importance", 5) / 10.0)
        
        # 3. Direct Keyword/Tag Overlap (0.10 weight)
        tags = set(t.lower() for t in fact.get("tags", []))
        overlap = len(query_words.intersection(tags))
        overlap_score = min(overlap / 3.0, 1.0) # Cap at 1.0 for 3 matching keywords
        
        # Calculate composite score
        final_score = (sim * 0.75) + (importance * 0.15) + (overlap_score * 0.10)
        
        scored_facts.append({
            "content": fact["content"],
            "score": final_score
        })
        
    scored_facts.sort(key=lambda x: x["score"], reverse=True)
    return scored_facts[:limit]

This formula drastically increased the retrieval accuracy, guaranteeing that highly important, tagged facts are prioritized over loose semantic matches.

🚫 Preventing AI Hallucinations: System Prompt Guardrails

Large Language Models (like Gemini) are inherently designed to generate fluent, creative text. In a portfolio RAG application, this creativity is a liability. You do not want the chatbot confidently inventing a Master's degree in Computer Science or stating you have 10 years of experience with Rust when you only built a basic utility script in it.

To solve this, I applied a highly restrictive, defensively engineered system prompt that binds the LLM strictly to the retrieved context, stripping its creative freedom entirely.

Here is the exact prompt blueprint:

You are Antigravity, the highly polished AI avatar and technical assistant of full-stack engineer Anush Kharel.
Your task is to answer user questions about Anush's career, projects, and experience using ONLY the retrieved facts provided below.

RETRIEVED FACTS:
{context}

CRITICAL RULES:
1. Rely ONLY on the retrieved facts above. If a user asks a question that cannot be answered using the retrieved facts, respond exactly with: "I'm sorry, I don't have that information in my knowledge base yet."
2. NEVER invent, exaggerate, or assume details about Anush's experience, technologies, or background.
3. Keep your tone professional, conversational, and direct.
4. Format all replies using clean, elegant Markdown.

By explicitly commanding the model to return a structured fallback sentence instead of attempting to guess, I successfully eliminated all hallucinations in our production test runs.

🚀 True Streaming: UI/UX Orchestration

Waiting for a full LLM completion before sending the text blocks the user's browser, creating a terrible delay. To make the interface feel alive, the chatbot uses FastAPI's StreamingResponse on the backend and reads the stream chunk-by-chunk on the React frontend.

When the network chunk is received, we decode the buffer using TextDecoder and update our state array, giving the user a beautiful real-time typing effect:

// frontend/src/components/AskMe.tsx
const handleChatSubmit = async (message: string) => {
  const response = await fetch("https://api.mydomain.com.np/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ message })
  });

  if (!response.body) return;
  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  
  while (true) {
    const { value, done } = await reader.read();
    if (done) break;
    const chunk = decoder.decode(value);
    
    // Update active message state in React dynamically
    setMessages(prev => {
      const last = prev[prev.length - 1];
      return [...prev.slice(0, -1), { ...last, content: last.content + chunk }];
    });
  }
};

💡 Key Lesson

RAG systems look simple in tutorials because they only demonstrate the happy path. In production, RAG quality is 80% about your knowledge base engineering and 20% about your code framework.

By running local embeddings, pre-warming your models on startup, applying direct tag-overlap weights to your vector queries, and using defensive system prompts, you can build a highly robust, sub-100ms AI chatbot that truly represents you!