How I built a RAG chatbot with FastAPI and Gemini

Designing a highly responsive, streaming AI assistant for a personal portfolio website is a masterclass in full-stack workflow orchestration. In this article, I want to pull back the curtain on how I built the Ask My AI chatbot system powering this site, detailing the engineering tradeoffs between backend latency, vector database operations, and real-time response generation.

🏗️ The Architectural Blueprint

Instead of relying on heavy orchestration libraries like LangChain which can introduce unwanted overhead and startup delays, I opted to build a custom, lightweight asynchronous pipeline in Python using the FastAPI framework.

The workflow proceeds in three sequential stages:

Semantic Search: Encoding user queries and matching them against pre-seeded knowledge in PostgreSQL.
Context Injection: Incorporating matching snippets into a highly constrained system prompt.
Response Streaming: Using the official Google GenAI SDK to stream markdown chunks directly to the Next.js client.

🔬 Local Vector Encoding with Sentence Transformers

While managed services like OpenAI's text-embedding-3-small are easy to use, they add network hops and extra API costs. To guarantee sub-50ms query encoding speeds, I opted to run the sentence-transformers/all-MiniLM-L6-v2 model locally on the server.

This model maps text sentences to a 384-dimensional dense vector space, which is incredibly lightweight yet highly precise for paragraph-length portfolio facts.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model locally
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def embed_text(text: str) -> list[float]:
    # Replace newlines for uniform sentence representation
    clean_text = text.replace("\n", " ")
    embedding = model.encode(clean_text)
    return embedding.tolist()

🗄️ Relational-Semantic Hybrid DB Layer

A key design choice was using PostgreSQL to manage both transactional logs and semantic knowledge. Although pgvector is a standard choice, I wanted to show that a custom cosine-similarity engine could run seamlessly. The database schemas handle three major tasks:

portfolio_embeddings: Houses the chunks of my work experience, tags, and calculated vectors.
chat_logs: Logs active session histories to allow multi-turn context retention.
app_settings: Stores configuration parameters.

To ensure our AI answers are incredibly accurate, the retrieval engine applies a custom scoring model that combines semantic vector comparisons, administrative importances, and keyword overlap matching:

Final Score = (Semantic Cosine Similarity * 0.75) + (Fact Importance * 0.15) + (Tag Overlap * 0.10)

This formula keeps responses perfectly accurate and strictly prevents the model from hallucinating details that are not pre-seeded in the database.

🚀 Low-Latency Streaming over HTTP

To achieve an elite user experience, waiting for the full LLM reply to generate before sending it to the front-end is unacceptable. I designed the FastAPI router to leverage Python's Async Generators coupled with FastAPI's StreamingResponse.

On the front-end, a standard Web Streams API (ReadableStream) reader parses the incoming text buffer and updates a React 19 state array chunk-by-chunk for a beautiful real-time typing animation.

from fastapi import APIRouter
from fastapi.responses import StreamingResponse
from google import genai

router = APIRouter()
client = genai.Client()

async def generate_response_stream(prompt: str):
    response = await client.aio.models.generate_content_stream(
        model="gemini-1.5-flash",
        contents=prompt
    )
    async for chunk in response:
        yield chunk.text

@router.post("/api/chat")
async def chat_endpoint(request: ChatRequest):
    # Retrieve context from DB based on question
    context = await retrieve_rag_context(request.message)
    prompt = build_system_prompt(context, request.message)
    
    return StreamingResponse(
        generate_response_stream(prompt),
        media_type="text/plain"
    )

💡 Lessons Learned

Building a RAG system entirely from scratch showed me that operational simplicity is almost always better than over-engineered frameworks. By relying on a lightweight local embedding model, async database pools, and direct streaming APIs, we created a lightning-fast chatbot that serves as both a high-quality portfolio showcase and an outstanding demonstration of practical AI engineering.

🏗️ The Architectural Blueprint

🔬 Local Vector Encoding with Sentence Transformers

🗄️ Relational-Semantic Hybrid DB Layer

🚀 Low-Latency Streaming over HTTP

💡 Lessons Learned

Settings