Designing a highly responsive, streaming AI assistant for a personal portfolio website is a masterclass in full-stack workflow orchestration. In this article, I want to pull back the curtain on how I built the Ask My AI chatbot system powering this site, detailing the engineering tradeoffs between backend latency, vector database operations, and real-time response generation.
🏗️ The Architectural Blueprint
Instead of relying on heavy orchestration libraries like LangChain which can introduce unwanted overhead and startup delays, I opted to build a custom, lightweight asynchronous pipeline in Python using the FastAPI framework.
The workflow proceeds in three sequential stages:
- Semantic Search: Encoding user queries and matching them against pre-seeded knowledge in PostgreSQL.
- Context Injection: Incorporating matching snippets into a highly constrained system prompt.
- Response Streaming: Using the official Google GenAI SDK to stream markdown chunks directly to the Next.js client.
🔬 Local Vector Encoding with Sentence Transformers
While managed services like OpenAI's text-embedding-3-small are easy to use, they add network hops and extra API costs. To guarantee sub-50ms query encoding speeds, I opted to run the sentence-transformers/all-MiniLM-L6-v2 model locally on the server.
This model maps text sentences to a 384-dimensional dense vector space, which is incredibly lightweight yet highly precise for paragraph-length portfolio facts.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model locally
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def embed_text(text: str) -> list[float]:
# Replace newlines for uniform sentence representation
clean_text = text.replace("\n", " ")
embedding = model.encode(clean_text)
return embedding.tolist()
🗄️ Relational-Semantic Hybrid DB Layer
A key design choice was using PostgreSQL to manage both transactional logs and semantic knowledge. Although pgvector is a standard choice, I wanted to show that a custom cosine-similarity engine could run seamlessly. The database schemas handle three major tasks:
portfolio_embeddings: Houses the chunks of my work experience, tags, and calculated vectors.chat_logs: Logs active session histories to allow multi-turn context retention.app_settings: Stores configuration parameters.
To ensure our AI answers are incredibly accurate, the retrieval engine applies a custom scoring model that combines semantic vector comparisons, administrative importances, and keyword overlap matching:
Final Score = (Semantic Cosine Similarity * 0.75) + (Fact Importance * 0.15) + (Tag Overlap * 0.10)
This formula keeps responses perfectly accurate and strictly prevents the model from hallucinating details that are not pre-seeded in the database.
🚀 Low-Latency Streaming over HTTP
To achieve an elite user experience, waiting for the full LLM reply to generate before sending it to the front-end is unacceptable. I designed the FastAPI router to leverage Python's Async Generators coupled with FastAPI's StreamingResponse.
On the front-end, a standard Web Streams API (ReadableStream) reader parses the incoming text buffer and updates a React 19 state array chunk-by-chunk for a beautiful real-time typing animation.
from fastapi import APIRouter
from fastapi.responses import StreamingResponse
from google import genai
router = APIRouter()
client = genai.Client()
async def generate_response_stream(prompt: str):
response = await client.aio.models.generate_content_stream(
model="gemini-1.5-flash",
contents=prompt
)
async for chunk in response:
yield chunk.text
@router.post("/api/chat")
async def chat_endpoint(request: ChatRequest):
# Retrieve context from DB based on question
context = await retrieve_rag_context(request.message)
prompt = build_system_prompt(context, request.message)
return StreamingResponse(
generate_response_stream(prompt),
media_type="text/plain"
)
💡 Lessons Learned
Building a RAG system entirely from scratch showed me that operational simplicity is almost always better than over-engineered frameworks. By relying on a lightweight local embedding model, async database pools, and direct streaming APIs, we created a lightning-fast chatbot that serves as both a high-quality portfolio showcase and an outstanding demonstration of practical AI engineering.