1. Introduction: When Static Chat Isn't Enough
Building a real-time chat product is one thing. Building a chat product that can remember, understand context, and adapt to each user is a different challenge altogether.
I am building a conversational platform designed to help autistic individuals improve communication skills, we quickly realized that basic turn-based Q&A just wasn't enough. Users needed interactions that felt more natural and continuous, like the AI truly remembered them. That's when we turned to LangChain and Pinecone.
2. Why LangChain?
LangChain is a framework that makes it easier to build complex, multi-step LLM workflows. It helps developers abstract away a lot of boilerplate when integrating memory, retrieval-augmented generation (RAG), and tools.
We picked LangChain because:
- It supported structured memory modules.
- It integrated well with vector stores like Pinecone.
- It offered flexibility via
ConversationalRetrievalChain,ChatMessageHistory, and custom prompts.
This allowed us to focus on application logic and user experience rather than LLM plumbing.
3. Why Pinecone?
Traditional databases like Firestore or SQL are great for storing structured chat data. But when we wanted to retrieve semantically relevant past messages, these fell short.
Pinecone solved this with:
- Fast vector similarity search
- Horizontal scalability
- Easy integration via API and LangChain wrappers
Each user's messages were converted to embeddings using OpenAI and stored in Pinecone, indexed by userId as a namespace to preserve privacy, sessionId, and contactId for fast retrieval.
4. System Architecture Overview
Here's a high-level view of how things fit together:
Frontend: React + Next.js
- Push-to-talk UI
- Real-time chat with visual feedback
API Gateway: Kong Gateway
- Secures and routes API traffic between frontend and backend services
- Enables rate-limiting, authentication, and service discovery
Backend: NestJS + LangChain
- Message received via Kong
- OpenAI for response generation
- Store in Pinecone
- Retrieve relevant messages based on similarity
- LangChain orchestrates context + message into LLM prompt
Database:
- Firestore for chat structure; Pinecone for vector context
Deployment:
- Deployed each service to Google Cloud Run to reduce initial costs via Docker
- Used GitHub Actions to upload Docker images to Google Docker Repo
- Deployed using Google Cloud commands in GitHub Actions
LangChain served as our chat orchestration logic, not a full system orchestrator like Kubernetes, but rather the part of the stack responsible for message understanding, memory management, and LLM routing.
5. Structuring Vector Data
We experimented with both message-level and session-level embedding granularity. Message-level gave us better results for context injection.
Each vector stored in Pinecone included metadata like:
{
"memberId": "abc123",
"sessionId": "sess001",
"timestamp": 1714600000
}
This helped us retrieve the most relevant chunks across all sessions.
Also, all messages were stored under each user's id as the namespace to preserve privacy across users, and in future we would also use member id to maintain privacy in discussions between individual contacts.
6. Lessons from Production
Dimensional Mismatch
We initially hit a snag: our Pinecone index was 1024-dimensional, but OpenAI embeddings were 1536. This caused upload errors. We fixed it by using an appropriate embedding model that matched our index.
Latency vs Relevance
Fetching 5 similar messages worked best. More than that increased token cost and response delay. We tuned similarity thresholds to filter noise.
Prompt Engineering
Too much context confused the LLM. We curated message history using recency and semantic closeness.
Token Budget Management
To fit within model limits, we tokenized and trimmed messages dynamically based on Pinecone scores.
7. Impact and Results
- +30% improvement in relevance of AI responses (measured via manual reviews)
- 60% faster retrieval vs Firestore query-based lookups
- Reduced user frustration: AI began picking up past interactions, making conversations smoother
8. What We Would Do Differently
- Use a hybrid memory approach: Redis for short-term memory, Pinecone for long-term
- Add summarization to reduce token load
- Explore local embedding models to reduce OpenAI dependency
9. Bonus: Firestore vs Pinecone
| Feature | Firestore | Pinecone |
|---|---|---|
| Speed (semantic) | No | Yes |
| Cost | Yes (low) | No (usage-based) |
| Context relevance | No | Yes |
| Setup effort | Yes | Yes |
We continue to use both: Firestore for storing structured chat threads, and Pinecone for semantic recall.
10. Final Thoughts
LangChain and Pinecone gave us the tools to elevate our chatbot from a reactive assistant to a remembering, context-aware partner.
If you're building any chat product where user memory matters, this stack is more than worth considering.
About the Author: I'm a full-stack builder working at the intersection of AI and accessibility.
Continue the Discussion
If you are building a RAG-powered chat product and want help with memory design, retrieval quality, and production guardrails, book a CTO consultation.
You can also connect with me on LinkedIn to compare implementation approaches.