1. Introduction: When Static Chat Isn't Enough

Building a real-time chat product is one thing. Building a chat product that can remember, understand context, and adapt to each user is a different challenge altogether.

I am building a conversational platform designed to help autistic individuals improve communication skills, we quickly realized that basic turn-based Q&A just wasn't enough. Users needed interactions that felt more natural and continuous, like the AI truly remembered them. That's when we turned to LangChain and Pinecone.

2. Why LangChain?

LangChain is a framework that makes it easier to build complex, multi-step LLM workflows. It helps developers abstract away a lot of boilerplate when integrating memory, retrieval-augmented generation (RAG), and tools.

We picked LangChain because:

This allowed us to focus on application logic and user experience rather than LLM plumbing.

3. Why Pinecone?

Traditional databases like Firestore or SQL are great for storing structured chat data. But when we wanted to retrieve semantically relevant past messages, these fell short.

Pinecone solved this with:

Each user's messages were converted to embeddings using OpenAI and stored in Pinecone, indexed by userId as a namespace to preserve privacy, sessionId, and contactId for fast retrieval.

4. System Architecture Overview

Here's a high-level view of how things fit together:

Frontend: React + Next.js

API Gateway: Kong Gateway

Backend: NestJS + LangChain

Database:

Deployment:

LangChain served as our chat orchestration logic, not a full system orchestrator like Kubernetes, but rather the part of the stack responsible for message understanding, memory management, and LLM routing.

5. Structuring Vector Data

We experimented with both message-level and session-level embedding granularity. Message-level gave us better results for context injection.

Each vector stored in Pinecone included metadata like:

{
  "memberId": "abc123",
  "sessionId": "sess001",
  "timestamp": 1714600000
}

This helped us retrieve the most relevant chunks across all sessions.

Also, all messages were stored under each user's id as the namespace to preserve privacy across users, and in future we would also use member id to maintain privacy in discussions between individual contacts.

6. Lessons from Production

Dimensional Mismatch

We initially hit a snag: our Pinecone index was 1024-dimensional, but OpenAI embeddings were 1536. This caused upload errors. We fixed it by using an appropriate embedding model that matched our index.

Latency vs Relevance

Fetching 5 similar messages worked best. More than that increased token cost and response delay. We tuned similarity thresholds to filter noise.

Prompt Engineering

Too much context confused the LLM. We curated message history using recency and semantic closeness.

Token Budget Management

To fit within model limits, we tokenized and trimmed messages dynamically based on Pinecone scores.

7. Impact and Results

8. What We Would Do Differently

9. Bonus: Firestore vs Pinecone

Feature Firestore Pinecone
Speed (semantic) No Yes
Cost Yes (low) No (usage-based)
Context relevance No Yes
Setup effort Yes Yes

We continue to use both: Firestore for storing structured chat threads, and Pinecone for semantic recall.

10. Final Thoughts

LangChain and Pinecone gave us the tools to elevate our chatbot from a reactive assistant to a remembering, context-aware partner.

If you're building any chat product where user memory matters, this stack is more than worth considering.

About the Author: I'm a full-stack builder working at the intersection of AI and accessibility.

Continue the Discussion

If you are building a RAG-powered chat product and want help with memory design, retrieval quality, and production guardrails, book a CTO consultation.

You can also connect with me on LinkedIn to compare implementation approaches.