Building Memory That Doesn't Lapse
Breaking Free from OpenAI and switching to local LLM infrastructure
The Crisis
It was a typical Thursday morning when I started noticing some anomalies in my OpenClaw agent’s memory system. My ReBAC memory persistence system, which had been working perfectly, suddenly started throwing timeouts when ingesting memories at the end of each turn:
[plugins] openclaw-memory-graphiti: deferred SpiceDB write failed for memory_store:
Error: Failed to resolve episode UUID for "memory_1770911378846" in group "main"
after 90s — episode not yet visible in get_episodes (Graphiti LLM processing may
still be running)
The culprit? An OpenAI API outage. Both the ingestion portion of my memory system (i.e., the portion of the pipeline that uses embedding to create new memories) as well as recall (which uses embeddings to create the search query) had been rendered inoperable. Clearly, I needed a backup (or perhaps a complete replacement) for times like this.
And you can see me talk about my experience… and the final result… at SpiceDB Community Day 2026! Register here.
The Problem: Multiple External Dependencies
The Graphiti MCP server, upon which the ReBAC memory system relies, was using external models (by default Open AI) for several independent tasks including:
Entity Extraction - Using
gpt-4o-minito analyze conversations and extract structured knowledge (entities, relationships, facts)Embeddings - Using
text-embedding-3-smallto generate 1536-dimensional vectors for semantic searchSearch - Using
text-embedding-3-smallto create the search query and (potentially)gpt-4.1-nanoinside theOpenAIRerankerClientto perform reranking.
Using the default Graphiti configuration, when OpenAI models go down, one or more systems fail, depending upon the extent of the outage. But beyond the risks of creating a single point of failure by relying solely upon Open AI, I was paying for API calls that could easily run locally on hardware I already owned.
The Solution: Go Local with Ollama
I decided to replace all OpenAI dependencies with local models running on Ollama either GPU-accelerated or directly on the CPU:
Entity Extraction: nvidia/nemotron-3-nano:30b via Ollama (GPU-accelerated)
Embeddings: nomic-ai/nomic-embed-text-v1.5 via Ollama (CPU-only)
Reranker: BAAI/bge-reranker-v2-m3 via the
sentence-transformerslibrary (CPU-only)
Benefits:
✅ No more external dependencies
✅ No API costs
✅ Complete data privacy
✅ Works even if internet connectivity is down
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Graphiti Core │
│ ┌─────────────────────────┬──────────────────────────┬─────────────────────────┐ │
│ │ LLM Client │ Embedder Client │ Reranker Client │ │
│ │ (Entity Extraction) │ (Vector Generation) │ (Search) │ │
│ └─────────────┬───────────┴──────────────┬───────────┴────────────┬────────────┘ │
└────────────────┼──────────────────────────┼────────────────────────┼────────────────┘
│ │ │
▼ ▼ ▼
┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
│ Ollama Server #1 │ │ Ollama Server #2 │ │ sentence-transformers │
│ :11434/v1 │ │ :11434/v1 │ │ │
│ (DGX-Spark) │ │ (OpenClaw machine) │ │ (OpenClaw machine) │
│ │ │ │ │ │
│ nemotron-3-nano:30b │ │ nomic-embed-text │ │ bge-reranker-v2-m3 │
│ (GPU-accelerated) │ │ (CPU-only) │ │ (CPU-only) │
└───────────────────────┘ └───────────────────────┘ └───────────────────────┘
Implementation Journey
Looking at the Graphiti codebase, it initially seemed like all the pieces were there; I just needed to swap out the OpenAI dependencies for their Ollama (and BGE) alternatives. But this plan quickly ran into a snag; not every configuration option available in the Graphiti core is directly available for external use.
Challenge #1: The Factory Pattern
Graphiti uses a factory pattern to create LLM clients. The existing code had:
OpenAIClient- Uses OpenAI’s proprietaryresponses.parse()API (structured outputs)OpenAIGenericClient- Uses standard OpenAI-compatiblechat.completions.create()API
The factory was creating OpenAIClient but wasn’t hooked up to OpenAIGenericClient. I needed to add a new provider case.
Challenge #2: Embedding Dimension Mismatch
OpenAI’s text-embedding-3-small generates 1536-dimensional vectors, but nomic-embed-text-v1.5 generates 768-dimensional vectors.
Normally this would require a database migration. Lucky break: my FalkorDB instance was empty (zero episodes in all groups due to a previous misconfiguration), so I could just switch dimensions without any data migration. This is something to bear in mind, however… if you change your embedding model, it will likely invalidate your embeddings database.
The Code Changes
1. Added openai_generic Provider to Factory
File: graphiti/mcp_server/src/services/factories.py
After the case 'openai': block, I added:
case 'openai_generic':
if not config.providers.openai:
raise ValueError('OpenAI provider configuration not found')
api_key = config.providers.openai.api_key or 'not-needed'
api_url = config.providers.openai.api_url
logger.info(f'Creating OpenAI Generic client (base_url: {api_url})')
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.llm_client.config import LLMConfig as CoreLLMConfig
llm_config = CoreLLMConfig(
api_key=api_key,
base_url=api_url,
model=config.model,
temperature=config.temperature,
max_tokens=config.max_tokens,
)
return OpenAIGenericClient(config=llm_config, max_tokens=config.max_tokens)
Key insight: The openai_generic provider reuses the config.providers.openai section for credentials/URLs. This keeps the config schema clean.
2. Update Config for Local Endpoints
File: graphiti/mcp_server/config/config.yaml
LLM section (entity extraction):
llm:
provider: "openai" # Ollama is OpenAI-compatible
model: "nemotron-3-nano:30b"
max_tokens: 32768
temperature: 0.0
providers:
openai:
api_key: "not-needed"
api_url: ${LLM_API_URL}
Embedder section (vector generation):
embedder:
provider: "openai" # Ollama is OpenAI-compatible
model: "nomic-embed-text:v1.5"
dimensions: 768 # Changed from 1536
providers:
openai:
api_key: "not-needed"
api_url: ${EMBEDDER_API_URL}
Why provider: "openai"? Ollama implements the OpenAI-compatible API, so the existing EmbedderFactory works seamlessly with any OpenAI-compatible endpoint. No code changes needed!
3. Environment Variables
File: openclaw-memory-graphiti/.env
# Local LLM configuration (Ollama)
LLM_PROVIDER=openai_generic
LLM_MODEL=nemotron-3-nano:30b
LLM_API_URL=http://ollama-server:11434/v1
LLM_MAX_TOKENS=32768
# Local embedder configuration (Ollama)
EMBEDDER_MODEL=nomic-embed-text:v1.5
EMBEDDER_API_URL=http://localhost:11434/v1
EMBEDDER_DIMENSIONS=768
EMBEDDER_API_KEY=not-needed
# Local reranker
RERANKER_PROVIDER=bgeInfrastructure Setup
Ollama Server #1: Entity Extraction (GPU-Accelerated)
# Run Ollama with GPU support
docker run -d \
--name ollama-llm \
--gpus all \
-p 11434:11434 \
-v ollama-llm-data:/root/.ollama \
ollama/ollama:latest
# Pull the Nemotron model for entity extraction (~17GB)
docker exec ollama-llm ollama pull nemotron-3-nano:30b
Why Nemotron-3-nano:30b?
Excellent instruction following for knowledge extraction
30B parameters strike a balance between quality and speed
GPU-accelerated for fast inference
OpenAI-compatible API endpoint
Ollama Server #2: Embeddings (CPU-Only)
# Run Ollama in Docker (CPU-only, no GPU needed)
docker run -d \
--name ollama-embeddings \
-p 11434:11434 \
-v ollama-embeddings-data:/root/.ollama \
ollama/ollama:latest
# Pull the embedding model (~500MB)
docker exec ollama-embeddings ollama pull nomic-embed-text:v1.5Why separate Ollama instances?
Dedicated resources: Entity extraction gets GPU, embeddings run on CPU
Independent scaling: Can run on different machines
Isolation: Model loading/updates don’t affect each other
Why Ollama for (almost) everything?
Straightforward setup (one command per model)
OpenAI-compatible API (
/v1/chat/completions,/v1/embeddings)Automatic GPU detection and utilization
Built-in model management (
ollama pull,ollama list)Tiny memory footprint (~2GB RAM for embeddings)
Final validation (or so I thought)
I did some final testing and submitted a PR to the Graphiti repo, receiving some feedback in fairly short order. I addressed the feedback, merged in some changes from main that had occurred in the interim and then… the wheels came off.
Suddenly, the automatic ingestion occurring at the end of each turn was consistently timing out again. This time I knew that it couldn’t be a model outage, since everything was running locally. So what had happened?
Checking through the changes I had pulled in from main, two PRs caught my eye :
feat: simplify extraction pipeline and add batch entity summarization (#1224)
feat: driver operations architecture redesign (#1232)
The net result? Ingesting a single episode went from 60-90 seconds (not great, but ingestion doesn’t need to be instantaneous) to more than 15 minutes. In particular, while the number of LLM calls necessary for entity extraction had roughly doubled (from ~15 to ~30), the number of embedding calls per episode had gone from ~40 to ~300. Clearly something had changed, and not for the better; while a moderate performance hit when running local models is not unexpected, this was in another league entirely. Perhaps local models were no longer viable for use with the newly rearchitected Graphiti?
This sent me back to the drawing board, leading me to another memory framework that had readily available examples making use of local models. More on that experience in my next post.
In the meantime, this experience was not without its learnings…
Lessons Learned
OpenAI-Compatibility is Great: The OpenAI API specs have become the de facto standard whether dealing with chat completions or embeddings.
Separate Your Concerns: Graphiti’s clean separation of LLM and embedder configs made it easy to swap them independently. I could have switched just one if needed.
Switching embedding models is non-trivial: The embedding dimension change would have been painful with existing data.
CPU Embeddings are Fine: Embeddings don’t need GPU. A cheap mini PC running Ollama is perfect… until the number of embedding calls increases drastically.
Local models show promise: Nemotron Nano 3 is more than adequate for entity extraction
Conclusion
Breaking free from OpenAI wasn’t just about avoiding outages—it was about taking control of my infrastructure. Ollama made this transition refreshingly smooth with its simple Docker-friendly setup and OpenAI-compatible API.
What’s Next?
While I’m still keen on SpiceDB for it’s ReBAC functionality, the rearchitecting of Graphiti appears to have rendered it unusable as a backend memory store using local models for entity extraction and embeddings. Looks like it’s time to explore alternative memory architectures…
Resources
Graphiti - Knowledge graph memory for LLMs
Ollama - Run LLMs locally with ease
Nemotron 3 Nano - Fully capable for entity extraction
Nomic Embed - Efficient open embedding model, can run on CPU
Questions?
If you try this approach, I’d love to hear about it! What models are you running locally? Were you able to get past the performance barrier that I ran into?

