Building Memory That Doesn't Lapse

Breaking Free from OpenAI and switching to local LLM infrastructure

Feb 26, 2026

The Crisis

It was a typical Thursday morning when I started noticing some anomalies in my OpenClaw agent’s memory system. My ReBAC memory persistence system, which had been working perfectly, suddenly started throwing timeouts when ingesting memories at the end of each turn:

[plugins] openclaw-memory-graphiti: deferred SpiceDB write failed for memory_store:
Error: Failed to resolve episode UUID for "memory_1770911378846" in group "main"
after 90s — episode not yet visible in get_episodes (Graphiti LLM processing may
still be running)

The culprit? An OpenAI API outage. Both the ingestion portion of my memory system (i.e., the portion of the pipeline that uses embedding to create new memories) as well as recall (which uses embeddings to create the search query) had been rendered inoperable. Clearly, I needed a backup (or perhaps a complete replacement) for times like this.

And you can see me talk about my experience… and the final result… at SpiceDB Community Day 2026! Register here.

The Problem: Multiple External Dependencies

The Graphiti MCP server, upon which the ReBAC memory system relies, was using external models (by default Open AI) for several independent tasks including:

Entity Extraction - Using gpt-4o-mini to analyze conversations and extract structured knowledge (entities, relationships, facts)
Embeddings - Using text-embedding-3-small to generate 1536-dimensional vectors for semantic search
Search - Using text-embedding-3-small to create the search query and (potentially) gpt-4.1-nano inside the OpenAIRerankerClient to perform reranking.

Using the default Graphiti configuration, when OpenAI models go down, one or more systems fail, depending upon the extent of the outage. But beyond the risks of creating a single point of failure by relying solely upon Open AI, I was paying for API calls that could easily run locally on hardware I already owned.

Part I of the OpenClaw memory journey

The Solution: Go Local with Ollama

I decided to replace all OpenAI dependencies with local models running on Ollama either GPU-accelerated or directly on the CPU:

Entity Extraction: nvidia/nemotron-3-nano:30b via Ollama (GPU-accelerated)
Embeddings: nomic-ai/nomic-embed-text-v1.5 via Ollama (CPU-only)
Reranker: BAAI/bge-reranker-v2-m3 via the sentence-transformers library (CPU-only)

Benefits:

✅ No more external dependencies
✅ No API costs
✅ Complete data privacy
✅ Works even if internet connectivity is down

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                                  Graphiti Core                                      │
│  ┌─────────────────────────┬──────────────────────────┬─────────────────────────┐   │
│  │   LLM Client            │   Embedder Client        │     Reranker Client     │   │
│  │   (Entity Extraction)   │   (Vector Generation)    │        (Search)         │   │
│  └─────────────┬───────────┴──────────────┬───────────┴────────────┬────────────┘   │
└────────────────┼──────────────────────────┼────────────────────────┼────────────────┘
                 │                          │                        │
                 ▼                          ▼                        ▼
     ┌───────────────────────┐  ┌───────────────────────┐ ┌───────────────────────┐
     │   Ollama Server #1    │  │   Ollama Server #2    │ │ sentence-transformers │
     │   :11434/v1           │  │   :11434/v1           │ │                       │
     │   (DGX-Spark)         │  │   (OpenClaw machine)  │ │   (OpenClaw machine)  │
     │                       │  │                       │ │                       │
     │   nemotron-3-nano:30b │  │   nomic-embed-text    │ │   bge-reranker-v2-m3  │
     │   (GPU-accelerated)   │  │   (CPU-only)          │ │   (CPU-only)          │
     └───────────────────────┘  └───────────────────────┘ └───────────────────────┘

Implementation Journey

Looking at the Graphiti codebase, it initially seemed like all the pieces were there; I just needed to swap out the OpenAI dependencies for their Ollama (and BGE) alternatives. But this plan quickly ran into a snag; not every configuration option available in the Graphiti core is directly available for external use.

Challenge #1: The Factory Pattern

Graphiti uses a factory pattern to create LLM clients. The existing code had:

OpenAIClient - Uses OpenAI’s proprietary responses.parse() API (structured outputs)
OpenAIGenericClient - Uses standard OpenAI-compatible chat.completions.create() API

The factory was creating OpenAIClient but wasn’t hooked up to OpenAIGenericClient. I needed to add a new provider case.

Challenge #2: Embedding Dimension Mismatch

OpenAI’s text-embedding-3-small generates 1536-dimensional vectors, but nomic-embed-text-v1.5 generates 768-dimensional vectors.

Normally this would require a database migration. Lucky break: my FalkorDB instance was empty (zero episodes in all groups due to a previous misconfiguration), so I could just switch dimensions without any data migration. This is something to bear in mind, however… if you change your embedding model, it will likely invalidate your embeddings database.

The Code Changes

1. Added `openai_generic` Provider to Factory

File: graphiti/mcp_server/src/services/factories.py

After the case 'openai': block, I added:

case 'openai_generic':
    if not config.providers.openai:
        raise ValueError('OpenAI provider configuration not found')

    api_key = config.providers.openai.api_key or 'not-needed'
    api_url = config.providers.openai.api_url

    logger.info(f'Creating OpenAI Generic client (base_url: {api_url})')

    from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
    from graphiti_core.llm_client.config import LLMConfig as CoreLLMConfig

    llm_config = CoreLLMConfig(
        api_key=api_key,
        base_url=api_url,
        model=config.model,
        temperature=config.temperature,
        max_tokens=config.max_tokens,
    )
    return OpenAIGenericClient(config=llm_config, max_tokens=config.max_tokens)

Key insight: The openai_generic provider reuses the config.providers.openai section for credentials/URLs. This keeps the config schema clean.

2. Update Config for Local Endpoints

File: graphiti/mcp_server/config/config.yaml

LLM section (entity extraction):

llm:
  provider: "openai"  # Ollama is OpenAI-compatible
  model: "nemotron-3-nano:30b"
  max_tokens: 32768
  temperature: 0.0

  providers:
    openai:
      api_key: "not-needed"
      api_url: ${LLM_API_URL}

Embedder section (vector generation):

embedder:
  provider: "openai"  # Ollama is OpenAI-compatible
  model: "nomic-embed-text:v1.5"
  dimensions: 768  # Changed from 1536

  providers:
    openai:
      api_key: "not-needed"
      api_url: ${EMBEDDER_API_URL}

Why provider: "openai"? Ollama implements the OpenAI-compatible API, so the existing EmbedderFactory works seamlessly with any OpenAI-compatible endpoint. No code changes needed!

3. Environment Variables

File: openclaw-memory-graphiti/.env

# Local LLM configuration (Ollama)
LLM_PROVIDER=openai_generic
LLM_MODEL=nemotron-3-nano:30b
LLM_API_URL=http://ollama-server:11434/v1
LLM_MAX_TOKENS=32768

# Local embedder configuration (Ollama)
EMBEDDER_MODEL=nomic-embed-text:v1.5
EMBEDDER_API_URL=http://localhost:11434/v1
EMBEDDER_DIMENSIONS=768
EMBEDDER_API_KEY=not-needed

# Local reranker
RERANKER_PROVIDER=bge

Infrastructure Setup

Ollama Server #1: Entity Extraction (GPU-Accelerated)

# Run Ollama with GPU support
docker run -d \
  --name ollama-llm \
  --gpus all \
  -p 11434:11434 \
  -v ollama-llm-data:/root/.ollama \
  ollama/ollama:latest

# Pull the Nemotron model for entity extraction (~17GB)
docker exec ollama-llm ollama pull nemotron-3-nano:30b

Why Nemotron-3-nano:30b?

Excellent instruction following for knowledge extraction
30B parameters strike a balance between quality and speed
GPU-accelerated for fast inference
OpenAI-compatible API endpoint

Ollama Server #2: Embeddings (CPU-Only)

# Run Ollama in Docker (CPU-only, no GPU needed)
docker run -d \
  --name ollama-embeddings \
  -p 11434:11434 \
  -v ollama-embeddings-data:/root/.ollama \
  ollama/ollama:latest

# Pull the embedding model (~500MB)
docker exec ollama-embeddings ollama pull nomic-embed-text:v1.5

Why separate Ollama instances?

Dedicated resources: Entity extraction gets GPU, embeddings run on CPU
Independent scaling: Can run on different machines
Isolation: Model loading/updates don’t affect each other

Why Ollama for (almost) everything?

Straightforward setup (one command per model)
OpenAI-compatible API (/v1/chat/completions, /v1/embeddings)
Automatic GPU detection and utilization
Built-in model management (ollama pull, ollama list)
Tiny memory footprint (~2GB RAM for embeddings)

Final validation (or so I thought)

I did some final testing and submitted a PR to the Graphiti repo, receiving some feedback in fairly short order. I addressed the feedback, merged in some changes from main that had occurred in the interim and then… the wheels came off.

Suddenly, the automatic ingestion occurring at the end of each turn was consistently timing out again. This time I knew that it couldn’t be a model outage, since everything was running locally. So what had happened?

Checking through the changes I had pulled in from main, two PRs caught my eye :

feat: simplify extraction pipeline and add batch entity summarization (#1224)
feat: driver operations architecture redesign (#1232)

The net result? Ingesting a single episode went from 60-90 seconds (not great, but ingestion doesn’t need to be instantaneous) to more than 15 minutes. In particular, while the number of LLM calls necessary for entity extraction had roughly doubled (from ~15 to ~30), the number of embedding calls per episode had gone from ~40 to ~300. Clearly something had changed, and not for the better; while a moderate performance hit when running local models is not unexpected, this was in another league entirely. Perhaps local models were no longer viable for use with the newly rearchitected Graphiti?

This sent me back to the drawing board, leading me to another memory framework that had readily available examples making use of local models. More on that experience in my next post.

In the meantime, this experience was not without its learnings…

Lessons Learned

OpenAI-Compatibility is Great: The OpenAI API specs have become the de facto standard whether dealing with chat completions or embeddings.
Separate Your Concerns: Graphiti’s clean separation of LLM and embedder configs made it easy to swap them independently. I could have switched just one if needed.
Switching embedding models is non-trivial: The embedding dimension change would have been painful with existing data.
CPU Embeddings are Fine: Embeddings don’t need GPU. A cheap mini PC running Ollama is perfect… until the number of embedding calls increases drastically.
Local models show promise: Nemotron Nano 3 is more than adequate for entity extraction

Conclusion

Breaking free from OpenAI wasn’t just about avoiding outages—it was about taking control of my infrastructure. Ollama made this transition refreshingly smooth with its simple Docker-friendly setup and OpenAI-compatible API.

What’s Next?

While I’m still keen on SpiceDB for it’s ReBAC functionality, the rearchitecting of Graphiti appears to have rendered it unusable as a backend memory store using local models for entity extraction and embeddings. Looks like it’s time to explore alternative memory architectures…

Resources

Graphiti - Knowledge graph memory for LLMs
Ollama - Run LLMs locally with ease
Nemotron 3 Nano - Fully capable for entity extraction
Nomic Embed - Efficient open embedding model, can run on CPU

Questions?

If you try this approach, I’d love to hear about it! What models are you running locally? Were you able to get past the performance barrier that I ran into?

Clawtocracy

Discussion about this post

Ready for more?