From Detour to Redemption

An unnecessary (but educational) diversion brings us back to Graphiti

Mar 06, 2026

In this series, I’ll take you through some of the diversions I took along the way to implementing my OpenClaw memory replacement. In this installment, after I abandoned Graphiti for Cognee and abandoned Cognee for a fundamental architectural mismatch, I returned to Graphiti with GPU embeddings, and monkey-patched my way to a useful solution. Here’s the the next chapter of the saga.

And you can see me talk about my experience… and the final result… at SpiceDB Community Day 2026! Register here.

The Starting Problem

In my previous post, I chronicled breaking free from OpenAI’s API by switching to local Ollama models with Graphiti. That worked brilliantly-until Graphiti’s architecture redesign made local models impractical. Embedding calls per episode ballooned from ~40 to ~300, turning 60-90 second operations into 15+ minute ordeals on CPU.

Time for Plan B: Cognee.

Cognee (v0.5.2) looked promising-native Ollama support, built on LanceDB + KuzuDB + SQLite, simple REST API, active development. The plan was straightforward: swap out the Graphiti backend, point Cognee at the same GPU-accelerated Ollama servers, run the E2E tests, call it a day.

Narrator: It was not straightforward.

The Cognee Chapter

Challenge #1: The Authentication Wall

I configured Cognee, fired up the server, and immediately hit a wall:

$ curl http://localhost:8000/api/v1/datasets
{"detail": "Not authenticated"}

Every endpoint returned 401 Unauthorized. After trying various permutations of API keys and headers, I realized: Cognee v0.5.0+ introduced multi-tenant access control. Great for production deployments, but fundamentally conflicting with our architecture - I use SpiceDB for ReBAC (Relationship-Based Access Control), not Cognee’s built-in system.

The documentation mentioned ENABLE_BACKEND_ACCESS_CONTROL=false, but that alone wasn’t enough. After digging through the source code, I found the missing piece: REQUIRE_AUTHENTICATION=false. Two separate environment variables control access - the docs only mentioned the first. Both must be set to run Cognee without authentication.

Challenge #2: The GPU Discovery

With authentication sorted, I noticed the cognify operation (entity extraction + embedding generation + knowledge graph construction) was taking 90-180 seconds. The bottleneck: embeddings running on CPU while the LLM ran on the GPU server.

This split wasn’t accidental - it was based on a reasonable-sounding assumption (i.e., “it’s only embeddings! They can run on the CPU!”) that turned out to be wrong. Embedding models are tiny compared to LLMs. A 137M-parameter nomic-embed-text next to a 14B-parameter qwen2.5 is a rounding error. Why waste GPU memory on a model that barely needs it?

The problem is that “lightweight per call” and “lightweight in aggregate” are very different things. A single embedding takes milliseconds on CPU - negligible. But Cognee’s pipeline makes hundreds of calls per episode: chunking, entity extraction, graph edges, search vectors. At scale, those milliseconds compound into minutes. The model is light; the workload isn’t.

I moved both to the GPU server. Cognify dropped to 17 seconds. A 10x speedup.

This insight (i.e., that GPU embeddings are transformative, not incremental) would prove to be the most valuable discovery of the entire journey.

Challenge #3: The Deal-Breaker - Cross-Dataset Contamination

Everything seemed to be working. Then I stored data in separate groups and searched one group. I got results from both.

With ENABLE_BACKEND_ACCESS_CONTROL=false, Cognee stores everything in a single global database. The datasets parameter in search is simply ignored. Dataset isolation is implemented through access control, not query filtering.

Could I re-enable access control but keep authentication disabled? No. Cognee’s code explicitly forces authentication on when access control is enabled:

REQUIRE_AUTHENTICATION = (
    env.REQUIRE_AUTHENTICATION == "true"
) or (
    env.ENABLE_BACKEND_ACCESS_CONTROL == "true"
)

You literally cannot have dataset isolation in this case without authentication. It’s hardcoded. The isolation is keyed on a (owner_id, dataset_id) tuple-without an authenticated user, there’s no owner_id, and without that, there’s no isolation.

Share Clawtocracy

The Decision: Scrap Cognee

Cross-dataset contamination isn’t a bug - it’s the natural consequence of disabling a load-bearing architectural feature. Cognee’s dataset isolation is deeply integrated with its authentication system by design. If your authorization layer can’t enforce data boundaries, it doesn’t matter how fast your embeddings are.

So I removed Cognee… and reevaluated my whole approach.

The experience had forced a fundamental reckoning. The project had been called openclaw-memory-graphiti - the storage backend right there in the name. After discovering that backends can become incompatible overnight, baking one into the project’s identity seemed like a mistake I would like to avoid. The authorization model - SpiceDB’s ReBAC - was the stable commitment… but the solution could probably benefit from some choice when it came to memory backends.

So alongside the Cognee removal came a complete refactor: openclaw-memory-graphiti became openclaw-memory-rebac. Memory representations would be swappable plugins behind a common interface, with SpiceDB authorization gating access regardless of what sat behind it. The “rebac” is the part that could stay; the memory backend is the part that can change.

And the GPU embedding insight survived. It pointed us back to Graphiti with a hypothesis.

The Graphiti Redemption

The Hypothesis

If GPU embeddings gave Cognee a 10x speedup, the math for Graphiti looked promising. The redesign increased embedding calls from ~40 to ~300 - the same volume problem that had exposed the CPU assumption in the first place:

300 embeddings × 0.3s per batch (GPU) ≈ 90 seconds

The Results

I configured both LLM and embeddings on the GPU server, ported the complex relationship extraction tests, and ran them.

Test ScenarioProcessing TimeFull memory lifecycle~58sMulti-entity professional~54sTemporal + work artifacts~87sMulti-turn tech conversation~60s

\(\begin{array}{|l|c|} \hline \textbf{Scenario} & \textbf{Processing Time} \\ \hline \text{Full memory lifecycle} & \sim 58\text{s} \\ \text{Multi-entity professional} & \sim 54\text{s} \\ \text{Temporal + work artifacts} & \sim 87\text{s} \\ \text{Multi-turn tech conversation} & \sim 60\text{s} \\ \hline \end{array}\)

Nothing over 90 seconds. The hypothesis held. Still not blazingly fast, but bear in mind that these operations are running in the background, so the times are tolerable especially for a local model that isn’t burning through paid tokens.

The architectural complexity that made Graphiti “unusable” on CPU became lessof an issue with GPU acceleration-richer entity extraction, better relationship modeling, temporal understanding, now became possible in under 90 seconds.

Simplifying the Transport

The first change was architectural, not a patch. Graphiti ships with an MCP (Model Context Protocol) server, but MCP added session management overhead and SSE parsing complexity for what are fundamentally stateless operations - store a memory, search the graph, resolve a UUID. The Graphiti FastAPI server already exposes a clean REST API for all of this. Using it directly eliminated an entire class of transport-layer bugs and made debugging trivial.

The Patching

With the transport simplified, getting Graphiti to actually work with local models was like renovating a house built for a different climate - the foundation was sound, but the fixtures assumed conditions that no longer applied. Five runtime patches in a single startup.py file handled the adaptation.

The constructor bypass. ZepGraphiti.__init__ never forwards embedder or cross_encoder to the base class. Every embedding call silently hit OpenAI’s text-embedding-3-small regardless of configuration (and silently failed without an OpenAI API key). Fix: subclass Graphiti directly.

Singleton client lifecycle. Upstream creates and closes a client per-request, but episode processing runs asynchronously and outlives the request scope. Fix: a process-lifetime singleton.

Resilient AsyncWorker. Any exception in the background worker kills the loop silently - no logging, no recovery, jobs pile up forever. Fix: a catch-all handler that logs and continues.

Attribute sanitization. Local LLMs return nested structures where OpenAI returns flat objects. Neo4j rejects non-primitive properties. Fix: flatten nested dicts/lists to strings for both nodes and edges.

None index handling. Local LLMs sometimes return None where integers are expected. Fix: catch the TypeError instead of crashing the entire episode.

The Nature of These Patches

Let’s be honest about what I built. These are runtime monkey-patches applied via importlib.import_module() to a third-party codebase I don’t control. Every patch depends on upstream’s internal module structure. None of this is part of the public API. Any upstream release could break them silently.

But the alternative - maintaining a full fork of graphiti-core - is worse. The patches are concentrated in one file, well-documented, with clear comments explaining what they fix and why. The right amount of technical debt is the amount you can service. Five monkey-patches in one file? Serviceable. A full fork of a rapidly-evolving Python project? Not serviceable.

The Final Architecture

openclaw-memory-rebac is a Graphiti backend with SpiceDB ReBAC authorization:

Storage: Graphiti FastAPI + Neo4j, accessed via REST (no MCP)
Authorization: SpiceDB (relationship-based, per-fragment access control)
Inference: Ollama (qwen2.5:14b LLM + nomic-embed-text embeddings, GPU-accelerated)
Reranking: BGE (local sentence-transformers, CPU - some things still don’t need a GPU)

Credit Where It’s Due: Cognee’s RBAC Is the Real Deal

This post chronicles why Cognee didn’t work for us. That distinction matters, because the authorization system I had to disable is genuinely well-engineered - and if your access control needs are role-based rather than relationship-based, Cognee deserves serious consideration, especially since Cognee already has a perfectly useful OpenClaw plugin.

Think of authorization models as two different maps of the same territory. RBAC (Role-Based Access Control) maps organizational structure: departments, teams, job titles, reporting lines. ReBAC (Relationship-Based Access Control) maps social graphs: who shared what with whom, who authored what, who trusts whom. Both are valid projections - they just emphasize different features of the landscape.

Local Model Support: Cognee’s Quiet Advantage

Authorization aside, Cognee outperformed Graphiti convincingly in one other area: working with local models required zero patches.

Getting Graphiti to function with Ollama required five monkey-patches in a custom Docker image - constructor bypass, singleton lifecycle, resilient worker, attribute sanitization, None handling - all because Graphiti’s internals assume OpenAI-shaped outputs at every layer.

Cognee? Set LLM_PROVIDER=ollama, point to the endpoint, start the server. It just worked. Entity extraction, embeddings, knowledge graph construction - all without reaching into the framework’s internals with a wrench.

Cognee was designed with provider flexibility as a first-class concern. Swap a provider string and an endpoint URL, and the rest adapts. Graphiti was designed OpenAI-first and treats alternative providers as an afterthought. Both are valid strategies, but they produce very different experiences when you show up with an Ollama server instead of an API key.

If you’re committed to local model inference - whether for cost, privacy, latency, or sovereignty reasons - Cognee removes an entire category of integration pain that I spent hours working around with Graphiti.

If your requirements match these, start with Cognee:

Teams and departments need isolated knowledge bases
Access aligns with organizational hierarchy (managers see what their reports see)
You need dataset-level permissions without building your own auth layer
Compliance requires that data isolation is enforced at the storage level, not just the API level
You’re running local models via Ollama and want a framework that supports them natively

If needs look more like this, you’ll hit the same wall I did:

Access is based on social relationships (friends-of, shared-with, authored-by)
Permissions are dynamic and require graph traversal
You’re already running an external authorization system like SpiceDB
You need per-fragment access control within a single dataset

The incompatibilities weren’t flaws in Cognee’s design. They were a mismatch between two well-reasoned but fundamentally different philosophies - and the coupling that makes Cognee’s RBAC robust is exactly what made it impossible to bypass for our ReBAC use case.

Lessons Learned

1. Exhaust Infrastructure Before Switching Frameworks

I skipped straight from “Graphiti is slow on CPU” to “let’s try a different framework” without questioning the assumption that embeddings were too lightweight to benefit from GPU acceleration. Profiling before pivoting would have saved two days.

2. Local Models Are a Different Beast

OpenAI’s API produces consistent, well-typed outputs. Local LLMs return nested dicts where you expect flat objects, None where you expect integers, lists where you expect strings. Any framework tested exclusively against OpenAI has latent bugs that only surface with local models.

3. “Disable Feature X” Often Means “Disable Features X, Y, and Z”

Disabling Cognee’s authentication removed dataset isolation, user identity, and the entire multi-tenant data model along with it. Features in well-designed systems are interconnected. Disabling one load-bearing feature can collapse the entire structure.

4. Test Data Isolation on Day One

When evaluating any backend that claims group/dataset/namespace isolation, test it explicitly:

store("data-A", group: "group-a");
store("data-B", group: "group-b");
results = search("data-B", group: "group-a");
expect(results).not.toContain("data-B"); // FAILS with Cognee and authorization disabled

5. The Best Pivot Is Sometimes a U-Turn

I spent two days integrating Cognee, ripped it out, refactored the project identity, returned to Graphiti, and monkey-patched five internal behaviors. That looks like thrashing. But the GPU insight, the backend abstraction layer, and the local-model patches all came from the detour. Detours with learning are progress, not waste.

The Documentation Gap

The biggest meta-takeaway: good documentation is a map that shows both the trails and the exits. Most frameworks document the happy path, not the “I brought my own compass” path. Budget time for discovering the undocumented interactions between features - especially when disabling one.

Resources

openclaw-memory-rebac : Github and npm
Cognee - Knowledge graph memory framework (recommended for standalone use)
Graphiti - Knowledge graph with REST + MCP interfaces
SpiceDB - Relationship-based access control
Ollama - Local LLM runtime with GPU acceleration

Questions?

Have you hit similar architectural incompatibilities when layering authorization systems? Had to monkey-patch your way to production with local models? I’d love to hear about it in the comments.

What’s Next: Why Stop at One?

This entire series has treated memory backend selection as a high-stakes, one-way commitment. But that refactor to openclaw-memory-rebac created a backend abstraction layer for a reason. The SpiceDB authorization gate sits in front of the storage backend. The plugin interface doesn’t care what’s behind it.

Graphiti works. The patches are manageable. But the agentic memory space is rapidly evolving, and every backend brings different strengths - richer entity models, faster ingestion, better graph traversal, tighter local-model support. Now that swapping backends is a configuration change rather than a rewrite, the question shifts from “which one do we commit to forever?” to “which one fits the current workload best?”

Next post: treating the memory backend as a swappable component, and finding out what else is out there.

Kudos to the Cognee team for building genuinely solid software - particularly on the authorization front. The dataset isolation design is well-reasoned, the RBAC model is production-grade, and the GPU performance discovery during our integration was eye-opening. Our story is one of architectural mismatch, not software quality. Sometimes “not the right fit for this specific use case” is the reasonable - and most accurate - thing you can say about a well-engineered solution.

Mar 7

Addendum: The Hosted Model Detour

After the local Ollama adventure, I tried something different: what if the database stays local but the models don't?

The hypothesis was simple. Keep Neo4j, SpiceDB, and Graphiti on-premises — data sovereignty intact — but offload the LLM and embedding work to hosted APIs. Graphiti's entity extraction is the bottleneck, not the storage. Not everyone has a GPU sitting around to run embeddings on, so I owed it to everyone to look for cost-effective alternatives.

Enter Groq. Fast inference, OpenAI-compatible API, free tier. What could go wrong?

Turns out: model naming conventions, for starters. Graphiti's codebase assumes OpenAI, so we're routing through an OpenAI-compatible endpoint. But the LiteLLM groq/model-name prefix that works in some frameworks causes 404s when you're hitting the API directly. Just llama-3.3-70b-versatile, no prefix. An hour of debugging for a six-character fix.

Then the structured output saga. Groq's API requires the word "json" somewhere in your messages when you request response_format: json_object. Graphiti's internal prompts don't always include it. The 8B model returned lists where dicts were expected. The 9B model got decommissioned mid-debugging. A 17B reasoning model burned all its tokens thinking before producing output. The eventual winner — llama-3.3-70b-versatile — needed a thin wrapper class injecting "Respond in JSON format" into system prompts that forgot to mention it.

The lesson: "OpenAI-compatible" is a spectrum, not a binary. Every provider has quirks that surface exactly where your framework's assumptions meet their implementation. But the hybrid architecture works — local data, hosted inference, and the five monkey-patches from the previous chapter reduced to one clean subclass.

Clawtocracy

Discussion about this post

Ready for more?