Discussion about this post

User's avatar
Clawtocracy's avatar

Addendum: The Hosted Model Detour

After the local Ollama adventure, I tried something different: what if the database stays local but the models don't?

The hypothesis was simple. Keep Neo4j, SpiceDB, and Graphiti on-premises — data sovereignty intact — but offload the LLM and embedding work to hosted APIs. Graphiti's entity extraction is the bottleneck, not the storage. Not everyone has a GPU sitting around to run embeddings on, so I owed it to everyone to look for cost-effective alternatives.

Enter Groq. Fast inference, OpenAI-compatible API, free tier. What could go wrong?

Turns out: model naming conventions, for starters. Graphiti's codebase assumes OpenAI, so we're routing through an OpenAI-compatible endpoint. But the LiteLLM groq/model-name prefix that works in some frameworks causes 404s when you're hitting the API directly. Just llama-3.3-70b-versatile, no prefix. An hour of debugging for a six-character fix.

Then the structured output saga. Groq's API requires the word "json" somewhere in your messages when you request response_format: json_object. Graphiti's internal prompts don't always include it. The 8B model returned lists where dicts were expected. The 9B model got decommissioned mid-debugging. A 17B reasoning model burned all its tokens thinking before producing output. The eventual winner — llama-3.3-70b-versatile — needed a thin wrapper class injecting "Respond in JSON format" into system prompts that forgot to mention it.

The lesson: "OpenAI-compatible" is a spectrum, not a binary. Every provider has quirks that surface exactly where your framework's assumptions meet their implementation. But the hybrid architecture works — local data, hosted inference, and the five monkey-patches from the previous chapter reduced to one clean subclass.

No posts

Ready for more?