Subsystem deep-dive

Local AI

Models that run on your own hardware, with no cloud and no token bill, and the only thing allowed to read raw private data. Claude orchestrates; the local models do the work.

Local models

MCP servers shared

Callable tools

Tokens billed

How it's wired

One tool layer, two front-ends

The same MCP tool servers are shared by Claude Code (the orchestrator) and by Open WebUI (your local chat surface), bridged by mcpo and all talking to Ollama on your machine.

🧰 MCP tool serversshared by both clients

filesystem · applescript · git crawl4ai · ollama-bridge · n8n

↓ the same tools, two ways in ↓

🤖 Claude Codeorchestrator

talks to MCP directly (stdio / SSE)

💬 ConnorGPTOpen WebUI · Docker :3000

via mcpo bridge (MCP → OpenAPI)

↓

🧠 Ollamahost :11434 · on-device inference

8 local models reachable privately over Tailscale

The roster

Right tool per job

gemma4:26b

The resident driver. A mixture-of-experts model (4B active) that decodes ~22 t/s; default on both chat surfaces and the base every consolidated agent lane runs on after a July 2026 head-to-head bake-off.

17 GB · driver

gemma4:12b

The fast local fallback for Command Center Chat and the local rung of the task-box draft ladder. Schema-constrained generation for everyday structured jobs.

7.6 GB · fallback

qwen3-coder:30b

Fast structured-output worker. Strong on code and single-shot tool calls.

18 GB · worker

llama3.1:8b

A general-purpose mid-size worker for everyday local jobs.

4.9 GB · general

gemma3:4b

The speed play (~42 t/s) and the on-device OCR engine for the save-evaluator pipeline. Still selectable in the local chat UI.

3.3 GB · fast

llava

Vision (image → text), fully on-device, for describe-style jobs. Benched out of OCR duty when a two-model agreement gate beat it on accuracy.

4.7 GB · vision

The framing

Workers, not agents

The honest scope: local models started as reliable workers Claude delegates to. The autonomous multi-tool agent was deliberately pared back because multi-step tool loops weren't reliable, but successive tool-use bake-offs (May and July 2026) show that's changed, which has reopened the question.

Pattern A: Claude delegates

Claude orchestrates and hands a local model a single-shot job through the ollama-bridge MCP. Bulk work runs locally; Claude keeps the reasoning.

Pattern B: code orchestrates private work

A deterministic script drives a private-file job; the local model is the only LLM that ever sees raw personal content, and only a sanitized derivative crosses to Brain.

The multi-step bar just got cleared, and what that reopens

Single-shot tool calls always worked. Verified end-to-end (local model → mcpo → filesystem MCP → a real directory listing), with 56 tools wired across both clients.

Reliable multi-step tool loops were the weak spot, which is why the autonomous-agent ambition got scoped down. But the 2026-05-24 tool-use bake-off changed the verdict: a local model hit 96% tool-call correctness driving a 5-tool multi-step agent loop end-to-end (the Mac-control toolkit: app, messages, calendar, notes, screen-read) behind dry-run/confirm gates. The driver seat has since consolidated onto gemma4:26b after a July re-bake-off.

The hard privacy constraint is unchanged, and it's now the interesting part: an orchestrator must see what it orchestrates, so Claude still can't loop over Claude-denied private files, but a capable local agent can. So the open question is no longer "can local models do this" but "should a local agent take over the private-file orchestration that deterministic scripts do today." That's a scope call now actively reopened.

Why run models locally at all

Three things only on-device buys you

Privacy

Raw personal data (the Journal, health, finances) can be processed by an LLM without a single byte leaving the Mac.

Free + offline

Local inference costs zero tokens and works with no network. Bulk jobs don't run up a bill.

Private remote access

ConnorGPT is reachable from your phone over Tailscale, never exposed on the public internet.