Cost Optimization Playbook
Practical rollout guide to reduce API/LLM spend while preserving memory quality.
Goals
- Keep recall/capture quality high for daily use.
- Move expensive features to opt-in.
- Use cheap-first routing with safe fallbacks.
- Enforce ongoing cost checks.
Recommended baseline profiles
Profile A — Local-first (lowest cost)
Use when you want durable memory with near-zero variable model spend.
{
"mode": "local",
"retrieval": { "strategies": ["fts5"] }
}
What you keep: auto-capture + auto-recall (FTS), WAL, durable SQLite memory.
What you avoid: external embeddings and chat-model features.
Profile B — Minimal (best value default)
Use for most production agents that still need semantic recall and low-cost maintenance.
{
"mode": "minimal",
"distill": {
"modelTier": "maintenance",
"extractionModelTier": "nano"
},
"autoRecall": {
"interactiveEnrichment": "fast",
"maxTokens": 800
}
}
Profile C — Enhanced with guardrails
Use only for workflows that truly need reflection and richer behavior.
{
"mode": "enhanced",
"distill": {
"modelTier": "maintenance",
"extractionModelTier": "nano"
},
"store": { "classifyBeforeWrite": false }
}
Enable expensive options (reranking, contextual variants, self-correction depth) only with clear business value.
10 implementation controls
-
Mode by use-case
Keep default agents onlocalorminimal; reserveenhanced/completefor specific teams or agents. -
Cheap-first tier routing
Put the lowest-cost acceptable model first inllm.nanoandllm.maintenance. Keep fallback chains, but avoid expensive first-choice entries. -
Distill extraction tier discipline
Keepdistill.extractionModelTieratnano(ormaintenancewhen needed). Avoiddefault/heavyfor routine extraction. -
Lean scheduled jobs
Start with nightly distill + weekly reflection only. Disable optional cron chains until they show clear value. -
Local pre-filter before cloud distill
Enable local pre-filtering for distill (extraction.preFilter.enabled: true) so low-signal sessions are skipped before cloud calls. - Token budget tightening
Tune down context budgets first:autoRecall.maxTokens(start ~600–800)retrieval.ambientBudgetTokensretrieval.explicitBudgetTokensPreferinteractiveEnrichment: "fast"for stable low-cost turns.
-
Avoid per-write classification by default
Keepstore.classifyBeforeWrite: falseunless immediate write-time conflict resolution is required. -
Tiering discipline (HOT small, COLD deeper)
Keep HOT compact (memoryTiering.hotMaxTokens,hotMaxFacts) and move low-value old content to COLD sooner. -
Local embeddings where feasible
For high-volume installs, preferembedding.provider: "ollama"or"onnx"and use cloud embedding as fallback viaembedding.preferredProviders. - Continuous cost guardrails
Run weekly checks and keep cost tracking enabled:openclaw hybrid-mem stats --efficiencyopenclaw hybrid-mem context-auditopenclaw hybrid-mem verifyDefine spend/latency thresholds and downgrade feature sets when exceeded.
Weekly operator checklist
- Review top token sources with
context-audit. - Review tier/source growth with
stats --efficiency. - Confirm model routing and warnings with
verify. - Trim or disable low-value scheduled jobs.
- Re-check after changes within one week.
Rollout order
- Move all non-critical agents to
minimal(orlocal). - Enforce cheap-first
llm.nanoandllm.maintenance. - Lock distill extraction to
nano. - Enable/confirm distill local pre-filter.
- Tighten recall budgets.
- Disable per-write classification where not needed.
- Tighten tiering limits (HOT smaller, COLD faster).
- Migrate heavy-traffic environments to local embeddings.
- Keep only high-value cron jobs.
- Establish weekly guardrail review and rollback rules.