Cost Optimization Playbook

Practical rollout guide to reduce API/LLM spend while preserving memory quality.

Goals

Keep recall/capture quality high for daily use.
Move expensive features to opt-in.
Use cheap-first routing with safe fallbacks.
Enforce ongoing cost checks.

Recommended baseline profiles

Profile A — Local-first (lowest cost)

Use when you want durable memory with near-zero variable model spend.

{
  "mode": "local",
  "retrieval": { "strategies": ["fts5"] }
}

What you keep: auto-capture + auto-recall (FTS), WAL, durable SQLite memory.
What you avoid: external embeddings and chat-model features.

Profile B — Minimal (best value default)

Use for most production agents that still need semantic recall and low-cost maintenance.

{
  "mode": "minimal",
  "distill": {
    "modelTier": "maintenance",
    "extractionModelTier": "nano"
  },
  "autoRecall": {
    "interactiveEnrichment": "fast",
    "maxTokens": 800
  }
}

Profile C — Enhanced with guardrails

Use only for workflows that truly need reflection and richer behavior.

{
  "mode": "enhanced",
  "distill": {
    "modelTier": "maintenance",
    "extractionModelTier": "nano"
  },
  "store": { "classifyBeforeWrite": false }
}

Enable expensive options (reranking, contextual variants, self-correction depth) only with clear business value.

10 implementation controls

Mode by use-case
Keep default agents on local or minimal; reserve enhanced/complete for specific teams or agents.
Cheap-first tier routing
Put the lowest-cost acceptable model first in llm.nano and llm.maintenance. Keep fallback chains, but avoid expensive first-choice entries.
Distill extraction tier discipline
Keep distill.extractionModelTier at nano (or maintenance when needed). Avoid default/heavy for routine extraction.
Lean scheduled jobs
Start with nightly distill + weekly reflection only. Disable optional cron chains until they show clear value.
Local pre-filter before cloud distill
Enable local pre-filtering for distill (extraction.preFilter.enabled: true) so low-signal sessions are skipped before cloud calls.
Token budget tightening
Tune down context budgets first:
- autoRecall.maxTokens (start ~600–800)
- retrieval.ambientBudgetTokens
- retrieval.explicitBudgetTokens Prefer interactiveEnrichment: "fast" for stable low-cost turns.
Avoid per-write classification by default
Keep store.classifyBeforeWrite: false unless immediate write-time conflict resolution is required.
Tiering discipline (HOT small, COLD deeper)
Keep HOT compact (memoryTiering.hotMaxTokens, hotMaxFacts) and move low-value old content to COLD sooner.
Local embeddings where feasible
For high-volume installs, prefer embedding.provider: "ollama" or "onnx" and use cloud embedding as fallback via embedding.preferredProviders.
Continuous cost guardrails
Run weekly checks and keep cost tracking enabled:
- openclaw hybrid-mem stats --efficiency
- openclaw hybrid-mem context-audit
- openclaw hybrid-mem verify Define spend/latency thresholds and downgrade feature sets when exceeded.

Weekly operator checklist

Review top token sources with context-audit.
Review tier/source growth with stats --efficiency.
Confirm model routing and warnings with verify.
Trim or disable low-value scheduled jobs.
Re-check after changes within one week.

Rollout order

Move all non-critical agents to minimal (or local).
Enforce cheap-first llm.nano and llm.maintenance.
Lock distill extraction to nano.
Enable/confirm distill local pre-filter.
Tighten recall budgets.
Disable per-write classification where not needed.
Tighten tiering limits (HOT smaller, COLD faster).
Migrate heavy-traffic environments to local embeddings.
Keep only high-value cron jobs.
Establish weekly guardrail review and rollback rules.