Procedural Memory — Auto-Generated Skills from Learned Patterns
Issue: #23
Procedural memory extends the hybrid memory system with “what have I learned to do”: it extracts successful (and failed) multi-step tool-call patterns from session logs and turns them into reusable procedures and, when validated enough, into auto-generated skills that any session or sub-agent can discover.
Overview
| Layer | What it does |
|---|---|
| 1. Procedure tagging | During session processing, multi-step tool sequences are extracted from JSONL logs; successful runs → positive procedures, failures → negative procedures. Stored in the procedures table and optionally as procedure-tagged facts. |
| 2. Procedure-aware recall | memory_recall_procedures(taskDescription) and auto-recall inject “Last time this worked” and “Known issue: avoid …” so the agent reuses proven flows and avoids known failures. |
| 3. Skill generation | After a procedure is validated N times (default 3), the plugin can auto-generate skills/auto/{slug}/SKILL.md and recipe.json, discoverable by the standard skill system. |
Configuration
All under plugins.entries["openclaw-hybrid-memory"].config.procedures:
| Option | Default | Description |
|---|---|---|
enabled | true | Enable procedure extraction, recall injection, and skill generation. |
sessionsDir | ~/.openclaw/agents/main/sessions | Directory containing session .jsonl files. |
minSteps | 2 | Minimum tool-call steps to consider a sequence a procedure. |
validationThreshold | 3 | Success count required before auto-generating a skill. |
skillTTLDays | 30 | Auto skill generation only considers positive procedures whose latest activity (last_validated, else updated_at, else created_at) is within this many days. |
skillsAutoPath | skills/auto | Path (relative to workspace or absolute) for auto-generated skills. |
requireApprovalForPromote | true | When true, human should move skills out of auto/ to promote to permanent. |
CLI Commands
Extract procedures from session logs
# Default: use config sessionsDir, all files
openclaw hybrid-mem extract-procedures
# Only sessions modified in last 7 days
openclaw hybrid-mem extract-procedures --days 7
# Custom directory
openclaw hybrid-mem extract-procedures --dir /path/to/sessions
# Preview without writing
openclaw hybrid-mem extract-procedures --dry-run
Use this in your nightly pipeline together with (or after) session distillation: same session JSONL can be used for fact extraction and procedure extraction.
Generate auto-skills
# Generate SKILL.md + recipe.json for procedures that reached validationThreshold
openclaw hybrid-mem generate-auto-skills
# Preview only
openclaw hybrid-mem generate-auto-skills --dry-run
Generated skills live under skills/auto/ (or your procedures.skillsAutoPath). To promote one to a permanent skill, move the folder out of auto/ (e.g. to skills/ or a custom path).
Size and quality gates (issues #1537–#1548):
SKILL.mdis capped at 256 KB (OpenClaw loader default); the generator targets 200 KB with shrink + optionalreferences/workflow.mdoffload.recipe.jsonis summarized and capped at 64 KB (no raw marathon traces).- Frontmatter uses Skill Creator layout:
name,description, andmetadata.{category,provenance,generated_at}. - Deterministic evals are written to
evals/results.json; promotion defers on failed trigger/functional/actionability gates. - Legacy oversized or transcript-style skills:
openclaw hybrid-mem skills audit [--json] [--quarantine].
Generated skill telemetry
openclaw hybrid-mem skills telemetry
openclaw hybrid-mem skills telemetry moltbook-check
openclaw hybrid-mem skills demote moltbook-check --reason "over-triggering"
openclaw hybrid-mem skills reset moltbook-check --reason "agent prompt updated; false positives were stale"
openclaw hybrid-mem skills reject moltbook-check --reason "superseded by skill-xyz"
openclaw hybrid-mem skills doctor # scan for skills missing on disk
openclaw hybrid-mem skills doctor --fix # mark missing skills as uninstalled
openclaw hybrid-mem skills audit --json # scan skills/auto for oversized or suspicious drafts
openclaw hybrid-mem skills audit --quarantine # move unsafe/oversized auto-skills aside (recoverable)
Generated skills start in the experimental lifecycle state. Each activation or near-miss can be recorded with openclaw hybrid-mem skills record <skill-name> ..., and a specific activation can later be marked as a false-positive with openclaw hybrid-mem skills correct <activation-id> --reason "...".
Telemetry reports surface activations per week, near-misses, false-positive/false-negative signals, success/failure/partial rates, repeated corrections, and archive/revision candidates. Each row also includes a heuristic riskLevel (low | medium | high) derived from task pattern + recipe content. The lifecycle policy:
- Auto-promotes experimental skills to
trustedafter repeated successful uses without correction. - Auto-demotes when false-positive rate crosses a risk-adjusted threshold (high-risk demotes sooner, low-risk uses a slightly higher FP bar).
- Auto-archives skills after the configured idle window has passed since the last selected activation (or since skill generation when there have been no selections).
- Auto-unblocks demoted skills back to
experimentalafter enough clean uses (configurable viaunblockAfterCleanUses).
When a skill is reset from demoted back to experimental (manually or automatically), the evaluation window resets so pre-demotion signals don’t block the recovery.
Procedure candidate score, user signal, and risk (#1414)
When ranking promotion candidates and generating verification telemetry:
- User signal uses a 0 raw baseline, clamps to
[-1,1], then remaps to[0,1]for additive scoring. - Rules/preferences contribution is capped so repeated rules cannot dominate the signal.
- Risk is applied as a multiplicative score factor (high ≈
0.35x, medium ≈0.65x, low1x) rather than a small additive nudge. - Generated-skill demotion uses the same risk tier via
effectiveDemoteThresholdsForRisk. - Concreteness and reusability (distinct sessions) are additive score terms; deferrals include
procedure_too_obvious(single obvious read/git status-class steps) andlow_concreteness(thin task/recipe). auto-safeadditionally requires ≥1 manual workflow request or ≥3 distinct source sessions (insufficient_auto_safe_evidence).- Near-duplicate procedures are clustered (task-token Jaccard ≥ 0.6); non-representatives defer with
cluster_merged_intoand land inverification.jsonasrelatedProcedures.
Skill Creator alignment (v2)
Generated SKILL.md bodies are tightened (~6 sections): Do Not Use When, Workflow (risk-tiered freedom + checklist + plan→validate→execute when needed), Verification, Examples (concrete input/output), optional Anti-patterns. Triggering lives in a pushy, multi-paraphrase description (≤1024 chars); name uses gerund form (≤64 chars, no anthropic/claude).
Sidecars:
| Artifact | Purpose |
|---|---|
evals/trigger-eval.json | 8 should-trigger + 8 should-not-trigger queries (Skill Creator schema) |
evals/results.json | Deterministic eval + replay baselineComparison vs historical prompts |
references/telemetry.md | Operator telemetry / rollback (not in SKILL.md) |
references/workflow.md | Progressive disclosure when over byte budget |
scripts/replay.sh | Deterministic exec replay when recipe has repeatable commands |
See SKILL-PIPELINES.md for the full pipeline architecture and operator playbooks.
Tools
memory_recall_procedures(taskDescription, limit?)
Searches stored procedures by task description (FTS on task_pattern). Returns:
- Last time this worked: positive procedures with recipe steps.
- Known issues (avoid): negative procedures (e.g. dead endpoints, failing flows).
Example: when the user says “check Moltbook”, the agent can call memory_recall_procedures("check Moltbook") and get back working steps and warnings like “don’t use /api/v1/agents/notifications (returns HTML 404)”.
Auto-recall injection
When auto-recall is enabled and procedures are enabled, each turn the plugin:
- Searches procedures matching the current prompt.
- If any match, prepends a
<relevant-procedures>block to the injected context with:- Short “Last time this worked” lines (task + steps).
- “Known issue (avoid)” lines for negative procedures.
So the model sees procedure hints without having to call the tool first.
Schema (SQLite)
Facts table (additions)
procedure_type—'positive' | 'negative' | NULLsuccess_count— integer, default 0last_validated— epoch seconds or NULLsource_sessions— JSON array of session IDs (text)
Procedures table
-
id,task_pattern,recipe_json,procedure_type(positivenegative) success_count,failure_count,last_validated,last_failedconfidence,ttl_days,promoted_to_skill,skill_pathskill_state,skill_state_reason,skill_version,skill_generated_atcreated_at,updated_at
Generated skill telemetry table
procedure_id,skill_name,skill_versionrequest_hash,request_summary,decision,confidence,reasontask_outcome,user_correction,correction_reasonfalse_negative_signal,caused_rework,saved_tool_calls,saved_time_msscope,scope_target,agent_id,session_id,created_at
Full-text search: procedures_fts on task_pattern for searchProcedures and getNegativeProceduresMatching.
Security and safety
- Secrets: Procedure recipes never store API keys, passwords, or tokens; the extractor redacts known secret keys from step args.
- Sandbox: Auto-generated skills are written only under
skills/auto/(or your configured path), separate from human-authored skills. - Rate limiting: Skill generation is capped per run (default 10) to avoid runaway self-modification.
- Audit: Each generated skill file includes the source procedure id and metadata (confidence, last validated).
Example end-to-end
- Day 1: User asks to “check Moltbook”. Agent calls
/api/v1/agents/notifications, gets HTML 404. Session ends in failure. - Nightly:
openclaw hybrid-mem extract-procedures --days 1runs. Parser sees tool sequence + error content → stores a negative procedure: “Check Moltbook …” with recipe andprocedure_type: negative. - Day 2: User asks again to “check Moltbook”. Auto-recall injects: “Known issue (avoid): … /notifications …”. Agent uses a different endpoint and succeeds.
- Nightly: Extract-procedures runs again; this time the session is successful → positive procedure stored or existing one’s
success_countincremented. - Day 7: After several successful runs,
success_countreaches 3. You runopenclaw hybrid-mem generate-auto-skills→skills/auto/moltbook-check/SKILL.mdandrecipe.jsonare created. - Later: Any session or sub-agent that loads skills can use
skills/auto/moltbook-checkuntil you move it out ofauto/to promote it.
Related docs
- MEMORY-TO-SKILLS.md — Cluster procedures and synthesize skill drafts (nightly or
skills-suggest). - SESSION-DISTILLATION.md — Fact extraction from session logs (same JSONL source).
- CLI-REFERENCE.md — All
hybrid-memcommands. - CONFIGURATION.md — Full plugin config reference.
Procedure-to-skill promotion autopilot (#1328)
Procedure promotion uses the shared pending-autopilot foundation from #1334. The procedure adapter emits shared PendingDecision envelopes with queue procedures, input hash, policy version, reason/capability classes, redacted evidence, and parent/child equivalence tests. The parent digest-autopilot route (#1326) must consume these same adapter decisions; cron (#1330) only invokes/observes the parent.
Commands:
openclaw hybrid-mem procedures triage --not-promoted --policy draft-only --json
openclaw hybrid-mem generate-auto-skills --dry-run --max 50 --policy auto-safe --json
openclaw hybrid-mem generate-auto-skills --apply --max 50 --policy auto-safe --json
Policies are conservative:
draft-only/manual: classify and report; mutation paths require human review.auto-safe: writes only draft/quarantined generated skill artifacts when all eligibility, safety, quality, duplicate, trigger, and usefulness gates pass.
Dry-run is non-mutating: it writes no skill files, does not mark procedures promoted, and does not update durable pending-autopilot state. Apply writes SKILL.md, recipe.json, verification.json, and evals/evals.json only for candidates that pass every gate. Generated skills are marked enabled: false; static validation alone never enables a skill.
Promotion gates reject or defer procedures with insufficient success evidence, insufficient distinct sessions/contexts, recent failures, low confidence/success rate, malformed/noisy/non-deterministic recipes, vague or context-specific triggers, missing validation checks, duplicate/overlapping existing skills, destructive/service/package/SSH/remote operations, credential/private-data/high-entropy leakage, external sends/posts/writes, or approval-bypass/prompt-injection content. Recipe JSON and generated SKILL.md are both scanned and redacted before durable output.
Generated skill drafts follow the skill-creator quality contract: trigger and near-miss examples, scope/non-scope, prerequisites, ordered workflow, safe tool usage, validation, failure handling, rollback/disable guidance, realistic examples, and provenance metadata. Verification metadata records source procedure ids, success/failure/session counts, input hash, policy version, static/safety/trigger/functional eval status, baseline comparison, rejection/defer reasons, and enabled: false.