Deep Dive — Storage, Search, Tags, Links, and Context

How facts are stored, searched, linked, and injected — the internals explained.


The two backends

Every fact is stored in two places simultaneously:

1. SQLite + FTS5 (structured storage)

File: ~/.openclaw/memory/facts.db

SQLite stores the full fact with all metadata:

facts table
├── id              (UUID)
├── text            ("User prefers dark mode")
├── category        (preference)
├── importance      (0.5)
├── entity          ("user")
├── key             ("preference")
├── value           ("dark mode")
├── source          ("conversation")
├── created_at      (epoch seconds)
├── source_date     (epoch seconds, when fact originated)
├── decay_class     (stable)
├── expires_at      (epoch seconds or null)
├── confidence      (1.0, decays over time)
├── summary         (short summary for long facts)
├── tags            ("ui,preference")
├── normalized_hash (SHA-256 of normalized text, for dedup)
├── recall_count    (how many times recalled)
├── last_accessed   (epoch seconds)
├── access_count    (INTEGER, times recalled — added by migration #237)
├── last_accessed_at (ISO 8601 timestamp of last recall — added by migration #237)
├── valid_from      (bi-temporal: when fact became true)
├── valid_until     (bi-temporal: when fact stopped being true)
├── superseded_at   (epoch seconds, when a newer fact replaced this)
├── superseded_by   (ID of the replacing fact)
└── supersedes_id   (ID of the fact this one replaced)

An FTS5 virtual table (facts_fts) mirrors text, category, entity, key, and value for full-text search. It uses Porter stemming and Unicode tokenization — so searching for “preferred” matches “prefer”, and non-ASCII characters work correctly. Triggers keep FTS in sync on every insert/update/delete.

Indexes on: category, entity, created_at, expires_at, decay_class, tags, source_date, last_accessed, superseded_at, valid_from/valid_until, normalized_hash, access_count, last_accessed_at (partial, non-null only).

2. LanceDB (vector storage)

Directory: ~/.openclaw/memory/lancedb/

LanceDB stores the embedding vector alongside minimal metadata:

memories table (LanceDB)
├── id              (UUID)
├── text            (fact text)
├── vector          (float array, 1536 dims for text-embedding-3-small)
├── importance      (0.5)
├── category        ("preference")
└── createdAt       (epoch seconds)

The vector is generated by sending the fact text to the configured embedding provider (default: OpenAI text-embedding-3-small, 1536 dimensions). Alternative providers — Ollama (local), ONNX (local), or Google (Gemini API) — are also supported; see LLM-AND-PROVIDERS.md. LanceDB uses this vector for approximate nearest neighbor search — finding facts that are semantically similar to a query even when the exact words don’t match.

Why two backends?

  SQLite + FTS5 LanceDB
Query type Exact match, keyword, entity/key lookup “What was that thing about…” fuzzy semantic
Cost Free (local) ~$0.00002 per embedding (OpenAI); free with Ollama or ONNX local providers
Speed Instant (local disk) Fast (local disk + ANN index)
Structured data Full metadata (entity, key, value, tags, decay) Minimal (text + vector)
When it excels “What’s User’s email?” — exact entity/key “That discussion about database choices” — semantic

By searching both and merging results, the system gets the best of both worlds.


How search works

FTS5 search (SQLite)

When you search for “database performance”:

  1. Query preparation — words are quoted and joined with OR: "database" OR "performance"
  2. FTS5 MATCH — Porter stemming means “databases” matches “database”, “performing” matches “performance”
  3. Scoring — combines three factors:
    • BM25 rank (60%) — text relevance from FTS5
    • Freshness (25%) — how far from expiry (1.0 = not expiring, 0.0 = expired)
    • Confidence (15%) — decays over time; refreshed on access
    • Dynamic salience — access boost (frequently recalled facts score higher) and time decay (older unused memories fade). See DYNAMIC-SALIENCE.md.
  4. Filtering — excludes expired facts, superseded facts, and optionally filters by tag
  5. Sorting — by composite score, then by effective date (newer first) on ties
  6. Access tracking — bumps access_count, last_accessed_at (and recall_count) for returned facts; extends TTL for stable/active/durable/normal facts; drives salience scoring

Vector search (LanceDB)

When you search for “database performance”:

  1. Embed the query — sends text to OpenAI, gets a 1536-dim vector (~$0.00002)
  2. ANN search — LanceDB finds the N nearest vectors by distance
  3. Score conversion — distance → score: score = 1 / (1 + distance) (higher = more similar)
  4. Min score filter — drops results below minScore (default 0.3)

Merge and deduplicate

Results from both backends are merged:

  1. SQLite results first — added to the merged list (have full metadata)
  2. LanceDB results — added if:
    • Not a duplicate (by ID or by text match with an existing result)
    • Not a superseded fact (checked against superseded texts cache)
  3. Sort by score — highest score first; ties broken by newer effective date (source_date or created_at)
  4. Limit — trim to the requested number of results

LanceDB results that match a SQLite fact by text are enriched — the full metadata from SQLite replaces the minimal LanceDB metadata. This means vector search results get proper entity/key/value, tags, decay info, etc.


How lookup works

Lookup is SQLite-only — no vector search, no embedding cost.

lookup("user", "preference") runs:

SELECT * FROM facts
WHERE lower(entity) = lower('user')
  AND lower(key) = lower('preference')
  AND (expires_at IS NULL OR expires_at > now)
  AND superseded_at IS NULL
ORDER BY confidence DESC, COALESCE(source_date, created_at) DESC

Returns: all matching facts, ordered by confidence (highest first), then by effective date (newest first).

With tag filter: lookup("user", "preference", "ui") adds:

AND (',' || COALESCE(tags,'') || ',') LIKE '%,ui,%'

How auto-recall injects context

Each turn, before the agent sees your message:

  1. Search both backends with your prompt as the query
  2. Merge results (as described above)
  3. Optional entity lookup — if your prompt mentions a known entity (e.g. “user”), lookup facts for that entity are merged in. Names come from entityLookup.entities when that list is non-empty; otherwise, with entityLookup.autoFromFacts true (default), from distinct entity values on active facts (capped by maxAutoEntities). See CONFIGURATION-MODES.md and CONFIGURATION.md.
  4. Optional graph traversal — if enabled, follow typed links from seed facts to discover related facts (zero LLM cost)
  5. Score adjustments:
    • preferLongTerm — multiply score by 1.2 for permanent facts, 1.1 for stable
    • useImportanceRecency — factor in importance and recency alongside relevance
  6. Token budget — accumulate facts until maxTokens (default 800) is reached
  7. Summary injection — if a fact is longer than summaryThreshold (default 300 chars) and has a stored summary, inject the summary instead
  8. Format — each fact is formatted as:
    • full: [sqlite/preference] User prefers dark mode
    • short: preference: User prefers dark mode
    • minimal: User prefers dark mode
  9. Inject — the formatted block is prepended to the agent’s context as <memory-context>...</memory-context>

Tags

What tags are

Tags are comma-separated topic labels stored in the tags column. They enable filtered queries — “show me only Zigbee-related facts” — without relying on full-text or semantic matching.

How tags are assigned

Auto-tagging at write time: when tags is omitted from memory_store or CLI store, the plugin runs extractTags(text, entity):

"NIBE F1245 uses Modbus TCP on port 502"
→ tags: ["nibe"]

"Home Assistant Zigbee coordinator on /dev/ttyUSB0"
→ tags: ["homeassistant", "zigbee"]

"OAuth token for the API endpoint"
→ tags: ["auth", "api"]

Built-in patterns: nibe, zigbee, z-wave, auth, homeassistant, openclaw, postgres, sqlite, lancedb, api, docker, kubernetes, ha.

Entity-based tags: if the entity name matches a known tag pattern, it’s added automatically.

Manual tags: pass tags explicitly to override auto-tagging:

openclaw hybrid-mem store --text "..." --tags "nibe,modbus,hvac"

How tags are stored

Stored as a comma-separated string: "nibe,modbus,hvac". Indexed with a partial index on non-empty values.

How tag filtering works

Search and lookup accept an optional tag parameter. The filter uses substring matching on the comma-delimited list:

-- "Does the tags string contain ',nibe,' ?"
AND (',' || COALESCE(tags,'') || ',') LIKE '%,nibe,%'

When memory_recall is called with a tag filter, it skips LanceDB and uses only SQLite (FTS search + lookup) with the tag filter applied. This is faster and free.


Links are typed, directed relationships between facts stored in the memory_links table:

memory_links table
├── id               (UUID)
├── source_fact_id   (the "from" fact)
├── target_fact_id   (the "to" fact)
├── link_type        (SUPERSEDES, CAUSED_BY, PART_OF, RELATED_TO, DEPENDS_ON)
├── strength         (0.0 – 1.0, default 1.0)
└── created_at       (epoch seconds)
Type Meaning Example
SUPERSEDES New fact replaces old “Use pnpm” supersedes “Use npm”
CAUSED_BY A caused B “Build failure” caused by “Dependency update”
PART_OF A is part of B “Login page” is part of “Auth system”
RELATED_TO General association “Database schema” related to “API endpoints”
DEPENDS_ON A requires B “Frontend deploy” depends on “API deploy”

Explicit: The agent calls memory_link(sourceId, targetId, linkType, strength).

Auto-linking (when graph.autoLink is enabled): After storing a new fact, the plugin finds similar existing facts via embedding search. If the similarity score exceeds graph.autoLinkMinScore (default 0.7), a RELATED_TO link is created automatically.

Supersession: When classify-before-write determines a fact should UPDATE an existing one, a SUPERSEDES link is implicit in the supersedes_id / superseded_by columns.

How graph traversal works

When recall is enabled with graph.useInRecall:

  1. Seed set — initial search results (from FTS5 + LanceDB)
  2. BFS traversal — from each seed fact, follow links in both directions up to graph.maxTraversalDepth hops (default 2)
  3. Collect connected facts — all discovered fact IDs
  4. Fetch and merge — load the connected facts from SQLite and add them to the result set

This is zero LLM cost — graph traversal uses only SQLite queries. It finds causally or structurally related facts that vector search would miss.


Supersession (contradiction resolution)

When a new fact contradicts or updates an old one:

→ Full guide: CONFLICTING-MEMORIES.md

Automatic (classify-before-write)

If store.classifyBeforeWrite is enabled:

  1. Find similar facts — embed the new text, search LanceDB + SQLite for similar existing facts
  2. LLM classification — send the new fact + similar existing facts to a cheap LLM. It decides:
    • ADD — new information, store alongside existing
    • UPDATE — new version of an existing fact; supersede the old one
    • DELETE — retraction of an existing fact; mark it as superseded
    • NOOP — already known; don’t store
  3. On UPDATE: the old fact gets superseded_at, superseded_by, and valid_until set. The new fact gets supersedes_id and valid_from. Search filters exclude superseded facts by default.

Manual

The memory_store tool accepts a supersedes parameter (fact ID). When provided, the specified fact is marked as superseded and the new fact is linked. The CLI supports the same: hybrid-mem store --text "..." --supersedes <fact-id>.

Bi-temporal queries

Every fact has valid_from and valid_until (epoch seconds). This enables point-in-time queries:

# What did we know as of January 15?
openclaw hybrid-mem search "database" --as-of 2026-01-15

The search adds: AND valid_from <= @asOf AND (valid_until IS NULL OR valid_until > @asOf).


File-based memory (memorySearch)

Separate from the plugin, but part of the hybrid system:

What it is

memorySearch is an OpenClaw built-in feature that indexes all memory/**/*.md files under your workspace. It provides semantic search over your file-based knowledge.

How it works

  1. Indexing — on session start (and on file watch), memorySearch reads all .md files under memory/, chunks them (500 tokens, 50 overlap), and stores chunks in its own SQLite + vector index.
  2. Search — hybrid BM25 + vector search over the chunks. Results include file path, section, and matching text.
  3. Automatic — happens transparently when the agent needs information. No explicit action required.

How it differs from the plugin

  memory-hybrid plugin memorySearch
Data Individual facts (1 sentence – 1 paragraph) Whole markdown files (any size)
Storage Plugin’s SQLite + LanceDB OpenClaw’s built-in index
Write memory_store, auto-capture, CLI Manual file editing, agent file writes
Search Auto-recall, memory_recall, lookup Automatic on session start, on search
Best for Isolated facts, preferences, decisions Structured docs, project state, reference data
Loaded Auto-injected each turn (top N by relevance) On-demand (when query matches a chunk)

When to use which

  • Small, isolated fact (“User’s timezone is CET”) → memory_store (plugin)
  • Structured reference doc (API endpoints, device list) → memory/technical/api.md (file)
  • Project status with roadmapmemory/projects/project.md (file)
  • Decision with rationalememory_store (auto-captured) + memory/decisions/2026-02.md (file)

Both systems work together: the agent gets auto-recalled facts in context plus can search files when it needs deeper information.


Deduplication

Three levels prevent duplicate facts:

1. Exact text match

Before storing, checks: SELECT id FROM facts WHERE text = ? LIMIT 1. If found, the store is skipped.

2. Fuzzy dedup (optional)

When store.fuzzyDedupe is enabled, text is normalized (trim, collapse whitespace, lowercase) and SHA-256 hashed. If an existing fact has the same hash, the store is skipped. Catches near-identical rephrasing.

3. Vector dedup

Before adding to LanceDB, checks if a very similar vector already exists: hasDuplicate(vector, threshold=0.95). If found, the LanceDB write is skipped (SQLite still stores the fact for structured queries).

4. Classify-before-write (optional)

The most sophisticated level: asks an LLM whether the new fact is truly new (ADD), updates an existing fact (UPDATE), retracts one (DELETE), or is already known (NOOP). See the Supersession section above.



Back to top

OpenClaw Hybrid Memory — durable agent memory

This site uses Just the Docs, a documentation theme for Jekyll.