Search Improvements — RRF, ingest-files, query expansion
GitHub issue #33 adds three improvements to hybrid memory search:
- Reciprocal Rank Fusion (RRF) — Better merging of SQLite FTS5 and LanceDB vector results
ingest-files— Index workspace markdown (skills, TOOLS.md, etc.) as searchable facts- Query expansion — Optional LLM-generated hypothetical answer or expanded query before embedding (opt-in; config:
queryExpansion.enabled). Legacy configsearch.hydeEnabledis deprecated.
Reciprocal Rank Fusion (RRF)
RRF is the core search improvement in this release. It replaces a naive score-based merge with a rank-based algorithm that correctly combines BM25 keyword scores and cosine similarity scores — two metrics that are otherwise not comparable.
Search pipeline
┌──────────────────────────┐
│ query │
└───────────┬──────────────┘
│
┌─────────────────────┴─────────────────────┐
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ SQLite FTS5 (BM25) │ │ LanceDB (cosine sim) │
│ keyword search │ │ vector search │
└───────────┬───────────┘ └───────────┬───────────┘
│ │
└─────────────────┬─────────────────────┘
│
┌────────────▼────────────┐
│ RRF score merge │
│ score = Σ 1/(k + rank) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Dedup (ID + text) │
│ SQLite wins ties │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Superseded filter │
│ (via SupersededProvider)│
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Scope filter │
│ (user/agent/session) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ results │
└─────────────────────────┘
Problem
The previous merge logic combined SQLite BM25 scores and LanceDB cosine similarity by sorting on raw score. Those metrics are on incompatible scales (e.g. BM25 ~8.5 vs cosine ~0.82), so mixing them produced poor ranking.
Solution
Reciprocal Rank Fusion (Cormack et al., 2009) uses ranks instead of raw scores:
- Rank SQLite results by BM25 score descending (best = rank 1).
- Rank LanceDB results by cosine similarity descending.
- For each unique fact:
rrf_score = Σ 1/(k + rank)over all lists it appears in. - Sort by
rrf_scoredescending.
Facts that rank well in both keyword and semantic search get higher RRF scores and rise to the top. No score normalization needed.
Default behaviour
RRF is always on. No config change required. The constant k=60 is standard in IR literature.
Deduplication strategy
Before RRF scores are computed, results from both backends are deduplicated using two passes:
-
ID-based dedup — If the same fact ID appears in both SQLite and LanceDB results, only the first occurrence is kept. SQLite results are processed first, so SQLite wins when both backends return the same fact.
-
Text-based dedup (case-insensitive) — If two results have identical text (ignoring case), only the first occurrence is kept. Again, SQLite is processed first, so it wins ties.
-
Superseded text filtering — Before dedup, results whose text appears in
SupersededProvider.getSupersededTexts()are discarded. This prevents old, replaced facts from surfacing even when they still match semantically. TheSupersededProvideris an optional interface backed byFactsDB; if it is not provided, no superseded filtering is applied.
This means SQLite (BM25) results have priority in dedup, but both backends still contribute to the RRF score calculation. A fact that only appears in LanceDB results will have its LanceDB rank contribute to its RRF score normally.
RRF k-parameter tuning
The k constant controls how sensitive RRF is to rank differences between results.
What k controls: Each result’s RRF contribution is 1 / (k + rank). The constant k sets a floor that dampens the advantage of top-ranked results. With k=60, the difference between rank 1 and rank 2 is small (1/61 vs 1/62). With k=1, rank 1 dominates (1/2 vs 1/3).
Symptoms of k too high (e.g. k > 120):
- All results score very similarly — ranking feels flat.
- Top keyword hits and top vector hits are not clearly distinguished from mid-list results.
- You see marginally relevant facts competing equally with highly relevant ones.
Symptoms of k too low (e.g. k < 10):
- Top-ranked results dominate aggressively.
- Long-tail results (ranked 5+) are effectively buried, even when they appear in both lists.
- Results that appear in only one backend but rank #1 there may score higher than results appearing in both backends but at rank 3.
When to change k: For most users: never. The default k=60 is well-established in the information retrieval literature and works well across a wide range of query types and corpus sizes. Consider tuning only if you have a large corpus (>10 000 facts) and have measured ranking quality systematically. The k parameter is exposed via the internal MergeOptions API; it is not surfaced in user config.
Impact
Internal benchmarks showed a significant improvement in recall when moving from naive score merge to RRF. The benchmark compared raw recall@10 on a fixed test set of queries against known-relevant facts; exact methodology and corpus details are not published. Treat the improvement as qualitative — the ranking quality difference is clearly observable in practice.
Note: The “+32% recall” figure previously cited here has been removed. We don’t have sufficient details about the original benchmark methodology to cite it with confidence.
Ingest workspace files
Overview
openclaw hybrid-mem ingest-files indexes workspace markdown files (skills, TOOLS.md, AGENTS.md, etc.) as facts. An LLM extracts key capabilities and technical knowledge from each file; facts are stored with category: technical, decayClass: stable, and tag ingest.
CLI
openclaw hybrid-mem ingest-files # Use config or defaults
openclaw hybrid-mem ingest-files --dry-run # Preview without storing
openclaw hybrid-mem ingest-files --workspace /path/to/project
openclaw hybrid-mem ingest-files --paths "skills/**/*.md,TOOLS.md,docs/api.md"
| Option | Default | Description |
|---|---|---|
--dry-run | — | Show what would be processed without storing |
--workspace <path> | OPENCLAW_WORKSPACE or cwd | Workspace root for glob resolution |
--paths <globs> | Config or skills/**/*.md,TOOLS.md,AGENTS.md | Comma-separated glob patterns |
Config
{
"ingest": {
"paths": ["skills/**/*.md", "TOOLS.md", "AGENTS.md"],
"chunkSize": 800,
"overlap": 100
}
}
| Key | Default | Description |
|---|---|---|
paths | — | Glob patterns relative to workspace; required to enable ingest |
chunkSize | 800 | Characters per chunk for LLM extraction |
overlap | 100 | Overlap between chunks |
Flow
- Resolve workspace root (CLI
--workspaceor config/env). - Expand globs (e.g.
skills/**/*.md) into file paths. - Read each file, chunk with overlap.
- Send chunks to LLM with
ingest-filesprompt; parse JSONL facts. - Dedupe against existing memory (text + embedding similarity).
- Store new facts with
source: "ingest",decayClass: "stable", tags includeingest.
When to run
- After adding or updating
skills/,TOOLS.md, or other docs. - Periodically (e.g. weekly) to refresh indexed content.
- Re-runs are safe: duplicates are skipped by existing dedup.
Impact
Indexing a skills/ folder can give ~+10% recall in benchmarks by making capability docs searchable during memory recall.
Query expansion (formerly HyDE)
Overview
Query expansion (config: queryExpansion.enabled) generates a short hypothetical answer or expanded query before embedding. The embedding of that text is used for vector search instead of the raw query. This can improve recall because the expanded text is often closer in embedding space to actual stored facts.
Deprecated: search.hydeEnabled and search.hydeModel are deprecated. Use queryExpansion.enabled and queryExpansion.model instead. If you still have search.hydeEnabled: true, the plugin auto-enables query expansion and logs a deprecation warning.
Config
{
"queryExpansion": {
"enabled": true,
"model": "openai/gpt-4.1-nano",
"timeoutMs": 15000
}
}
| Key | Default | Description |
|---|---|---|
enabled | false | Enable query expansion for vector search |
model | (nano tier) | Model for expansion; when omitted, uses llm.nano[0] or legacy default |
timeoutMs | 15000 (25s when migrating from HyDE) | Timeout for expansion call in ms. Raised to 15s in #339 to accommodate thinking models (e.g. Gemini 2.5 Flash). A minimum floor of 10000ms is enforced (#384); values explicitly set below that are silently raised. Set to 0 or a negative value to remove the timeout entirely. |
See CONFIGURATION.md for the full queryExpansion block.
Where it applies
memory_recalltool- Auto-recall (injection at session start)
CLI hybrid-mem search does not use query expansion (no LLM available in that context).
Trade-offs
- Pros: Can improve recall (~+5–8% in some benchmarks).
- Cons: Extra LLM call per search; adds latency and API cost.
Query expansion is off by default. Enable when the recall gain justifies the extra cost.
References
- Cormack, Clarke, Buettcher (2009): Reciprocal Rank Fusion
- r/openclaw: “How I built a memory system that actually works”