Write-Ahead Log (WAL) for Crash Resilience
Overview
The Write-Ahead Log (WAL) feature provides crash resilience for the OpenClaw Hybrid Memory system. It ensures that memory operations (decisions, user preferences, facts) are not lost if the agent crashes, times out, or the session is killed during generation.
Problem
Previously, memory updates happened asynchronously or alongside the response generation. If the agent crashed during generation, critical context could be lost, including:
- User decisions and preferences
- Important facts discovered during conversation
- Entity relationships and structured data
- Auto-captured memories from long-running tasks
Solution
The WAL implementation follows a pre-flight commit pattern:
- Before Storage: Write pending memory operations to a durable WAL file
- Commit: Store the memory to SQLite and LanceDB
- Cleanup: Remove the WAL entry after successful commit
- Recovery: On startup, replay any uncommitted operations from the WAL
Architecture
WAL Entry Structure
type WALEntry = {
id: string; // Unique identifier for this operation
timestamp: number; // Unix timestamp (ms) when ready to commit (after embedding)
operation: "store" | "delete" | "update"; // delete operations are skipped during recovery
data: {
text: string;
category?: string;
importance?: number;
entity?: string | null;
key?: string | null;
value?: string | null;
source?: string;
decayClass?: DecayClass;
summary?: string | null;
tags?: string[];
vector?: number[]; // Computed embedding (included if generation succeeded)
};
};
File Format
The WAL is stored as a JSON array in ~/.openclaw/memory/memory.wal (or custom path via config). Each entry represents a pending operation that has not yet been confirmed as committed.
Recovery Process
On plugin startup, the system:
- Reads all entries from the WAL file
- Filters out stale entries (older than
maxAge, default 5 minutes) - For each valid entry:
- Skips
deleteoperations (cannot be reliably replayed) - Checks if the memory already exists (idempotency)
- If not, commits it to SQLite and LanceDB
- Removes the entry from the WAL
- Skips
- Logs recovery statistics
Configuration
Enable/Disable WAL
WAL is enabled by default for crash resilience. To disable:
{
"plugins": {
"entries": {
"openclaw-hybrid-memory": {
"config": {
"wal": {
"enabled": false
}
}
}
}
}
}
Custom WAL Path
{
"wal": {
"enabled": true,
"walPath": "/custom/path/to/memory.wal"
}
}
Maximum Age
Control how long WAL entries are considered valid (default: 5 minutes):
{
"wal": {
"enabled": true,
"maxAge": 300000 // milliseconds (5 minutes)
}
}
Behavior
Normal Operation
- User/agent triggers memory storage (via
memory_storetool or auto-capture) - System generates embedding vector (async, may fail or be slow)
- WAL write (synchronous, durable) - timestamp captured here
- Commit to SQLite (synchronous)
- Commit to LanceDB (async)
- WAL cleanup (remove committed entry)
Crash Scenario
- User/agent triggers memory storage
- System generates embedding vector (async, may fail or be slow)
- WAL write (synchronous, durable) ✓ - timestamp captured after embedding completes
- Commit to SQLite starts…
- CRASH (agent timeout, kill signal, system failure)
- On next startup:
- WAL recovery detects uncommitted entry
- Replays the operation (vector already computed)
- Memory is restored
Idempotency
The recovery process is idempotent:
- Before replaying a WAL entry, the system checks if the memory already exists (via fuzzy deduplication)
- If the memory was partially committed before the crash, it won’t be duplicated
- This handles cases where the crash happened after SQLite commit but before WAL cleanup
Performance Impact
Write Path
- Synchronous WAL write: ~1-5ms per operation (local file I/O with atomic rename)
- Minimal overhead: JSON array rewrite per operation (atomic temp file + rename for crash safety)
- No network calls: WAL is purely local
Startup
- Recovery check: ~10-50ms for typical WAL sizes (<100 entries)
- Replay: Only uncommitted operations are replayed (with embedding regeneration if needed)
- Pruning: Stale entries are automatically removed
Storage
- File size: ~1-2KB per entry (including embedding vector)
- Typical size: <100KB for normal operation
- Auto-cleanup: Entries are removed after successful commit
Logging
The WAL system logs the following events:
Startup
memory-hybrid: WAL enabled (/home/user/.openclaw/memory/memory.wal)
memory-hybrid: WAL recovery starting — found 3 pending operation(s)
memory-hybrid: WAL recovery completed — recovered 3 operation(s), 0 failed
memory-hybrid: WAL pruned 2 stale entries
Runtime
memory-hybrid: WAL write failed: <error>
memory-hybrid: WAL cleanup failed: <error>
memory-hybrid: auto-capture WAL write failed: <error>
Errors
WAL failures are logged as warnings and do not block memory operations. The system degrades gracefully:
- If WAL write fails, the memory is still stored (but not crash-protected)
- If WAL cleanup fails, the entry will be pruned on next startup
- If recovery fails for an entry, it’s logged and skipped
Testing
Simulated Crash Test
To test WAL recovery:
- Enable WAL in config
- Store a memory via
memory_storetool - Kill the OpenClaw process immediately (before it completes)
- Restart OpenClaw
- Check logs for “WAL recovery” messages
- Verify the memory was recovered via
memory_recall
Manual WAL Inspection
cat ~/.openclaw/memory/memory.wal | jq .
This shows all pending operations in the WAL.
Force Recovery
To test recovery without a crash:
- Manually create a WAL entry in
memory.wal - Restart OpenClaw
- Check logs for recovery messages
Comparison to SQLite WAL Mode
This is not the same as SQLite’s built-in WAL mode (though they serve similar purposes):
- SQLite WAL: Database-level crash recovery for transactions
- Memory WAL: Application-level crash recovery for the entire memory pipeline (SQLite + LanceDB + embeddings)
The Memory WAL protects against:
- Crashes during storage operations (WAL is written after embedding generation but before storage commits)
- Crashes during LanceDB write (after SQLite write)
- Crashes during multi-step operations (though credentials vault is currently not WAL-protected)
Note on timing: The WAL timestamp is captured AFTER embedding generation completes. This prevents slow embedding operations from causing WAL entries to age out and be pruned before they can be committed. If embedding fails, no WAL entry is created (since there’s nothing to commit).
Limitations
- Not a distributed log: WAL is local to the machine running OpenClaw
- Not a transaction log: Each operation is independent (no multi-operation transactions)
- Best-effort recovery: If the WAL file is corrupted, recovery may fail
- Stale entries: Entries older than
maxAgeare discarded (not recoverable)
Future Enhancements
Potential improvements for future versions:
- Batched writes: Group multiple operations into a single WAL write
- Compression: Reduce WAL file size for large embeddings
- Rotation: Archive old WAL files instead of pruning
- Distributed WAL: Sync WAL across multiple instances
- Transaction support: Multi-operation atomic commits
Related Features
- Auto-capture: Automatically protected by WAL
- Credentials vault: Currently not WAL-protected; credentials are stored via a separate mechanism that returns early in the memory_store path
- Session distillation: Bulk imports can use WAL for resilience
Troubleshooting
WAL file growing too large
- Check for failed operations that aren’t being cleaned up
- Verify
maxAgeis set appropriately - Manually clear the WAL:
rm ~/.openclaw/memory/memory.wal
Recovery not working
- Check WAL file exists and is readable
- Verify
wal.enabledistruein config - Check logs for recovery errors
- Ensure entries are within
maxAgewindow
Performance issues
- If WAL writes are slow, check disk I/O
- Consider moving WAL to faster storage (SSD)
- Reduce
maxAgeto keep WAL smaller
References
- Elite Longterm Memory - Inspiration for WAL pattern
- SQLite WAL Mode - Database-level WAL
- Write-Ahead Logging - General concept
Related docs
- README — Project overview and all docs
- ARCHITECTURE.md — System architecture overview
- TROUBLESHOOTING.md — Common issues and fixes