Issue #1551: enhance diagnostics: capture native RSS / fd evidence on memory pressure
Status: WIP scaffold PR only — implementation pending.
Source Issue
- Issue: #1551
- URL: https://github.com/markus-lassfolk/openclaw-hybrid-memory/issues/1551
- State: OPEN
- Priority label: priority:high
- Labels: enhancement, priority:high, observability, issue/stage/enriching
Acceptance Criteria / Issue Body
Summary
When gateway memory pressure becomes critical, diagnostics currently report RSS/heap pressure but do not capture the evidence needed to distinguish JS heap, native addon memory, SQLite handles, LanceDB/Arrow buffers, file mappings, or stale plugin generations.
Evidence observed on Maeve
Repeated diagnostics showed critical RSS pressure while snapshots were disabled:
[diagnostics/memory] memory pressure: level=critical reason=rss_threshold rssBytes=~6.6-8.1GB heapUsedBytes=~0.55-0.65GB thresholdBytes=3221225472 memoryPressureSnapshot=disabled
[diagnostics/memory] critical memory pressure snapshot disabled: diagnostics.memoryPressureSnapshot=false
Manual live inspection showed the missing context was important:
Node RSS: ~8.1 GB
JS heap: ~0.6 GB
RssAnon: ~7.5 GB
Duplicate SQLite handles: 10-11 per major DB
LanceDB directory size: ~3.9 GB
Active-task injection: 100 active tasks, 87 stale
Why this matters
The system correctly detects memory pressure, but the current log line is insufficient for root cause analysis. By the time an operator investigates manually, the process may have restarted or the interesting state may be gone.
Acceptance criteria
When memory pressure crosses critical threshold, optionally emit a compact diagnostic bundle containing at least:
process.memoryUsage()includingrss,heapTotal,heapUsed,external, andarrayBuffers.- Active handle/request count where available.
/proc/self/statusmemory fields on Linux.- Top file descriptor targets grouped by path, especially SQLite DB/WAL/SHM and LanceDB files.
- Plugin runtime generation / registration count.
- Number of registered tools/hooks if available from OpenClaw API.
- Hybrid-memory state summary:
- DB paths
- LanceDB path
- whether LanceDB initialized
- active readers / optimize state if accessible
- active-task injected count and stale count
- recall-in-flight count
- Rate-limit snapshots to avoid log spam.
Suggested implementation
- Add a lightweight
memory-pressure-snapshothelper that is safe to run from the gateway process. - Keep the default compact text/JSON log under a max size.
- Optionally write a JSON artifact under
~/.openclaw/diagnostics/memory-pressure/for later analysis. - Make
diagnostics.memoryPressureSnapshot=truesafe enough to enable in production.
Verification
- Force/trigger a low threshold in tests and verify the snapshot includes fd grouping and memory breakdown.
- Ensure repeated critical checks are deduped/cooldown-limited.
Stewardship Notes
This placeholder document exists to create a draft PR branch for Issue Stewardship tracking. The implementation work remains pending and should replace or extend this scaffold with the actual fix and verification evidence.