Issue #1551: enhance diagnostics: capture native RSS / fd evidence on memory pressure

Status: WIP scaffold PR only — implementation pending.

Source Issue

  • Issue: #1551
  • URL: https://github.com/markus-lassfolk/openclaw-hybrid-memory/issues/1551
  • State: OPEN
  • Priority label: priority:high
  • Labels: enhancement, priority:high, observability, issue/stage/enriching

Acceptance Criteria / Issue Body

Summary

When gateway memory pressure becomes critical, diagnostics currently report RSS/heap pressure but do not capture the evidence needed to distinguish JS heap, native addon memory, SQLite handles, LanceDB/Arrow buffers, file mappings, or stale plugin generations.

Evidence observed on Maeve

Repeated diagnostics showed critical RSS pressure while snapshots were disabled:

[diagnostics/memory] memory pressure: level=critical reason=rss_threshold rssBytes=~6.6-8.1GB heapUsedBytes=~0.55-0.65GB thresholdBytes=3221225472 memoryPressureSnapshot=disabled
[diagnostics/memory] critical memory pressure snapshot disabled: diagnostics.memoryPressureSnapshot=false

Manual live inspection showed the missing context was important:

Node RSS: ~8.1 GB
JS heap: ~0.6 GB
RssAnon: ~7.5 GB
Duplicate SQLite handles: 10-11 per major DB
LanceDB directory size: ~3.9 GB
Active-task injection: 100 active tasks, 87 stale

Why this matters

The system correctly detects memory pressure, but the current log line is insufficient for root cause analysis. By the time an operator investigates manually, the process may have restarted or the interesting state may be gone.

Acceptance criteria

When memory pressure crosses critical threshold, optionally emit a compact diagnostic bundle containing at least:

  • process.memoryUsage() including rss, heapTotal, heapUsed, external, and arrayBuffers.
  • Active handle/request count where available.
  • /proc/self/status memory fields on Linux.
  • Top file descriptor targets grouped by path, especially SQLite DB/WAL/SHM and LanceDB files.
  • Plugin runtime generation / registration count.
  • Number of registered tools/hooks if available from OpenClaw API.
  • Hybrid-memory state summary:
    • DB paths
    • LanceDB path
    • whether LanceDB initialized
    • active readers / optimize state if accessible
    • active-task injected count and stale count
    • recall-in-flight count
  • Rate-limit snapshots to avoid log spam.

Suggested implementation

  • Add a lightweight memory-pressure-snapshot helper that is safe to run from the gateway process.
  • Keep the default compact text/JSON log under a max size.
  • Optionally write a JSON artifact under ~/.openclaw/diagnostics/memory-pressure/ for later analysis.
  • Make diagnostics.memoryPressureSnapshot=true safe enough to enable in production.

Verification

  • Force/trigger a low threshold in tests and verify the snapshot includes fd grouping and memory breakdown.
  • Ensure repeated critical checks are deduped/cooldown-limited.

Stewardship Notes

This placeholder document exists to create a draft PR branch for Issue Stewardship tracking. The implementation work remains pending and should replace or extend this scaffold with the actual fix and verification evidence.


Back to top

OpenClaw Hybrid Memory — durable agent memory

This site uses Just the Docs, a documentation theme for Jekyll.