Skip to content

canary: carry Memory Crystal fixes onto OpenClaw v2026.4.25#4

Merged
parkertoddbrooks merged 10 commits intokody/v2026-4-25-basefrom
kody/v2026-4-25-carry-memory-core
Apr 27, 2026
Merged

canary: carry Memory Crystal fixes onto OpenClaw v2026.4.25#4
parkertoddbrooks merged 10 commits intokody/v2026-4-25-basefrom
kody/v2026-4-25-carry-memory-core

Conversation

@parkertoddbrooks
Copy link
Copy Markdown
Member

Corrected stable-release canary for Lēsa's OpenClaw upgrade.\n\nBase:\n- v2026.4.25 commit aa36ee6\n\nCarries:\n- memory-core seedEmbeddingCache .iterate() fix\n- cooperative yield during seedEmbeddingCache\n- chatCompletions main-session routing\n- non-streaming and streaming next-turn queue\n- runtime-config boundary for the queue check\n\nNot carried:\n- Yuanbao catalog pin, because that was upstream-main only\n- broad chat final fallback, because upstream openclaw#71293 replaced it\n\nLocal validation already run before PR:\n- pnpm install\n- focused oxfmt check\n- focused tests\n- pnpm tsgo\n- pnpm build\n- npm pack dry-run\n- production-size read-only seed canary: 435266 rows / 8.09 GiB streamed in 27.7s, max RSS 145 MB\n- isolated gateway canary on port 18889: /healthz and /readyz green; legacy /health timed out and is not the v2026.4.25 gate

lesaai and others added 8 commits April 27, 2026 11:06
… heap OOM

The embedding_cache table sync in MemoryManager.seedEmbeddingCache called
.all() on SELECT * FROM embedding_cache, materializing the full result set
into a JS array. embedding_cache rows contain serialized embedding text
(~20 KB each on text-embedding-3-small) and can grow into hundreds of
thousands of rows on long-running deployed databases. On a local 16 GB
main.sqlite (435,136 rows, 8.68 GB of embedding text), the .all() call
exceeds V8's ~4 GB default heap limit and aborts the gateway with:

  FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap
  out of memory
  ... node::sqlite::StatementSync::All ...

Switching .all() -> .iterate() streams rows one at a time through the
same BEGIN/COMMIT upsert transaction. Peak V8 heap stays bounded by a
single row (~20 KB) plus the prepared statement, not the whole table.

Also drops the empty-check on the materialized array's .length; an
empty iterator commits a no-op transaction, which is cheap and
preserves the observable behavior for empty caches.

Scope note: this is the primary R2.A target (seedEmbeddingCache); a
follow-up patch will address the secondary listChunks / keyword fallback
.all() path in manager-search.ts.

Validation:
- pnpm tsgo:prod: green (core + extensions graphs)
- pnpm test extensions/memory-core: 512 passed, 3 skipped, 0 failed
R2.A.2. The .iterate()-based seed (R2.A v1, a315280) prevents the V8
heap OOM but the iterate loop still runs synchronously for ~117s on a
435K-row embedding_cache. wip-healthcheck SIGKILLs the gateway after
its 30s probe timeout fails. No FATAL ERROR, no Abort trap.

Patch: convert seedEmbeddingCache to async, yield to the event loop
every 1000 rows via setImmediate. Keeps memory bounded; preserves the
streaming behavior; restores /health responsiveness during the seed.

The only caller is inside an existing async arrow wrapping
runMemoryAtomicReindex's build callback. Adding await is a one-line
change.

Validation:
- pnpm tsgo:prod: green
- pnpm test extensions/memory-core: 512 passed, 3 skipped, 0 failed

Scope: does not soften wip-healthcheck (separate guardrail per Parker
direction). Does not address secondary listChunks path (R2.A.3).
Revert the top-of-file lint-suppression comments accidentally landed in
the previous commit (f9e9970). They were added to work around an
oxlint resolver false positive that turned out to be transient state,
not a real lint failure. Production code shouldn't carry misleading
explanations for problems that didn't actually persist.

Net diff of this branch vs base is now just the seedEmbeddingCache
yield patch: function -> async, setImmediate every 1000 rows, caller
await. No lint comments, no file-level disables.
…der or user=main

When x-openclaw-dm-scope: main header is sent, or user field is "main",
the chatCompletions endpoint routes to agent:main:main instead of creating
a separate openai-user:{name} session.

This allows bridge messages (CC -> Lesa) to land in the same session as
iMessage DMs, so Parker sees everything in one stream.

Co-Authored-By: Parker Todd Brooks <parkertoddbrooks@users.noreply.github.com>
Co-Authored-By: Lēsa <lesaai@icloud.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a chatCompletions request hits a session that is currently
streaming a turn, the existing code awaits agentCommandFromIngress
synchronously, which blocks or times out on the caller side. Bridge
and other agent-to-agent HTTP callers see this as a 15-120s hang.

Wire the non-stream branch of handleOpenAiHttpRequest into the
same steer-backlog path the iMessage transport uses:

1. Load the session entry via loadSessionEntryByKey(sessionKey) to
   map sessionKey -> sessionId (the key used in ACTIVE_EMBEDDED_RUNS).
2. Honor the user's messages.queue.mode config. Only "steer" and
   "steer-backlog" opt into steering; other modes fall through to the
   original blocking path.
3. Call queueEmbeddedPiMessage(sessionId, prompt.message). This is
   fire-and-forget: returns true only if the session has an active
   streaming run that isn't compacting.
4. On successful queue, return a 200 response in OpenAI-compat shape
   with an x-openclaw-queued: steer header and a "[queued] ..." marker
   in the assistant content field. Callers that want to distinguish
   queued from synchronous replies can read the header.
5. On any other state (no active run, not streaming, compacting, no
   session entry, or queue config disabled), fall through to the
   existing agentCommandFromIngress synchronous path unchanged.

Pre-check failures are caught and logged so they never block the
synchronous fallback.

Verified end-to-end:
- Idle case: curl with user=main returns a normal synchronous reply
  (no x-openclaw-queued header).
- Busy case: fire a long slow request in the background, then a fast
  interjection 4s later. The fast request returns 200 immediately
  with x-openclaw-queued: steer and the "[queued]" marker body. The
  slow request completes normally with the full reply.

Refs: wipcomputer/wip-ldm-os#266

Co-Authored-By: Parker Todd Brooks <parkertoddbrooks@users.noreply.github.com>
Co-Authored-By: Lēsa <lesaai@icloud.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the steer-backlog fix only covered the non-stream branch of
handleOpenAiHttpRequest. Any OpenAI-compatible client using the default
streaming API (which is most of them) would still block on a busy
session.

Lift the queue pre-check above the stream/non-stream branch so both
paths benefit:

1. Resolve sessionKey -> sessionId once, try queueEmbeddedPiMessage.
2. If queued and !stream: respond with JSON (unchanged from previous
   commit).
3. If queued and stream: set x-openclaw-queued header, setSseHeaders,
   emit one assistant role chunk and one content chunk carrying the
   [queued] marker with finish_reason="stop", write [DONE], end.
4. Otherwise fall through to the original stream/non-stream handlers.

Verified end-to-end:
- Idle + non-stream: HTTP 200, no queue header, real reply ("hello").
- Busy + non-stream: HTTP 200, x-openclaw-queued: steer header, JSON
  body with the queued marker.
- Busy + stream: HTTP 200, text/event-stream, x-openclaw-queued: steer
  header, SSE with role chunk + content chunk (finish_reason=stop) +
  [DONE].
- Slow background request in all three cases still completes normally
  with the full reply.

Refs: wipcomputer/wip-ldm-os#266

Co-Authored-By: Parker Todd Brooks <parkertoddbrooks@users.noreply.github.com>
Co-Authored-By: Lēsa <lesaai@icloud.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parker and Lēsa observed during live testing that while our patch
calls queueEmbeddedPiMessage() (which wraps activeSession.steer()),
the receiving side does NOT actually see the message as a mid-turn
steer. Lēsa reported: "Yeah, I received it. Came through as a regular
message in my session, not a steer."

The OpenClaw internal API is named "steer" but in practice it queues
the text for the agent's next available slot, which appears after
the current turn completes rather than being injected mid-stream.
Our x-openclaw-queued: steer header was accurate to OpenClaw's
internal terminology but misleading to HTTP callers who might expect
true mid-turn interjection.

Rename to x-openclaw-queued: next-turn and update the body marker
to be explicit about the semantics. Callers can now tell exactly
what happened: the message was delivered, but they won't get a
synchronous reply and the receiving agent processes it after its
current turn rather than mid-stream.

Co-Authored-By: Parker Todd Brooks <parkertoddbrooks@users.noreply.github.com>
Co-Authored-By: Lēsa <lesaai@icloud.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member Author

Canary update from Kody:

  • Isolated v2026.4.25 carry build booted from temp state, not live Lēsa state.
  • State/config used: /tmp/openclaw-v25-canary, OPENCLAW_STATE_DIR=/tmp/openclaw-v25-canary, OPENCLAW_CONFIG_PATH=/tmp/openclaw-v25-canary/openclaw.json.
  • Gateway ran on loopback port 18889; canary process has been stopped and port is clear.
  • /healthz: {"ok":true,"status":"live"}.
  • /readyz: {"ready":true,"failing":[]}.
  • External WIP plugins resolved from temp registry and loaded: memory-crystal, compaction-indicator, session-export, plus existing local WIP plugins.
  • Required v2026.4.25 hook flags survived config validation/rewrite:
    • plugins.entries.memory-crystal.hooks.allowConversationAccess: true
    • plugins.entries.compaction-indicator.hooks.allowConversationAccess: true
    • plugins.entries.session-export.hooks.allowConversationAccess: true
  • Config rewrite did not strip fallbacks or hook flags. It did normalize the primary route from codex/gpt-5.5 to openai/gpt-5.5 and added agents.defaults.agentRuntime.id: "codex". Treat that as the v25 invariant spelling before live promotion.
  • Only config warning observed: duplicate tavily global plugin overridden by bundled plugin. Not a Memory Crystal blocker.

Current gate state: code CI is green from the latest head; remaining blockers are non-code label automation and the queued parity gate unless explicitly waived per the upgrade closure plan.

Copy link
Copy Markdown
Member Author

Parity gate waiver decision:

Waiving Run the OpenAI / Opus 4.6 parity gate against the qa-lab mock for this fork-side canary merge as infrastructure-unavailable in the fork runner environment.

Reason:

  • The workflow requests blacksmith-32vcpu-ubuntu-2404, which is upstream OpenClaw CI runner infrastructure and is not available/assigned in wipcomputer/openclaw.
  • PR-triggered parity run remained queued with no logs.
  • Manual workflow dispatch also queued; one earlier manual run was cancelled by GitHub concurrency in favor of the PR run.
  • No parity test failed; no parity logs started.

Replacement evidence accepted for this fork-side canary:

  • full code CI green except known label automation failures
  • local focused tests green
  • pnpm tsgo green
  • pnpm build green
  • isolated v2026.4.25 temp canary booted on port 18889
  • /healthz and /readyz green
  • Memory Crystal / compaction-indicator / session-export loaded with hooks.allowConversationAccess: true
  • config rewrite preserved fallbacks and hook flags, with accepted v25 model normalization to openai/gpt-5.5 + agentRuntime: "codex"

Next gate after merge remains controlled live promotion with launchctl kickstart -k, not openclaw gateway restart.

@parkertoddbrooks parkertoddbrooks merged commit c188a36 into kody/v2026-4-25-base Apr 27, 2026
93 of 98 checks passed
@parkertoddbrooks parkertoddbrooks deleted the kody/v2026-4-25-carry-memory-core branch April 27, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants