- docs/v1-vs-v2/: full v1→v2 regression analysis (SUMMARY + 21 per-module docs + ACTION-ITEMS rollup with decisions + timezone recreation spec). - container/agent-runner/scripts/sdk-signal-probe.ts: empirical harness used to characterise Claude Agent SDK event/hook/stderr timing for the stuck-detection design in item 9. - src/channels/chat-sdk-bridge.ts: document the conversations Map staleness in a code comment; fix deferred to when dynamic group registration lands (ACTION-ITEMS item 17). No runtime behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20 KiB
IPC: v1 vs v2
Scope
v1
- Host side:
/Users/gavriel/nanoclaw4/src/v1/ipc.ts(127 lines) — file-system watcher, task authorization, message routing - Auth/handshake tests:
/Users/gavriel/nanoclaw4/src/v1/ipc-auth.test.ts(614 lines) — authorization gates, schedule types, cron validation - Container side:
/Users/gavriel/nanoclaw4/container/agent-runner/src/v1/ipc-mcp-stdio.ts(509 lines) — MCP server over stdio, file-based message writes - Total v1 codebase: ~1,250 lines (v1/ subtree)
v2 counterparts
This is not a file-for-file mapping. The entire IPC abstraction layer has been replaced with SQLite DBs:
- Host delivery/routing:
/Users/gavriel/nanoclaw4/src/delivery.ts(912 lines) — polls outbound.db, delivers, handles system actions - Host sweep/recurrence:
/Users/gavriel/nanoclaw4/src/host-sweep.ts(174 lines) — 60s maintenance, stale detection via heartbeat, processing_ack sync - Session setup/DB:
/Users/gavriel/nanoclaw4/src/session-manager.ts(361 lines) — DB paths, folder init, destinations + routing writes - Container poll loop:
/Users/gavriel/nanoclaw4/container/agent-runner/src/poll-loop.ts(200+ lines) — fetches messages_in, marks status in processing_ack - Container destinations:
/Users/gavriel/nanoclaw4/container/agent-runner/src/destinations.ts(118 lines) — reads inbound.db's destinations table live - DB layer (host):
src/db/session-db.ts— insertMessage, getDueOutboundMessages, markDelivered, syncProcessingAcks, etc. - DB layer (container):
container/agent-runner/src/db/{messages-in,messages-out,session-state,connection}.ts - Schema:
/Users/gavriel/nanoclaw4/docs/db-session.md(184 lines) — definitive per-session DB layout
Paradigm shift
v1: IPC as explicit message files + stdio tunnel + MCP server
In v1, the host spawned an MCP server inside each container's stdio. The container's ipc-mcp-stdio.ts exposed tools (send_message, schedule_task, register_group, etc.) by writing JSON files to the host's data/ipc/{groupFolder}/{messages|tasks}/ directory. The host's ipc.ts file-watcher scanned these directories every IPC_POLL_INTERVAL (~1s), parsed the JSON, applied authorization gates (isMain? folder-match?), executed side effects (DB writes, group registration), and deleted the files. Ordering, atomicity, and backpressure were implicit in the filesystem.
v2: Everything is a message in two persistent DBs
The IPC abstraction has been entirely removed. All host↔container communication now flows through two SQLite files per session:
- inbound.db (host writes, container reads):
messages_infor inbound chat/tasks,destinationsfor the routing map,session_routingfor default reply channel - outbound.db (container writes, host reads):
messages_outfor agent responses,processing_ackfor status acks,session_statefor continuation storage
There is no MCP server inside the container that exposes system tools. Instead:
- Container side calls
writeMessageOut()directly, writing a JSONcontentblob withaction="schedule_task"(or similar) into themessages_outtable. - Host side polls
getDueOutboundMessages()from outbound.db, deserializes thecontent, and inhandleSystemAction()interprets the action, validates it, and applies it directly to inbound.db (no IPC file write).
The single-writer-per-file invariant (host writes inbound.db, container writes outbound.db) replaces the file-system locking and atomic rename semantics.
Key ownership shift:
- v1: Container owned the "request to do something" (file write). Host decided whether to act (authorization on read).
- v2: Host owns the "task is pending" state (messages_in row). Container marks its progress (processing_ack). Host syncs status, detects stale containers, and triggers recurrence.
Capability map
| v1 IPC Behavior | v2 Equivalent | Status | Notes |
|---|---|---|---|
| Handshake / auth | Database schema + envelope ID | ✓ Functional but different | v1: read isMain env var at startup, gate each IPC op. v2: host resolves session once, container reads destinations table on every query. No per-message auth envelope. |
| Message framing | JSON in files (atomic rename) | ✓ Replaced with DB schema | v1: writeIpcFile() temp-then-rename. v2: better-sqlite3 with journal_mode=DELETE + open-close-per-op for cross-mount visibility. |
| Transport (pipes/sockets) | SQLite on FUSE mount | ✓ Completely different | v1: filesystem watching (no network). v2: cross-mount DB access (requires journal_mode=DELETE pragma, see session-manager.ts:9–11). |
| Message types | kind field in messages_in/out |
✓ Expanded | v1: message/task files. v2: `kind=chat |
| Auth / authorization gates | Host-side permission checks in delivery.ts | ◐ Simplified but different | v1: checked per file (isMain flag, folder-match). v2: admin perms checked at container startup (adminUserIds set in poll-loop.ts:22–33), destination ACL in agent_destinations table, delivery.ts enforces on send. No per-message envelope. |
| Handshake semantics | None (session exists at startup) | ✗ Removed | v1: env vars set identity at container boot. v2: session_id/agent_group_id is stable DB fixture; container learns routing from session_routing table. No negotiation. |
| Backpressure / flow control | Implicit (filesystem poll interval) | ◐ Different model | v1: host polls files at 1s intervals; if processing is slow, files pile up. v2: messages_in rows sit with status='pending' until container marks processing_ack='processing', then host polls and syncs status. Host can enforce delivery retry budget (MAX_DELIVERY_ATTEMPTS=3 in delivery.ts:58). |
| Keepalives / timeouts | No explicit mechanism | ✓ Heartbeat file replaces | v1: IPC files served as implicit liveness. v2: container touches .heartbeat file (mtime tracked by host). Host uses heartbeat staleness (10min threshold in host-sweep.ts:32) to detect crash and reset stuck messages. |
| Ordering / seq parity | Implicit filename order (timestamp+random) | ✓ Enforced | v1: files had timestamps but no formal ordering. v2: seq is monotonic per session, even←host / odd←container (see db-session.md §3). Parity disambiguates edit/reaction targeting. |
| Reconnect semantics | Container restart picks up where it left off (env vars) | ✓ Improved | v1: continuation not persisted across restarts. v2: provider continuation (Claude JSON transcript, etc.) stored in session_state.session_id on every SDK result. Survives crash. |
| Error handling / retries | File left in errors/ dir on parse failure |
✓ Better visibility | v1: failed IPC files moved to data/ipc/errors/ for manual inspection. v2: status='failed' in messages_in; delivery.ts retries with exponential backoff (3 attempts), marks failed on max. Persisted in DB for audit. |
| Task scheduling (schedule_task) | IPC file write → host parses → DB insert | ✓ Same end result, different path | v1: container writes task JSON, host reads/validates cron. v2: container writes system message with action="schedule_task" to messages_out, host reads + inserts into messages_in as new kind='task' row. Validation still in host (cron parsing at delivery time). |
| Admin commands (/clear, /setup) | Not in v1 IPC | ✓ Implemented | v2 has /clear command in poll-loop.ts, checked against adminUserIds set. Clears session_state.session_id. No MCP server expose. |
| Tool-call plumbing | MCP server in container exposes send_message, schedule_task, etc. | ✗ Removed entirely | v1 tools are now plain SDK result processors. send_message writes messages_out. schedule_task writes messages_out with action="schedule_task". |
| Message delivery tracking | None (fire-and-forget) | ✓ Added | v1: host sends message, doesn't track if it reached the user. v2: delivered table in inbound.db (platform_message_id + status). delivery.ts marks as delivered/failed. Enables message edits, reactions, and retry logic. |
| Stale container detection | None | ✓ Added | v1: no heartbeat. v2: host-sweep.ts checks .heartbeat mtime. If >10min old and processing_ack has 'processing' entries, resets with backoff. |
| Recurrence / cron re-firing | Not in v1 | ✓ Added | v1: tasks were one-shot. v2: recurrence field in messages_in + series_id. host-sweep.ts fires next occurrence when completed message has recurrence. CronExpressionParser used at sync time. |
Missing from v2
1. Auth handshake envelope
v1 had explicit authorization gates for every IPC operation:
- Read
isMainandgroupFolderfrom env vars at startup (ipc-mcp-stdio.ts:19–21) - For
schedule_task: gate thetargetJid— non-main groups can only schedule forchatJid(line 187–188) - For
register_group: only isMain=true can call (line 471–481) - For
send_message: isMain || (target group's folder == sender's folder) (ipc.ts:78)
v2 equivalent: Authorization is now split:
- Container time: adminUserIds set passed at boot (poll-loop.ts:22–33), used to gate
/clearcommand only - Delivery time: host checks destination ACL via agent_destinations table, permission to send to a messaging group (delivery.ts:535–561)
- No per-message auth envelope; the session fixture itself represents authorization
What's lost: Per-request explicit authorization metadata. The agent can't prove it's "main" anymore; instead the host verifies at delivery time using the central DB. This is arguably better security (no token in container to leak), but if the agent needs to know why a request failed, it no longer gets an explicit auth reject response.
2. Backpressure / request queuing
v1 file-based IPC was implicitly backpressured:
- Container calls
send_message()MCP tool, which callswriteIpcFile()and returns immediately (fire-and-forget) - If the host is slow or overloaded, files pile up in
data/ipc/messages/ - Container is blocked only if the filesystem fills
v2 equivalent: No queueing or explicit backpressure:
- Container calls
writeMessageOut(), which executes a synchronous SQLite INSERT into outbound.db - Host polls outbound.db at 1s (active) or 60s (sweep)
- If delivery fails, messages sit in outbound.db with
status='pending'until 3 retries exhausted
What's lost: Queue depth visibility. In v1, you could see ls data/ipc/messages/ | wc -l to get backlog. In v2, you have to query the outbound DB. The container has no way to ask "how many pending messages are waiting for me?" — it just writes and hopes the host picks them up.
3. Explicit keepalive / ping
v1 had implicit keepalives via file timestamps:
- Each IPC file wrote a
timestampfield (ipc-mcp-stdio.ts:61, 202) - Host could reason about "last IPC activity"
v2 equivalent: Heartbeat file mtime:
- Container touches
.heartbeatfile (connection.tstouchHeartbeat()) - Host checks mtime every 60s in host-sweep.ts
- Detects stale if >10min old and processing_ack has 'processing' entries
What's lost: Sub-heartbeat timeouts. If the container is hung but the heartbeat is fresh (just stuck in a long computation), the host won't detect it. v1 had no explicit timeout either, so this is not a regression, but there's no keepalive mechanism (no ping/pong protocol).
4. Payload size limits / chunking
v1 wrote task files with a single JSON blob:
- ipc-mcp-stdio.ts:31:
fs.writeFileSync(tempPath, JSON.stringify(data, null, 2)) - Filesystem might have limits on inode size, but generally no explicit cap
v2: No explicit chunking or size limits in the DB layer:
- messages_in.content and messages_out.content are TEXT
- SQLite TEXT default is ~1GB per cell
- No mention in the codebase of max payload size
What's lost: Explicit awareness. In v1, if a task prompt was 10MB, it would be a 10MB JSON file. In v2, it's a 10MB DB cell. The system doesn't actively prevent this, and there's no mention of a sanitizer.
Behavioral discrepancies
1. Task scheduling authorization
v1 (ipc-auth.test.ts:71–127):
// Main group can schedule for another group
await processTaskIpc({ type: 'schedule_task', targetJid: 'other@g.us' }, 'whatsapp_main', true, deps);
// Non-main group can ONLY schedule for itself
await processTaskIpc({ type: 'schedule_task', targetJid: 'main@g.us' }, 'other-group', false, deps);
// ↑ blocked by authorization gate (ipc.ts:170)
v2 (delivery.ts:645–712):
The container writes a system message with action="schedule_task" directly into messages_out. The host reads it and calls insertTask(inDb, {...}) with no authorization gate. The targetJid is derived from the system message platformId and channelType, not from an explicitly routed targetJid parameter.
Discrepancy: v1 prevented non-main groups from scheduling cross-group tasks at the request stage. v2 has no equivalent gate — the container can write any task to any group (in theory) because it's the host that does the actual DB insert. In practice, the container only has one session and only sees messages for that session, so it can't reach another group's messages_in. But the authorization model is implicitly structural, not explicit.
2. Message send authorization
v1 (ipc-auth.test.ts:339–373):
// Main can send to any chat
expect(isMessageAuthorized('whatsapp_main', true, 'other@g.us', groups)).toBe(true);
// Non-main can send to its own chat
expect(isMessageAuthorized('other-group', false, 'other@g.us', groups)).toBe(true);
// Non-main cannot send to another group's chat
expect(isMessageAuthorized('other-group', false, 'main@g.us', groups)).toBe(false);
v2 (delivery.ts:550–561):
const isOriginChat = session.messaging_group_id === mg.id;
if (!isOriginChat && !hasDestination(session.agent_group_id, 'channel', mg.id)) {
throw new Error(`unauthorized channel destination: ...`);
}
The container's session has a fixed messaging_group_id + thread_id. The agent can only reply to that origin or to a destination in the agent_destinations table. There is no isMain flag.
Discrepancy: v1 was group-centric (folder-based identity). v2 is session-centric (agent is wired to one or more messaging groups via central DB, projected into inbound.db). If an agent is wired to multiple chats with session_mode='agent-shared', it has one session and can see all of them as destinations. This is more flexible than v1's binary main/non-main gate.
3. Task update semantics
v1 (ipc-auth.test.ts:264–309): Container passes type='update_task', host reads the task, re-computes next_run if schedule changed, updates DB.
v2 (delivery.ts:695–712): Container writes system message with action="update_task", host applies the update directly. The host does not recompute next_run if the schedule changes — it only updates the fields the container specified. Recurrence is re-fired by the host in host-sweep.ts (line 160–165), not at update time.
Discrepancy: v1 eagerly recomputed next_run on update. v2 lazily computes it during the 60s sweep. If an agent updates a task's cron expression, it won't take effect until the next sweep cycle. This is a ~60s latency increase.
4. Error handling
v1 (ipc.ts:85–91): Files that fail to parse are moved to data/ipc/errors/ for manual inspection.
v2 (delivery.ts:422–459): Messages that fail delivery get up to 3 retries with exponential backoff. If they still fail, they're marked status='failed' in the DB. There's no "errors" directory; the audit trail is in the DB + logs.
Discrepancy: v1's error handling was "fire-and-forget" (parse, move on). v2's is "retry + persistent state." This is better observability, but v1's "move to errors/" was a gentler way to pause processing without losing the file.
5. Reconnect / session resumption
v1: No persistence. If the container crashed, the next restart had no knowledge of prior messages or state.
v2 (poll-loop.ts:51–55): Reads session_state.session_id at startup and passes it to the provider as continuation. The provider (Claude) deserializes a .jsonl transcript and resumes. Survives container crash.
Discrepancy: v2 has explicit continuation support. v1 did not. This is a strict improvement.
Worth preserving?
1. Per-request authorization envelope
Recommendation: No, v2's structural approach is better. In v1, a malicious container could spoof an isMain flag to bypass gates (though env vars are hard to spoof). v2's model — the host resolves identity once and checks permissions against the central DB — is more robust and easier to audit.
2. Message send ACL at request time
Recommendation: Partially — v2 should validate agent_destinations rows exist before the agent attempts a send, so it fails fast instead of silently dropping at delivery time. Currently, if an agent tries <message to="nonexistent">...</message>, it writes to messages_out and the host later rejects it. A pre-send validation in the container (via destinations.ts) would be better UX.
3. Backpressure / delivery acknowledgment
Recommendation: Maybe. If an agent rapidly fires 100 send_message() calls, they all block on SQLite INSERT (fast) and return immediately. The host drains them at 1s per poll. If the channel adapter is slow, messages pile up in messages_out. There's no way for the agent to ask "is there backlog?" or "wait until sent." This is probably fine for most use cases (agents don't spam), but if latency-sensitive, a send_message() that returns {delivered_at} would help.
4. Heartbeat / stale detection
Recommendation: Yes, and it's been preserved (.heartbeat file replaces file-based timestamps). But the 10min threshold is conservative. Consider shorter thresholds for interactive agents (containers should be responsive, stale is a sign of crash, not slow work).
File references
v1 (historical, in src/v1/ and container/agent-runner/src/v1/)
- ipc.ts:30–127 — startIpcWatcher loop, per-group folder scan, message/task file dispatch
- ipc.ts:129–356 — processTaskIpc with authorization gates (lines 169, 228, 241, 254, 271, 313, 326)
- ipc-auth.test.ts:71–127 — schedule_task authorization tests
- ipc-auth.test.ts:339–373 — message send authorization tests
- ipc-mcp-stdio.ts:37–68 — send_message MCP tool, writeIpcFile
- ipc-mcp-stdio.ts:70–216 — schedule_task tool with validation, target_group_jid param
- ipc-mcp-stdio.ts:445–504 — register_group tool, isMain gate
v2 (active, in src/ and container/agent-runner/src/)
- db-session.md:1–50 — inbound.db schema (messages_in, delivered, destinations, session_routing)
- db-session.md:120–174 — outbound.db schema (messages_out, processing_ack, session_state)
- db-session.md:104–117 — seq parity invariant
- delivery.ts:383–394 — drainSession loop (active poll 1s, sweep 60s)
- delivery.ts:467–638 — deliverMessage, handles all message kinds, permission checks, delivery retry
- delivery.ts:645–906 — handleSystemAction, interprets action="schedule_task" etc.
- host-sweep.ts:48–109 — sweepSession, syncProcessingAcks, stale detection via heartbeat, recurrence handling
- session-manager.ts:1–12 — cross-mount invariant doc (journal_mode=DELETE, close-per-op)
- session-manager.ts:122–130 — initSessionFolder, schema creation
- session-manager.ts:152–222 — writeSessionRouting, writeDestinations (replaces static env vars with live table)
- session-manager.ts:231–267 — writeSessionMessage (host writes to messages_in)
- poll-loop.ts:22–33 — PollLoopConfig with adminUserIds set
- poll-loop.ts:46–77 — runPollLoop entry, getPendingMessages, markProcessing
- destinations.ts:44–52 — getAllDestinations, findByName (reads from inbound.db live)
- db/messages-in.ts — getPendingMessages, markProcessing, markCompleted
- db/messages-out.ts — writeMessageOut (container writes system actions here)
- db/session-state.ts — getStoredSessionId, setStoredSessionId (continuation persistence)
- db/connection.ts — touchHeartbeat, journal_mode=DELETE pragma, cross-mount setup
Generated from deep-dive analysis of v1 IPC → v2 DB paradigm shift.