Files
nanoclaw/docs/v1-vs-v2/group-queue.md
T
gavrielc 47950671fa docs: add v1→v2 action-items analysis + SDK signal probe tool
- docs/v1-vs-v2/: full v1→v2 regression analysis (SUMMARY + 21 per-module
  docs + ACTION-ITEMS rollup with decisions + timezone recreation spec).
- container/agent-runner/scripts/sdk-signal-probe.ts: empirical harness
  used to characterise Claude Agent SDK event/hook/stderr timing for the
  stuck-detection design in item 9.
- src/channels/chat-sdk-bridge.ts: document the conversations Map staleness
  in a code comment; fix deferred to when dynamic group registration lands
  (ACTION-ITEMS item 17).

No runtime behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:00:04 +03:00

4.4 KiB

group-queue: v1 vs v2

Scope

  • v1: src/v1/group-queue.ts (325 LOC), group-queue.test.ts (457 LOC) — in-memory per-group state machine, IPC-file dispatch
  • v2: no equivalent class. Serialization is now DB-based and distributed across src/session-manager.ts, src/host-sweep.ts, src/container-runner.ts, src/delivery.ts

Capability map

v1 behavior v2 location Status Notes
Per-group message queue inbound.db.messages_in + status='pending' replaced Atomic status transitions serialize work per-session
Per-group task queue inbound.db.messages_in with kind='task' replaced Same table; kind discriminates
MAX_CONCURRENT_CONTAINERS global cap container-runner.ts:42-52 activeContainers Map + wakeContainer dedup kept Enforced at spawn
One container per group invariant One container per session redefined Session is identity unit now
Task-before-message priority (drainGroup) host-sweep.ts recurrence + delivery.ts active poll partially lost No priority; polled by process_after timestamp ordering
Exponential retry backoff host-sweep.ts:145-147 BACKOFF_BASE_MS * 2^tries kept Max 5 tries, same shape
Idle preemption (notifyIdle/closeStdin) heartbeat file mtime removed No interrupt signal — container polls continuously
Message dispatch to active container (sendMessage) Write to messages_in table replaced Host writes; container polls
Cascading drain on task arrival delivery.ts (~1s) + host-sweep.ts (~60s) polls async-ized Work discovery on next tick, not synchronous
Shutdown without kill containers continue under --rm similar Host shutdown does not stop containers
Task dedup (pendingTasks.some(t => t.id === id)) PK on messages_in.id partial Unique ID prevents DB duplicates; does not prevent two distinct rows with same series_id
drainWaiting (waiting-group fairness) Implicit: any session can wake if slot free async No explicit fairness

Serialization model diff

v1 (push-based): GroupState in memory per group: active, pendingMessages, pendingTasks, idleWaiting, runningTaskId. drainGroup() synchronously dispatches. IPC file write signals container readiness. State lost on restart.

v2 (pull-based via DB): messages_in.status is the queue (pendingprocessingcompleted/failed). Host writes rows + calls wakeContainer(); container polls + atomic UPDATE to take work. One writer per DB file (host→inbound, container→outbound) eliminates cross-mount contention. Heartbeat file mtime replaces IPC for liveness. State persisted; survives crashes.

Missing from v2

  1. Idle-state preemption — v1 could interrupt an idle container on task arrival via closeStdin. v2 has no interrupt; container finishes current work and polls again
  2. Synchronous drain cascade — v1's drainGroup immediately ran the next item; v2 discovers it on the next poll tick (~1s active, ~60s sweep)
  3. In-memory task dedup — v1 checked pending-task list before enqueue. v2 can have two task rows with the same series_id coexisting (both pending) — relies on atomic status update for single-execution, best-effort
  4. Priority ordering — v1 tasks preempted messages; v2 is timestamp-ordered only

Behavioral discrepancies

Aspect v1 v2
Wake trigger on enqueue (sync) on wakeContainer() call, or poll finding due message
Idle timeout implicit via IPC explicit heartbeat mtime (10 min)
Task ordering FIFO within group, tasks preempt messages process_after timestamp; ties by insert seq
Retry host scheduleRetry() host sweep detects stale, increments tries, sets backoff
Concurrency cap same same (enforced in spawnContainer dedup)

Worth preserving?

  1. Explicit task dedup — add (kind, series_id, session_id) unique index on messages_in, or dedup in host-sweep.ts before inserting retry rows. Currently best-effort via atomic status update
  2. Priority ordering — add a priority column or document the ~1s task-wake latency as the SLA
  3. Idle preemption — not critical; 1s polling is acceptable for most workflows
  4. Fairness — v1's drainWaiting ensured no group starved. v2 is fair by timestamp but untested under concurrent load. Monitor in production