Files
nanoclaw/REFACTOR.md
T
gavrielc 5ed5b72f10 docs: consolidate refactor follow-ups into a single REFACTOR.md
Single forward-looking reference that replaces the two untracked planning
docs (REFACTOR_PLAN.md + REFACTOR_EXECUTION.md) which had become a mix of
historical PR timeline and still-relevant decisions.

Keeps only what's actionable going forward:
- Module tiers, the four registries, and the module distribution model
  (architecture summary).
- Remaining work: Phase 5 (v2 → main) and the modules-branch decision.
- Operational patterns worth preserving (standing checks, TDZ rule,
  branch-sync file-presence diff procedure, prettier drift pattern).
- 17 curated open questions across design, distribution, core slotting,
  and documentation.

Canonical references (docs/module-contract.md, docs/architecture.md, etc.)
are linked but not duplicated. This doc is transient — retire when the
refactor is fully behind us.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:14:09 +03:00

13 KiB
Raw Blame History

NanoClaw Refactor — Forward-Looking Reference

Consolidates what's still relevant from REFACTOR_PLAN.md and REFACTOR_EXECUTION.md: open decisions, remaining work, operational patterns worth keeping. Historical PR timeline and phase framing have been dropped — the work is in the commit history.


Architecture (still authoritative)

Module tiers

Three categories, distinguished by shipping model and dependency direction:

Tier Where it lives Loaded by default? Removal cost
Core src/** (outside src/modules/, src/channels/, src/providers/) always N/A — can't remove
Default modules src/modules/<name>/ on main yes — imported by src/modules/index.ts edit core imports (intentional friction)
Optional modules src/modules/<name>/ on main (for now — see open q #7) yes, via barrel import delete files + barrel line + revert MODULE-HOOK edits
Channel adapters src/channels/<name>.ts on channels branch no — cherry-pick via /add-<name> delete files + barrel line
Providers on providers branch no — cherry-pick via /add-<provider> delete files + barrel line

Default modules today: typing, mount-security, approvals, cli. Optional modules: interactive, scheduling, permissions, agent-to-agent, self-mod.

Dependency rule: core ← default modules ← optional modules. Optional modules must not depend on each other. Known transitional violation (flagged): src/db/messaging-groups.ts auto-wires agent_destinations when agent-to-agent is installed.

The four registries

Full contract in docs/module-contract.md. Summary:

  1. Delivery action handlersdelivery.ts; modules call registerDeliveryAction(name, fn).
  2. Router inbound gaterouter.ts; single setter (setSenderResolver + setAccessGate). Default: allow-all.
  3. Response dispatcherresponse-registry.ts; modules call registerResponseHandler(fn). First to return true claims.
  4. Container MCP tool self-registrationcontainer/agent-runner/src/mcp-tools/server.ts; modules call registerTools([...]) at import.

Anything else single-consumer uses either a sqlite_master-guarded inline read or a MODULE-HOOK:<name>:start/end skill edit.

Module distribution (pending)

  • main — core + default modules + default channel (cli). Ships clean.
  • channels — fully loaded runnable branch with all channel adapters; skills cherry-pick from it.
  • providers — same pattern for agent providers (OpenCode).
  • modules branch — proposed but NOT created yet. See "Remaining work" below.

Remaining work

Phase 5: merge v2main

Cut-over the refactor. Pre-reqs (already met): green build, green tests, green service boot, clean channels / providers syncs.

Open logistics:

  • Release versioning: bump to 1.3.0 at merge time or cut a v2-rc tag first for internal testing? Non-blocking — decide at merge.
  • Coordinate with anyone still running the old main (v1.2.53) — breaking change for them.
  • Announce the new layout + the one shell command that changed (pnpm run chat is new default).

modules branch — create, skip, or defer?

The original plan (PR #10) was to fork a modules branch and populate it with the 5 optional modules, so future /add-<module> skills pull via git show origin/modules:path. Three paths:

  • (a) Create it now. Matches the channels/providers pattern for consistency. Extra surface to maintain: every core change must be merged into modules at phase boundaries (same cadence as channels/providers). Pays off if we ever want to make a module truly optional (not shipped on main).
  • (b) Skip it. Leave all 5 optional modules shipped on main. No modules branch, no install skills, no cherry-picking. Simpler but loses the "opt-in" property for users who want a leaner install.
  • (c) Defer. Ship main without the modules branch; create it later if someone actually wants to slim their install. No-cost option for now.

Recommendation leans toward (c) — we've already paid the architectural cost (tier boundary, dependency rule, registries) without needing the branch today.

Per-module follow-ups (tracked as open questions below)

Each has a specific landing zone when we get to it:

  • #1113 (admin mechanism, providers registry, container-runner audit) — scope a focused cleanup pass.
  • #14 (CLAUDE.md review) — single dedicated PR touching every module.
  • #15 (A2A / destinations rethink) — requires design, not just cleanup.
  • #1718 (self-mod rethink, per-group source) — requires design.
  • #19 (system vs user CLAUDE.md) — requires install-skill tooling.

Operational patterns (keep using these)

Standing checks for every PR

Non-negotiable; a unit test suite alone doesn't catch circular-import TDZ bugs:

  1. pnpm run build clean.
  2. pnpm test + bun test (in container/agent-runner/) all green.
  3. Service actually starts. gtimeout 5 node dist/index.js (or launchctl kickstart) must reach NanoClaw running. Unit tests import individual files; only main() exercises the module-init order.
  4. Expected boot log lines present (at least: Central DB ready, Delivery polls started, Host sweep started, NanoClaw running, plus any module lifecycle line like OneCLI approval handler started or CLI channel listening).

Module architecture rule (TDZ bug, PR #3)

Any registry state a module writes to at import time must live in a file with no back-edge to src/index.ts — transitively. src/index.ts imports src/modules/index.js for side effects; if a module calls registerX() at top level and X lives in src/index.ts, the ES module loader hits a TDZ reference on the const declaration. Fix: registry state lives in its own dependency-free file (e.g. src/response-registry.ts). Any new registry follows the same pattern.

Branch sync procedure

After every v2 (or future main) sync into channels / providers / future modules:

  1. File-presence diff. Enumerate files that existed pre-sync but are missing post-sync:

    git ls-tree -r <pre-sync>  | awk '{print $4}' | sort > /tmp/pre.txt
    git ls-tree -r <post-sync> | awk '{print $4}' | sort > /tmp/post.txt
    comm -23 /tmp/pre.txt /tmp/post.txt
    

    Classify each missing file:

    • Intentional (core deleted it) → leave deleted.
    • Branch-owned (channels branch still needs it) → restore from pre-sync HEAD.

    This has caught real losses on both channels (17 adapter files plus 3 setup scripts after PR #2's channel move) and providers (opencode files after PR #2).

  2. Cross-file consistency. When restoring a file, check whether something else that also changed references it (e.g. setup/index.ts's STEPS map).

  3. Run the standing checks against the synced branch (not just v2).

Prettier drift pattern

The format:fix pre-commit hook sometimes reformats peer files after the commit completes, leaving cosmetic-only diffs in the working tree. Discard with git checkout -- <files>. Do not re-commit the drift — it's trivial whitespace and noise.


Open questions (curated)

Design / architecture

  1. NANOCLAW_ADMIN_USER_IDS as the admin mechanism. Host queries user_roles at container wake, collapses into env var, container compares sender IDs. Conflates identity-at-send with privilege-at-wake and forces the container to care about namespaced user IDs. Revisit during a container-runner audit.

  2. Host-side src/providers/ registry. One real consumer (OpenCode). A registry is probably overkill — the install skill could just edit container-runner.ts via MODULE-HOOK. Fold into the container-runner audit.

  3. Container-runner audit. src/container-runner.ts has accreted wake/spawn/kill, mount assembly, OneCLI credential application, admin-ID env var, idle timers, image rebuild. Some pieces should pull apart or move into modules. Not blocking. Related to #1 and #2.

  4. Revisit destinations + A2A capability holistically. The destination projection invariant, dual-purpose routing+ACL table, channel vs agent destination shapes, createMessagingGroupAgent auto-wire coupling — more machinery than the feature warrants. Phase 3 moved it out of core intact; a redesign is warranted but scoped post-refactor.

  5. Self-mod approach rethink. Three separate MCP tools + three delivery actions + three approval handlers for what's essentially "mutate container.config.json and rebuild." Also: post-rebuild latency (host sweep waits up to 60s), and agents sometimes send redundant add_mcp_server + request_rebuild pairs. Consider collapsing into a single "apply this container-config diff" approval primitive.

  6. Per-agent-group source / per-group base image. Self-mod today layers packages/MCP on a shared base. As groups diverge (different base images, provider configs, runtime toolchains), the shared-base assumption won't scale. Scope post-refactor.

Distribution / operational

  1. Providers on a consolidated modules branch? Staying separate for now. Revisit if a second optional provider appears.

  2. Per-group module enablement. Modules are currently project-wide. If one agent group wants approvals and another doesn't, we'd need per-group feature flags. Flag if asked.

  3. Module removal UX. We do not drop tables on uninstall. Is that the right default? (Alternative: /remove-<module> optionally runs a down migration. YAGNI until requested.)

  4. Cross-module ordering for the response dispatcher. Registration order determines who claims a given questionId. IDs are disjoint in practice (q-… vs appr-…), so first-match-wins is safe. If a third response-consuming module arrives, we may need keyed dispatch.

  5. Versioned module migrations. Reinstalls are idempotent (migrator skips anything already in schema_version). If a module ships a new migration in a later version, the install skill must append the new file + barrel entry without touching prior ones. Simplest rule: install skills are additive; content changes to an already-applied migration are a hard error.

  6. Telegram pairing imports from permissions (channels branch). src/channels/telegram.ts reaches into src/modules/permissions/db/* for grantRole/hasAnyOwner/upsertUser in the pairing-bootstrap branch. Cross-branch tier violation. Fix: extract those writes into a pairing helper (e.g. src/channels/telegram-pairing-accept.ts or setup/pair-telegram.ts). Non-blocking.

Core slotting (files not explicitly discussed)

  1. state-sqlite.ts, webhook-server.ts, timezone.ts. state-sqlite is likely core (host tracker). Webhook-server likely core (channel infra). Timezone likely core utility. Confirm if any of them prove to be module-shaped during future audits.

  2. Chat SDK bridge location. src/channels/chat-sdk-bridge.ts is channel infra that bridges adapters on the channels branch. Stays in src/channels/ for now.

  3. OneCLI credential injection. Lives in container-runner.ts. Every agent call uses it, no clean optional boundary. Stays core. Related: onecli-approvals.ts is bundled inside the approvals default module on the assumption OneCLI stays in core. If OneCLI later moves to its own module, onecli-approvals follows.

Documentation

  1. CLAUDE.md content per module. Every module ships with project.md + agent.md. Need a dedicated review pass: (a) write the missing agent-to-agent snippets, (b) audit other modules for accuracy/tone, (c) confirm agent.md files are actually tailored for the agent vs. copy-pastes of project.md.

  2. Split system CLAUDE.md from user CLAUDE.md. Project CLAUDE.md and groups/global/CLAUDE.md mix system-authored content (module contracts, install-skill appends) with user customizations. Updates currently risk clobbering user intent. Look at a system-owned region (or separate file) that skills rewrite freely plus a user-owned one that's never touched. Related to #16.


Where the canonical references live

This doc (REFACTOR.md) is transient — prune when open questions close; retire entirely once the refactor is fully behind us and the operational patterns have been absorbed into CLAUDE.md or docs/.