Files
nanoclaw/REFACTOR.md
T
gavrielc 5ed5b72f10 docs: consolidate refactor follow-ups into a single REFACTOR.md
Single forward-looking reference that replaces the two untracked planning
docs (REFACTOR_PLAN.md + REFACTOR_EXECUTION.md) which had become a mix of
historical PR timeline and still-relevant decisions.

Keeps only what's actionable going forward:
- Module tiers, the four registries, and the module distribution model
  (architecture summary).
- Remaining work: Phase 5 (v2 → main) and the modules-branch decision.
- Operational patterns worth preserving (standing checks, TDZ rule,
  branch-sync file-presence diff procedure, prettier drift pattern).
- 17 curated open questions across design, distribution, core slotting,
  and documentation.

Canonical references (docs/module-contract.md, docs/architecture.md, etc.)
are linked but not duplicated. This doc is transient — retire when the
refactor is fully behind us.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:14:09 +03:00

176 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NanoClaw Refactor — Forward-Looking Reference
Consolidates what's still relevant from `REFACTOR_PLAN.md` and `REFACTOR_EXECUTION.md`: open decisions, remaining work, operational patterns worth keeping. Historical PR timeline and phase framing have been dropped — the work is in the commit history.
---
## Architecture (still authoritative)
### Module tiers
Three categories, distinguished by shipping model and dependency direction:
| Tier | Where it lives | Loaded by default? | Removal cost |
|------|----------------|--------------------|--------------|
| **Core** | `src/**` (outside `src/modules/`, `src/channels/`, `src/providers/`) | always | N/A — can't remove |
| **Default modules** | `src/modules/<name>/` on main | yes — imported by `src/modules/index.ts` | edit core imports (intentional friction) |
| **Optional modules** | `src/modules/<name>/` on main (for now — see open q #7) | yes, via barrel import | delete files + barrel line + revert `MODULE-HOOK` edits |
| **Channel adapters** | `src/channels/<name>.ts` on `channels` branch | no — cherry-pick via `/add-<name>` | delete files + barrel line |
| **Providers** | on `providers` branch | no — cherry-pick via `/add-<provider>` | delete files + barrel line |
Default modules today: `typing`, `mount-security`, `approvals`, `cli`.
Optional modules: `interactive`, `scheduling`, `permissions`, `agent-to-agent`, `self-mod`.
Dependency rule: **core ← default modules ← optional modules**. Optional modules must not depend on each other. Known transitional violation (flagged): `src/db/messaging-groups.ts` auto-wires `agent_destinations` when agent-to-agent is installed.
### The four registries
Full contract in [`docs/module-contract.md`](docs/module-contract.md). Summary:
1. **Delivery action handlers**`delivery.ts`; modules call `registerDeliveryAction(name, fn)`.
2. **Router inbound gate**`router.ts`; single setter (`setSenderResolver` + `setAccessGate`). Default: allow-all.
3. **Response dispatcher**`response-registry.ts`; modules call `registerResponseHandler(fn)`. First to return `true` claims.
4. **Container MCP tool self-registration**`container/agent-runner/src/mcp-tools/server.ts`; modules call `registerTools([...])` at import.
Anything else single-consumer uses either a `sqlite_master`-guarded inline read or a `MODULE-HOOK:<name>:start/end` skill edit.
### Module distribution (pending)
- **`main`** — core + default modules + default channel (`cli`). Ships clean.
- **`channels`** — fully loaded runnable branch with all channel adapters; skills cherry-pick from it.
- **`providers`** — same pattern for agent providers (OpenCode).
- **`modules` branch** — proposed but NOT created yet. See "Remaining work" below.
---
## Remaining work
### Phase 5: merge `v2` → `main`
Cut-over the refactor. Pre-reqs (already met): green build, green tests, green service boot, clean `channels` / `providers` syncs.
Open logistics:
- Release versioning: bump to `1.3.0` at merge time or cut a `v2-rc` tag first for internal testing? Non-blocking — decide at merge.
- Coordinate with anyone still running the old `main` (v1.2.53) — breaking change for them.
- Announce the new layout + the one shell command that changed (`pnpm run chat` is new default).
### `modules` branch — create, skip, or defer?
The original plan (PR #10) was to fork a `modules` branch and populate it with the 5 optional modules, so future `/add-<module>` skills pull via `git show origin/modules:path`. Three paths:
- **(a) Create it now.** Matches the `channels`/`providers` pattern for consistency. Extra surface to maintain: every core change must be merged into `modules` at phase boundaries (same cadence as channels/providers). Pays off if we ever want to make a module *truly* optional (not shipped on main).
- **(b) Skip it.** Leave all 5 optional modules shipped on main. No `modules` branch, no install skills, no cherry-picking. Simpler but loses the "opt-in" property for users who want a leaner install.
- **(c) Defer.** Ship main without the modules branch; create it later if someone actually wants to slim their install. No-cost option for now.
Recommendation leans toward (c) — we've already paid the architectural cost (tier boundary, dependency rule, registries) without needing the branch today.
### Per-module follow-ups (tracked as open questions below)
Each has a specific landing zone when we get to it:
- #1113 (admin mechanism, providers registry, container-runner audit) — scope a focused cleanup pass.
- #14 (CLAUDE.md review) — single dedicated PR touching every module.
- #15 (A2A / destinations rethink) — requires design, not just cleanup.
- #1718 (self-mod rethink, per-group source) — requires design.
- #19 (system vs user CLAUDE.md) — requires install-skill tooling.
---
## Operational patterns (keep using these)
### Standing checks for every PR
Non-negotiable; a unit test suite alone doesn't catch circular-import TDZ bugs:
1. `pnpm run build` clean.
2. `pnpm test` + `bun test` (in `container/agent-runner/`) all green.
3. **Service actually starts.** `gtimeout 5 node dist/index.js` (or `launchctl kickstart`) must reach `NanoClaw running`. Unit tests import individual files; only `main()` exercises the module-init order.
4. Expected boot log lines present (at least: `Central DB ready`, `Delivery polls started`, `Host sweep started`, `NanoClaw running`, plus any module lifecycle line like `OneCLI approval handler started` or `CLI channel listening`).
### Module architecture rule (TDZ bug, PR #3)
Any registry state a module writes to at import time must live in a file with **no back-edge to `src/index.ts`** — transitively. `src/index.ts` imports `src/modules/index.js` for side effects; if a module calls `registerX()` at top level and `X` lives in `src/index.ts`, the ES module loader hits a TDZ reference on the const declaration. Fix: registry state lives in its own dependency-free file (e.g. `src/response-registry.ts`). Any new registry follows the same pattern.
### Branch sync procedure
After every `v2` (or future `main`) sync into `channels` / `providers` / future `modules`:
1. **File-presence diff.** Enumerate files that existed pre-sync but are missing post-sync:
```
git ls-tree -r <pre-sync> | awk '{print $4}' | sort > /tmp/pre.txt
git ls-tree -r <post-sync> | awk '{print $4}' | sort > /tmp/post.txt
comm -23 /tmp/pre.txt /tmp/post.txt
```
Classify each missing file:
- **Intentional** (core deleted it) → leave deleted.
- **Branch-owned** (channels branch still needs it) → restore from pre-sync HEAD.
This has caught real losses on both `channels` (17 adapter files plus 3 setup scripts after PR #2's channel move) and `providers` (opencode files after PR #2).
2. **Cross-file consistency.** When restoring a file, check whether something *else* that also changed references it (e.g. `setup/index.ts`'s `STEPS` map).
3. **Run the standing checks** against the synced branch (not just v2).
### Prettier drift pattern
The `format:fix` pre-commit hook sometimes reformats peer files *after* the commit completes, leaving cosmetic-only diffs in the working tree. Discard with `git checkout -- <files>`. Do not re-commit the drift — it's trivial whitespace and noise.
---
## Open questions (curated)
### Design / architecture
1. **`NANOCLAW_ADMIN_USER_IDS` as the admin mechanism.** Host queries `user_roles` at container wake, collapses into env var, container compares sender IDs. Conflates identity-at-send with privilege-at-wake and forces the container to care about namespaced user IDs. Revisit during a container-runner audit.
2. **Host-side `src/providers/` registry.** One real consumer (OpenCode). A registry is probably overkill — the install skill could just edit `container-runner.ts` via `MODULE-HOOK`. Fold into the container-runner audit.
3. **Container-runner audit.** `src/container-runner.ts` has accreted wake/spawn/kill, mount assembly, OneCLI credential application, admin-ID env var, idle timers, image rebuild. Some pieces should pull apart or move into modules. Not blocking. Related to #1 and #2.
4. **Revisit destinations + A2A capability holistically.** The destination projection invariant, dual-purpose routing+ACL table, channel vs agent destination shapes, `createMessagingGroupAgent` auto-wire coupling — more machinery than the feature warrants. Phase 3 moved it out of core intact; a redesign is warranted but scoped post-refactor.
5. **Self-mod approach rethink.** Three separate MCP tools + three delivery actions + three approval handlers for what's essentially "mutate container.config.json and rebuild." Also: post-rebuild latency (host sweep waits up to 60s), and agents sometimes send redundant `add_mcp_server` + `request_rebuild` pairs. Consider collapsing into a single "apply this container-config diff" approval primitive.
6. **Per-agent-group source / per-group base image.** Self-mod today layers packages/MCP on a shared base. As groups diverge (different base images, provider configs, runtime toolchains), the shared-base assumption won't scale. Scope post-refactor.
### Distribution / operational
7. **Providers on a consolidated `modules` branch?** Staying separate for now. Revisit if a second optional provider appears.
8. **Per-group module enablement.** Modules are currently project-wide. If one agent group wants approvals and another doesn't, we'd need per-group feature flags. Flag if asked.
9. **Module removal UX.** We do not drop tables on uninstall. Is that the right default? (Alternative: `/remove-<module>` optionally runs a down migration. YAGNI until requested.)
10. **Cross-module ordering for the response dispatcher.** Registration order determines who claims a given `questionId`. IDs are disjoint in practice (`q-…` vs `appr-…`), so first-match-wins is safe. If a third response-consuming module arrives, we may need keyed dispatch.
11. **Versioned module migrations.** Reinstalls are idempotent (migrator skips anything already in `schema_version`). If a module ships a *new* migration in a later version, the install skill must append the new file + barrel entry without touching prior ones. Simplest rule: install skills are additive; content changes to an already-applied migration are a hard error.
12. **Telegram pairing imports from permissions (channels branch).** `src/channels/telegram.ts` reaches into `src/modules/permissions/db/*` for `grantRole`/`hasAnyOwner`/`upsertUser` in the pairing-bootstrap branch. Cross-branch tier violation. Fix: extract those writes into a pairing helper (e.g. `src/channels/telegram-pairing-accept.ts` or `setup/pair-telegram.ts`). Non-blocking.
### Core slotting (files not explicitly discussed)
13. **`state-sqlite.ts`, `webhook-server.ts`, `timezone.ts`.** state-sqlite is likely core (host tracker). Webhook-server likely core (channel infra). Timezone likely core utility. Confirm if any of them prove to be module-shaped during future audits.
14. **Chat SDK bridge location.** `src/channels/chat-sdk-bridge.ts` is channel infra that bridges adapters on the `channels` branch. Stays in `src/channels/` for now.
15. **OneCLI credential injection.** Lives in `container-runner.ts`. Every agent call uses it, no clean optional boundary. Stays core. Related: `onecli-approvals.ts` is bundled inside the `approvals` default module on the assumption OneCLI stays in core. If OneCLI later moves to its own module, `onecli-approvals` follows.
### Documentation
16. **CLAUDE.md content per module.** Every module ships with project.md + agent.md. Need a dedicated review pass: (a) write the missing agent-to-agent snippets, (b) audit other modules for accuracy/tone, (c) confirm `agent.md` files are actually tailored for the agent vs. copy-pastes of `project.md`.
17. **Split system CLAUDE.md from user CLAUDE.md.** Project `CLAUDE.md` and `groups/global/CLAUDE.md` mix system-authored content (module contracts, install-skill appends) with user customizations. Updates currently risk clobbering user intent. Look at a system-owned region (or separate file) that skills rewrite freely plus a user-owned one that's never touched. Related to #16.
---
## Where the canonical references live
- **Module contract** — [`docs/module-contract.md`](docs/module-contract.md)
- **Architecture overview** — [`docs/architecture.md`](docs/architecture.md)
- **DB layout** — [`docs/db.md`](docs/db.md), [`docs/db-central.md`](docs/db-central.md), [`docs/db-session.md`](docs/db-session.md)
- **Agent-runner internals** — [`docs/agent-runner-details.md`](docs/agent-runner-details.md)
- **Channel isolation model** — [`docs/isolation-model.md`](docs/isolation-model.md)
- **Build + runtime split** — [`docs/build-and-runtime.md`](docs/build-and-runtime.md)
- **Top-level** — [`CLAUDE.md`](CLAUDE.md)
This doc (`REFACTOR.md`) is transient — prune when open questions close; retire entirely once the refactor is fully behind us and the operational patterns have been absorbed into `CLAUDE.md` or `docs/`.