diff --git a/.claude/skills/setup/SKILL.md b/.claude/skills/setup/SKILL.md index 9182968fa..25cbfa379 100644 --- a/.claude/skills/setup/SKILL.md +++ b/.claude/skills/setup/SKILL.md @@ -89,7 +89,23 @@ Run `pnpm exec tsx setup/index.ts --step timezone` and parse the status block. - macOS: install via `brew install --cask docker`, then `open -a Docker` and wait for it to start. If brew not available, direct to Docker Desktop download at https://docker.com/products/docker-desktop - Linux: install with `curl -fsSL https://get.docker.com | sh && sudo usermod -aG docker $USER`. Note: user may need to log out/in for group membership. -### 3b. Build and test +### 3b. CJK fonts + +Agent containers skip CJK fonts by default (~200MB saved). Without them, Chromium-rendered screenshots and PDFs show tofu for Chinese/Japanese/Korean. + +- **User writing to you in Chinese, Japanese, or Korean** → enable without asking. Mention it briefly. +- **Resolved timezone from step 2a is a CJK region** (`Asia/Tokyo`, `Asia/Shanghai`, `Asia/Hong_Kong`, `Asia/Taipei`, `Asia/Seoul`) or other signal short of active CJK use → ask: "Enable CJK fonts? Adds ~200MB, lets the agent render CJK in screenshots and PDFs." +- **Otherwise** → skip. + +To enable, write `INSTALL_CJK_FONTS=true` to `.env`: + +```bash +grep -q '^INSTALL_CJK_FONTS=' .env && sed -i.bak 's/^INSTALL_CJK_FONTS=.*/INSTALL_CJK_FONTS=true/' .env && rm -f .env.bak || echo 'INSTALL_CJK_FONTS=true' >> .env +``` + +The next step's build picks it up automatically. + +### 3c. Build and test Run `pnpm exec tsx setup/index.ts --step container -- --runtime docker` and parse the status block. diff --git a/CLAUDE.md b/CLAUDE.md index 06331365a..55c7b0556 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -102,13 +102,18 @@ Before creating a PR, adding a skill, or preparing any contribution, you MUST re Run commands directly — don't tell the user to run them. ```bash +# Host (Node + pnpm) pnpm run dev # Host with hot reload pnpm run build # Compile host TypeScript (src/) ./container/build.sh # Rebuild agent container image (nanoclaw-agent:latest) -pnpm test # Host tests +pnpm test # Host tests (vitest) + +# Agent-runner (Bun — separate package tree under container/agent-runner/) +cd container/agent-runner && bun install # After editing agent-runner deps +cd container/agent-runner && bun test # Container tests (bun:test) ``` -Container typecheck is a separate tsconfig — if you edit `container/agent-runner/src/`, run `pnpm exec tsc -p container/agent-runner/tsconfig.json --noEmit` to check it. +Container typecheck is a separate tsconfig — if you edit `container/agent-runner/src/`, run `pnpm exec tsc -p container/agent-runner/tsconfig.json --noEmit` from root (or `bun run typecheck` from `container/agent-runner/`). Service management: ```bash @@ -146,7 +151,38 @@ This project uses pnpm with `minimumReleaseAge: 4320` (3 days) in `pnpm-workspac | [docs/v2-setup-wiring.md](docs/v2-setup-wiring.md) | What's wired, what's open in the setup flow | | [docs/v2-checklist.md](docs/v2-checklist.md) | Rolling status checklist across all subsystems | | [docs/v2-architecture-diagram.md](docs/v2-architecture-diagram.md) | Diagram version of the architecture | +| [docs/v2-build-and-runtime.md](docs/v2-build-and-runtime.md) | Runtime split (Node host + Bun container), lockfiles, image build surface, CI, key invariants | ## Container Build Cache The container buildkit caches the build context aggressively. `--no-cache` alone does NOT invalidate COPY steps — the builder's volume retains stale files. To force a truly clean rebuild, prune the builder then re-run `./container/build.sh`. + +## Container Runtime (Bun) + +The agent container runs on **Bun**; the host runs on **Node** (pnpm). They communicate only via session DBs — no shared modules. Details and rationale: [docs/v2-build-and-runtime.md](docs/v2-build-and-runtime.md). + +**Gotchas — trigger + action:** + +- **Adding or bumping a runtime dep in `container/agent-runner/`** → edit `package.json`, then `cd container/agent-runner && bun install` and commit the updated `bun.lock`. Do not run `pnpm install` there — agent-runner is not a pnpm workspace. +- **Bumping `@anthropic-ai/claude-agent-sdk`, `@modelcontextprotocol/sdk`, or any agent-runner runtime dep** → no `minimumReleaseAge` policy applies to this tree. Check the release date on npm, pin deliberately, never `bun update` blindly. +- **Writing a new named-param SQL insert/update in the container** → use `$name` in both SQL and JS keys: `.run({ $id: msg.id })`. `bun:sqlite` does not auto-strip the prefix the way `better-sqlite3` does on the host. Positional `?` params work normally. +- **Adding a test in `container/agent-runner/src/`** → import from `bun:test`, not `vitest`. Vitest runs on Node and can't load `bun:sqlite`. `vitest.config.ts` excludes this tree. +- **Adding a Node CLI the agent invokes at runtime** (like `agent-browser`, `claude-code`, `vercel`) → put it in the Dockerfile's pnpm global-install block, pinned to an exact version via a new `ARG`. Don't use `bun install -g` — that bypasses the pnpm supply-chain policy. +- **Changing the Dockerfile entrypoint or the dynamic-spawn command** (`src/container-runner.ts` line ~301) → keep `exec bun ...` so signals forward cleanly. The image has no `/app/dist`; don't reintroduce a tsc build step. +- **Changing session-DB pragmas** (`container/agent-runner/src/db/connection.ts`) → `journal_mode=DELETE` is load-bearing for cross-mount visibility. Read the comment block at the top of the file first. + +## CJK font support + +Agent containers ship without CJK fonts by default (~200MB saved). If you notice signals the user works with Chinese/Japanese/Korean content — conversing in CJK, CJK timezone (e.g., `Asia/Tokyo`, `Asia/Shanghai`, `Asia/Seoul`, `Asia/Taipei`, `Asia/Hong_Kong`), system locale hint, or mentions of needing to render CJK in screenshots/PDFs/scraped pages — offer to enable it: + +```bash +# Ensure .env has INSTALL_CJK_FONTS=true (overwrite or append) +grep -q '^INSTALL_CJK_FONTS=' .env && sed -i.bak 's/^INSTALL_CJK_FONTS=.*/INSTALL_CJK_FONTS=true/' .env && rm -f .env.bak || echo 'INSTALL_CJK_FONTS=true' >> .env + +# Rebuild and restart so new sessions pick up the new image +./container/build.sh +launchctl kickstart -k gui/$(id -u)/com.nanoclaw # macOS +# systemctl --user restart nanoclaw # Linux +``` + +`container/build.sh` reads `INSTALL_CJK_FONTS` from `.env` and passes it through as a Docker build-arg. Without CJK fonts, Chromium-rendered screenshots and PDFs containing CJK text show tofu (empty rectangles) instead of characters. diff --git a/docs/v2-build-and-runtime.md b/docs/v2-build-and-runtime.md new file mode 100644 index 000000000..7b7d85883 --- /dev/null +++ b/docs/v2-build-and-runtime.md @@ -0,0 +1,80 @@ +# Build & Runtime + +NanoClaw runs a split stack: the host is Node + pnpm, the agent container is Bun. They communicate exclusively through two SQLite files per session — there are no shared modules between them, which is what lets them use different runtimes cleanly. + +## Why the split + +- **Host stays on Node** because Baileys (WhatsApp) depends on `libsignal-node` native bindings and a long-tested WebSocket/HTTP stack. Bun's Node-API compat has improved, but this isn't where we want risk. +- **Container runs Bun** because `bun:sqlite` is built-in (no native compile of `better-sqlite3` per image rebuild), source runs directly (no tsc build step at image build or session wake), and `bun install` is ~5-10× faster than `npm install`. + +Host and container each have their own package tree: + +``` +/ pnpm + Node 22 + pnpm-lock.yaml host deps (channels, Chat SDK, Baileys, better-sqlite3, etc.) + pnpm-workspace.yaml minimumReleaseAge + onlyBuiltDependencies policy + +/container/agent-runner/ Bun 1.3+ + bun.lock agent-runner runtime deps (Claude Agent SDK, MCP SDK, zod, etc.) + package.json @types/bun, typescript devDeps for type-checking +``` + +The container image also has pnpm + Node inside for global CLIs (`@anthropic-ai/claude-code`, `agent-browser`, `vercel`). Those are Node binaries the agent invokes at runtime, not library deps. Keeping them on pnpm preserves the supply-chain policy for CLI versions. + +## Lockfiles + +| Tree | Lockfile | Manager | Regenerate after dep change | +|------|----------|---------|----------------------------| +| Host | `pnpm-lock.yaml` | pnpm 10 | `pnpm install` | +| Agent-runner | `container/agent-runner/bun.lock` | Bun 1.3+ | `cd container/agent-runner && bun install` | + +Both are committed. CI and the Dockerfile run `--frozen-lockfile` variants — any drift between `package.json` and lockfile fails the build. + +## Supply chain + +- **Host + global CLIs** (pnpm): `minimumReleaseAge: 4320` (3-day hold on new versions), `onlyBuiltDependencies` allowlist for postinstall scripts. See `pnpm-workspace.yaml` and `docs/SECURITY.md`. +- **Agent-runner** (Bun): no release-age policy — Bun doesn't have an equivalent today. The defenses are `bun.lock` pinning plus version-pinned CLIs/Bun itself via Dockerfile ARGs. When bumping `@anthropic-ai/claude-agent-sdk` or any runtime dep, review the release date on npm and bump deliberately, not via `bun update`. + +## Image build surface + +`container/Dockerfile` is a single-stage build on `node:22-slim`: + +- **Pinned ARGs** — `BUN_VERSION`, `CLAUDE_CODE_VERSION`, `AGENT_BROWSER_VERSION`, `VERCEL_VERSION`. Bump deliberately in PRs. +- **CJK fonts** — `ARG INSTALL_CJK_FONTS=false`. `container/build.sh` reads `INSTALL_CJK_FONTS` from `.env` and passes it through. Default build saves ~200MB; opt in when the user works with Chinese/Japanese/Korean content. +- **BuildKit cache mounts** — `/var/cache/apt`, `/var/lib/apt`, `/root/.bun/install/cache`, `/root/.cache/pnpm`. Rebuilds where `package.json`/`bun.lock` haven't changed are fast. Requires BuildKit (default on Docker 23+, Apple Container-compat). +- **`tini` as init** — reaps Chromium zombies, forwards signals so in-flight `outbound.db` writes finalize on SIGTERM. +- **`entrypoint.sh`** (extracted) — `exec bun run /app/src/index.ts` under tini. Readable and diffable. +- **No compiled `/app/dist`** — Bun runs TS directly. The host also mounts fresh source over `/app/src` at session start, so host edits take effect without rebuilding the image. + +## Session wake (two paths) + +1. **Base image ENTRYPOINT** — used for stdin-piped test invocations like the sample in `container/build.sh`: `tini --> entrypoint.sh` captures stdin to `/tmp/input.json`, then `exec bun run src/index.ts`. +2. **Host-spawned session** — `src/container-runner.ts` at line ~301 uses `--entrypoint bash` with `-c 'exec bun run /app/src/index.ts'`. Bypasses tini (Docker's default PID 1 handling applies). Stdin is unused; all IO flows through the mounted session DBs. + +Both paths end with Bun running the same source file from `/app/src/index.ts`. + +## CI shape + +`.github/workflows/ci.yml` installs both Node (with pnpm cache) and Bun, then runs in order: + +1. `pnpm install --frozen-lockfile` (host) +2. `bun install --frozen-lockfile` in `container/agent-runner/` (container) +3. `pnpm run format:check` +4. `pnpm exec tsc --noEmit` (host typecheck) +5. `pnpm exec tsc -p container/agent-runner/tsconfig.json --noEmit` (container typecheck) +6. `pnpm exec vitest run` (host tests) +7. `bun test` in `container/agent-runner/` (container tests) + +Any failure fails the PR. + +## Key invariants + +- **Session DBs must use `journal_mode=DELETE`.** WAL's `-shm` memory-map doesn't cross VirtioFS between host and guest. See the doc comment at the top of `container/agent-runner/src/db/connection.ts` and `src/session-manager.ts`. +- **Named SQL parameters in the container require the prefix in JS object keys.** `bun:sqlite` does not auto-strip `@`/`$`/`:` the way `better-sqlite3` does on the host. Use `$name` in both SQL and keys: `.run({ $id: msg.id })`. Positional `?` params work normally. +- **Agent-runner tests run under `bun:test`, not vitest.** `vitest.config.ts` excludes the `container/agent-runner/` tree because vitest runs on Node and can't load `bun:sqlite`. +- **No tsc build step in the container image.** Re-adding one would reintroduce the ~200-500ms per-session-wake cost we removed. +- **Global container CLIs stay on pnpm, not Bun.** `agent-browser`, `@anthropic-ai/claude-code`, `vercel` and any future Node CLIs the agent invokes should be pinned versions under the Dockerfile's pnpm global-install block. `bun install -g` would bypass the pnpm supply-chain policy. + +## Migration history + +This structure replaced a uniform npm-on-Node stack across both host and container. The pnpm migration landed first (PR #1771) to bring the host under supply-chain policy, then the container moved to Bun to eliminate native-module compilation and the per-wake tsc step. The split was chosen over going full-Bun because Baileys' native deps are the main risk surface on the host — the container has no such deps, so it benefits from Bun without taking the risk.