A turn that ends in a non-retryable provider error (e.g. an Anthropic
403 billing_error) comes back from the streaming SDK as a result with
is_error=true and no <message> envelope. dispatchResultText treated it
as scratchpad and dropped it, then the poll-loop pushed a re-wrap nudge
-> new turn -> same error, re-hammering the gateway until idle-kill. The
user saw silence.
- providers/claude.ts: surface is_error on the result event, and fall
back to errors[] for the message text (error subtypes carry no result).
- poll-loop.ts: when a result has no <message> blocks and is_error, deliver
the notice verbatim to the originating channel and skip the nudge.
Verified live (real agent image + SDK, 403 mock): the notice is delivered
to the channel and the retry loop is gone.
Refs #2751
The "read imported-agent-memory.md, treat it as binding" doctrine sat in the
memory definition that every group loads, but it only matters when an import
actually happened. Move it into the /migrate-memory skill — the step that
writes the imported file and its index pointer (which the agent inlines into
its prompt each turn) — and drop the always-on block from definition.md.
Addresses review feedback on #2756.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The agent's global Node CLIs (claude-code, agent-browser, vercel) were each
a hardcoded ARG + RUN layer in the Dockerfile, so adding or bumping one meant
editing the Dockerfile — a code reach-in every tool-installing skill had to make.
Move the tool list into container/cli-tools.json. A skill now adds a CLI by
appending a {name, version} entry (a json-merge) — the safest change shape:
deterministic, idempotent, removable. install-cli-tools.sh parses the manifest
with node (no new jq dep), writes the per-tool only-built-dependencies opt-ins,
and runs one pinned `pnpm install -g`, so the pnpm supply-chain path is unchanged.
Behavior is byte-for-byte: same opt-ins, same pinned installs. agent-browser is
now pinned (0.27.1, what `latest` last resolved to) instead of floating.
container/cli-tools.test.ts guards the seam: red if a baseline tool is dropped,
a version unpins, or the Dockerfile wiring / pnpm path is removed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the agent provider a first-class, operator-chosen property instead of a
Claude-only assumption. Trunk gains the seams; the actual non-default payloads
(Codex first) install from the `providers` branch.
Setup
- A provider registry feeds a hard-wired setup picker (Claude | Codex). Picking
a non-default provider installs its payload (setup/add-codex.sh, channel-style),
runs a vault-only auth walkthrough (--step provider-auth), and records the pick
on the first agent before its first spawn.
- Picking Claude changes nothing — default installs are byte-for-byte unaffected.
Provider as a DB property
- Provider lives on container_configs.provider (materialized to container.json,
read by resolveProviderName). Creation stays provider-agnostic; the picked
provider is applied via the picked-provider seam. The deprecated
agent_groups.agent_provider path is not used.
Switching + memory
- Switch a live group with `ncl groups config update --provider` + restart.
- Memory never migrates at runtime — each provider keeps its own store. The
/migrate-memory skill carries a group's memory across a switch in either
direction (flat CLAUDE.local.md <-> memory/ scaffold). group-init seeds an
imported-agent-memory note for non-default providers; the runner's memory
definition reads it first turn. See docs/provider-migration.md.
No install-wide default, no runtime provider guard — switching is operator-by-
convention, consistent with the no-install-gating posture.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Inverts conversation archiving into an optional onExchangeComplete provider
hook: the runner never archives on a provider's behalf, and the markdown
writer ships with the provider that needs it. Dormant for the default
provider.
Slash commands now interrupt an in-flight turn — a runner-handled command
(/clear, /compact, /cost, …) arriving mid-turn aborts the active stream and
runs immediately instead of waiting out the turn.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a provider capability (usesMemoryScaffold) and a container-side boot
scaffold that materializes a persistent memory/ tree for providers that opt
in. Dormant for the default provider — the scaffold is only built when a
provider declares the capability, so existing installs are byte-identical
(asserted by a boot-gate wiring test).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
create_agent writes central-DB state (agent_groups, container_configs,
agent_destinations) and scaffolds host filesystem state, but the only
gate lived inside the untrusted container and is bypassed by writing the
outbound system row directly (the "host re-checks permission" comment was
false). Authorize host-side by CLI scope: trusted owner agent groups
(global scope) create sub-agents directly; confined groups require admin
approval via requestApproval. Adds regression tests for the branch.
Alternative to #2383 (which denies confined groups outright); co-authored
from that work.
Co-Authored-By: hinotoi-agent <paperlantern.agent@gmail.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A bare * in the pre-filled secret_url path doesn't survive (the gateway
URL-encodes everything, so an unencoded * collapses to just /, which only
exact-matches the path /). Leave the path blank instead so the created
secret matches all of huggingface.co, not a single endpoint.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop "(host pattern pre-filled)" and "— no restart needed" from the HF
setup instructions.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The not-signed-in message hardcoded both a local and a hosted OneCLI
dashboard URL because the container can't tell which gateway it's behind.
But the gateway already tells us: a credential-less proxied request comes
back with the right URL in its error body —
- credential_not_found → secret_url (pre-filled "new secret" form)
- access_restricted → manage_url (grant this agent access)
- app_not_connected → connect_url
Capture whoami's body + status (drop -f so the JSON survives the 401),
extract that URL, and present it. It's always the correct gateway, local
or hosted, with zero extra wiring. The secret_url's pre-filled `path`
defaults to the failing request path (/api/whoami-v2), so broaden it to
/* — otherwise the created secret wouldn't cover the upload endpoints.
Falls back to generic text when there's no gateway JSON to read.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The default OneCLI secret mode for auto-created agents is `all`, not
`selective` — a fresh agent created via ensureAgent({name, identifier})
comes back with secretMode "all", so matching vault secrets inject
automatically. Drop the now-unnecessary per-agent assignment step.
- upload-trace.ts: remove step 3 (set-secret-mode) from the not-authed
message; creating the token and adding it to the vault is enough
- CLAUDE.md: trim the secret-mode gotcha to reflect `all` as the default
- init-onecli skill: replace stale `onecli start` (gone in 1.4.x) and the
`ps aux | grep onecli` check with the real Docker Compose start path
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a runner-handled /upload-trace slash command (admin-gated, like /clear)
that uploads the current session's Claude Code transcript to the user's own
private {hf_user}/nanoclaw-traces dataset, browsable in the HF Agent Trace
Viewer. The transcript is already in the format the viewer auto-detects, so
the command just locates the newest one and pushes it via the Hub commit API.
Auth is handled by the OneCLI gateway: curl goes out through the injected
HTTPS_PROXY, which adds the user's HF token — no credential ever touches
agent code. A missing/unassigned token yields a clear setup message.
- container/agent-runner/src/upload-trace.ts: isUploadTraceCommand() + uploadTrace()
- poll-loop.ts: recognize and handle /upload-trace in the runner
- command-gate.ts: admin-gate /upload-trace on the host
- upload-trace.test.ts: unit + integration coverage for the command
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
claude-code CLI 2.1.128 -> 2.1.154 (Dockerfile build-arg). agent-runner SDK 0.2.128 -> 0.3.154: the 0.3 major moved @anthropic-ai/sdk and @modelcontextprotocol/sdk from regular deps to peer deps, so add @anthropic-ai/sdk ^0.100.0 as a direct dep and raise @modelcontextprotocol/sdk to ^1.29.0. Regenerate bun.lock. Typecheck + agent-runner tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The agent-runner runs the Agent SDK with settingSources: ['project', 'user'], which omits 'local'. Per the SDK docs the 'local' source is what loads CLAUDE.local.md (the 'project' source loads CLAUDE.md). So every group's CLAUDE.local.md is silently never read, even though container/CLAUDE.md tells each agent to use it as per-group memory.
Closes#2185.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The follow-up poll catches and logs SQLite errors but never recovers
from them. On Docker Desktop macOS, the kernel page cache for the
inbound.db bind mount can latch a torn snapshot mid-host-write (a known
virtiofs / gRPC-FUSE coherency issue), after which every fresh
openInboundDb() in the same process sees the same broken view and
emits 'database disk image is malformed' at the poll rate (2/sec).
Reopening the DB handle inside the container does not recover — only
a fresh container mount does. The fix: after CORRUPTION_STREAK_EXIT
consecutive corruption errors (~5s), log a clear message and
process.exit(75) so host-sweep respawns the container with a fresh
mount. Transient single torn reads are still tolerated.
- Add isCorruptionError() helper covering the three SQLite read-side
corruption symptoms (disk image malformed, SQLITE_CORRUPT, file is
not a database).
- Add streak counter scoped to processQuery's pollHandle so it resets
on any successful or non-corruption error.
- Add unit tests for the matcher.
Refs the cross-mount invariants documented in db/connection.ts:11-18.
fe2e881b (#2556) removed the <messages> wrapper from formatChatMessages
so the Claude Agent SDK calls the API instead of emitting a synthetic
stub, but poll-loop.test.ts still asserted the wrapper. The test has
failed on every PR built against main since. Assert the current shape:
no envelope, one self-contained <message> block per message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLAUDE_TRANSCRIPT_ROTATE_AGE_DAYS=0 (or negative) is documented to
disable age-based rotation, but transcriptRotateAgeMs() routed it
into the same branch as an unset var and returned the 14-day default.
Sessions intentionally configured to stay long-lived were still
rotated at 14 days, causing unexpected resets and context loss.
Distinguish unset/non-numeric (default 14d) from an explicit
non-positive override (Infinity = disabled; size alone governs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A long-lived hub session never rotates its continuation, so the on-disk
.jsonl grows without bound — days of history plus base64 image blocks the
agent Read (screenshots from QA lanes, etc.). The SDK reloads the whole
transcript on every --resume, and past a threshold the first turn alone
exceeds the host's 30-min idle ceiling: the container is SIGKILLed before
it can reply, then the next message repeats the cycle forever. Symptom:
a hub that was responsive for days suddenly goes silent on a heavy turn.
Before resuming, the Claude provider now checks the transcript backing the
stored continuation; if it exceeds a size cap (default 12MB) or age cap
(default 14 days, from the first entry's timestamp) it archives a markdown
summary to conversations/ and starts a fresh session. Both caps are
operator-overridable via CLAUDE_TRANSCRIPT_ROTATE_BYTES /
CLAUDE_TRANSCRIPT_ROTATE_AGE_DAYS. The PreCompact archiver is refactored
into a shared archiveTranscriptFile() reused by the rotation path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When 2+ pending messages were bundled into <messages>...</messages> at
container/agent-runner/src/formatter.ts:162-167, the Claude Agent SDK
responded with a synthetic stub (model="<synthetic>", stop_reason=
"stop_sequence", content="No response requested.") instead of calling
the real API. The poll loop never yielded a `result` event, so the
inbound message was never marked completed; the container exited; the
next sweep tick respawned it with the same batch; same synthetic; the
transcript file ballooned with each retry until tries=5 → failed.
Single-message turns (which skipped the wrapper) worked normally — the
SDK's heuristic appears to treat the wrapped envelope as a context dump
rather than a real user turn. Each `<message id=... from=...>...</message>`
block is already self-contained, so dropping the outer wrapper lets the
N>1 case work the same way the N=1 case always has.
Fix:
function formatChatMessages(messages: MessageInRow[]): string {
return messages.map(formatSingleChat).join('\n');
}
Updates one existing test that asserted on the envelope, and adds two
regression tests: one negative (no `<messages>` wrapper), one positive
(each inbound row produces a `<message>` block in order).
Confirmed working in a real install: two stuck lanes recovered after
reducing their pending queue to 1 message, and both produced normal
replies from claude after the wipe + this fix were both applied (the
wipe alone wasn't enough — a fresh session given the same batch shape
hit the same synthetic loop).
Refs nanocoai/nanoclaw#2555 for full repro + transcript evidence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Teaches agents WhatsApp's mention syntax (@<phone-digits>, never display
names) and where to find the sender's phone JID in inbound metadata
(content.sender). Without this, agents default to @<displayName>, which
WhatsApp can't tag — it just renders as plain text with no notification.
Two files:
- SKILL.md — frontmatter + description so the Claude Agent SDK can
discover it via skill metadata for ad-hoc lookups.
- instructions.md — always-on guidance. claude-md-compose.ts inlines
any skill that ships an instructions.md into every group's CLAUDE.md
on container spawn, so the rule is in the agent's context for every
reply (not just when the agent decides to invoke the Skill tool).
Mirrors the existing container/skills/slack-formatting/ layout for the
analogous Slack mrkdwn rules. Pairs with the adapter-side fix on the
`channels` branch that wires `mentions` through to Baileys' contextInfo
— both layers are needed for tags to render end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to #2467. The trailing "anything outside these tags is also
treated as scratchpad" clause contradicted the rest of the system prompt,
which requires bare text to be wrapped in `<message>` blocks. Removing it
keeps the description focused on what `<internal>` actually does.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The welcome skill told the agent to send the greeting via `send_message`,
but the destinations system prompt also requires the final response to
be wrapped in `<message to="…">` blocks (since 1d4d920). The agent
followed both, sending the greeting once via the MCP tool and once via
the wrapped final output.
- welcome/SKILL.md: drop the mechanism — "send a short, warm greeting"
lets the system prompt steer how it's delivered.
- destinations.ts: reframe `<message>` blocks and `send_message` as the
same delivery surface, with the explicit note that each call/block
lands as its own message — so they compose into a sequence rather than
reading as additive duplicates of the same content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The parenthetical "(single-destination: just write)" was stale after
9db39b2 removed the bare-text routing fallback. Agents following this
hint had their responses silently dropped to scratchpad.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the agent outputs bare text without <message to="..."> blocks,
nothing gets delivered — silent failure. Now the poll-loop pushes a
one-shot correction back into the active query telling the agent to
re-send with proper wrapping. Capped at once per user turn to avoid
loops; resets when a new follow-up message arrives.
Also updates destination instructions to require explicit <internal>
wrapping for scratchpad instead of treating bare text as implicit
scratchpad.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tell the compactor to include the <message to="name"> wrapping reminder
verbatim at the END of the summary so it's the last thing the agent sees
after compaction. Previously the instruction just asked to "preserve"
routing info, which the compactor could place anywhere or summarize away.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The compacted event handler injected a system-tagged reminder into the
live query after SDK auto-compaction, which caused the agent to send
an unintended message. Reverts the four changes from #2327:
- Remove `compacted` variant from ProviderEvent union
- Restore `result` yield for compact_boundary in ClaudeProvider
- Remove compacted event handler and getAllDestinations import in poll-loop
- Remove compaction integration tests and CompactingProvider helper
Closes#2325 differently — the reminder approach is not viable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The container opens inbound.db read-only, so it can't ALTER TABLE.
If the host hasn't run migrateMessagesInTable yet (e.g., container
rebuilt before host restart), the on_wake column won't exist and
the query crashes, causing a restart loop.
Detect the column via PRAGMA table_info and conditionally include
the filter clause.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sweep of outbound strings, doc URLs, comments, and clone instructions
that were missed in the original org rename. One both-match case in
setup/lib/channels-remote.sh (URL detection) accepts either name so
existing forks with a `qwibitai` remote continue to resolve cleanly;
everywhere else is a straight rename.
Historical mentions left intact:
- CHANGELOG.md (v2.0.0 entry, frozen history)
- .claude/skills/add-gmail-tool/SKILL.md (pre-v2 qwibitai skill — historical)
- repo-tokens/badge.svg (auto-regenerated by update-tokens.yml)
12 patch versions ahead. The 2.1.120 binary baseline introduced a
number of plugin and skill behaviors that have since landed in the
public Claude Code docs: ${CLAUDE_EFFORT} substitution, settled
`arguments` field in skill frontmatter, plugin `channels` field.
No breaking changes for nanoclaw's runtime contract. Verified by
running container/skills/{agent-browser,vercel-cli,slack-formatting}
under the bumped image; all three load and execute as expected.
SDK at ^0.2.116 (caret) remains compatible with claude-code 2.1.128.
Bumping CLAUDE_CODE_VERSION invalidates the pnpm install layer in
container/Dockerfile and triggers a full rebuild of the agent image.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Allow individual agent groups to opt into different models or effort levels
without changing host-wide defaults. Useful when one group is high-stakes
(opus, high effort) but most are routine (sonnet/haiku, low effort).
container.json gains two optional fields:
- model: alias ("sonnet" | "opus" | "haiku") or full model ID
- effort: "low" | "medium" | "high" | "xhigh" | "max"
Both omitted = SDK default (current behavior). The host plumbs them as
NANOCLAW_MODEL / NANOCLAW_EFFORT env vars at container spawn time; the
agent-runner reads them in providers/index.ts and threads through to the
provider via ProviderOptions. The Claude provider passes them straight to
sdkQuery options.
`effort` is currently typed as `any` because the @anthropic-ai/claude-
agent-sdk type doesn't surface it yet — passing it through still works at
runtime via the SDK's loose option handling. Drop the cast once the SDK
adds an `effort` field to its options type.
One-liner in cli.instructions.md pointing to `ncl groups config help`.
Each config operation's description now says whether restart or rebuild
is needed — agent discovers it via progressive disclosure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Decouple container restart from config updates — config CLI ops now only
write to the DB; restart is a separate `ncl groups restart` command with
--rebuild and --message flags. Add on_wake column to messages_in so wake
messages are only picked up by a fresh container's first poll, preventing
dying containers from stealing them during the SIGTERM grace window.
killContainer accepts an onExit callback for race-free respawn. Agent-
called restart auto-scopes to the calling session.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`ncl groups config get` now works alongside `ncl groups config-get`.
Parser joins all positionals with dashes; dispatcher falls back by
trimming the last segment as a target ID (`ncl groups get abc123`).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- genericList now accepts column filters (--flag value) and LIMIT (default 200)
- Remove early inDb.close() in container pollResponse to avoid double-close
- Document filtering and --limit in cli.instructions.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename the CLI binary, socket path, container wrapper, error prefixes,
and all references from `nc` to `ncl`. Add ~/.local/bin symlink during
setup and pnpm script alias.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>