Chapter 08: Production: Daemon, Skills & Plugins#
Prerequisites: Chapter 07 (Routing, Compaction & Guardrails)
In this chapter: - How a persistent agent survives restarts without losing work (append-only log + checkpoint pattern) - How a skill library lets your swarm get smarter over time without retraining - How the plugin pattern lets non-engineers extend an agent system at runtime - A concrete walkthrough of Claude Code's plugin format: build one, install it, invoke it end-to-end
1. Motivation#
Your Chapter 07 swarm has a hook bus, approval gates, injection defence, and cost tracking. It is safe. It is also ephemeral. You start it, it runs, it exits. If the host restarts at hour five of a six-hour job, you begin again from zero.
A ship-ready agent needs three things the Chapter 07 swarm does not have.
It needs a durable daemon. Long-running work implies crashes. A power cut, a deploy, an OOM kill: any of these can happen mid-task. The daemon pattern is four decades old; it is what lets a process stop and resume without corrupting its own state.
It needs a skill library. Your swarm solved a config-file parsing problem on Tuesday. On Thursday a different worker reinvents the same twelve lines. Every reinvention costs tokens, latency, and variance. A skill library captures the solution once and hands it to the next worker as a ready-made snippet.
It needs a plugin system. The agent you build is not the one shipped to users. Users want to add their own tools, their own prompts, their own integrations, all without editing your source tree. The plugin pattern draws a line between the runtime you own and the domain extensions other people install. Non-engineers can drop a folder into place and their extension loads. They can share it with a colleague. They can version it. They can remove it without breaking anything.
This chapter builds all three. Section 2 covers the daemon pattern: what ticks, what gets logged, what is fail-closed. Section 3 handles crash recovery: the append-only invariant that makes replay correct. Section 4 builds the skill library, with a worked tutorial distilling a skill from a real trace. Section 5 is new: the plugin pattern in the abstract, then Claude Code's concrete format as the worked example. None of the three is optional for a production deployment. A swarm missing any one of them will either lose work on the next restart, burn tokens reinventing last week's solution, or force you into a code freeze every time a user asks for a new integration.
2. The daemon pattern#
A script runs once. A background daemon (we call it KAIROS) ticks on a schedule, reads its own append-only log, and decides whether to act. The decision point is the key design move: KAIROS is not a cron job. Every tick is a call into a model that inspects state and chooses what to do, constrained by what has been previously authorised.
The word "daemon" is older than Unix. Ken Thompson and Dennis Ritchie's first Unix used background processes for printers and terminals. But the engineering discipline of daemon lifecycle management came later. Erlang/OTP's supervision trees (Armstrong, Virding, Williams, 1986; formalised in OTP around 1998)1 gave us the "let it crash" philosophy: don't try to prevent worker failures; contain them and restart automatically. systemd (Poettering, 2010)2 formalised the daemon contract on Linux: PID file, sd_notify(READY=1), clean SIGTERM handling. KAIROS inherits from both traditions.
The tick loop#
The tick loop has six responsibilities (keep alive, audit queues, inspect health, retry failures, observe metrics, schedule tasks), which is why the mnemonic is KAIROS. In code, the structure is small:
async def _tick(self) -> None:
pending = self._pending_tasks
events = self._new_events[-10:]
log_tail = self._read_log_tail(n=50)
prompt = self._build_prompt(pending, events, log_tail)
decision = await call_agent(..., prompt=prompt, system=DAEMON_SYSTEM)
# decision is JSON: {"action": "none" | "execute", ...}
self._append_log(f"DECISION {decision}")
if decision["action"] == "execute" and not decision["requires_human"]:
asyncio.create_task(self._dispatch(decision["task"]))
[full: swarm/daemon/kairos.py]
Three design choices matter.
Decision before action. The LLM call returns a structured decision (action, task, reason, requires_human). Nothing executes until the decision is appended to the log. The daemon never acts without a logged justification.
Fail-closed on missing authorisation. The system prompt (DAEMON_SYSTEM in kairos.py) instructs: "Irreversible actions require prior explicit user authorisation. Inferred permission is not sufficient." If the daemon is unsure, the correct output is {"action": "none"}. Doing nothing is always safe; doing the wrong thing may not be reversible.
Tick interval tuned to resolution. If your finest-grained scheduled task runs every five minutes, a 60-second tick is fine. Finer wastes cycles and tokens; coarser adds jitter. The daemon in kairos.py defaults to 300 seconds.
Task errors inside the tick are caught and logged, not re-raised. The daemon survives its workers crashing. This is Erlang's supervision discipline at the task level. A task that always fails will be logged on every tick with a task_error event; a monitoring pipeline can alert on the N-th consecutive failure and circuit-break the task out of rotation.
Scheduled versus event-driven ticks#
There are two dispatch styles inside a tick. Scheduled tasks fire on a cron-like expression: "every 300 seconds", "every day at 09:00 UTC", "on the hour". ScheduledTask.cron_expr = "every_300s" combined with task.last_run tracking is enough to implement a simple scheduler without pulling in a full cron parser. Event-driven tasks fire because something new arrived in the inbox: a webhook, a queue message, an inbound event appended via daemon.add_event(...). The tick processes both kinds.
On first startup last_run = 0.0, so scheduled tasks fire immediately on the first tick. A freshly started daemon should not wait one full interval to do its first heartbeat; otherwise a restart looks indistinguishable from a dead process for the interval's duration.
The lifecycle diagram#
stateDiagram-v2
[*] --> STOPPED
STOPPED --> STARTING : start()
STARTING --> RUNNING : log("daemon_start")
RUNNING --> TICKING : interval elapsed
TICKING --> RUNNING : tick complete
TICKING --> STOPPING : SIGTERM mid-tick
RUNNING --> STOPPING : SIGTERM between ticks
STOPPING --> STOPPED : log("daemon_stop"), flush
STOPPED --> [*]
note right of TICKING
Task errors caught here.
Failed tasks are logged,
not re-raised. Daemon
continues to next task.
end note
SIGTERM handling#
A signal with no handler kills the process immediately (mid-task, mid-write). The correct handler sets a flag and lets the current tick finish:
def handle_sigterm(sig, frame):
daemon._running = False
daemon.log.append("daemon_shutdown", {"signal": "SIGTERM"})
This gives active workers a chance to write their checkpoints before exit. Under systemd or Kubernetes, you get about 30 seconds between SIGTERM and SIGKILL. That is plenty of time for a tick to drain, provided the handler does not hang.
Why append-only logs beat mutable state#
A file write is not atomic on most operating systems. If your process crashes mid-write to a JSON file, you get a truncated or corrupted file. JSONL (JSON Lines) sidesteps this: each write is one line ending in \n. A partial last line is skipped on recovery; every prior line is intact.
# UNSAFE: mutable state file
Path(path).write_text(json.dumps(state)) # crash → corrupt file
# SAFE: append-only log
with open(path, "a") as f:
f.write(json.dumps(record) + "\n") # crash → prior lines intact
Kleppmann's Designing Data-Intensive Applications (2017) documents how every major database (PostgreSQL, Kafka, LevelDB) uses append-only writes internally.3 Jay Kreps' 2013 essay "The Log"4 argues the same pattern is the central abstraction of distributed systems. AppendOnlyLog is a single-machine Kafka for your daemon.
There is also a transparency argument. Every action the daemon takes (task_start, task_complete, task_error, daemon_start, daemon_stop) is logged with timestamp and event type. A human operator can open logs/daemon.jsonl and reconstruct exactly what the daemon did, when, and with what outcomes. There are no hidden state transitions. Anthropic builds this into Claude's own systems; Claude Code's background agent writes event logs for the same reason. You cannot trust a system you cannot observe.
Observing the daemon in operation#
The daemon log is your primary debugging tool. After a run, standard Unix utilities go a long way:
# Which tasks errored in the last day?
grep '"task_error"' logs/kairos.jsonl | tail -20
# How many decisions resulted in action=none vs action=execute?
grep '"action"' logs/kairos.jsonl | awk -F'"action":' '{print $2}' | sort | uniq -c
# Checkpoint age, newest first
ls -lt checkpoints/ | head -5
Because every record is one line of JSON with a timestamp, analysis is a jq away. jq 'select(.event == "task_complete") | .data.id' logs/kairos.jsonl | sort | uniq -c gives you a per-task run count. No bespoke dashboard needed.
Deployment readiness checklist, before you hand the daemon to a production orchestrator:
| Signal | What to check | Tool |
|---|---|---|
| Liveness | daemon_start event within last N minutes |
tail of log |
| Task health | task_error rate per hour |
grep + count |
| Checkpoint age | newest checkpoint file timestamp | ls -lt |
| Skill library size | len(library._load_all()) |
Python REPL |
| Memory | daemon RSS trend over time | ps or cadvisor |
Hook any of these into your alerting stack. The convention on systemd hosts is to run the daemon under Type=notify so sd_notify(READY=1) signals startup completion and a missing WATCHDOG=1 heartbeat triggers a restart. Under Kubernetes, a livenessProbe that reads the log tail works: if the newest task_start or decision record is older than three tick intervals, something is wrong.
3. Crash recovery#
The daemon survives its own restart. The trick is the ordering: every state transition is appended to the log before execution. On restart, you replay the log and reconstruct state. Because appends are crash-consistent, replay is correct even if the crash happened mid-append.
The invariant#
Write the log entry, then do the thing. Never the reverse. Combined with a checkpoint written at phase boundaries:
# CORRECT: checkpoint before transition
self.checkpoint(phase="planning")
await self.run_working_phase()
# WRONG: checkpoint after transition
await self.run_working_phase()
self.checkpoint(phase="working") # crash inside working → wrong state
This is write-ahead: the same rule that runs PostgreSQL's WAL. The database writes the log record before applying the change to the data page. RecoveryManager does the same: a crash during a phase leaves the checkpoint pointing to the previous completed phase, which is the correct restart point.
The recovery flow#
sequenceDiagram
participant D as KairosDaemon
participant RM as RecoveryManager
participant W as WorkerTask
participant L as AppendOnlyLog
D->>RM: list_incomplete
RM-->>D: run_abc done u1 u2
D->>L: append resume_run
D->>W: run u3 u4 u5
W->>L: append task_start u3
W->>RM: save_checkpoint done u1 u2 u3
W->>L: append task_complete u3
Note over D,L: Crash here - u4 u5 not started
D->>RM: list_incomplete
RM-->>D: run_abc done u1 u2 u3
D->>W: run u4 u5 skip u1 u2 u3
The three pieces#
AppendOnlyLog: JSONL event log with a single guarantee. Each append() is one line plus \n. read_all() wraps json.loads(line) in a try/except so a truncated last line from a crash is skipped; every prior line parses cleanly.
RecoveryManager: one checkpoint file per run ID, rewritten atomically at each phase boundary. list_incomplete() scans the checkpoint directory on daemon startup and returns any checkpoint whose phase != "complete". That is your restart queue.
Idempotent resume. If the daemon finds an incomplete run and crashes again before completing the resume, the checkpoint still points to the last completed phase. The next restart resumes from the same point. Test this by inserting a deliberate crash in your resume logic; it must be safe to crash anywhere.
The full implementation is short:
class RecoveryManager:
def save_checkpoint(self, state: CheckpointState) -> None:
path = self.dir / f"{state.run_id}.json"
path.write_text(json.dumps({...}))
def list_incomplete(self) -> list[CheckpointState]:
return [
CheckpointState(**json.loads(f.read_text()))
for f in self.dir.glob("*.json")
if json.loads(f.read_text()).get("phase") != "complete"
]
[full: modules/11_production_daemon/code/production.py]
A subtle failure mode: permanently failed runs never reach phase="complete" and appear in list_incomplete() on every restart. Add a max_retries field and skip runs that exceed it. Stale checkpoints are the recovery system's equivalent of a zombie process.
4. The skill library#
A worker solves a problem. It reads a config, retries a flaky HTTP call, parses a date in three formats. The solution is twelve lines of correct code. Today it works. Tomorrow a different worker is handed the same problem and reinvents the same twelve lines (differently, slightly wrong, at full model cost).
A skill library (the Voyager-style pattern, Wang et al. 2023)5 captures the twelve lines once and makes them searchable. Before the next worker implements, it searches. If a matching skill exists, the worker injects it into its own prompt and uses it instead of reinventing.
When to distill, when to throw away#
A skill is worth keeping when three conditions hold.
- Reusable. The snippet works beyond the specific task that produced it. "Read a text file line by line" is reusable. "Parse our company's 2024 invoice format" is too narrow.
- Small. Ten to forty lines of code. Larger and it is a module, not a skill; it belongs in a package.
- Unambiguous. A clear name, a one-line description, tags. A worker searching for "retry HTTP" must be able to recognise it.
Throw a skill away when it collides with an existing one (merge them), when it has not been retrieved in a month (archive), or when it contains a task-specific detail that leaked through distillation.
The retrieval loop#
Search is intentionally simple: keyword match across name, description, tags, and code. The implementation:
async def search(self, query: str, *, top_k: int = 5) -> list[Skill]:
skills = self._load_all()
query_lower = query.lower()
scored = []
for skill in skills:
haystack = " ".join([
skill.name, skill.description, skill.code, *skill.tags
]).lower()
score = sum(1 for w in query_lower.split() if w in haystack)
if score > 0:
scored.append((score, skill))
scored.sort(key=lambda x: x[0], reverse=True)
return [s for _, s in scored[:top_k]]
[full: swarm/skills/library.py]
Keyword search works up to roughly 100 skills. Beyond that, precision drops: "data" starts matching thirty skills and top_k=5 picks five that all look plausible but miss the actual best match. The upgrade path is a sentence-transformers embedding plus cosine similarity; a 20-line change.
format_for_prompt(skills) renders retrieved skills as a ## Available Skills section injected ahead of the task description. The worker sees them as part of its system prompt, not as tool output, and treats them as prior knowledge.
Library growth and plateau#
A healthy library's size curve has three phases. Week one: around ten skills, mostly generic utilities (file I/O, retry helpers, JSON parsing). Week four: thirty to eighty skills, with domain-specific patterns emerging and some near-duplicates appearing (three variants of "read config file" is a normal symptom). Week eight and beyond: growth slows as new tasks increasingly match existing skills. This plateau is the signal that your library is well-saturated for the workload.
Track search precision. What fraction of top_k results are actually useful for the worker's query? A rough proxy: how often does the worker go ahead and invoke the returned skill versus reinvent anyway? Instrument search() to log the query and the returned IDs; join against the worker's subsequent tool calls. When adoption rate drops below sixty percent, you have outgrown keyword search.
Skill decay is as important as skill accumulation. Add a last_used timestamp and prune skills that have gone 30+ days without a retrieval. A dead skill is not free: it still matches keywords and competes for top_k slots. The library is a cache, not an archive. If you need to preserve history, move pruned skills to a cold-storage JSONL file, not the active index.
Sidebar: tutorial, building a skill from a trace#
The SkillLibrary API has five methods: add_skill, search, get, distill_from_trace, format_for_prompt. Everything rolls up through these. Here is a full walkthrough against the real library in swarm/skills/library.py.
Step 1 (the trace). Assume a worker just finished a task and produced a trace that looks roughly like:
TASK: Read the first 10 non-empty lines from /var/log/app.log.
TOOL CALL: read_file path=/var/log/app.log
TOOL RESULT: "...\n...\n..."
TOOL CALL: python: [l for l in text.splitlines() if l.strip()][:10]
TOOL RESULT: [...]
FINAL: "done"
Step 2 (distill). Feed the trace to distill_from_trace. In production, bus is the live HookBus; in tests, leave it None. The method calls the LLM with a short prompt ("extract one reusable skill, JSON only, or NO_SKILL") and returns either a Skill or None.
library = SkillLibrary("./skills")
skill = await library.distill_from_trace(trace, model="claude-sonnet-4-5")
# skill.name = "Read non-empty lines from a text file"
# skill.code = "def read_nonempty(path: str, n: int) -> list[str]: ..."
# skill.tags = ["file", "io", "text"]
Step 3 (test in isolation). Before trusting the skill, execute it yourself. This is a health check on the distillation:
If the skill executes and returns the expected shape, keep it. If it references a function that does not exist in the current environment, throw it away; distillation hallucinated a dependency.
Step 4 (register). distill_from_trace already called add_skill internally, so the skill is on disk at ./skills/index.jsonl. To register a hand-written skill bypassing distillation:
skill = await library.add_skill(
name="Retry HTTP with exponential backoff",
description="Wrap requests.get() with tenacity retry",
code="@retry(stop=stop_after_attempt(5), wait=wait_exponential(...))\ndef get(url): ...",
tags=["http", "retry", "network"],
)
Step 5 (invoke from a new worker). Search before implementing:
hits = await library.search("http retry", top_k=3)
prompt_section = library.format_for_prompt(hits)
# Inject prompt_section before the task description when calling the worker.
The new worker now has the skill in its system prompt. It will use it rather than reinventing. The injection pattern (retrieved skills rendered as ## Available Skills ahead of the task) is important: skills are available context, not tool calls. The worker is not forced to use them, but the cost of checking "does a relevant skill already exist?" is one search() call against local JSONL, so the model almost always references them when they fit.
Step 6 (iterate). After running a few dozen tasks with the library in place, check success_count on each skill. Skills that never increment are dead weight; prune them. Skills with high counts but inconsistent outputs may have a bug in the snippet; hand-edit them in ./skills/index.jsonl (the library is intentionally a plain JSONL file so direct edits work).
One note on distillation economics. distill_from_trace truncates the trace at 3,000 characters because a full 50 K-token trace would cost more to distill than the skill saves. Distillation is summarisation; summarisation is cheap only when bounded.
Worked example: customer support agent#
To see the pieces fit together, consider a customer support agent. The domain is well-bounded: users have orders, orders have statuses, support tickets have priorities. Every Chapter 08 primitive earns its keep.
The KairosDaemon runs poll_new_tickets() every 60 seconds: it pulls unassigned tickets from the queue, dispatches a worker per ticket, logs the dispatch. The AppendOnlyLog captures every tool call (lookup_order, lookup_customer, create_ticket, apply_credit) with timestamps and results. After one week you have a complete audit trail: every credit applied, every escalation, every customer interaction.
The skill library accumulates patterns as the agent runs. After the first few successful resolutions of "premium customer + delayed shipment", the library holds a skill titled something like "Delayed premium shipment response, apply $10 credit, priority ticket, next-step tracking". A new agent handling a variant of the scenario retrieves the skill in one search call, injects it into its prompt, and skips the policy-reasoning step entirely. Resolution latency drops; cost per ticket drops; variance across agents drops.
After fourteen days of operation the library contains fifteen to twenty domain-specific skills on top of the generic utilities. Keyword search works fine at this scale: "shipment" returns two to four skills and all are on-topic. Crash recovery matters because a queue poller that dies mid-dispatch must not lose tickets; RecoveryManager checkpoints after each ticket's tool-call sequence completes, so a restart re-queues only unfinished work. The daemon, the log, the skills, the checkpoint file: each solves one specific failure mode, and together they turn a prototype into something you could actually deploy behind a zendesk webhook.
5. Plugins#
Part 1: plugins as a pattern#
A plugin is a self-contained bundle of capabilities (skills, hooks, tools, configuration) that a host system loads at runtime without recompiling. The pattern is not new. Photoshop had plugins in 1990. Emacs has had them since 1976. VS Code's extension model is their direct descendant. What is new is applying the pattern to an agent runtime.
Why you want this. The agent runtime you build is general-purpose. The work the user wants done is domain-specific. If every user has to fork your runtime to add a tool, you have lost. Plugins draw a clean line between the runtime (your code) and the capabilities (their code). A user who knows Python and YAML but nothing about your internals can ship an extension.
A plugin system has five primitives, in order of importance.
Lifecycle hooks. on_load, on_unload, on_event. The host tells the plugin when it is starting, when it is stopping, and when something interesting happens. The plugin registers its behaviour at on_load and cleans up at on_unload.
Capability registration. The plugin declares what it contributes: tool schemas (in Anthropic format, same as Chapter 04), skill definitions (same shape as Section 4), event subscribers, commands, agents. The host catalogues everything at load time and exposes it through its normal interfaces.
Isolation. A bad plugin should not take down the host. Process isolation is strongest (and most expensive); module-level isolation with try/except around every plugin boundary is cheaper and usually sufficient. Either way, an exception in plugin code must not cascade into a daemon restart.
Versioning. Semantic versioning on the plugin, with a minimum host-version requirement. If the host upgrades, old plugins keep working unless the manifest says otherwise. If the plugin upgrades, the user knows before installing.
Hot-reload. Drop a new version in place; the host reloads without a restart. Not strictly required, but users expect it.
Compare plugins to adjacent patterns.
- Traditional Python imports are tightly coupled: the extension is compiled into the host, version-locked, and requires a redeploy to change. Good for internal code, wrong for user extensions.
- REST APIs are too remote: network hops, auth, serialisation overhead, and you cannot share process state. Appropriate for cross-company integrations, overkill for a user's personal skill.
- MCP (Model Context Protocol, Chapter 04) is a wire protocol for tool servers. Good for language-agnostic tools across process boundaries, but too heavy for a drop-in skill or prompt. Plugins can contain MCP configs (as we do in the tutorial below) without forcing every extension to be an MCP server.
- In-process callback hooks (the
HookBusfrom Chapter 07) are tightly coupled too: the host must import the hook module, and the hook cannot carry its own dependencies.
Plugins sit between: locally loaded (cheap invocation, shared process state) but sandboxed (failures are contained, updates are independent of host).
This pattern is framework-agnostic. If you build your own agent system, put a plugin layer on it. Users who cannot touch your source can still extend it.
Plugin anatomy#
flowchart LR
host[Host runtime] -->|on_load| manifest[plugin.json<br/>name, version, author]
manifest --> skills[skills/*<br/>definitions + triggers]
manifest --> hooks[hooks.json<br/>event subscribers]
manifest --> tools[tools / agents<br/>capability definitions]
manifest --> mcp[.mcp.json<br/>external server configs]
skills -.->|registered| host
hooks -.->|installed| host
tools -.->|catalogued| host
mcp -.->|spawned| host
host -->|on_event| hooks
host -->|on_unload| manifest
At load time the host reads the manifest, then walks each sub-directory, registering whatever it finds. Registration is idempotent: loading the same plugin twice is a no-op. Unregistration at on_unload is the inverse. Every registration from on_load must have a paired teardown, otherwise hot-reload leaks state.
Part 2: Claude Code plugins (build-along tutorial)#
Claude Code (Anthropic's CLI) ships with a plugin format. A plugin is a directory containing a manifest and any of: skills, hooks, commands, agents, MCP server configs. The user installs it with /plugin install …; it becomes part of their Claude Code session from that point on.
The exact file format may evolve; the authoritative reference is the Claude Code documentation, and you should treat this section as the April 2026 snapshot.6 The layout and semantics below are what ships in the plugins on my machine, verified against Anthropic's official marketplace and the superpowers plugin by Jesse Vincent.
Canonical layout.
my-first-plugin/
├── .claude-plugin/
│ └── plugin.json # manifest: name, description, version, author
├── skills/
│ └── summarize-urls/
│ └── SKILL.md # skill definition with YAML frontmatter
├── hooks/
│ └── hooks.json # hook registrations (PreToolUse, SessionStart, etc.)
├── commands/ # slash commands (one .md per command)
├── agents/ # subagent definitions
└── .mcp.json # MCP server configs (keyed by server name)
Every directory is optional except .claude-plugin/plugin.json. A plugin that only ships one skill has exactly two files. The real-world example-plugin in Anthropic's official marketplace at .claude/plugins/marketplaces/claude-plugins-official/plugins/example-plugin/ is a good minimal reference: it demonstrates all five capability types in under ten files.
The superpowers plugin by Jesse Vincent (in .claude/plugins/cache/claude-plugins-official/superpowers/) is the richest public example. It bundles fifteen skills (brainstorming, systematic-debugging, test-driven-development, and more), a SessionStart hook that runs a shell script, and a handful of slash commands. Its plugin.json manifest is twenty lines. Read through its skills/ directory to see the range of SKILL.md conventions.
Step 1 (the manifest).
{
"name": "my-first-plugin",
"description": "Tutorial plugin: URL summariser skill, tool logger hook, time MCP",
"version": "0.1.0",
"author": {"name": "Your Name", "email": "you@example.com"}
}
Save this as my-first-plugin/.claude-plugin/plugin.json. The $schema field is optional. The name must match the directory name.
Step 2 (the skill). Skills in Claude Code are Markdown files with YAML frontmatter. The frontmatter's description is what the model sees when deciding whether to invoke the skill; write it as a trigger description ("Use this when …").
---
name: summarize-urls
description: "Fetch a URL and return a one-line summary. Use when the user
asks to summarise a webpage, article, or link."
---
# Summarise URLs
Given a URL, fetch the page, extract the main content, and return a
single-sentence summary of what the page is about.
Steps:
1. Fetch the URL with WebFetch.
2. Identify the main heading and first paragraph.
3. Return one sentence (max 25 words) capturing the page's purpose.
Save as my-first-plugin/skills/summarize-urls/SKILL.md. The directory name must match the name in the frontmatter.
Step 3 (the hook). Hooks intercept events in the Claude Code lifecycle. A PreToolUse hook fires before any tool call and can log or veto. Create my-first-plugin/hooks/hooks.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "echo \"[tool] $CLAUDE_TOOL_NAME\" >> ~/.claude/tool.log",
"async": true
}
]
}
]
}
}
Every tool call now writes its name to ~/.claude/tool.log. The matcher: "*" means "all tools"; you can scope to specific tools by name. The async: true flag means the hook does not block tool execution. The real superpowers plugin uses the same hooks.json structure for its SessionStart hook.
Step 4 (the MCP server config). Point at the get_time MCP server you built in Chapter 04 Exercise 02. Create my-first-plugin/.mcp.json:
{
"mcpServers": {
"get-time": {
"type": "stdio",
"command": "python",
"args": ["${CLAUDE_PLUGIN_ROOT}/../modules/04_tools_sandbox/solutions/mcp_server.py"]
}
}
}
${CLAUDE_PLUGIN_ROOT} resolves to the plugin's install directory; Claude Code substitutes it at runtime. Three MCP transport types exist: stdio (spawn a subprocess), http (connect to a URL), sse (server-sent events). Stdio is the most portable and what Chapter 04's server uses.
Step 5 (install). Claude Code installs plugins from a marketplace, which is a repository containing a marketplace.json. For local testing you can install straight from a directory:
Files land in ~/.claude/plugins/cache/<marketplace>/<plugin-name>/<version>/ and the plugin is registered in ~/.claude/plugins/installed_plugins.json. Claude Code picks up skills, hooks, and MCP servers on the next session start.
Step 6 (invoke). Start a Claude Code session and issue a prompt that triggers the skill:
What happens, in order:
1. Session starts. Claude Code loads my-first-plugin: the summarize-urls skill is registered, the PreToolUse hook is installed, the get-time MCP server is spawned as a subprocess.
2. The model reads the prompt, sees summarize-urls in its skill catalogue, decides to invoke it for the URL part.
3. The skill instructs the model to call WebFetch. The PreToolUse hook fires and appends [tool] WebFetch to ~/.claude/tool.log. The fetch runs.
4. The model produces a one-sentence summary.
5. For the time part, the model calls the get_time tool. The hook fires again ([tool] get-time:get_time). The MCP server returns the current ISO timestamp.
6. Final response combines the two.
You can tail ~/.claude/tool.log in another terminal and watch each tool call land in real time. That single chain (skill -> hook -> tool -> MCP call -> hook again) is the full plugin contract demonstrated in one user prompt.
Install mechanics in detail#
When the user runs /plugin install ./my-first-plugin, Claude Code does four things:
- Validate the manifest. Reject the install if
plugin.jsonis missing, if required fields are absent, or if the name conflicts with an already-installed plugin. - Copy into cache. Files land in
~/.claude/plugins/cache/<source>/<name>/<version>/with the source being a marketplace name orlocalfor a directory install. - Register in the index.
~/.claude/plugins/installed_plugins.jsongets an entry recording the install path, version, and timestamp. This file is the source of truth for "what is installed". - Load on next session. Claude Code scans
installed_plugins.jsonat session start, loads each entry's manifest, walks the sub-directories, and registers capabilities. Skills become discoverable, hooks are armed, MCP servers are spawned.
Uninstall is the reverse: remove from the index, reverse any hook installations, kill subprocess-based MCP servers, and optionally delete from cache. A hot-reload (same plugin, new version) is an uninstall-install pair with the cache overwrite in between. Idempotency matters: running install twice in a row should be a no-op, not a double-registration of hooks.
Two conventions help keep installs clean. First, pin the Claude Code minimum version in your plugin's manifest if you use features that were added after v1.0; the loader can refuse older hosts rather than fail at runtime. Second, keep any writable state your plugin needs under ${CLAUDE_PLUGIN_ROOT}/state/ rather than the user's home directory. That way uninstall actually removes everything the plugin wrote, which is what users expect.
Sharing via a marketplace#
A marketplace is the unit of distribution. Structurally, it is a Git repository with .claude-plugin/marketplace.json at its root, listing plugins either inline (for plugins stored in the same repo) or via remote URLs. Anthropic's official marketplace, claude-plugins-official, lists several hundred plugins; users add it with /plugin marketplace add anthropics/claude-plugins-public.
For your own marketplace you have two options. Inline plugins (source: "./plugins/my-first-plugin") bundle the plugin directories into the marketplace repo itself. Simplest to maintain, but every plugin update requires a marketplace repo commit. External plugins (source: {source: "url", url: "https://github.com/you/my-first-plugin.git"}) point at independent repos; each plugin has its own release cadence.
A minimal marketplace for sharing my-first-plugin with a colleague:
my-marketplace/
├── .claude-plugin/
│ └── marketplace.json # lists my-first-plugin
└── plugins/
└── my-first-plugin/ # the directory from Step 1
Your colleague runs /plugin marketplace add <your-repo> then /plugin install my-first-plugin. The indirection lets them get updates via git pull on the marketplace, with no re-sharing a zip every time you ship a fix.
Step 7 (test and ship). Three checks before sharing.
Verify the plugin loads clean: no errors on startup, skill appears in /skill list:
Trigger each capability with a synthetic prompt and confirm the log is what you expect. Then package for distribution by publishing the directory to a Git repository whose root contains a .claude-plugin/marketplace.json listing your plugin, or simply share the folder and have your teammate run /plugin install against it.
The official marketplace at anthropics/claude-plugins-public on GitHub follows the same structure; the superpowers plugin's manifest lives at .claude-plugin/plugin.json and is 20 lines long. Your first plugin can be just as small.
One thing to watch: MCP servers run as subprocesses, and a buggy server takes down the whole MCP connection, not just one tool. Handle errors explicitly in your server before shipping. Claude Code isolates plugins from each other but does not isolate you from your own code.
Debugging a plugin that will not load#
Plugin failures are almost always one of four things. Symptom: you run /plugin install and the plugin appears to install but is not visible on the next session.
Manifest parse error. plugin.json is malformed JSON. Run python -m json.tool .claude-plugin/plugin.json before installing. If the parser is silent, the manifest is valid.
Directory name does not match manifest name. The directory under cache/<source>/ must match the name in plugin.json. A rename in one place without the other silently breaks loading.
Hook or MCP config references a missing file. ${CLAUDE_PLUGIN_ROOT}/hooks/run-hook.cmd must exist, and under stdio MCP servers the command must be on PATH. Relative paths are relative to the plugin root, not the user's cwd.
Skill frontmatter missing description. A SKILL.md with only a name: in its frontmatter loads, but the model never sees a trigger description and so never invokes the skill. The user observes the skill as "installed but dead". Always include a description that starts with "Use this when…".
~/.claude/plugins/ is safe to inspect directly. installed_plugins.json is the index; cache/<source>/<name>/<version>/ holds the unpacked plugin. If the cache directory is present but the entry is missing from installed_plugins.json, reinstall.
6. What Goes Wrong & Onward#
The common failure modes at this layer:
- Daemon restarts during a resume. Always possible; the invariant from Section 3 makes it idempotent, so restart-during-resume just replays cleanly.
- Skill library keyword-search cliff. At around 100 skills, precision degrades; the fix is sentence-transformers embeddings plus cosine similarity, a twenty-line upgrade.
- Stale checkpoints. Permanently failed runs never reach phase="complete" and pile up in list_incomplete(); add max_retries to the checkpoint schema.
- Plugin crashes the host. Isolate at plugin boundaries; wrap every on_event and every MCP tool invocation in try/except and log to the audit channel.
- Plugin version drift. A plugin built against host v1.2 breaks on v1.4; pin the minimum host version in the manifest and gate loading on compatibility.
The primitives are now in place: durable daemon, resumable recovery, reusable skills, extensible plugins. Chapter 09 is the capstone. It integrates everything built in Chapters 01 through 08 and runs the result against real benchmarks (SWE-bench Verified, TAU-Bench, and GAIA). That is where the engineering you have done meets the evaluation numbers that decide whether the system is ready for users, and whether the shape of the thing you have built is production-grade or still a prototype.
-
Armstrong, J., Virding, R., Williams, M. (1993). Concurrent Programming in Erlang. Prentice-Hall. Supervision trees are documented in the OTP Design Principles guide. ↩
-
Poettering, L. (2010). Rethinking PID 1. http://0pointer.de/blog/projects/systemd.html ↩
-
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. Chapters 3 and 5. ↩
-
Kreps, J. (2013). The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction. ↩
-
Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291. Section 3.3 describes the skill distillation loop. ↩
-
Claude Code plugin documentation: https://docs.claude.com/en/docs/claude-code/plugins. The schema URL referenced by plugin manifests is
https://anthropic.com/claude-code/plugin.schema.json. ↩