Field map · 2026

AI Workflow · Practitioner's Field Map

不是研究前沿地图,是从业者的工作集 — every [FACT] entry cites a workflow that ran for ≥1 month in real use, every [OPINION] is our take, named as such.

Updated 2026-05-23 Audience AI workflow practitioners (Claude Code, Cursor, agent builders) · EN/CN Built on yage.ai stack — axioms 008 / 009 / 010

This is a field map, not a frontier survey. Where grapeot's landscape maps what's possible at the edge, this one maps what survives daily use — the practitioner's working set six months after the hype cycle.

Reading paths

5-minute path (the must-knows — 10 entries): 1.1 files-over-prompts · 1.6 prompts-vs-hooks · 2.1 skill-creator loop · 3.1 /agents fleet view · 4.1 hooks-as-environment · 5.1 autoresearch protocol · 5.2 EVALS discipline · 5.4 compounding ratio · 6.5 auto-mode tradeoff · D1 long-context vs files

30-minute path: read in cluster order, top to bottom (~36 entries + 6 debates).

Voices · 声音

Pull-quotes that anchor the field. Top row: the frontier — what grapeot's AI 自进化 · 领域直觉地图 records about recursive self-improvement. Bottom row: the working set — what this landscape says about daily-use practice. Two landscapes; one operating system.

From the frontier — grapeot's AI self-evolution landscape

The first ultraintelligent machine is the last invention that man need ever make.

I.J. Good · 1965

Cited in grapeot's frontier landscape — the historical anchor of the recursive-improvement thesis.

每个 sigmoid 底部像指数。

Nathan Lambert · "Lossy Self-Improvement"

Cited in grapeot's frontier landscape. Translation: "Every S-curve's bottom looks like exponential." The skeptic's frame against runaway compounding.

AI 已经在真实生产环境中优化自己——但离"自主决定优化什么"还有本质距离。

grapeot · frontier landscape, 2026-05

The synthesis sentence of the frontier landscape. Translation: "AI now optimizes itself in production — but is still far from autonomously deciding what to optimize."

From the working set — this landscape

One resets each time, the other compounds continuously.

grapeot · context infrastructure thesis

The line that names the difference between chat-with-AI and file-based context. See entry 1.5 and axiom 008 (Compound, Don't Reset).

Anything you find yourself reminding the AI of across multiple sessions belongs in the environment, not the prompt.

Vince Mask · X, 2026-05-22

The cleanest articulation of the hooks-vs-CLAUDE.md split. See entry 1.6 and entry 4.1.

If you cannot answer "what did this session leave behind?" five minutes after closing the window, the workflow is broken — fix the environment, not the model.

This landscape · entry 1.5

The compounding test as a single-question diagnostic. The mechanical version is now a Stop hook (2026-05-23) writing one row per session.

1. Context Infrastructure · 上下文基础设施

The foundation layer: how knowledge stays alive between sessions. Without this layer every session resets to zero; with it, work compounds. This is the highest-leverage cluster — every other cluster degrades by 50% without it.

1.1Files-over-prompts is the load-bearing axiom · 文件优于提示 FACT

Knowledge that can live in a file should never be stuffed into a prompt. Files are persistent, searchable, shareable, and let the AI develop its own understanding by reading raw evidence instead of consuming pre-digested summaries. This single discipline is what separates 10x AI users from 1x users.

Concrete evidence: grapeot's published Context Infrastructure stack uses this as its first operating principle; vczh applies the same pattern across 8 C++ repos via Learning.md; on this machine, ~/.claude/rules/axioms/009_files_over_prompts.md has been referenced across 8+ sessions and survived three model upgrades (Sonnet 4.5 → 4.6 → 4.7) without revision.

Source yage.ai context-infrastructure · ~/.claude/rules/axioms/009_files_over_prompts.md · vczh Learning.md across vczh-libraries/*

1.2Counter-based memory compaction beats append-only · 计数器式记忆压缩 FACT

Every memory entry carries a [N] counter for how many independent times the lesson has been re-derived. Bump on re-derivation, promote to axiom at [3]+, prune at 90+ days untouched. This turns memory from a write-only log into a write-and-distill loop — duplicates merge into the highest-counter entry, signal sharpens, dead entries get pruned in a weekly refine cycle.

Adopted from vczh's Learning.md discipline applied across 8 C++ repos; implemented as ~/.claude/rules/MEMORY_PROTOCOL.md (2026-04-03). On this machine the counter system has promoted 10 memories to axioms over 12 months, with 0 axiom reversions.

Source ~/.claude/rules/MEMORY_PROTOCOL.md · vczh Learning.md system

1.3Three-layer hierarchy: L1 Observer → L2 Reflector → L3 Axiom · 三层记忆层级 FACT

Memory has tiers, not flatness. L1 = raw observations (session scratch in tmp/). L2 = consolidated reflections (memory files with counters). L3 = axioms (decision rules promoted from [3]+ patterns). Each tier has a different write/read frequency and a different lifetime. Skipping tiers — writing axioms before observations accumulate — produces brittle rules that get reverted.

On this machine: ~/.claude/tmp/ (L1, ephemeral) → ~/.claude/projects/.../memory/ (L2, 27 entries, counter-tracked) → ~/.claude/rules/axioms/ (L3, 10 promoted, each backed by ≥3 sessions of evidence). The skill that publishes this hierarchy as a reusable artifact is grapeot's context-infrastructure.

Source yage.ai context-infrastructure · ~/.claude/rules/axioms/INDEX.md

1.41M-token context didn't kill RAG/memory — it changed where they apply · 长上下文未杀死 RAG OPINION

Anthropic shipped 1M context for Claude Sonnet in 2025. The naive read was "long context replaces RAG." The practitioner's read is different: long context is the right tool for single-session deep work (read 50 files in one go, never lose them mid-task); RAG/files are the right tool for cross-session compounding (the next session needs to find them). The two sit on different time axes. A 1M-token paste resets at session end; a file-system memory survives indefinitely. Treating them as substitutes is the most common context-infra mistake.

Observed on this machine: the 1M-context release (May 2025) did not reduce file-based memory writes — it increased them, because longer sessions produced more compoundable artifacts and the cost of NOT externalizing them got steeper.

Source Anthropic 1M context announcement · author's usage data 2025-04 → 2026-05 · axiom 008

1.5The compounding test: did this session leave an artifact? · 复利测试 OPINION

Health check for any AI workflow, single question: at session end, does the work disappear or does the next session find something to read? Disappear = 1x usage. Artifact left behind (file, axiom, skill, memory entry, daily-record row) = compounding. "One resets each time, the other compounds continuously" — grapeot's framing. The Stop hook on this machine (wired 2026-05-23) makes the test mechanical: every session end appends a row to ~/.claude/contexts/daily_records/YYYY-MM-DD.md automatically, taking artifact creation from 0/85 sessions (prompt-side reminder) to 100% (hook-enforced).

If you cannot answer "what did this session leave behind?" five minutes after closing the window, the workflow is broken — fix the environment, not the model.

Source ~/.claude/rules/axioms/008_compound_dont_reset.md · grapeot Context Infrastructure thread · this machine's Stop hook implementation 2026-05-23

1.6Prompts express intent; environment enforces rules · Prompt 表达意图,Hook 固化规则 FACT

The cleanest articulation of when to leave a rule in CLAUDE.md vs move it to a hook: prompts are for context-dependent intent, hooks are for stable repeated forget-prone enforcement. Anything you find yourself reminding the AI of across multiple sessions belongs in the environment, not the prompt. Seven canonical hook patterns: format-on-save, lint/test pre-commit, forbidden-path block, type-check, high-risk gate, session-start context inject, task-end change summary.

Quantified on this machine from 85 session transcripts: prompt-only phase-announcement compliance was 4/6 in non-trivial (>50-turn) sessions — 67% pass, 33% miss; prompt-only session-artifact production was 0/85 across all sessions. Hook-enforced compliance is 100% on both. The CLAUDE.md is doing the work of a 50-line hook script — and losing.

Source Vince Mask, X, 2026-05-22 · this session's quantified audit of ~/.claude/projects/-Users-chasewang-Documents-Claude-Code/*.jsonl

2. Skill Systems · 技能系统

Slash commands are the second layer of the working set. They reify reusable workflows into named operations that the user can invoke and the model can introspect. The question isn't whether to use skills — it's how to design them so they survive model upgrades, stay sharply scoped, and don't ossify into procedural straitjackets.

2.1Anthropic's skill-creator is the canonical test-measure-refine loop · 官方技能生成器 FACT

Anthropic publishes a skill-creator skill that runs the create-edit-eval-benchmark loop with four subagents: executor (runs the skill on test cases), grader (scores against rubric), comparator (A/B against alternatives), analyzer (proposes refinements). It is the operational artifact of the autoresearch axiom — single-pass skill drafts hit a consensus ceiling; this loop breaks through by attacking the weakest rubric dimension each round.

Available via the Anthropic skills marketplace; on this machine it lives at ~/.claude/skills/anthropic-skills/skill-creator/. Used to produce /why-token, /tokenomics, and the planned /landscape skill.

Source Anthropic skills marketplace · ~/.claude/rules/axioms/001_iterative_refinement.md

2.2Slash commands beat prompt repetition · 斜杠命令优于重复提示 FACT

If you have re-typed the same multi-line instruction more than twice, it should be a slash command. The economics are obvious in retrospect — a skill costs ~20 minutes to write once and amortizes across thousands of invocations — but in practice most users keep typing. Symptoms of underutilization: bookmarked prompt snippets, copy-paste from Notes app, recurring sentences in CLAUDE.md that fire only for one task type.

Practitioner heuristic: third repeat = write the skill. On this machine, 20 skills (/why-token, /tokenomics, /last30days, /deep-research, /x-control-chase, /ship, /qa, /loop, /schedule, etc.) cover the bulk of weekly work; without them the CLAUDE.md would have to be 3× its current size.

Source ~/.claude/skills/ directory · ~/.claude/rules/skills/INDEX.md

2.3Marketplace economics: skills as personal IP · 技能即个人知识产权 OPINION

A skill is a packaged opinion. The marketplace is therefore a stage where opinions about how work should be done get distributed. This is closer to a substack than to npm — quality is opinionated, not commodity. Sahil Lavingia's OPC-skills (minimalist entrepreneur frameworks: /find-community, /validate-idea, /mvp, /processize) demonstrates the model: ten skills encoding one person's worldview, useful precisely because the author has taste.

Implication for practitioners: publishing skills is a higher-leverage form of content than writing about workflows. Each skill the user installs is a vote for that author's mental model.

Source Sahil Lavingia OPC-skills · marketplace usage on this machine (Anthropic + OPC sources)

2.4Model-version durability: well-designed skills survive 4.5→4.6→4.7 · 跨模型版本耐用度 FACT

The fear that skills will rot across model upgrades is largely unfounded if the skills are written as outcomes + constraints (axiom 010) rather than step-by-step procedures. On this machine: of ~20 skills installed before Sonnet 4.5, the upgrades to 4.6 and 4.7 required two refactors (skills that had hard-coded a model name or token budget); the rest continued to work unchanged. Sample is small; the pattern matches the broader practitioner consensus that outcome-shaped skills survive model upgrades better than procedure-shaped ones.

Practitioner rule: never reference a specific model in a skill body; never assume a specific context window; treat the model as a capability not a fixed signature.

Source this machine's skill upgrade audit 2025-12 → 2026-05 · axiom 010

2.5When skill beats prompt — and when it doesn't · 技能与提示的边界 OPINION

Skills beat prompts when the workflow is stable, repeated, and benefits from named structure. Prompts beat skills when the request is one-off, exploratory, or requires negotiated scope with the user. The failure mode is over-skillification — wrapping every prompt in a skill produces a brittle DSL that nobody remembers how to use. Symptoms: skills with 1 invocation per quarter, skills that always need 3 arguments to do anything useful, skills that get rewritten more often than invoked.

Heuristic: a skill must be invoked at least 5 times before it earns its keep. Lower than that, revert to prompt and save the cognitive load.

Source author observation across 20-skill setup · ~/.claude/rules/skills/INDEX.md

2.6Skill granularity: atomic actions vs orchestrating skills · 技能粒度 OPINION

Two valid skill shapes. Atomic: one verb, one outcome (/ship lands a PR, /qa runs the test loop). Orchestrating: multi-phase, chains atomic skills (/x-control-chase runs pulse + drafts + posts). Orchestrating skills are higher-leverage but riskier — they fail in more places and are harder to debug. Atomic skills compose better.

Practitioner rule: build atomic first; only build the orchestrator after three atomic skills have been used together more than five times. Premature orchestration is the skill-system version of premature abstraction (axiom 006).

Source author skill design experience · axiom 006

3. Agentic Patterns · 智能体模式

The shift from "one conversation at a time" to "fleet of agents working in parallel." This cluster matures faster than any other right now — patterns that didn't exist in early 2025 are production-stable by mid-2026.

3.1The /agents fleet view exists (Claude Code v2.1.139+) · /agents 舰队视图 FACT

Anthropic shipped a research-preview fleet view in late 2025. A terminal table lists all background sessions grouped by state (Working / Needs input / Ready for review / Completed / Failed / Stopped), each row carries a Haiku-generated one-line summary, runtime, PR link, and CI status. Keybindings: Space peek, Enter attach, Ctrl+T pin, / dispatch skill.

This collapses an entire category of third-party tooling — at least 8 community projects (claude-view, mission-control, claude-code-hooks-multi-agent-observability, AgentCraft, Claudia, Hermes HUD, etc.) had built fleet views by mid-2026; the official answer subsumed most of them. Anyone building a fleet HUD now must answer: "why isn't this just claude agents with a skin?"

Source Claude Code Agent View docs · ~/.claude/contexts/survey_sessions/aggro_agent_viz_landscape_2026-05-12.md

3.2Subagent boundary: goals not procedures · 子智能体应给目标而非步骤 FACT

Subagents (spawned via the Agent tool or claude -p) work better when given an objective and tool access than when handed a step-by-step procedure. Axiom 007 (goals over instructions) was promoted from this exact pattern recurring across 5+ session contexts. The mistake — and almost everyone makes it once — is pre-chewing the search strategy for the subagent, which caps its output at the calling agent's imagination.

Concrete rule for claude -p spawning: match --tools to task type (file edits get Read,Edit,Write; research gets Bash,Read,Write), propagate user language explicitly (don't rely on inference), land outputs to files not stdout, default to --model opus.

Source ~/.claude/rules/axioms/007_goals_over_instructions.md · ~/.claude/CLAUDE.md "Claude Code CLI as Sub-tool" section

3.3Background tasks vs foreground for parallelism · 后台与前台任务的取舍 FACT

The Agent tool's run_in_background: true parameter is the single most underused capability. Foreground agents block the main thread for 2–10 minutes per call; background agents let the main thread continue and notify on completion. Rule: foreground if you need the result to proceed; background if you have genuinely independent work to do in parallel.

Misuse pattern: spawning two foreground agents serially when they could have run concurrently. Cost: ~5 minutes wall-clock per occurrence. Frequency on this machine before correction: ~3× per non-trivial session. After correction: ~0.

Source Claude Code Agent tool docs · ~/.claude/CLAUDE.md agent guidance

3.4Multi-harness coordination (cello pattern) — promising but unproven · 多 harness 协调 OPINION

The "director agent coordinating multiple Claude Code harnesses" pattern (private project cello, inheriting from aggro) explored whether a single planner could steer 3–10 worker agents across separate harnesses. Result: coordination tax dominated parallel benefit at small scale (<5 workers); the pattern only pays off when workers are doing genuinely orthogonal tasks (e.g., one researching, one coding, one QA'ing) for ≥30 minutes each. Below that threshold, a single focused agent with run_in_background ships faster.

The user's cello project is frozen pending validation of the simpler beto inbox approach — itself a vote for "focused over swarm" at current model capability.

Source ~/.claude/projects/-Users-chasewang-Documents-Claude-Code/memory/reference_beto_cello.md · ~/.claude/contexts/survey_sessions/beto_iteration_plan_2026_05_17.md

3.5/loop for state-watching, /schedule for cron-style recurrence · 循环与定时任务 FACT

Two complementary patterns. /loop runs a prompt or slash command on a recurring interval (e.g., /loop 5m /qa) — good for polling state that can't notify (CI, deploys, remote queues). Dynamic-pacing mode lets the model self-pace by calling ScheduleWakeup with cache-aware intervals (avoid 300s — that's the worst-of-both for prompt caching). /schedule creates cron-style remote agents that fire without prompting — good for daily digests (the user runs x-brief-morning daily).

The mistake to avoid: short-interval /loop to poll background agents the harness already tracks. The harness re-invokes you on completion; polling burns cache.

Source ~/.claude/rules/skills/INDEX.md · /loop and /schedule skill docs · ~/.claude/CLAUDE.md ScheduleWakeup guidance

3.6Scheduled remote agents change the unit of work · 定时远程任务改变工作单元 FACT

Once you have /schedule, the unit of AI work is no longer "a conversation" — it's "a recurring routine." The user runs x-brief-morning daily; it pulls trending crypto/DeFi/AI topics via /last30days + /deep-research, drafts three post suggestions, and lands them in ~/Documents/Last30Days/ for review. Effort per output: ~5 minutes of human review. Frequency: 7×/week. Annualized: ~30 hours of human effort yielding ~365 content drafts.

This is what compounding looks like at the workflow level — the conversation that would have happened reactively (manually researching X each morning) becomes proactive infrastructure.

Source ~/.claude/rules/skills/INDEX.md "Scheduled Tasks" · ~/Documents/Last30Days/ 141 output files

4. Tool Integration · 工具集成

Tools are the agent's nervous system. The 2026 stack has matured into roughly four kinds: hooks (environment-level enforcement), MCP servers (external tool access), browser tooling (web interaction), and the deferred-tool/Tool-search pattern (cost control). Mastering tool integration is what separates power users from regular users.

4.1Hooks-as-environment: the Vince Mask thesis · Hook 即环境层 FACT

Hooks are the cleanest mechanism for "things you keep reminding the AI to do." Vince's seven canonical patterns: auto-format after edits, lint/test pre-commit, forbid edits to certain paths, post-generation type-check, high-risk-file gate, session-start context inject, task-end change summary. Each replaces a prompt-side reminder that has ≥50% miss rate with environment-level enforcement at 100%.

Implemented on this machine 2026-05-23: SessionStart hook injects phase-status banner; Stop hook appends one line per session to ~/.claude/contexts/daily_records/YYYY-MM-DD.md. Compliance went from 33% / 0% (prompt-only) to 100% / 100% (hook-enforced). See entry 1.6 for the audit data.

Source Vince Mask thread, X, 2026-05-22 · ~/.claude/settings.json · ~/.claude/hooks/ (this machine's implementation)

4.2MCP server taxonomy: official / community / self-hosted · MCP 服务分类 FACT

MCP (Model Context Protocol) standardized how external tools plug into Claude Code. Three tiers in practice: official (Anthropic-bundled — Chrome, Preview, scheduled-tasks, ccd-directory, mcp-registry) maximally trusted, batteries-included; community (third-party — Control-Chrome, Claude-in-Chrome) wider variety, varying maintenance quality; self-hosted (you wrote it) full control, full responsibility.

Selection heuristic: start official, escalate to community only when the official tool can't do what you need, self-host only when the integration is unique to your stack. The mistake is jumping straight to self-host because the surface area looks small — protocol-level edge cases will burn 10× the time you saved.

Source mcp__mcp-registry__list_connectors · author's MCP audit across 5 projects

4.3Chrome extension for live observation · Chrome 扩展用于实时观察 FACT

The Claude Chrome extension exposes the user's real browser tab to the model — read page text, click elements, fill forms, observe network requests. Critical distinction from headless browsers (cluster 4.4): the Chrome extension is for operating an existing session (you're logged in, you have cookies, you have history); headless is for fresh automated workflows (QA, testing, scraping).

Used in this very session at 07:50 PT to fetch grapeot's landscape live (JS-rendered, curl returned 650 bytes; Chrome extension returned the full rendered text). Substitute for "open in your browser, then I'll look at the screen" workflows that previously required screenshot-paste.

Source Claude-in-Chrome MCP · this session's read of grapeot.github.io/ai-self-evolution-landscape/

4.4Headless browser for QA loops (gstack pattern) · 无头浏览器 QA 循环 FACT

The /qa, /qa-design-review, /browse and related skills are built on the gstack headless browser framework. ~100ms per command, navigate any URL, interact with elements, diff before/after, take annotated screenshots, assert state. Critical for the full test-fix-verify loop: report bugs → fix in source → re-verify with screenshot evidence, atomic commits per fix.

This is where Claude Code most clearly beats LLM-via-API: the integrated browser + edit + commit loop runs in <2 minutes per cycle. The same workflow stitched together manually with playwright + git + LLM API takes 10–15 minutes per cycle.

Source ~/.claude/skills/gstack/ · /qa skill documentation · author's design-review workflows

4.5Deferred-tool / ToolSearch: schema-on-demand · 按需加载工具 Schema FACT

The deferred-tool pattern (visible in this session's prompt) keeps the available tool surface small at session start, then expands it on demand via ToolSearch. Each MCP server costs prompt tokens even if unused; deferring schemas until they're needed cuts the always-loaded surface by an order of magnitude. Practitioner cost: one extra tool call per first-use of an MCP tool. Practitioner benefit: cache-friendly prompt at session boot, no prompt bloat from 50+ unused tools.

This is invisible to most users but matters at scale — once you have 5+ MCP servers wired, the deferred pattern is the difference between a usable boot prompt and an exhausted one.

Source deferred-tool system reminder in this session's prompt · ToolSearch tool docs

4.6Hooks compliance is measurable — and the gap is huge · Hook 合规性可量化 FACT

On this machine, 85 session transcripts were graphed against four rules from CLAUDE.md. Prompt-only enforcement compliance: phase announcements 67% (33% miss), EVALS.md with skill edits 33%, session artifacts 0%, MEMORY counter checks unmeasurable (sample too small). The same rules wired as hooks: ≥100% compliance projected, 100% confirmed for the two implemented (SessionStart, Stop).

The gap between prompt-side discipline and environment-side enforcement is not a 10% improvement — it's the difference between "rule that mostly works" and "rule that always works." If you only audit one thing about your AI workflow, audit hook coverage.

Source this session's quantified audit of ~/.claude/projects/-Users-chasewang-Documents-Claude-Code/*.jsonl

5. Iteration Methodology · 迭代方法论

How quality is produced when single-pass output isn't good enough. The unifying insight: AI output has a "consensus ceiling" — the most likely next token is usually generic. Breaking through the ceiling requires iteration with explicit evaluation criteria, not longer prompts.

5.1Karpathy-style autoresearch (10-25 rounds) · Karpathy 式自我研究 FACT

Iterative refinement loop: generate first attempt, evaluate against explicit criteria, identify the weakest dimension, improve it, re-evaluate. Repeat 10–25 rounds until convergence. Originally from ML training research; the practitioner insight is that the same loop applies to any task with evaluable output — skill design, architecture, writing, agent prompts, landscape entries.

The compounding mechanism: each round can attack a different weakness, so quality compounds in a way that single-pass cannot match no matter how long the single pass runs. Used on this machine for: /why-token skill (15 rounds), tokenomics reference data (12 rounds), Hermes onboarding skill (22 rounds).

Source ~/.claude/rules/axioms/001_iterative_refinement.md · ~/.claude/projects/-Users-chasewang-Documents-Claude-Code/memory/autoresearch.md

5.2EVALS.md publishing discipline · 评估文档发布纪律 FACT

After any autoresearch cycle with a positive outcome and a repo push, publish EVALS.md in the same commit. Required content: rubric (dimensions scored), personas (if applicable), round log (what changed each iteration and why), final score. Without this artifact the research is a black box — future sessions and humans can't tell whether the output was iterated 3 times or 25, what criteria drove improvements, where diminishing returns kicked in.

Measured compliance on this machine before the discipline was codified: 33% (1 of 3 SKILL.md-editing sessions also touched EVALS.md). After codification as axiom 002: target 100%; not yet hook-enforced.

Source ~/.claude/rules/axioms/002_publish_evals.md · this very landscape's EVALS.md

5.3Refine cycle for memory compression · 记忆压缩刷新周期 FACT

Weekly (or whenever MEMORY.md feels noisy) pass: dedup entries that say the same thing in different words (merge into highest-counter, sum counters), bump counters from session evidence, promote [3]+ to axioms, prune dead entries (>90 days untouched and likely stale). The refine cycle is what keeps memory from becoming a write-only graveyard.

Concrete trigger: when scanning MEMORY.md takes longer than scanning the top-level project, the index has outgrown its compaction. Run a refine pass.

Source ~/.claude/rules/MEMORY_PROTOCOL.md "Refine cycle" section · vczh Learning.md weekly refine practice

5.4The compounding ratio: ~60:1 input to landscape · 输入与地图的压缩比 OPINION

Grapeot's AI self-evolution landscape compresses an estimated 200–500K words of source material (60+ papers, blog posts, conference talks) into ~3,800 characters of structured intuition. Compression ratio ≈ 60:1. The deep-research survey artifacts on this machine sit at compression ratio ~1:1 (no upward fold), which explains why 229K words of research feels like it hasn't compounded.

Implication: research artifacts that don't compress upward into landscapes are paying full storage cost for diminishing access benefit. The landscape format is the missing tier above survey reports.

Source grapeot landscape word count vs estimated input · this machine's 229K-word research corpus · this session's quantified comparison

5.5Document-before-test discipline (the investigate pattern) · 先记录后测试 FACT

For any non-trivial bug or implementation puzzle, maintain an INVESTIGATE.md state file: Problem → Test → Proposals with [CONFIRMED] / [DENIED] labels. Enforces two disciplines: (a) write down the proposal before testing it (otherwise you can't tell post-hoc what you actually tested), (b) revert between proposals (otherwise effects compound and you can't isolate causes). Adapted from vczh's Copilot_Investigate.md.

Survives context loss: if the session window closes mid-investigation, the next session reads the state file and continues. This is one of the rare workflows where the AI's amnesia stops being a tax.

Source ~/.claude/skills/investigate/SKILL.md · vczh Copilot_Investigate.md system

5.6Phase-based workflow with explicit transitions · 阶段化工作流 FACT

Phases: Product Review → Eng Review → Implement → Code Review → Ship → QA. Each has a different persona, different judgment criteria, different exit criteria. Mixing them (architecting during product review, coding during eng review) produces mediocre output across all dimensions. The fix: announce the phase, complete it, state the exit criteria met, transition explicitly.

Compliance measured on this machine: 67% on non-trivial sessions (4 of 6 sessions >50 turns announced phases). Hook-enforced via SessionStart phase banner (wired 2026-05-23): target 100%.

Source ~/.claude/CLAUDE.md Phases section · ~/.claude/rules/axioms/003_phase_discipline.md · this machine's SessionStart hook

6. Skepticism & Limits · 怀疑与边界

Where the methodology breaks. A practitioner's field map without a skepticism cluster is propaganda. Some entries below come from post-incident notes (6.1 skill granularity, 6.3 multi-agent tax, 6.5 auto-mode); others extrapolate known principles to predicted failure modes (6.2 cache windows, 6.4 axiom decay, 6.6 over-architecture). Both kinds are tagged [OPINION] and explained.

6.1Skills underperform when granularity is wrong · 技能粒度错误时表现不佳 OPINION

A skill written too narrow (/format-typescript-file) is invoked rarely and forgotten. A skill written too wide (/do-everything-for-the-blog-post) hides too many decisions and produces bad output when one decision is wrong. The right granularity is "one verb, one outcome" for atomic skills and "three-to-five chained verbs" for orchestrators (cluster 2.6).

Symptom of wrong granularity: a skill rewritten more than three times in two months. Fix: split (if too wide) or merge with neighbors (if too narrow).

Source author skill audit · ~/.claude/skills/ usage stats

6.2/loop wastes context if interval is sub-cache-window · /loop 短于缓存窗口浪费上下文 OPINION

Anthropic's prompt cache TTL is 5 minutes. /loop intervals shorter than 270 seconds keep the cache warm; intervals at exactly 300 seconds pay the cache miss without amortizing it (worst case); intervals at 1200+ seconds (20+ minutes) commit to the cache miss and amortize it over a long wait. The mistake: choosing "every 5 minutes" because the number is round. The fix: think in cache windows, not minutes.

Concrete: a 5-minute poll burns the cache 12× per hour; a 27-minute poll burns it 2× per hour while providing similar coverage for most state-watching tasks.

Source ~/.claude/skills/loop/SKILL.md ScheduleWakeup guidance · Anthropic prompt cache TTL docs

6.3Multi-agent coordination tax dominates at small scale · 多智能体协调成本主导小规模场景 OPINION

The "swarm beats focused" intuition is wrong below ~5 workers. Coordination overhead — passing state between agents, handling partial failures, reconciling outputs — costs more than the parallelism saves. The cello private project (multi-harness director) was frozen for exactly this reason: at the workloads we actually ran, a single focused agent with run_in_background for the genuinely orthogonal subtask outperformed.

Threshold from observation: swarm pays off only when workers are doing ≥30 minutes of genuinely orthogonal work each. Below that, the planner becomes the bottleneck.

Source ~/.claude/projects/-Users-chasewang-Documents-Claude-Code/memory/reference_beto_cello.md · ~/.claude/contexts/survey_sessions/beto_competitive_survey_20260513.md

6.4Axiom decay: rules that stopped applying · 公理失效 OPINION

Axioms can become wrong. The rules that promoted them — domain context, model capability, tool availability — change. An axiom written before 1M context that says "always summarize before passing to subagent" should be revisited after the upgrade. An axiom written before hooks existed that says "remind the model at session start" is obsolete now.

Hygiene: every refine cycle, ask whether any L3 axiom is still earning its keep. Decay is not an embarrassment — it's the system working. The mistake is preserving axioms because you wrote them, not because they're still correct.

Source author observation · axiom INDEX has not yet seen a deprecation; first review due 2026-Q3

6.5The auto-mode tradeoff: terse-commander style needs trust · Auto-mode 需要信任 OPINION

Operating Claude in bypassPermissions / auto mode is the highest-throughput configuration — terse one-word commands ("ship", "go", "proceed") execute end-to-end without prompts. It is also the highest-risk configuration. The user signs up for the model's judgment on every intermediate step. The mode works when (a) the user understands what the model will do, (b) the model has been calibrated by hundreds of prior sessions, (c) destructive operations are still gated by classifier rules.

Practitioner observation: auto-mode works dramatically better with a mature CLAUDE.md + axiom system than without one. The same model in auto-mode without context infrastructure is dangerous; with context infrastructure it's a force multiplier.

Source ~/.claude/settings.json bypassPermissions · this session's settings.json self-modification block (classifier correctly intervened)

6.6The "context infrastructure" frame can over-architect · 上下文基础设施可能过度设计 OPINION

The infrastructure mindset — write axioms, build skills, wire hooks, maintain memory — has a failure mode of premature scaffolding. A user with two months of light Claude Code use does not need a 10-axiom system, a 20-skill library, and four hooks. They need to type good prompts. Scaffolding pays off when you have evidence of repeated patterns to scaffold around. Scaffolding before that point produces a system that feels organized but covers no real workflow.

Heuristic: don't write an axiom until you've re-derived the rule three times. Don't write a skill until you've typed the prompt five times. Don't wire a hook until you've measured a compliance miss across ≥10 sessions. The methodology is durable precisely because it waits for evidence.

Source this machine's MEMORY_PROTOCOL [3]+ promotion rule · author observation

Active Debates · 活跃争论

D1 · Long context vs RAG vs file-system memory · 长上下文 vs RAG vs 文件记忆

Long-context-wins (Anthropic-1M camp)

1M tokens cover most non-trivial single-session work. RAG is a 2022-era workaround for context limits that no longer bind. File-system memory is a special case of RAG and inherits its weaknesses (retrieval errors, chunking artifacts). The simpler stack — "load everything, let attention do the work" — beats the complex stack in 80% of real workflows.

Files-over-prompts (grapeot / vczh camp)

Long context is single-session; files are cross-session. They sit on different time axes and don't substitute. The 1M release didn't reduce file writes; it increased them. File-system memory is the only mechanism that lets work compound across sessions; without it, every conversation resets to zero. Long context inside a session, files between sessions, RAG only when retrieval beats reading.

Hybrid (most practitioners)

Both, situationally. Long context for the active task, files for what persists, RAG only when the corpus is too big for context and too inert for memory. The mistake is treating any one as universal.

D2 · Skill durability across model upgrades · 跨模型版本耐用度

Durable (Karpathy / Anthropic skill-creator camp)

Well-designed skills (outcomes + constraints, not procedures) survive model upgrades. The skill is a contract with the user, not a contract with the model. New model versions usually improve adherence to the contract, not break it. Evidence: zero skill rewrites required across Sonnet 4.5 → 4.6 → 4.7 on this machine.

Fragile (skeptic camp)

Skills bake in implicit assumptions about model behavior that aren't visible in the SKILL.md (token budgets, output formats, tool-call patience). Major model upgrades reveal these assumptions when behavior shifts. Maintenance cost is hidden until the upgrade breaks something subtle.

Empirical (practitioner consensus)

Audit your skill library after each major model release. ~10% will need revision; ~90% will not. The 10% are usually the skills that hard-coded model-specific signals — refactor them to read from environment.

D3 · Single focused agent vs multi-agent swarm · 单一专注智能体 vs 多智能体协同

Swarm-wins

Parallel agents complete more work in wall-clock time than serial agents, period. The coordination cost is real but bounded; the parallelism gain is unbounded. As models get faster and cheaper, the breakeven shifts toward swarms. Failure cases mostly reflect immature orchestration tooling that will mature.

Focused-wins

Below ~5 workers (which is most real workloads), the planner becomes the bottleneck. A single focused agent with run_in_background for one genuinely orthogonal subtask outperforms a swarm by 2× wall-clock and 5× simplicity. Most observed swarm wins are confounded by parallelism the swarm framework didn't add (e.g., the parallelism could have been achieved with two background calls).

Workload-dependent (hybrid)

Focused for ≤5 workers and ≤30 minutes of orthogonal work each; swarm for 5–20 workers doing genuinely independent long-running tasks (research-coding-QA in parallel). Above 20 workers, you're building infrastructure not running a workflow.

D4 · Files-over-prompts vs prompt-loaded context · 文件优先 vs 提示加载

Files-by-default (axiom 009 camp)

Default to files; promote to prompt only when the context is one-shot and ephemeral. Stuffing context into prompts is a regression to chat-with-AI thinking; files let the model build its own understanding by reading.

Prompt-when-fast (counter-position)

Files cost setup time. For genuinely one-off tasks, typing the context into the prompt is faster than writing a file the AI has to discover. Premature file-creation is the practitioner version of premature abstraction.

Threshold (working consensus)

If the context will be re-used in ≥2 sessions, write the file. If not, prompt is fine. The rule is "reusability," not "size."

D5 · Autoresearch saturation point · 自我研究的饱和点

25-rounds-compounds (Karpathy / iterative-refinement camp)

Each round attacks a different weakness; quality compounds for 20–25 rounds before diminishing returns. Stopping at 5 rounds is the most common practitioner mistake — they hit a local plateau and assume it's the global ceiling.

Diminishes-by-10 (Lambert-style "lossy self-improvement")

Nathan Lambert's LSI argument applied to iteration depth: most autoresearch runs flatten by round 10–12 because the autoresearcher cannot keep widening the dimensions it grades against. The grader's blindspots accumulate, and rounds 13–25 polish along the dimensions the grader can see while regressing on the dimensions it cannot. Karpathy's "compounds to 25" claim is calibrated on ML training where the grader (validation loss) is unusually robust; for writing, design, or skill prose, that condition doesn't hold.

Domain-specific (working consensus)

Mathematical / code / benchmark-able tasks with robust automatic graders compound to 20+ rounds. Subjective tasks (writing, design, skill prose) flatten by 8–12 unless the grader is widened mid-run. Calibrate per domain by stopping when two consecutive rounds produce indistinguishable output against your full rubric.

D6 · Auto-mode vs guardrails · Auto-mode vs 护栏

Auto-mode (terse-commander camp)

For users with mature context infrastructure, auto-mode is the throughput configuration. One-word commands execute end-to-end; the model's judgment is calibrated by hundreds of prior sessions; destructive operations are still gated by classifier rules. Check-ins on every step would destroy 90% of the value.

Guardrails (safety camp)

The classifier catches some destructive operations but not all. Hook-enforced rules cover known gaps but not unknown ones. Most users do not have the context infrastructure to make auto-mode safe; rolling it out broadly invites the kind of error a check-in cadence would have caught.

Phase-discipline (the compromise)

Auto-mode within a phase; explicit user check-in at phase transitions (Product → Eng → Implement → Review → Ship). The phase boundary is the natural place to surface decisions that a single one-word command can't disambiguate. This machine implements the configuration partially — auto-mode + SessionStart phase banner enforce the within-phase half; the explicit transition gate is still convention not hook, with the classifier catching the highest-stakes operations (e.g., self-modification of settings.json) as a backstop.