vault backup: 2026-04-13 18:04:51

2026-06-04 10:15:15 +08:00 · 2026-04-13 18:04:53 +08:00
parent 267e9f05e4
commit b4d35f5139
20 changed files with 1189 additions and 3 deletions
@@ -0,0 +1,160 @@
+# /kb — LLM 知识库管理工具
+
+基于 Karpathy 的 LLM Knowledge Base 模式：raw/ 存原始资料，LLM 编译成 wiki/，索引替代 RAG。
+
+## 快速开始
+
+### 1. 初始化知识库
+
+```
+/kb init
+```
+
+在当前目录创建知识库目录结构：
+- `raw/` — 原始资料（只读）
+- `wiki/concepts/` — 核心概念
+- `wiki/sources/` — 来源摘要
+- `wiki/comparisons/` — 对比分析
+- `output/analysis/` — 分析报告
+- `output/slides/` — 幻灯片
+- `index/` — 索引文件
+
+### 2. 导入文件
+
+将 PDF、Excel、图片、Word 文档放入 `raw/` 目录，然后：
+
+```
+/kb ingest
+```
+
+自动提取文本并登记到索引。
+
+### 3. 编译为 Wiki
+
+```
+/kb compile
+```
+
+LLM 读取原料，生成结构化 wiki 文章。
+
+### 4. 查询知识库
+
+```
+/kb query "你的问题"
+```
+
+生成结构化报告，包含分析、结论和回填建议。
+
+### 5. 回填有价值的结果
+
+```
+/kb file
+```
+
+将查询报告中有价值的内容并入 wiki。
+
+### 6. 健康检查
+
+```
+/kb lint
+```
+
+六项检查：断链、孤岛、溯源、一致性、覆盖度、空白发现。
+
+### 7. 查看状态
+
+```
+/kb status
+```
+
+仪表盘展示整体健康度和统计信息。
+
+---
+
+## 子命令速查
+
+| 命令 | 功能 | 触发词 |
+|------|------|--------|
+| `kb init [目录]` | 初始化知识库 | "初始化"、"创建知识库" |
+| `kb ingest` | 预处理 raw/ 文件 | "导入"、"处理新文件" |
+| `kb compile [文件]` | 编译为 wiki | "编译"、"更新 wiki" |
+| `kb query "<问题>"` | 查询知识库 | "查知识库"、"问知识库" |
+| `kb file [报告]` | 回填到 wiki | "回填"、"归档" |
+| `kb lint` | 健康检查 | "检查"、"lint" |
+| `kb status` | 状态仪表盘 | "状态"、"看看知识库" |
+
+---
+
+## 支持的文件格式
+
+| 格式 | 后缀 | 说明 |
+|------|------|------|
+| PDF | .pdf | 提取文本和图片 |
+| Excel | .xlsx, .xls, .csv | 提取表格内容 |
+| 图片 | .png, .jpg, .jpeg | OCR 文字识别 |
+| Word | .docx | 提取段落和表格 |
+
+---
+
+## 工作流程
+
+```
+投喂原料          LLM 编译          查询使用
+    │                │                │
+    ▼                ▼                ▼
+ raw/ ──────► wiki/ ──────► 查询分析 ──────► 回填
+    │                │                │
+ 原始文件        结构化文章       知识增长
+```
+
+---
+
+## 目录结构
+
+```
+{知识库根目录}/
+├── raw/                    # 原始资料（只读）
+│   └── .extracted/        # 提取的文本（自动生成）
+├── wiki/
+│   ├── concepts/          # 核心概念
+│   ├── sources/           # 来源摘要
+│   └── comparisons/       # 对比分析
+├── output/
+│   ├── analysis/          # 查询报告
+│   └── slides/           # 幻灯片
+├── index/
+│   ├── MASTER-INDEX.md   # 全局索引
+│   ├── TOPIC-MAP.md      # 主题分组
+│   ├── RAW-REGISTRY.md   # 原始文件登记
+│   ├── LINT-REPORT.md    # 健康检查报告
+│   └── ONTOLOGY.md       # 本体定义
+└── scripts/
+    ├── ingest.py          # 预处理脚本
+    └── extractors/        # 文件提取器
+```
+
+---
+
+## Python 依赖
+
+首次使用需要安装依赖：
+
+```bash
+pip install -r .claude/skills/kb/scripts/requirements.txt
+```
+
+依赖列表：
+- PyMuPDF — PDF 提取
+- openpyxl — Excel 读取
+- pandas — 数据处理
+- pytesseract — 图片 OCR
+- python-docx — Word 读取
+- Pillow — 图片处理
+
+---
+
+## SessionStart Hook（可选）
+
+配置后，每次打开 Claude Code 会自动检测 `raw/` 中的新文件并提醒处理。
+
+初始化时选择"是"即可启用。
@@ -0,0 +1,327 @@
+---
+name: kb
+description: |
+  LLM 驱动的知识库管理工具箱。当用户说"kb"、"知识库"、"查知识库"、"初始化知识库"、"导入文件"、"编译"、"回填"等时触发。
+  支持对 vault 或外部目录建立知识库：预处理文件、编译 wiki、查询分析、健康检查。
+  基于 Karpathy 的 LLM Knowledge Base 模式：raw/ 存原始资料，LLM 编译成 wiki/，索引替代 RAG。
+user-invocable: true
+---
+
+# /kb — LLM 知识库管理
+
+统一入口，包含 7 个子命令。
+
+## 子命令速查
+
+| 命令 | 功能 | 触发词 |
+|------|------|--------|
+| `kb init [目录]` | 初始化知识库 | "初始化"、"创建知识库" |
+| `kb ingest` | 预处理 raw/ 文件 | "导入"、"处理新文件" |
+| `kb compile [文件]` | 编译为 wiki | "编译"、"更新 wiki" |
+| `kb query "<问题>"` | 查询知识库 | "查知识库"、"问知识库" |
+| `kb file [报告]` | 回填到 wiki | "回填"、"归档" |
+| `kb lint` | 健康检查 | "检查"、"lint" |
+| `kb status` | 状态仪表盘 | "状态"、"看看知识库" |
+
+---
+
+## kb init [目标目录]
+
+初始化知识库目录结构、索引和本体定义。
+
+**参数**：可选目标目录，默认当前目录（vault）或指定外部目录。
+
+### 执行步骤
+
+1. **检查现有知识库**：查找 `{target}/index/MASTER-INDEX.md`，如果存在则警告并等待确认
+
+2. **创建目录结构**：
+   ```
+   {target}/raw/              — 原始资料（只读）
+   {target}/wiki/concepts/    — 核心概念
+   {target}/wiki/sources/     — 来源摘要
+   {target}/wiki/comparisons/ — 对比分析
+   {target}/output/analysis/  — 分析报告
+   {target}/output/slides/     — 幻灯片
+   {target}/index/            — 索引文件
+   {target}/scripts/           — 预处理脚本
+   ```
+
+3. **复制模板文件**：从本 Skill 的 `templates/` 目录复制到 `{target}/index/`：
+   - ONTOLOGY.md — 实体类型和关系定义
+   - MASTER-INDEX.md — 全局索引
+   - TOPIC-MAP.md — 主题分组
+   - RAW-REGISTRY.md — 原始文件登记
+
+4. **复制脚本**：从本 Skill 的 `scripts/` 目录复制到 `{target}/scripts/`
+
+5. **检查 Python 依赖**：
+   ```bash
+   pip show pymupdf openpyxl pandas pytesseract python-docx Pillow 2>&1
+   ```
+   报告缺失的包，询问是否安装
+
+6. **配置 SessionStart Hook（可选）**：询问是否配置，检测 raw/ 新文件时提醒
+
+7. **输出初始化摘要**
+
+---
+
+## kb ingest
+
+预处理 raw/ 中的新文件并登记到索引。
+
+**前置条件**：知识库已初始化（存在 index/RAW-REGISTRY.md）
+
+### 支持格式
+- PDF (.pdf)
+- Excel (.xlsx, .xls, .csv)
+- 图片 (.png, .jpg, .jpeg) — OCR 提取
+- Word (.docx)
+
+### 执行步骤
+
+1. **定位知识库**：向上查找 `index/RAW-REGISTRY.md`
+
+2. **运行预处理脚本**：
+   ```bash
+   python3 {skill_dir}/scripts/ingest.py {kb_root}
+   ```
+   脚本自动：扫描新文件 → 按类型提取文本 → 输出摘要
+
+3. **登记到 RAW-REGISTRY.md**：为每个新文件添加条目：
+   - 文件路径、类型、摘要（一句话）
+   - 状态：`pending`（待编译）
+
+4. **输出摘要**：报告导入数量，提示下一步 `/kb-compile`
+
+---
+
+## kb compile [文件]
+
+将 raw/ 中已导入但未编译的文件编译为 wiki 文章。
+
+**参数**：可选指定文件，默认处理所有 `status=pending` 的条目
+
+### 核心原则
+- Wiki 文章由 LLM 生成，遵循 ONTOLOGY.md 定义
+- 每篇文章必须有完整 YAML frontmatter
+- 使用 `[[双链]]` 建立关联
+- 编译是增量的
+
+### 执行步骤
+
+1. **检查待编译条目**：读 `index/RAW-REGISTRY.md`，找 `status=pending` 的条目
+   - 如果没有，告知用户并结束
+
+2. **加载上下文**：读 ONTOLOGY.md、MASTER-INDEX.md、TOPIC-MAP.md
+
+3. **逐个编译**：
+   - 读取源文件或 `raw/.extracted/` 下的提取文本
+   - 判断操作：新建 / 更新已有 / 综合分析
+   - 按模板生成 wiki 文章
+   - 更新 frontmatter（type, id, compiled_from, related, last_compiled）
+   - 用 `[[双链]]` 链接相关文章
+
+4. **更新索引**：
+   - MASTER-INDEX.md 添加/更新条目
+   - TOPIC-MAP.md 归入主题
+   - RAW-REGISTRY.md 状态改为 `done`，填编译产物路径
+
+5. **输出编译摘要**
+
+---
+
+## kb query "<问题>"
+
+对知识库提问，生成结构化报告。
+
+**参数**：必填，用户的问题
+
+### 执行步骤
+
+1. **定位知识库**：查找 `index/MASTER-INDEX.md`
+
+2. **检索相关文章**：
+   - 读 MASTER-INDEX.md 定位相关文件
+   - 按需读 TOPIC-MAP.md 精确定位
+   - 读取所有相关 wiki 文章内容
+
+3. **研究分析**：
+   - 基于 wiki 内容深入分析问题
+   - 交叉对比多篇文章
+   - 结论必须基于实际内容，标注来源
+
+4. **生成报告**：保存到 `output/analysis/YYYY-MM-DD-{topic-slug}.md`：
+   ```markdown
+   # {报告标题}
+
+   - **Date**: YYYY-MM-DD
+   - **Query**: {用户问题}
+   - **Sources**: {引用的 wiki 文章}
+
+   ---
+
+   ## 分析
+   {详细分析，引用具体文章用 [[双链]]}
+
+   ## 结论
+   {核心发现}
+
+   ## 回填建议
+   - [ ] {具体建议}
+   ```
+
+5. **输出结果**：展示摘要，提示可运行 `/kb file` 回填
+
+---
+
+## kb file [报告路径]
+
+将查询输出回填到 wiki 知识库。
+
+**参数**：可选指定 output/ 下的报告文件，默认扫描 `output/analysis/`
+
+### 执行步骤
+
+1. **定位知识库和待回填内容**
+
+2. **展示回填建议**：列出所有建议，编号说明
+
+3. **用户确认**：逐条 Y/N 或批量操作
+
+4. **执行回填**：
+   - **更新已有文章**：将新内容有机融入
+   - **新建文章**：按 ONTOLOGY.md 模板创建
+
+5. **更新索引**：MASTER-INDEX.md 和 TOPIC-MAP.md
+
+6. **输出摘要**
+
+---
+
+## kb lint
+
+对知识库进行六项健康检查。
+
+### 检查项目
+
+| 检查 | 说明 |
+|------|------|
+| 断链 | `[[链接]]` 指向不存在的文件 |
+| 孤岛 | 没有被任何文章链接的文章 |
+| 溯源 | frontmatter compiled_from 指向已删除的文件 |
+| 一致性 | 同一概念在不同文章中的矛盾描述 |
+| 覆盖度 | 未编译文件比例 |
+| 空白发现 | 被提及但没有独立文章的概念 |
+
+### 执行步骤
+
+1. **定位知识库**
+
+2. **执行六项检查**
+
+3. **输出 Lint 报告**（按严重程度排序）
+
+4. **提供修复选项**：可自动修复的问题询问是否执行
+
+5. **保存报告到 `index/LINT-REPORT.md`**
+
+---
+
+## kb status
+
+展示知识库整体状态仪表盘。
+
+### 执行步骤
+
+1. **定位知识库**
+
+2. **收集统计数据**：
+   - raw/ 文件数
+   - wiki/ 文章数和字数
+   - 编译率
+   - 待回填报告数
+   - 上次 lint 结果
+
+3. **展示仪表盘**：
+   ```
+   知识库状态
+   ═══════════════════════════════════
+   原始文件:    N 个
+   Wiki 文章:   M 篇 (共 ~X 字)
+   编译率:      XX%
+   待回填:      Y 份报告
+   上次 Lint:   日期 — 问题摘要
+   ═══════════════════════════════════
+
+   最近编译的文章:
+     - wiki/concepts/xxx.md (日期)
+
+   待处理:
+     - N 个文件待编译 → /kb compile
+     - M 份报告待回填 → /kb file
+   ```
+
+4. **建议下一步操作**
+
+---
+
+## 目录结构约定
+
+```
+{知识库根目录}/
+├── raw/                    # 原始资料（只读）
+│   └── .extracted/         # 提取的文本（自动生成）
+├── wiki/
+│   ├── concepts/           # 核心概念
+│   ├── sources/            # 来源摘要
+│   └── comparisons/        # 对比分析
+├── output/
+│   ├── analysis/           # 查询报告
+│   └── slides/             # 幻灯片
+├── index/
+│   ├── MASTER-INDEX.md     # 全局索引
+│   ├── TOPIC-MAP.md        # 主题分组
+│   ├── RAW-REGISTRY.md     # 原始文件登记
+│   ├── LINT-REPORT.md      # 健康检查报告
+│   └── ONTOLOGY.md         # 本体定义
+└── scripts/
+    ├── ingest.py           # 预处理脚本
+    ├── requirements.txt    # Python 依赖
+    └── extractors/         # 各类文件提取器
+```
+
+## 实体类型（ONTOLOGY.md）
+
+| 类型 | 目录 | 命名规则 |
+|------|------|----------|
+| concept | wiki/concepts/ | {slug}.md |
+| source | wiki/sources/ | {slug}.md |
+| comparison | wiki/comparisons/ | {a}-vs-{b}.md |
+
+## Wiki 文章 Frontmatter 模板
+
+```yaml
+---
+type: concept
+id: {slug}
+aliases: []
+compiled_from:
+  - raw/{source_file}
+related:
+  - "[[other-article]]"
+last_compiled: YYYY-MM-DD
+---
+```
+
+---
+
+## 故障排除
+
+| 问题 | 解决方案 |
+|------|----------|
+| 找不到知识库 | 先运行 `/kb init` 初始化 |
+| 脚本报错 | 运行 `pip install -r scripts/requirements.txt` |
+| 编译率低 | 运行 `/kb ingest` 导入新文件，然后 `/kb compile` |
+| 断链太多 | 运行 `/kb lint` 查看详情，手动修复或删除断链 |
@@ -0,0 +1,381 @@
+<!DOCTYPE html>
+<html lang="zh-CN">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>/kb — LLM 知识库管理工具</title>
+  <style>
+    :root {
+      --bg: #0d1117;
+      --surface: #161b22;
+      --border: #30363d;
+      --text: #e6edf3;
+      --text-muted: #8b949e;
+      --accent: #58a6ff;
+      --accent-bg: #1f6feb1a;
+      --success: #3fb950;
+      --warning: #d29922;
+    }
+
+    * {
+      box-sizing: border-box;
+      margin: 0;
+      padding: 0;
+    }
+
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
+      background: var(--bg);
+      color: var(--text);
+      line-height: 1.6;
+      min-height: 100vh;
+      padding: 2rem;
+    }
+
+    .container {
+      max-width: 800px;
+      margin: 0 auto;
+    }
+
+    h1 {
+      font-size: 2rem;
+      margin-bottom: 0.5rem;
+      display: flex;
+      align-items: center;
+      gap: 0.5rem;
+    }
+
+    h1::before {
+      content: '📚';
+    }
+
+    .subtitle {
+      color: var(--text-muted);
+      margin-bottom: 2rem;
+    }
+
+    h2 {
+      font-size: 1.4rem;
+      margin: 2rem 0 1rem;
+      padding-bottom: 0.5rem;
+      border-bottom: 1px solid var(--border);
+      display: flex;
+      align-items: center;
+      gap: 0.5rem;
+    }
+
+    h3 {
+      font-size: 1.1rem;
+      margin: 1.5rem 0 0.75rem;
+      color: var(--accent);
+    }
+
+    code {
+      background: var(--surface);
+      padding: 0.2rem 0.4rem;
+      border-radius: 4px;
+      font-family: 'SF Mono', Consolas, monospace;
+      font-size: 0.9em;
+      color: var(--success);
+    }
+
+    pre {
+      background: var(--surface);
+      border: 1px solid var(--border);
+      border-radius: 8px;
+      padding: 1rem;
+      overflow-x: auto;
+      margin: 1rem 0;
+    }
+
+    pre code {
+      background: none;
+      padding: 0;
+      color: var(--text);
+    }
+
+    .command {
+      background: linear-gradient(135deg, var(--accent-bg), transparent);
+      border-left: 3px solid var(--accent);
+      padding: 0.75rem 1rem;
+      margin: 0.5rem 0;
+      border-radius: 0 8px 8px 0;
+      font-family: 'SF Mono', Consolas, monospace;
+    }
+
+    .card {
+      background: var(--surface);
+      border: 1px solid var(--border);
+      border-radius: 8px;
+      padding: 1rem;
+      margin: 1rem 0;
+    }
+
+    table {
+      width: 100%;
+      border-collapse: collapse;
+      margin: 1rem 0;
+    }
+
+    th, td {
+      text-align: left;
+      padding: 0.75rem;
+      border-bottom: 1px solid var(--border);
+    }
+
+    th {
+      color: var(--accent);
+      font-weight: 600;
+    }
+
+    tr:hover {
+      background: var(--surface);
+    }
+
+    .flow {
+      display: flex;
+      align-items: center;
+      justify-content: center;
+      gap: 0.5rem;
+      margin: 1.5rem 0;
+      flex-wrap: wrap;
+    }
+
+    .flow-step {
+      background: var(--surface);
+      border: 1px solid var(--border);
+      border-radius: 8px;
+      padding: 0.75rem 1rem;
+      text-align: center;
+    }
+
+    .flow-arrow {
+      color: var(--text-muted);
+    }
+
+    .tag {
+      display: inline-block;
+      background: var(--accent-bg);
+      color: var(--accent);
+      padding: 0.2rem 0.6rem;
+      border-radius: 20px;
+      font-size: 0.85em;
+      margin-right: 0.5rem;
+    }
+
+    .dir-tree {
+      font-family: 'SF Mono', Consolas, monospace;
+      font-size: 0.9rem;
+      line-height: 1.8;
+    }
+
+    .dir-comment {
+      color: var(--text-muted);
+    }
+
+    footer {
+      margin-top: 3rem;
+      padding-top: 1rem;
+      border-top: 1px solid var(--border);
+      color: var(--text-muted);
+      text-align: center;
+    }
+  </style>
+</head>
+<body>
+  <div class="container">
+    <h1>/kb — LLM 知识库管理工具</h1>
+    <p class="subtitle">基于 Karpathy 的 LLM Knowledge Base 模式：raw/ 存原始资料，LLM 编译成 wiki/，索引替代 RAG。</p>
+
+    <h2>🚀 快速开始</h2>
+
+    <h3>1. 初始化知识库</h3>
+    <div class="command">/kb init</div>
+    <p style="margin-top: 0.5rem;">在当前目录创建知识库目录结构：</p>
+
+    <div class="dir-tree">
+      <pre>
+├── raw/                    <span class="dir-comment"># 原始资料（只读）</span>
+├── wiki/
+│   ├── concepts/          <span class="dir-comment"># 核心概念</span>
+│   ├── sources/           <span class="dir-comment"># 来源摘要</span>
+│   └── comparisons/       <span class="dir-comment"># 对比分析</span>
+├── output/
+│   ├── analysis/          <span class="dir-comment"># 分析报告</span>
+│   └── slides/           <span class="dir-comment"># 幻灯片</span>
+└── index/                <span class="dir-comment"># 索引文件</span>
+      </pre>
+    </div>
+
+    <h3>2. 导入文件</h3>
+    <p>将 PDF、Excel、图片、Word 文档放入 <code>raw/</code> 目录，然后：</p>
+    <div class="command">/kb ingest</div>
+    <p style="margin-top: 0.5rem;">自动提取文本并登记到索引。</p>
+
+    <h3>3. 编译为 Wiki</h3>
+    <div class="command">/kb compile</div>
+    <p style="margin-top: 0.5rem;">LLM 读取原料，生成结构化 wiki 文章。</p>
+
+    <h3>4. 查询知识库</h3>
+    <div class="command">/kb query "你的问题"</div>
+    <p style="margin-top: 0.5rem;">生成结构化报告，包含分析、结论和回填建议。</p>
+
+    <h3>5. 回填有价值的结果</h3>
+    <div class="command">/kb file</div>
+    <p style="margin-top: 0.5rem;">将查询报告中有价值的内容并入 wiki。</p>
+
+    <h3>6. 健康检查</h3>
+    <div class="command">/kb lint</div>
+    <p style="margin-top: 0.5rem;">六项检查：断链、孤岛、溯源、一致性、覆盖度、空白发现。</p>
+
+    <h3>7. 查看状态</h3>
+    <div class="command">/kb status</div>
+    <p style="margin-top: 0.5rem;">仪表盘展示整体健康度和统计信息。</p>
+
+    <h2>📋 子命令速查</h2>
+
+    <table>
+      <thead>
+        <tr>
+          <th>命令</th>
+          <th>功能</th>
+          <th>触发词</th>
+        </tr>
+      </thead>
+      <tbody>
+        <tr>
+          <td><code>kb init [目录]</code></td>
+          <td>初始化知识库</td>
+          <td>初始化、创建知识库</td>
+        </tr>
+        <tr>
+          <td><code>kb ingest</code></td>
+          <td>预处理 raw/ 文件</td>
+          <td>导入、处理新文件</td>
+        </tr>
+        <tr>
+          <td><code>kb compile [文件]</code></td>
+          <td>编译为 wiki</td>
+          <td>编译、更新 wiki</td>
+        </tr>
+        <tr>
+          <td><code>kb query "&lt;问题&gt;"</code></td>
+          <td>查询知识库</td>
+          <td>查知识库、问知识库</td>
+        </tr>
+        <tr>
+          <td><code>kb file [报告]</code></td>
+          <td>回填到 wiki</td>
+          <td>回填、归档</td>
+        </tr>
+        <tr>
+          <td><code>kb lint</code></td>
+          <td>健康检查</td>
+          <td>检查、lint</td>
+        </tr>
+        <tr>
+          <td><code>kb status</code></td>
+          <td>状态仪表盘</td>
+          <td>状态、看看知识库</td>
+        </tr>
+      </tbody>
+    </table>
+
+    <h2>📦 支持的文件格式</h2>
+
+    <table>
+      <thead>
+        <tr>
+          <th>格式</th>
+          <th>后缀</th>
+          <th>说明</th>
+        </tr>
+      </thead>
+      <tbody>
+        <tr>
+          <td>PDF</td>
+          <td>.pdf</td>
+          <td>提取文本和图片</td>
+        </tr>
+        <tr>
+          <td>Excel</td>
+          <td>.xlsx, .xls, .csv</td>
+          <td>提取表格内容</td>
+        </tr>
+        <tr>
+          <td>图片</td>
+          <td>.png, .jpg, .jpeg</td>
+          <td>OCR 文字识别</td>
+        </tr>
+        <tr>
+          <td>Word</td>
+          <td>.docx</td>
+          <td>提取段落和表格</td>
+        </tr>
+      </tbody>
+    </table>
+
+    <h2>🔄 工作流程</h2>
+
+    <div class="flow">
+      <div class="flow-step">投喂原料<br><small>raw/</small></div>
+      <span class="flow-arrow">→</span>
+      <div class="flow-step">LLM 编译<br><small>wiki/</small></div>
+      <span class="flow-arrow">→</span>
+      <div class="flow-step">查询使用<br><small>/kb query</small></div>
+      <span class="flow-arrow">→</span>
+      <div class="flow-step">知识增长<br><small>/kb file</small></div>
+    </div>
+
+    <h2>📁 完整目录结构</h2>
+
+    <pre class="dir-tree">
+{知识库根目录}/
+├── raw/                    <span class="dir-comment"># 原始资料（只读）</span>
+│   └── .extracted/        <span class="dir-comment"># 提取的文本（自动生成）</span>
+├── wiki/
+│   ├── concepts/          <span class="dir-comment"># 核心概念</span>
+│   ├── sources/           <span class="dir-comment"># 来源摘要</span>
+│   └── comparisons/       <span class="dir-comment"># 对比分析</span>
+├── output/
+│   ├── analysis/          <span class="dir-comment"># 查询报告</span>
+│   └── slides/           <span class="dir-comment"># 幻灯片</span>
+├── index/
+│   ├── MASTER-INDEX.md   <span class="dir-comment"># 全局索引</span>
+│   ├── TOPIC-MAP.md      <span class="dir-comment"># 主题分组</span>
+│   ├── RAW-REGISTRY.md   <span class="dir-comment"># 原始文件登记</span>
+│   ├── LINT-REPORT.md    <span class="dir-comment"># 健康检查报告</span>
+│   └── ONTOLOGY.md       <span class="dir-comment"># 本体定义</span>
+└── scripts/
+    ├── ingest.py          <span class="dir-comment"># 预处理脚本</span>
+    └── extractors/        <span class="dir-comment"># 文件提取器</span>
+    </pre>
+
+    <h2>🐍 Python 依赖</h2>
+
+    <p>首次使用需要安装依赖：</p>
+    <div class="command">pip install -r .claude/skills/kb/scripts/requirements.txt</div>
+
+    <div class="card">
+      <strong>依赖列表：</strong>
+      <ul style="margin-top: 0.5rem; padding-left: 1.5rem;">
+        <li>PyMuPDF — PDF 提取</li>
+        <li>openpyxl — Excel 读取</li>
+        <li>pandas — 数据处理</li>
+        <li>pytesseract — 图片 OCR</li>
+        <li>python-docx — Word 读取</li>
+        <li>Pillow — 图片处理</li>
+      </ul>
+    </div>
+
+    <h2>⚙️ SessionStart Hook（可选）</h2>
+
+    <p>配置后，每次打开 Claude Code 会自动检测 <code>raw/</code> 中的新文件并提醒处理。</p>
+    <p style="margin-top: 0.5rem;">初始化时选择"是"即可启用。</p>
+
+    <footer>
+      <p>/kb — 整合自 <a href="https://github.com/ChuYinan2023/kb-skills" style="color: var(--accent);">kb-skills</a></p>
+    </footer>
+  </div>
+</body>
+</html>
@@ -0,0 +1,28 @@
+"""Extract text from Word documents."""
+from docx import Document
+import os
+
+
+def extract(docx_path: str, output_dir: str) -> str:
+    """Extract all paragraphs and tables from docx."""
+    basename = os.path.splitext(os.path.basename(docx_path))[0]
+    txt_path = os.path.join(output_dir, f"{basename}.txt")
+
+    doc = Document(docx_path)
+    parts = []
+
+    for para in doc.paragraphs:
+        if para.text.strip():
+            parts.append(para.text)
+
+    for i, table in enumerate(doc.tables):
+        parts.append(f"\n--- Table {i+1} ---")
+        for row in table.rows:
+            cells = [cell.text.strip() for cell in row.cells]
+            parts.append(" | ".join(cells))
+
+    with open(txt_path, "w", encoding="utf-8") as f:
+        f.write("\n".join(parts))
+
+    print(f"  Word: {len(doc.paragraphs)} paragraphs, {len(doc.tables)} tables extracted")
+    return txt_path
@@ -0,0 +1,34 @@
+"""Extract text summary from Excel files."""
+import pandas as pd
+import os
+
+
+def extract(excel_path: str, output_dir: str) -> str:
+    """Read all sheets, output text summary."""
+    basename = os.path.splitext(os.path.basename(excel_path))[0]
+    txt_path = os.path.join(output_dir, f"{basename}.txt")
+
+    ext = os.path.splitext(excel_path)[1].lower()
+    if ext == ".csv":
+        df = pd.read_csv(excel_path)
+        parts = [f"--- CSV ({len(df)} rows x {len(df.columns)} cols) ---"]
+        parts.append(f"Columns: {', '.join(df.columns.astype(str))}")
+        parts.append(df.head(50).to_string(index=False))
+        if len(df) > 50:
+            parts.append(f"... ({len(df) - 50} more rows)")
+    else:
+        xls = pd.ExcelFile(excel_path)
+        parts = []
+        for sheet in xls.sheet_names:
+            df = pd.read_excel(xls, sheet_name=sheet)
+            parts.append(f"--- Sheet: {sheet} ({len(df)} rows x {len(df.columns)} cols) ---")
+            parts.append(f"Columns: {', '.join(df.columns.astype(str))}")
+            parts.append(df.head(50).to_string(index=False))
+            if len(df) > 50:
+                parts.append(f"... ({len(df) - 50} more rows)")
+
+    with open(txt_path, "w", encoding="utf-8") as f:
+        f.write("\n\n".join(parts))
+
+    print(f"  Excel: extracted to {basename}.txt")
+    return txt_path
@@ -0,0 +1,20 @@
+"""OCR text from images using pytesseract."""
+import pytesseract
+from PIL import Image
+import os
+
+
+def extract(image_path: str, output_dir: str) -> str:
+    """OCR image, return text file path."""
+    basename = os.path.splitext(os.path.basename(image_path))[0]
+    txt_path = os.path.join(output_dir, f"{basename}.txt")
+
+    img = Image.open(image_path)
+    text = pytesseract.image_to_string(img, lang="chi_sim+eng")
+
+    with open(txt_path, "w", encoding="utf-8") as f:
+        f.write(text)
+
+    chars = len(text.strip())
+    print(f"  Image OCR: {chars} characters extracted")
+    return txt_path
@@ -0,0 +1,34 @@
+"""Extract text and images from PDF files using PyMuPDF."""
+import fitz  # PyMuPDF
+import os
+
+
+def extract(pdf_path: str, output_dir: str) -> str:
+    """Extract text from PDF, save images, return text file path."""
+    doc = fitz.open(pdf_path)
+    text_parts = []
+    img_count = 0
+
+    for page_num, page in enumerate(doc):
+        text_parts.append(f"--- Page {page_num + 1} ---")
+        text_parts.append(page.get_text())
+
+        for img_idx, img in enumerate(page.get_images(full=True)):
+            xref = img[0]
+            pix = fitz.Pixmap(doc, xref)
+            if pix.n > 4:
+                pix = fitz.Pixmap(fitz.csRGB, pix)
+            img_path = os.path.join(output_dir, f"page{page_num+1}_img{img_idx+1}.png")
+            pix.save(img_path)
+            img_count += 1
+            pix = None
+
+    doc.close()
+
+    basename = os.path.splitext(os.path.basename(pdf_path))[0]
+    txt_path = os.path.join(output_dir, f"{basename}.txt")
+    with open(txt_path, "w", encoding="utf-8") as f:
+        f.write("\n".join(text_parts))
+
+    print(f"  PDF: {len(text_parts)//2} pages, {img_count} images extracted")
+    return txt_path
@@ -0,0 +1,102 @@
+#!/usr/bin/env python3
+"""Scan raw/ for new files, extract text, print summary for LLM to parse."""
+import importlib
+import os
+import sys
+
+# Add scripts dir to path so extractors can be imported
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+
+EXTRACTORS = {
+    ".pdf": "extractors.pdf_extractor",
+    ".xlsx": "extractors.excel_extractor",
+    ".xls": "extractors.excel_extractor",
+    ".csv": "extractors.excel_extractor",
+    ".png": "extractors.image_extractor",
+    ".jpg": "extractors.image_extractor",
+    ".jpeg": "extractors.image_extractor",
+    ".docx": "extractors.docx_extractor",
+}
+SKIP_EXT = {".md", ".txt"}
+SKIP_DIRS = {".extracted"}
+
+
+def scan_raw(raw_dir, registry_path):
+    """Find files in raw/ not yet in RAW-REGISTRY.md."""
+    registered = set()
+    if os.path.exists(registry_path):
+        with open(registry_path, "r", encoding="utf-8") as f:
+            for line in f:
+                if line.startswith("| raw/") or line.startswith("| ./raw/"):
+                    path = line.split("|")[1].strip()
+                    registered.add(path)
+
+    new_files = []
+    for root, dirs, files in os.walk(raw_dir):
+        dirs[:] = [d for d in dirs if d not in SKIP_DIRS]
+        for fname in sorted(files):
+            fpath = os.path.join(root, fname)
+            rel = os.path.relpath(fpath, os.path.dirname(raw_dir))
+            if rel not in registered:
+                new_files.append(fpath)
+    return new_files
+
+
+def process_file(fpath):
+    """Extract text from a single file. Returns (txt_path, file_type) or (None, file_type)."""
+    ext = os.path.splitext(fpath)[1].lower()
+    extracted_dir = os.path.join(os.path.dirname(fpath), ".extracted")
+    os.makedirs(extracted_dir, exist_ok=True)
+
+    if ext in SKIP_EXT:
+        return None, ext
+
+    mod_name = EXTRACTORS.get(ext)
+    if not mod_name:
+        print(f"  SKIP (unsupported): {os.path.basename(fpath)}")
+        return None, ext
+
+    try:
+        extractor = importlib.import_module(mod_name)
+        txt_path = extractor.extract(fpath, extracted_dir)
+        return txt_path, ext
+    except ImportError as e:
+        print(f"  ERROR (missing dependency): {e}")
+        return None, ext
+    except Exception as e:
+        print(f"  ERROR: {e}")
+        return None, ext
+
+
+def main():
+    kb_root = sys.argv[1] if len(sys.argv) > 1 else os.getcwd()
+    raw_dir = os.path.join(kb_root, "raw")
+    registry = os.path.join(kb_root, "index", "RAW-REGISTRY.md")
+
+    if not os.path.isdir(raw_dir):
+        print(f"ERROR: {raw_dir} not found")
+        sys.exit(1)
+
+    new_files = scan_raw(raw_dir, registry)
+    if not new_files:
+        print("No new files in raw/")
+        return
+
+    print(f"Found {len(new_files)} new file(s) in raw/:\n")
+    results = []
+    for fpath in sorted(new_files):
+        print(f"Processing: {os.path.basename(fpath)}")
+        txt_path, ext = process_file(fpath)
+        results.append((fpath, txt_path, ext))
+
+    # Output summary for LLM to parse and update RAW-REGISTRY.md
+    print(f"\n--- INGEST SUMMARY ---")
+    print(f"Processed: {len(results)} files")
+    for fpath, txt_path, ext in results:
+        rel = os.path.relpath(fpath, kb_root)
+        status = "extracted" if txt_path else "ready"
+        print(f"  {rel} [{ext}] -> {status}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,6 @@
+PyMuPDF>=1.24
+openpyxl>=3.1
+pandas>=2.0
+pytesseract>=0.3
+python-docx>=1.1
+Pillow>=10.0
@@ -0,0 +1,4 @@
+# Master Index
+
+| 路径 | 类型 | 摘要 |
+|------|------|------|
@@ -0,0 +1,50 @@
+# Ontology
+
+## 实体类型
+
+| 类型 | 目录 | 命名规则 | 说明 |
+|------|------|---------|------|
+| concept | wiki/concepts/ | {slug}.md | 核心概念 |
+| source | wiki/sources/ | {slug}.md | 来源摘要 |
+| comparison | wiki/comparisons/ | {a}-vs-{b}.md | 对比分析 |
+
+## 关系
+
+- 用 `[[双链]]` 表达引用关系
+- frontmatter 的 `compiled_from` 表达溯源
+- frontmatter 的 `related` 表达关联
+
+## Wiki 文章模板
+
+每篇 wiki 文章使用以下结构：
+
+```yaml
+---
+type: {entity_type}
+id: {slug}
+aliases: []
+compiled_from:
+  - raw/{source_file}
+related:
+  - "[[other-article]]"
+last_compiled: {date}
+---
+```
+
+### 正文结构
+
+```markdown
+# {标题}
+
+## 概述
+一段话定义...
+
+## 要点
+- ...
+
+## 关联
+- [[相关概念]]
+
+## 来源
+- 编译自 raw/xxx.pdf
+```
@@ -0,0 +1,4 @@
+# Raw Registry
+
+| 文件 | 类型 | 摘要 | 状态 | 编译产物 |
+|------|------|------|------|---------|
@@ -0,0 +1,3 @@
+# Topic Map
+
+<!-- LLM 自动维护，按主题分组 wiki 文章 -->
@@ -0,0 +1,23 @@
+{
+  "id": "conv-1776074446367-y9l6jom6z",
+  "providerId": "claude",
+  "title": "Start conversation",
+  "titleGenerationStatus": "success",
+  "createdAt": 1776074446367,
+  "updatedAt": 1776074479191,
+  "lastResponseAt": 1776074479191,
+  "sessionId": "bc8edeba-7a77-4523-8e79-95b84004035b",
+  "providerState": {
+    "providerSessionId": "bc8edeba-7a77-4523-8e79-95b84004035b"
+  },
+  "usage": {
+    "model": "haiku",
+    "inputTokens": 163,
+    "cacheCreationInputTokens": 21877,
+    "cacheReadInputTokens": 0,
+    "contextWindow": 200000,
+    "contextTokens": 22040,
+    "percentage": 11,
+    "contextWindowIsAuthoritative": true
+  }
+}
@@ -1,3 +1,4 @@
 [
-  "obsidian-git"
+  "obsidian-git",
+  "claudian"
 ]
@@ -0,0 +1,11 @@
+{
+  "tabManagerState": {
+    "openTabs": [
+      {
+        "tabId": "tab-1776074442549-1h1fbx3",
+        "conversationId": "conv-1776074446367-y9l6jom6z"
+      }
+    ],
+    "activeTabId": "tab-1776074442549-1h1fbx3"
+  }
+}
@@ -1 +0,0 @@
-
@@ -1 +0,0 @@
-{}