vault backup: 2026-04-13 18:04:51

2026-06-04 10:15:15 +08:00 · 2026-04-13 18:04:53 +08:00
parent 267e9f05e4
commit b4d35f5139
20 changed files with 1189 additions and 3 deletions
@@ -0,0 +1,160 @@
 # /kb — LLM 知识库管理工具
 基于 Karpathy 的 LLM Knowledge Base 模式：raw/ 存原始资料，LLM 编译成 wiki/，索引替代 RAG。
 ## 快速开始
 ### 1. 初始化知识库
 ```
 /kb init
 ```
 在当前目录创建知识库目录结构：
 - `raw/` — 原始资料（只读）
 - `wiki/concepts/` — 核心概念
 - `wiki/sources/` — 来源摘要
 - `wiki/comparisons/` — 对比分析
 - `output/analysis/` — 分析报告
 - `output/slides/` — 幻灯片
 - `index/` — 索引文件
 ### 2. 导入文件
 将 PDF、Excel、图片、Word 文档放入 `raw/` 目录，然后：
 ```
 /kb ingest
 ```
 自动提取文本并登记到索引。
 ### 3. 编译为 Wiki
 ```
 /kb compile
 ```
 LLM 读取原料，生成结构化 wiki 文章。
 ### 4. 查询知识库
 ```
 /kb query "你的问题"
 ```
 生成结构化报告，包含分析、结论和回填建议。
 ### 5. 回填有价值的结果
 ```
 /kb file
 ```
 将查询报告中有价值的内容并入 wiki。
 ### 6. 健康检查
 ```
 /kb lint
 ```
 六项检查：断链、孤岛、溯源、一致性、覆盖度、空白发现。
 ### 7. 查看状态
 ```
 /kb status
 ```
 仪表盘展示整体健康度和统计信息。
 ---
 ## 子命令速查
 | 命令 | 功能 | 触发词 |
 |------|------|--------|
 | `kb init [目录]` | 初始化知识库 | "初始化"、"创建知识库" |
 | `kb ingest` | 预处理 raw/ 文件 | "导入"、"处理新文件" |
 | `kb compile [文件]` | 编译为 wiki | "编译"、"更新 wiki" |
 | `kb query "<问题>"` | 查询知识库 | "查知识库"、"问知识库" |
 | `kb file [报告]` | 回填到 wiki | "回填"、"归档" |
 | `kb lint` | 健康检查 | "检查"、"lint" |
 | `kb status` | 状态仪表盘 | "状态"、"看看知识库" |
 ---
 ## 支持的文件格式
 | 格式 | 后缀 | 说明 |
 |------|------|------|
 | PDF | .pdf | 提取文本和图片 |
 | Excel | .xlsx, .xls, .csv | 提取表格内容 |
 | 图片 | .png, .jpg, .jpeg | OCR 文字识别 |
 | Word | .docx | 提取段落和表格 |
 ---
 ## 工作流程
 ```
 投喂原料          LLM 编译          查询使用
    │                │                │
    ▼                ▼                ▼
 raw/ ──────► wiki/ ──────► 查询分析 ──────► 回填
    │                │                │
 原始文件        结构化文章       知识增长
 ```
 ---
 ## 目录结构
 ```
 {知识库根目录}/
 ├── raw/                    # 原始资料（只读）
 │   └── .extracted/        # 提取的文本（自动生成）
 ├── wiki/
 │   ├── concepts/          # 核心概念
 │   ├── sources/           # 来源摘要
 │   └── comparisons/       # 对比分析
 ├── output/
 │   ├── analysis/          # 查询报告
 │   └── slides/           # 幻灯片
 ├── index/
 │   ├── MASTER-INDEX.md   # 全局索引
 │   ├── TOPIC-MAP.md      # 主题分组
 │   ├── RAW-REGISTRY.md   # 原始文件登记
 │   ├── LINT-REPORT.md    # 健康检查报告
 │   └── ONTOLOGY.md       # 本体定义
 └── scripts/
    ├── ingest.py          # 预处理脚本
    └── extractors/        # 文件提取器
 ```
 ---
 ## Python 依赖
 首次使用需要安装依赖：
 ```bash
 pip install -r .claude/skills/kb/scripts/requirements.txt
 ```
 依赖列表：
 - PyMuPDF — PDF 提取
 - openpyxl — Excel 读取
 - pandas — 数据处理
 - pytesseract — 图片 OCR
 - python-docx — Word 读取
 - Pillow — 图片处理
 ---
 ## SessionStart Hook（可选）
 配置后，每次打开 Claude Code 会自动检测 `raw/` 中的新文件并提醒处理。
 初始化时选择"是"即可启用。
@@ -0,0 +1,327 @@
 ---
 name: kb
 description: |
  LLM 驱动的知识库管理工具箱。当用户说"kb"、"知识库"、"查知识库"、"初始化知识库"、"导入文件"、"编译"、"回填"等时触发。
  支持对 vault 或外部目录建立知识库：预处理文件、编译 wiki、查询分析、健康检查。
  基于 Karpathy 的 LLM Knowledge Base 模式：raw/ 存原始资料，LLM 编译成 wiki/，索引替代 RAG。
 user-invocable: true
 ---
 # /kb — LLM 知识库管理
 统一入口，包含 7 个子命令。
 ## 子命令速查
 | 命令 | 功能 | 触发词 |
 |------|------|--------|
 | `kb init [目录]` | 初始化知识库 | "初始化"、"创建知识库" |
 | `kb ingest` | 预处理 raw/ 文件 | "导入"、"处理新文件" |
 | `kb compile [文件]` | 编译为 wiki | "编译"、"更新 wiki" |
 | `kb query "<问题>"` | 查询知识库 | "查知识库"、"问知识库" |
 | `kb file [报告]` | 回填到 wiki | "回填"、"归档" |
 | `kb lint` | 健康检查 | "检查"、"lint" |
 | `kb status` | 状态仪表盘 | "状态"、"看看知识库" |
 ---
 ## kb init [目标目录]
 初始化知识库目录结构、索引和本体定义。
 **参数**：可选目标目录，默认当前目录（vault）或指定外部目录。
 ### 执行步骤
 1. **检查现有知识库**：查找 `{target}/index/MASTER-INDEX.md`，如果存在则警告并等待确认
 2. **创建目录结构**：
   ```
   {target}/raw/              — 原始资料（只读）
   {target}/wiki/concepts/    — 核心概念
   {target}/wiki/sources/     — 来源摘要
   {target}/wiki/comparisons/ — 对比分析
   {target}/output/analysis/  — 分析报告
   {target}/output/slides/     — 幻灯片
   {target}/index/            — 索引文件
   {target}/scripts/           — 预处理脚本
   ```
 3. **复制模板文件**：从本 Skill 的 `templates/` 目录复制到 `{target}/index/`：
   - ONTOLOGY.md — 实体类型和关系定义
   - MASTER-INDEX.md — 全局索引
   - TOPIC-MAP.md — 主题分组
   - RAW-REGISTRY.md — 原始文件登记
 4. **复制脚本**：从本 Skill 的 `scripts/` 目录复制到 `{target}/scripts/`
 5. **检查 Python 依赖**：
   ```bash
   pip show pymupdf openpyxl pandas pytesseract python-docx Pillow 2>&1
   ```
   报告缺失的包，询问是否安装
 6. **配置 SessionStart Hook（可选）**：询问是否配置，检测 raw/ 新文件时提醒
 7. **输出初始化摘要**
 ---
 ## kb ingest
 预处理 raw/ 中的新文件并登记到索引。
 **前置条件**：知识库已初始化（存在 index/RAW-REGISTRY.md）
 ### 支持格式
 - PDF (.pdf)
 - Excel (.xlsx, .xls, .csv)
 - 图片 (.png, .jpg, .jpeg) — OCR 提取
 - Word (.docx)
 ### 执行步骤
 1. **定位知识库**：向上查找 `index/RAW-REGISTRY.md`
 2. **运行预处理脚本**：
   ```bash
   python3 {skill_dir}/scripts/ingest.py {kb_root}
   ```
   脚本自动：扫描新文件 → 按类型提取文本 → 输出摘要
 3. **登记到 RAW-REGISTRY.md**：为每个新文件添加条目：
   - 文件路径、类型、摘要（一句话）
   - 状态：`pending`（待编译）
 4. **输出摘要**：报告导入数量，提示下一步 `/kb-compile`
 ---
 ## kb compile [文件]
 将 raw/ 中已导入但未编译的文件编译为 wiki 文章。
 **参数**：可选指定文件，默认处理所有 `status=pending` 的条目
 ### 核心原则
 - Wiki 文章由 LLM 生成，遵循 ONTOLOGY.md 定义
 - 每篇文章必须有完整 YAML frontmatter
 - 使用 `[[双链]]` 建立关联
 - 编译是增量的
 ### 执行步骤
 1. **检查待编译条目**：读 `index/RAW-REGISTRY.md`，找 `status=pending` 的条目
   - 如果没有，告知用户并结束
 2. **加载上下文**：读 ONTOLOGY.md、MASTER-INDEX.md、TOPIC-MAP.md
 3. **逐个编译**：
   - 读取源文件或 `raw/.extracted/` 下的提取文本
   - 判断操作：新建 / 更新已有 / 综合分析
   - 按模板生成 wiki 文章
   - 更新 frontmatter（type, id, compiled_from, related, last_compiled）
   - 用 `[[双链]]` 链接相关文章
 4. **更新索引**：
   - MASTER-INDEX.md 添加/更新条目
   - TOPIC-MAP.md 归入主题
   - RAW-REGISTRY.md 状态改为 `done`，填编译产物路径
 5. **输出编译摘要**
 ---
 ## kb query "<问题>"
 对知识库提问，生成结构化报告。
 **参数**：必填，用户的问题
 ### 执行步骤
 1. **定位知识库**：查找 `index/MASTER-INDEX.md`
 2. **检索相关文章**：
   - 读 MASTER-INDEX.md 定位相关文件
   - 按需读 TOPIC-MAP.md 精确定位
   - 读取所有相关 wiki 文章内容
 3. **研究分析**：
   - 基于 wiki 内容深入分析问题
   - 交叉对比多篇文章
   - 结论必须基于实际内容，标注来源
 4. **生成报告**：保存到 `output/analysis/YYYY-MM-DD-{topic-slug}.md`：
   ```markdown
   # {报告标题}
   - **Date**: YYYY-MM-DD
   - **Query**: {用户问题}
   - **Sources**: {引用的 wiki 文章}
   ---
   ## 分析
   {详细分析，引用具体文章用 [[双链]]}
   ## 结论
   {核心发现}
   ## 回填建议
   - [ ] {具体建议}
   ```
 5. **输出结果**：展示摘要，提示可运行 `/kb file` 回填
 ---
 ## kb file [报告路径]
 将查询输出回填到 wiki 知识库。
 **参数**：可选指定 output/ 下的报告文件，默认扫描 `output/analysis/`
 ### 执行步骤
 1. **定位知识库和待回填内容**
 2. **展示回填建议**：列出所有建议，编号说明
 3. **用户确认**：逐条 Y/N 或批量操作
 4. **执行回填**：
   - **更新已有文章**：将新内容有机融入
   - **新建文章**：按 ONTOLOGY.md 模板创建
 5. **更新索引**：MASTER-INDEX.md 和 TOPIC-MAP.md
 6. **输出摘要**
 ---
 ## kb lint
 对知识库进行六项健康检查。
 ### 检查项目
 | 检查 | 说明 |
 |------|------|
 | 断链 | `[[链接]]` 指向不存在的文件 |
 | 孤岛 | 没有被任何文章链接的文章 |
 | 溯源 | frontmatter compiled_from 指向已删除的文件 |
 | 一致性 | 同一概念在不同文章中的矛盾描述 |
 | 覆盖度 | 未编译文件比例 |
 | 空白发现 | 被提及但没有独立文章的概念 |
 ### 执行步骤
 1. **定位知识库**
 2. **执行六项检查**
 3. **输出 Lint 报告**（按严重程度排序）
 4. **提供修复选项**：可自动修复的问题询问是否执行
 5. **保存报告到 `index/LINT-REPORT.md`**
 ---
 ## kb status
 展示知识库整体状态仪表盘。
 ### 执行步骤
 1. **定位知识库**
 2. **收集统计数据**：
   - raw/ 文件数
   - wiki/ 文章数和字数
   - 编译率
   - 待回填报告数
   - 上次 lint 结果
 3. **展示仪表盘**：
   ```
   知识库状态
   ═══════════════════════════════════
   原始文件:    N 个
   Wiki 文章:   M 篇 (共 ~X 字)
   编译率:      XX%
   待回填:      Y 份报告
   上次 Lint:   日期 — 问题摘要
   ═══════════════════════════════════
   最近编译的文章:
     - wiki/concepts/xxx.md (日期)
   待处理:
     - N 个文件待编译 → /kb compile
     - M 份报告待回填 → /kb file
   ```
 4. **建议下一步操作**
 ---
 ## 目录结构约定
 ```
 {知识库根目录}/
 ├── raw/                    # 原始资料（只读）
 │   └── .extracted/         # 提取的文本（自动生成）
 ├── wiki/
 │   ├── concepts/           # 核心概念
 │   ├── sources/            # 来源摘要
 │   └── comparisons/        # 对比分析
 ├── output/
 │   ├── analysis/           # 查询报告
 │   └── slides/             # 幻灯片
 ├── index/
 │   ├── MASTER-INDEX.md     # 全局索引
 │   ├── TOPIC-MAP.md        # 主题分组
 │   ├── RAW-REGISTRY.md     # 原始文件登记
 │   ├── LINT-REPORT.md      # 健康检查报告
 │   └── ONTOLOGY.md         # 本体定义
 └── scripts/
    ├── ingest.py           # 预处理脚本
    ├── requirements.txt    # Python 依赖
    └── extractors/         # 各类文件提取器
 ```
 ## 实体类型（ONTOLOGY.md）
 | 类型 | 目录 | 命名规则 |
 |------|------|----------|
 | concept | wiki/concepts/ | {slug}.md |
 | source | wiki/sources/ | {slug}.md |
 | comparison | wiki/comparisons/ | {a}-vs-{b}.md |
 ## Wiki 文章 Frontmatter 模板
 ```yaml
 ---
 type: concept
 id: {slug}
 aliases: []
 compiled_from:
  - raw/{source_file}
 related:
  - "[[other-article]]"
 last_compiled: YYYY-MM-DD
 ---
 ```
 ---
 ## 故障排除
 | 问题 | 解决方案 |
 |------|----------|
 | 找不到知识库 | 先运行 `/kb init` 初始化 |
 | 脚本报错 | 运行 `pip install -r scripts/requirements.txt` |
 | 编译率低 | 运行 `/kb ingest` 导入新文件，然后 `/kb compile` |
 | 断链太多 | 运行 `/kb lint` 查看详情，手动修复或删除断链 |
@@ -0,0 +1,381 @@
 <!DOCTYPE html>
 <html lang="zh-CN">
 <head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>/kb — LLM 知识库管理工具</title>
  <style>
    :root {
      --bg: #0d1117;
      --surface: #161b22;
      --border: #30363d;
      --text: #e6edf3;
      --text-muted: #8b949e;
      --accent: #58a6ff;
      --accent-bg: #1f6feb1a;
      --success: #3fb950;
      --warning: #d29922;
    }
    * {
      box-sizing: border-box;
      margin: 0;
      padding: 0;
    }
    body {
      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
      background: var(--bg);
      color: var(--text);
      line-height: 1.6;
      min-height: 100vh;
      padding: 2rem;
    }
    .container {
      max-width: 800px;
      margin: 0 auto;
    }
    h1 {
      font-size: 2rem;
      margin-bottom: 0.5rem;
      display: flex;
      align-items: center;
      gap: 0.5rem;
    }
    h1::before {
      content: '📚';
    }
    .subtitle {
      color: var(--text-muted);
      margin-bottom: 2rem;
    }
    h2 {
      font-size: 1.4rem;
      margin: 2rem 0 1rem;
      padding-bottom: 0.5rem;
      border-bottom: 1px solid var(--border);
      display: flex;
      align-items: center;
      gap: 0.5rem;
    }
    h3 {
      font-size: 1.1rem;
      margin: 1.5rem 0 0.75rem;
      color: var(--accent);
    }
    code {
      background: var(--surface);
      padding: 0.2rem 0.4rem;
      border-radius: 4px;
      font-family: 'SF Mono', Consolas, monospace;
      font-size: 0.9em;
      color: var(--success);
    }
    pre {
      background: var(--surface);
      border: 1px solid var(--border);
      border-radius: 8px;
      padding: 1rem;
      overflow-x: auto;
      margin: 1rem 0;
    }
    pre code {
      background: none;
      padding: 0;
      color: var(--text);
    }
    .command {
      background: linear-gradient(135deg, var(--accent-bg), transparent);
      border-left: 3px solid var(--accent);
      padding: 0.75rem 1rem;
      margin: 0.5rem 0;
      border-radius: 0 8px 8px 0;
      font-family: 'SF Mono', Consolas, monospace;
    }
    .card {
      background: var(--surface);
      border: 1px solid var(--border);
      border-radius: 8px;
      padding: 1rem;
      margin: 1rem 0;
    }
    table {
      width: 100%;
      border-collapse: collapse;
      margin: 1rem 0;
    }
    th, td {
      text-align: left;
      padding: 0.75rem;
      border-bottom: 1px solid var(--border);
    }
    th {
      color: var(--accent);
      font-weight: 600;
    }
    tr:hover {
      background: var(--surface);
    }
    .flow {
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 0.5rem;
      margin: 1.5rem 0;
      flex-wrap: wrap;
    }
    .flow-step {
      background: var(--surface);
      border: 1px solid var(--border);
      border-radius: 8px;
      padding: 0.75rem 1rem;
      text-align: center;
    }
    .flow-arrow {
      color: var(--text-muted);
    }
    .tag {
      display: inline-block;
      background: var(--accent-bg);
      color: var(--accent);
      padding: 0.2rem 0.6rem;
      border-radius: 20px;
      font-size: 0.85em;
      margin-right: 0.5rem;
    }
    .dir-tree {
      font-family: 'SF Mono', Consolas, monospace;
      font-size: 0.9rem;
      line-height: 1.8;
    }
    .dir-comment {
      color: var(--text-muted);
    }
    footer {
      margin-top: 3rem;
      padding-top: 1rem;
      border-top: 1px solid var(--border);
      color: var(--text-muted);
      text-align: center;
    }
  </style>
 </head>
 <body>
  <div class="container">
    <h1>/kb — LLM 知识库管理工具</h1>
    <p class="subtitle">基于 Karpathy 的 LLM Knowledge Base 模式：raw/ 存原始资料，LLM 编译成 wiki/，索引替代 RAG。</p>
    <h2>🚀 快速开始</h2>
    <h3>1. 初始化知识库</h3>
    <div class="command">/kb init</div>
    <p style="margin-top: 0.5rem;">在当前目录创建知识库目录结构：</p>
    <div class="dir-tree">
      <pre>
 ├── raw/                    <span class="dir-comment"># 原始资料（只读）</span>
 ├── wiki/
 │   ├── concepts/          <span class="dir-comment"># 核心概念</span>
 │   ├── sources/           <span class="dir-comment"># 来源摘要</span>
 │   └── comparisons/       <span class="dir-comment"># 对比分析</span>
 ├── output/
 │   ├── analysis/          <span class="dir-comment"># 分析报告</span>
 │   └── slides/           <span class="dir-comment"># 幻灯片</span>
 └── index/                <span class="dir-comment"># 索引文件</span>
      </pre>
    </div>
    <h3>2. 导入文件</h3>
    <p>将 PDF、Excel、图片、Word 文档放入 <code>raw/</code> 目录，然后：</p>
    <div class="command">/kb ingest</div>
    <p style="margin-top: 0.5rem;">自动提取文本并登记到索引。</p>
    <h3>3. 编译为 Wiki</h3>
    <div class="command">/kb compile</div>
    <p style="margin-top: 0.5rem;">LLM 读取原料，生成结构化 wiki 文章。</p>
    <h3>4. 查询知识库</h3>
    <div class="command">/kb query "你的问题"</div>
    <p style="margin-top: 0.5rem;">生成结构化报告，包含分析、结论和回填建议。</p>
    <h3>5. 回填有价值的结果</h3>
    <div class="command">/kb file</div>
    <p style="margin-top: 0.5rem;">将查询报告中有价值的内容并入 wiki。</p>
    <h3>6. 健康检查</h3>
    <div class="command">/kb lint</div>
    <p style="margin-top: 0.5rem;">六项检查：断链、孤岛、溯源、一致性、覆盖度、空白发现。</p>
    <h3>7. 查看状态</h3>
    <div class="command">/kb status</div>
    <p style="margin-top: 0.5rem;">仪表盘展示整体健康度和统计信息。</p>
    <h2>📋 子命令速查</h2>
    <table>
      <thead>
        <tr>
          <th>命令</th>
          <th>功能</th>
          <th>触发词</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td><code>kb init [目录]</code></td>
          <td>初始化知识库</td>
          <td>初始化、创建知识库</td>
        </tr>
        <tr>
          <td><code>kb ingest</code></td>
          <td>预处理 raw/ 文件</td>
          <td>导入、处理新文件</td>
        </tr>
        <tr>
          <td><code>kb compile [文件]</code></td>
          <td>编译为 wiki</td>
          <td>编译、更新 wiki</td>
        </tr>
        <tr>
          <td><code>kb query "&lt;问题&gt;"</code></td>
          <td>查询知识库</td>
          <td>查知识库、问知识库</td>
        </tr>
        <tr>
          <td><code>kb file [报告]</code></td>
          <td>回填到 wiki</td>
          <td>回填、归档</td>
        </tr>
        <tr>
          <td><code>kb lint</code></td>
          <td>健康检查</td>
          <td>检查、lint</td>
        </tr>
        <tr>
          <td><code>kb status</code></td>
          <td>状态仪表盘</td>
          <td>状态、看看知识库</td>
        </tr>
      </tbody>
    </table>
    <h2>📦 支持的文件格式</h2>
    <table>
      <thead>
        <tr>
          <th>格式</th>
          <th>后缀</th>
          <th>说明</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>PDF</td>
          <td>.pdf</td>
          <td>提取文本和图片</td>
        </tr>
        <tr>
          <td>Excel</td>
          <td>.xlsx, .xls, .csv</td>
          <td>提取表格内容</td>
        </tr>
        <tr>
          <td>图片</td>
          <td>.png, .jpg, .jpeg</td>
          <td>OCR 文字识别</td>
        </tr>
        <tr>
          <td>Word</td>
          <td>.docx</td>
          <td>提取段落和表格</td>
        </tr>
      </tbody>
    </table>
    <h2>🔄 工作流程</h2>
    <div class="flow">
      <div class="flow-step">投喂原料<br><small>raw/</small></div>
      <span class="flow-arrow">→</span>
      <div class="flow-step">LLM 编译<br><small>wiki/</small></div>
      <span class="flow-arrow">→</span>
      <div class="flow-step">查询使用<br><small>/kb query</small></div>
      <span class="flow-arrow">→</span>
      <div class="flow-step">知识增长<br><small>/kb file</small></div>
    </div>
    <h2>📁 完整目录结构</h2>
    <pre class="dir-tree">
 {知识库根目录}/
 ├── raw/                    <span class="dir-comment"># 原始资料（只读）</span>
 │   └── .extracted/        <span class="dir-comment"># 提取的文本（自动生成）</span>
 ├── wiki/
 │   ├── concepts/          <span class="dir-comment"># 核心概念</span>
 │   ├── sources/           <span class="dir-comment"># 来源摘要</span>
 │   └── comparisons/       <span class="dir-comment"># 对比分析</span>
 ├── output/
 │   ├── analysis/          <span class="dir-comment"># 查询报告</span>
 │   └── slides/           <span class="dir-comment"># 幻灯片</span>
 ├── index/
 │   ├── MASTER-INDEX.md   <span class="dir-comment"># 全局索引</span>
 │   ├── TOPIC-MAP.md      <span class="dir-comment"># 主题分组</span>
 │   ├── RAW-REGISTRY.md   <span class="dir-comment"># 原始文件登记</span>
 │   ├── LINT-REPORT.md    <span class="dir-comment"># 健康检查报告</span>
 │   └── ONTOLOGY.md       <span class="dir-comment"># 本体定义</span>
 └── scripts/
    ├── ingest.py          <span class="dir-comment"># 预处理脚本</span>
    └── extractors/        <span class="dir-comment"># 文件提取器</span>
    </pre>
    <h2>🐍 Python 依赖</h2>
    <p>首次使用需要安装依赖：</p>
    <div class="command">pip install -r .claude/skills/kb/scripts/requirements.txt</div>
    <div class="card">
      <strong>依赖列表：</strong>
      <ul style="margin-top: 0.5rem; padding-left: 1.5rem;">
        <li>PyMuPDF — PDF 提取</li>
        <li>openpyxl — Excel 读取</li>
        <li>pandas — 数据处理</li>
        <li>pytesseract — 图片 OCR</li>
        <li>python-docx — Word 读取</li>
        <li>Pillow — 图片处理</li>
      </ul>
    </div>
    <h2>⚙️ SessionStart Hook（可选）</h2>
    <p>配置后，每次打开 Claude Code 会自动检测 <code>raw/</code> 中的新文件并提醒处理。</p>
    <p style="margin-top: 0.5rem;">初始化时选择"是"即可启用。</p>
    <footer>
      <p>/kb — 整合自 <a href="https://github.com/ChuYinan2023/kb-skills" style="color: var(--accent);">kb-skills</a></p>
    </footer>
  </div>
 </body>
 </html>
@@ -0,0 +1,28 @@
 """Extract text from Word documents."""
 from docx import Document
 import os
 def extract(docx_path: str, output_dir: str) -> str:
    """Extract all paragraphs and tables from docx."""
    basename = os.path.splitext(os.path.basename(docx_path))[0]
    txt_path = os.path.join(output_dir, f"{basename}.txt")
    doc = Document(docx_path)
    parts = []
    for para in doc.paragraphs:
        if para.text.strip():
            parts.append(para.text)
    for i, table in enumerate(doc.tables):
        parts.append(f"\n--- Table {i+1} ---")
        for row in table.rows:
            cells = [cell.text.strip() for cell in row.cells]
            parts.append(" | ".join(cells))
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write("\n".join(parts))
    print(f"  Word: {len(doc.paragraphs)} paragraphs, {len(doc.tables)} tables extracted")
    return txt_path
@@ -0,0 +1,34 @@
 """Extract text summary from Excel files."""
 import pandas as pd
 import os
 def extract(excel_path: str, output_dir: str) -> str:
    """Read all sheets, output text summary."""
    basename = os.path.splitext(os.path.basename(excel_path))[0]
    txt_path = os.path.join(output_dir, f"{basename}.txt")
    ext = os.path.splitext(excel_path)[1].lower()
    if ext == ".csv":
        df = pd.read_csv(excel_path)
        parts = [f"--- CSV ({len(df)} rows x {len(df.columns)} cols) ---"]
        parts.append(f"Columns: {', '.join(df.columns.astype(str))}")
        parts.append(df.head(50).to_string(index=False))
        if len(df) > 50:
            parts.append(f"... ({len(df) - 50} more rows)")
    else:
        xls = pd.ExcelFile(excel_path)
        parts = []
        for sheet in xls.sheet_names:
            df = pd.read_excel(xls, sheet_name=sheet)
            parts.append(f"--- Sheet: {sheet} ({len(df)} rows x {len(df.columns)} cols) ---")
            parts.append(f"Columns: {', '.join(df.columns.astype(str))}")
            parts.append(df.head(50).to_string(index=False))
            if len(df) > 50:
                parts.append(f"... ({len(df) - 50} more rows)")
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write("\n\n".join(parts))
    print(f"  Excel: extracted to {basename}.txt")
    return txt_path
@@ -0,0 +1,20 @@
 """OCR text from images using pytesseract."""
 import pytesseract
 from PIL import Image
 import os
 def extract(image_path: str, output_dir: str) -> str:
    """OCR image, return text file path."""
    basename = os.path.splitext(os.path.basename(image_path))[0]
    txt_path = os.path.join(output_dir, f"{basename}.txt")
    img = Image.open(image_path)
    text = pytesseract.image_to_string(img, lang="chi_sim+eng")
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(text)
    chars = len(text.strip())
    print(f"  Image OCR: {chars} characters extracted")
    return txt_path
@@ -0,0 +1,34 @@
 """Extract text and images from PDF files using PyMuPDF."""
 import fitz  # PyMuPDF
 import os
 def extract(pdf_path: str, output_dir: str) -> str:
    """Extract text from PDF, save images, return text file path."""
    doc = fitz.open(pdf_path)
    text_parts = []
    img_count = 0
    for page_num, page in enumerate(doc):
        text_parts.append(f"--- Page {page_num + 1} ---")
        text_parts.append(page.get_text())
        for img_idx, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            pix = fitz.Pixmap(doc, xref)
            if pix.n > 4:
                pix = fitz.Pixmap(fitz.csRGB, pix)
            img_path = os.path.join(output_dir, f"page{page_num+1}_img{img_idx+1}.png")
            pix.save(img_path)
            img_count += 1
            pix = None
    doc.close()
    basename = os.path.splitext(os.path.basename(pdf_path))[0]
    txt_path = os.path.join(output_dir, f"{basename}.txt")
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write("\n".join(text_parts))
    print(f"  PDF: {len(text_parts)//2} pages, {img_count} images extracted")
    return txt_path
@@ -0,0 +1,102 @@
 #!/usr/bin/env python3
 """Scan raw/ for new files, extract text, print summary for LLM to parse."""
 import importlib
 import os
 import sys
 # Add scripts dir to path so extractors can be imported
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
 EXTRACTORS = {
    ".pdf": "extractors.pdf_extractor",
    ".xlsx": "extractors.excel_extractor",
    ".xls": "extractors.excel_extractor",
    ".csv": "extractors.excel_extractor",
    ".png": "extractors.image_extractor",
    ".jpg": "extractors.image_extractor",
    ".jpeg": "extractors.image_extractor",
    ".docx": "extractors.docx_extractor",
 }
 SKIP_EXT = {".md", ".txt"}
 SKIP_DIRS = {".extracted"}
 def scan_raw(raw_dir, registry_path):
    """Find files in raw/ not yet in RAW-REGISTRY.md."""
    registered = set()
    if os.path.exists(registry_path):
        with open(registry_path, "r", encoding="utf-8") as f:
            for line in f:
                if line.startswith("| raw/") or line.startswith("| ./raw/"):
                    path = line.split("|")[1].strip()
                    registered.add(path)
    new_files = []
    for root, dirs, files in os.walk(raw_dir):
        dirs[:] = [d for d in dirs if d not in SKIP_DIRS]
        for fname in sorted(files):
            fpath = os.path.join(root, fname)
            rel = os.path.relpath(fpath, os.path.dirname(raw_dir))
            if rel not in registered:
                new_files.append(fpath)
    return new_files
 def process_file(fpath):
    """Extract text from a single file. Returns (txt_path, file_type) or (None, file_type)."""
    ext = os.path.splitext(fpath)[1].lower()
    extracted_dir = os.path.join(os.path.dirname(fpath), ".extracted")
    os.makedirs(extracted_dir, exist_ok=True)
    if ext in SKIP_EXT:
        return None, ext
    mod_name = EXTRACTORS.get(ext)
    if not mod_name:
        print(f"  SKIP (unsupported): {os.path.basename(fpath)}")
        return None, ext
    try:
        extractor = importlib.import_module(mod_name)
        txt_path = extractor.extract(fpath, extracted_dir)
        return txt_path, ext
    except ImportError as e:
        print(f"  ERROR (missing dependency): {e}")
        return None, ext
    except Exception as e:
        print(f"  ERROR: {e}")
        return None, ext
 def main():
    kb_root = sys.argv[1] if len(sys.argv) > 1 else os.getcwd()
    raw_dir = os.path.join(kb_root, "raw")
    registry = os.path.join(kb_root, "index", "RAW-REGISTRY.md")
    if not os.path.isdir(raw_dir):
        print(f"ERROR: {raw_dir} not found")
        sys.exit(1)
    new_files = scan_raw(raw_dir, registry)
    if not new_files:
        print("No new files in raw/")
        return
    print(f"Found {len(new_files)} new file(s) in raw/:\n")
    results = []
    for fpath in sorted(new_files):
        print(f"Processing: {os.path.basename(fpath)}")
        txt_path, ext = process_file(fpath)
        results.append((fpath, txt_path, ext))
    # Output summary for LLM to parse and update RAW-REGISTRY.md
    print(f"\n--- INGEST SUMMARY ---")
    print(f"Processed: {len(results)} files")
    for fpath, txt_path, ext in results:
        rel = os.path.relpath(fpath, kb_root)
        status = "extracted" if txt_path else "ready"
        print(f"  {rel} [{ext}] -> {status}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,6 @@
 PyMuPDF>=1.24
 openpyxl>=3.1
 pandas>=2.0
 pytesseract>=0.3
 python-docx>=1.1
 Pillow>=10.0
@@ -0,0 +1,4 @@
 # Master Index
 | 路径 | 类型 | 摘要 |
 |------|------|------|
@@ -0,0 +1,50 @@
 # Ontology
 ## 实体类型
 | 类型 | 目录 | 命名规则 | 说明 |
 |------|------|---------|------|
 | concept | wiki/concepts/ | {slug}.md | 核心概念 |
 | source | wiki/sources/ | {slug}.md | 来源摘要 |
 | comparison | wiki/comparisons/ | {a}-vs-{b}.md | 对比分析 |
 ## 关系
 - 用 `[[双链]]` 表达引用关系
 - frontmatter 的 `compiled_from` 表达溯源
 - frontmatter 的 `related` 表达关联
 ## Wiki 文章模板
 每篇 wiki 文章使用以下结构：
 ```yaml
 ---
 type: {entity_type}
 id: {slug}
 aliases: []
 compiled_from:
  - raw/{source_file}
 related:
  - "[[other-article]]"
 last_compiled: {date}
 ---
 ```
 ### 正文结构
 ```markdown
 # {标题}
 ## 概述
 一段话定义...
 ## 要点
 - ...
 ## 关联
 - [[相关概念]]
 ## 来源
 - 编译自 raw/xxx.pdf
 ```
@@ -0,0 +1,4 @@
 # Raw Registry
 | 文件 | 类型 | 摘要 | 状态 | 编译产物 |
 |------|------|------|------|---------|
@@ -0,0 +1,3 @@
 # Topic Map
 <!-- LLM 自动维护，按主题分组 wiki 文章 -->
@@ -0,0 +1,23 @@
 {
  "id": "conv-1776074446367-y9l6jom6z",
  "providerId": "claude",
  "title": "Start conversation",
  "titleGenerationStatus": "success",
  "createdAt": 1776074446367,
  "updatedAt": 1776074479191,
  "lastResponseAt": 1776074479191,
  "sessionId": "bc8edeba-7a77-4523-8e79-95b84004035b",
  "providerState": {
    "providerSessionId": "bc8edeba-7a77-4523-8e79-95b84004035b"
  },
  "usage": {
    "model": "haiku",
    "inputTokens": 163,
    "cacheCreationInputTokens": 21877,
    "cacheReadInputTokens": 0,
    "contextWindow": 200000,
    "contextTokens": 22040,
    "percentage": 11,
    "contextWindowIsAuthoritative": true
  }
 }
@@ -1,3 +1,4 @@
 [
-  "obsidian-git"
+  "obsidian-git",
  "claudian"
 ]
@@ -0,0 +1,11 @@
 {
  "tabManagerState": {
    "openTabs": [
      {
        "tabId": "tab-1776074442549-1h1fbx3",
        "conversationId": "conv-1776074446367-y9l6jom6z"
      }
    ],
    "activeTabId": "tab-1776074442549-1h1fbx3"
  }
 }
@@ -1 +0,0 @@
@@ -1 +0,0 @@
 {}
		`@@ -0,0 +1,3 @@`
							`# Topic Map`

							`<!-- LLM 自动维护，按主题分组 wiki 文章 -->`