GraphRAG Studio — initial commit: multimodal RAG system with KG visualization

Full-stack application for document-to-knowledge-graph pipeline: - Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing - Frontend: React 19 + Vite + D3.js + shadcn/ui - Pipeline: MinerU parsing → LangExtract entity extraction → KG building Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 17:30:04 +08:00
commit b02d3378fc
127 changed files with 37218 additions and 0 deletions
--- a/docs/mineru_specification.md
+++ b/docs/mineru_specification.md
@@ -0,0 +1,680 @@
+# MinerU 文档解析规范文档
+
+> 基于 [opendatalab/MinerU](https://github.com/opendatalab/MinerU) 官方文档及云端 API 调研
+> 版本基线：2026-03-04
+
+---
+
+## 目录
+
+- [一、支持的原始输入文件格式](#一支持的原始输入文件格式)
+  - [1.1 支持格式清单](#11-支持格式清单)
+  - [1.2 输入限制](#12-输入限制)
+  - [1.3 OCR 语言支持](#13-ocr-语言支持)
+- [二、云端 API 输出格式规范](#二云端-api-输出格式规范)
+  - [2.1 输出文件总览](#21-输出文件总览)
+  - [2.2 content_list.json 字段规范](#22-content_listjson-字段规范)
+  - [2.3 middle.json 字段规范](#23-middlejson-字段规范)
+  - [2.4 Markdown 输出规范](#24-markdown-输出规范)
+  - [2.5 调试与可视化文件](#25-调试与可视化文件)
+- [三、布局信息规范](#三布局信息规范)
+  - [3.1 坐标系定义](#31-坐标系定义)
+  - [3.2 布局分类体系（Pipeline 后端）](#32-布局分类体系pipeline-后端)
+  - [3.3 布局分类体系（VLM 后端）](#33-布局分类体系vlm-后端)
+  - [3.4 内容层级与标题级别](#34-内容层级与标题级别)
+  - [3.5 布局精度提取指南](#35-布局精度提取指南)
+- [四、云端 API MVP 必要字段](#四云端-api-mvp-必要字段)
+  - [4.1 认证配置](#41-认证配置)
+  - [4.2 创建解析任务 — 请求规范](#42-创建解析任务--请求规范)
+  - [4.3 查询任务结果 — 响应规范](#43-查询任务结果--响应规范)
+  - [4.4 批量任务接口](#44-批量任务接口)
+  - [4.5 MVP 最小可用请求示例](#45-mvp-最小可用请求示例)
+
+---
+
+## 一、支持的原始输入文件格式
+
+### 1.1 支持格式清单
+
+| 格式 | 扩展名 | 说明 |
+|------|--------|------|
+| **PDF** | `.pdf` | 核心能力 — 文本型 / 扫描型 / 混合型均支持 |
+| **Word** | `.doc`, `.docx` | 旧版和新版 Word 文档 |
+| **PowerPoint** | `.ppt`, `.pptx` | 旧版和新版演示文稿 |
+| **图片** | `.png`, `.jpg`, `.jpeg` | 单页图片文档，支持 EXIF 方向自动校正 |
+| **HTML** | `.html` | 需指定 `MinerU-HTML` 模型版本 |
+
+### 1.2 输入限制
+
+| 约束项 | 限制值 |
+|--------|--------|
+| 单文件最大体积 | **200 MB** |
+| 单文件最大页数 | **600 页** |
+| 云端 API 每日免费额度 | **2,000 页**（最高优先级），超出部分降低优先级 |
+
+### 1.3 OCR 语言支持
+
+MinerU 内置 OCR 引擎支持 **109 种语言**，可通过 `language` 参数指定文档主语言（默认 `zh` 中文）。常用语言代码：
+
+| 代码 | 语言 | 代码 | 语言 |
+|------|------|------|------|
+| `zh` | 中文 | `en` | 英文 |
+| `ja` | 日文 | `ko` | 韩文 |
+| `fr` | 法文 | `de` | 德文 |
+
+---
+
+## 二、云端 API 输出格式规范
+
+### 2.1 输出文件总览
+
+云端 API 任务完成后，返回一个 ZIP 压缩包（通过 `full_zip_url` 获取），解压后包含以下文件：
+
+```
+output/
+├── auto/
+│   ├── auto.md                 # 多模态 Markdown（含图片引用）
+│   └── images/                 # 提取的图片资源
+│       ├── img_0_0.png
+│       ├── table_0_1.png
+│       └── ...
+├── auto_nlp/
+│   └── auto_nlp.md             # 纯文本 NLP Markdown（无图片）
+├── middle.json                 # 富元数据中间格式（完整层级结构）
+├── content_list.json           # 扁平化内容块列表（按阅读顺序）
+├── layout.pdf                  # 布局分析可视化（调试用）
+├── span.pdf                    # Span 级别标注（Pipeline 后端，调试用）
+└── model.json                  # 原始模型推理结果（调试用）
+```
+
+| 文件 | 用途 | 推荐场景 |
+|------|------|---------|
+| `content_list.json` | 扁平化内容块，按阅读顺序 | **推荐用于下游 NLP/KG 管道对接** |
+| `middle.json` | 完整层级结构，含丰富元数据 | 需要精确布局信息或二次开发 |
+| `auto/auto.md` | 多模态 Markdown | 人工阅读、LLM 直接消费 |
+| `auto_nlp/auto_nlp.md` | 纯文本 Markdown | 纯文本 NLP 处理 |
+| `layout.pdf` | 布局可视化 | 调试、验证解析质量 |
+
+---
+
+### 2.2 content_list.json 字段规范
+
+`content_list.json` 是一个 **JSON 数组**，每个元素是一个内容块，按文档阅读顺序排列。
+
+#### 2.2.1 公共字段（所有类型共有）
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `type` | `string` | 内容类型：`text` / `image` / `table` / `equation` / `code` / `list` |
+| `page_idx` | `int` | 所在页码（**0-indexed**） |
+| `bbox` | `[x0, y0, x1, y1]` | 边界框坐标，归一化到 **0–1000** 范围 |
+
+#### 2.2.2 文本块（type: "text"）
+
+```json
+{
+  "type": "text",
+  "text": "段落正文内容...",
+  "text_level": 0,
+  "page_idx": 0,
+  "bbox": [72, 120, 540, 145]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `text` | `string` | 文本内容 |
+| `text_level` | `int \| null` | 标题级别：`null` 或 `0` = 正文，`1` = 一级标题，`2` = 二级标题，依此类推 |
+
+#### 2.2.3 图片块（type: "image"）
+
+```json
+{
+  "type": "image",
+  "img_path": "images/img_0_0.png",
+  "image_caption": ["Figure 1: System architecture"],
+  "image_footnote": ["Source: internal report"],
+  "page_idx": 1,
+  "bbox": [100, 200, 500, 600]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `img_path` | `string` | 图片文件相对路径 |
+| `image_caption` | `string[]` | 图片标题列表 |
+| `image_footnote` | `string[]` | 图片脚注列表 |
+
+#### 2.2.4 表格块（type: "table"）
+
+```json
+{
+  "type": "table",
+  "img_path": "images/table_0_1.png",
+  "table_body": "<html><body><table><tr><td>...</td></tr></table></body></html>",
+  "table_caption": ["Table 1: Performance comparison"],
+  "table_footnote": ["* p < 0.05"],
+  "page_idx": 2,
+  "bbox": [50, 300, 950, 700]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `img_path` | `string` | 表格截图相对路径 |
+| `table_body` | `string` | 表格 HTML 表示（`<table>` 标签） |
+| `table_caption` | `string[]` | 表格标题列表 |
+| `table_footnote` | `string[]` | 表格脚注列表 |
+
+#### 2.2.5 公式块（type: "equation"）
+
+```json
+{
+  "type": "equation",
+  "text": "E = mc^2",
+  "text_format": "latex",
+  "img_path": "images/eq_0_0.png",
+  "page_idx": 3,
+  "bbox": [200, 400, 800, 450]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `text` | `string` | 公式的 LaTeX 表示 |
+| `text_format` | `string` | 固定值 `"latex"` |
+| `img_path` | `string` | 公式截图相对路径 |
+
+#### 2.2.6 代码块（type: "code"）— VLM 后端
+
+```json
+{
+  "type": "code",
+  "sub_type": "code",
+  "code_body": "def hello():\n    print('hello')",
+  "code_caption": ["Listing 1: Example function"],
+  "page_idx": 4,
+  "bbox": [80, 100, 920, 300]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `sub_type` | `string` | `"code"` 或 `"algorithm"` |
+| `code_body` | `string` | 代码文本内容 |
+| `code_caption` | `string[]` | 代码块标题（可选） |
+
+#### 2.2.7 列表块（type: "list"）— VLM 后端
+
+```json
+{
+  "type": "list",
+  "sub_type": "text",
+  "list_items": ["第一项", "第二项", "第三项"],
+  "page_idx": 5,
+  "bbox": [72, 200, 540, 350]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `sub_type` | `string` | `"text"` 或 `"ref_text"`（参考文献列表） |
+| `list_items` | `string[]` | 列表项内容 |
+
+---
+
+### 2.3 middle.json 字段规范
+
+`middle.json` 是 MinerU 的富元数据中间格式，保留完整的文档层级结构。
+
+#### 2.3.1 顶层结构
+
+```json
+{
+  "_backend": "pipeline | vlm | hybrid",
+  "_version_name": "2.7.4",
+  "pdf_info": [ ... ]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `_backend` | `string` | 使用的解析后端 |
+| `_version_name` | `string` | MinerU 版本标识 |
+| `pdf_info` | `array` | 按页组织的解析结果数组 |
+
+#### 2.3.2 页级结构（pdf_info 数组元素）
+
+```json
+{
+  "page_idx": 0,
+  "page_size": [595.0, 842.0],
+  "preproc_blocks": [ ... ],
+  "para_blocks": [ ... ],
+  "images": [ ... ],
+  "tables": [ ... ],
+  "interline_equations": [ ... ],
+  "discarded_blocks": [ ... ]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `page_idx` | `int` | 页码（0-indexed） |
+| `page_size` | `[float, float]` | 页面尺寸 `[宽, 高]`（原始 PDF 坐标系，单位 pt） |
+| `preproc_blocks` | `array` | 未分段的预处理块 |
+| `para_blocks` | `array` | **已分段的内容块**（主输出） |
+| `images` | `array` | 提取的图片块 |
+| `tables` | `array` | 提取的表格块 |
+| `interline_equations` | `array` | 行间公式块 |
+| `discarded_blocks` | `array` | 被过滤的内容（页眉、页脚、页码等） |
+
+#### 2.3.3 内容块层级结构
+
+内容块采用三级层级：**Block → Line → Span**
+
+**一级块（Level 1）— 容器块：**
+
+```json
+{
+  "type": "table",
+  "bbox": [x0, y0, x1, y1],
+  "blocks": [ ... ]
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `type` | `string` | `"table"` 或 `"image"` |
+| `bbox` | `[x0, y0, x1, y1]` | 边界框坐标（原始 PDF 坐标系） |
+| `blocks` | `array` | 包含的二级块 |
+
+**二级块（Level 2）— 语义块：**
+
+```json
+{
+  "type": "text",
+  "bbox": [x0, y0, x1, y1],
+  "lines": [ ... ]
+}
+```
+
+| `type` 值 | 说明 |
+|-----------|------|
+| `text` | 正文段落 |
+| `title` | 标题 |
+| `image_body` | 图片主体 |
+| `image_caption` | 图片标题 |
+| `image_footnote` | 图片脚注 |
+| `table_body` | 表格主体 |
+| `table_caption` | 表格标题 |
+| `table_footnote` | 表格脚注 |
+| `interline_equation` | 行间公式 |
+| `index` | 目录项 |
+| `list` | 列表项 |
+
+**行结构（Line）：**
+
+```json
+{
+  "bbox": [x0, y0, x1, y1],
+  "spans": [ ... ]
+}
+```
+
+**Span 结构（最小粒度）：**
+
+```json
+{
+  "bbox": [x0, y0, x1, y1],
+  "type": "text",
+  "content": "具体文本内容",
+  "score": 0.95
+}
+```
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `bbox` | `[x0, y0, x1, y1]` | 边界框坐标 |
+| `type` | `string` | `text` / `image` / `table` / `inline_equation` / `interline_equation` |
+| `content` | `string` | 文本内容（text 类型）|
+| `img_path` | `string` | 图片路径（image/table 类型）|
+| `score` | `float` | 模型置信度（0.0~1.0） |
+
+---
+
+### 2.4 Markdown 输出规范
+
+| 文件 | 特点 |
+|------|------|
+| `auto/auto.md` | 图片以 `![](images/img_x_x.png)` 引用；表格保留为 Markdown 表格或 HTML；公式使用 `$...$` 和 `$$...$$` 定界符 |
+| `auto_nlp/auto_nlp.md` | 纯文本，图片/表格替换为占位文本描述；适合直接送入 NLP 管道 |
+
+---
+
+### 2.5 调试与可视化文件
+
+| 文件 | 格式 | 说明 |
+|------|------|------|
+| `layout.pdf` | PDF | 每页叠加带编号的检测框，不同颜色区分内容类型，验证布局分析准确性和阅读顺序 |
+| `span.pdf` | PDF | 用不同颜色线框标注页面内容的 span 类型（仅 Pipeline 后端），排查文本丢失和公式识别问题 |
+| `model.json` | JSON | 原始模型推理结果，包含 `category_id`、`poly`（四边形坐标）、`score`（置信度） |
+
+---
+
+## 三、布局信息规范
+
+### 3.1 坐标系定义
+
+MinerU 使用两套坐标系，取决于输出文件：
+
+| 坐标系 | 适用文件 | 范围 | 原点 | 说明 |
+|--------|---------|------|------|------|
+| **归一化坐标** | `content_list.json` | `0 – 1000` | 左上角 | 页面宽高均映射到 0~1000 |
+| **原始 PDF 坐标** | `middle.json` | 实际 pt 值 | 左上角 | 与 PDF 页面尺寸一致（如 A4 = 595×842） |
+| **归一化比例坐标** | `model.json`（VLM） | `0.0 – 1.0` | 左上角 | 宽高均映射到 0~1 |
+
+**bbox 格式统一为：`[x0, y0, x1, y1]`**
+
+```
+(x0, y0) ─────────────────── (x1, y0)
+    │                            │
+    │       内容区域              │
+    │                            │
+(x0, y1) ─────────────────── (x1, y1)
+```
+
+- `x0, y0`：左上角坐标
+- `x1, y1`：右下角坐标
+
+### 3.2 布局分类体系（Pipeline 后端）
+
+`model.json` 中的 `category_id` 枚举：
+
+| category_id | 类型 | 说明 |
+|-------------|------|------|
+| 0 | `title` | 标题 |
+| 1 | `plain_text` | 正文文本 |
+| 2 | `abandon` | 丢弃区域（页眉/页脚/页码等） |
+| 3 | `figure` | 图片 |
+| 4 | `figure_caption` | 图片标题 |
+| 5 | `table` | 表格 |
+| 6 | `table_caption` | 表格标题 |
+| 7 | `table_footnote` | 表格脚注 |
+| 8 | `isolate_formula` | 独立行间公式 |
+| 9 | `formula_caption` | 公式标题 |
+| 13 | `embedding` | 嵌入内容 |
+| 14 | `isolated` | 隔离内容 |
+| 15 | `OCR_text` | OCR 识别文本 |
+
+### 3.3 布局分类体系（VLM 后端）
+
+VLM 后端使用字符串类型标识，分类更细：
+
+| type 值 | 说明 |
+|---------|------|
+| `text` | 正文 |
+| `title` | 标题 |
+| `equation` | 公式 |
+| `image` | 图片 |
+| `image_caption` | 图片标题 |
+| `image_footnote` | 图片脚注 |
+| `table` | 表格 |
+| `table_caption` | 表格标题 |
+| `table_footnote` | 表格脚注 |
+| `code` | 代码块 |
+| `code_caption` | 代码标题 |
+| `list` | 列表 |
+| `header` | 页眉（discarded） |
+| `footer` | 页脚（discarded） |
+| `page_number` | 页码（discarded） |
+| `aside_text` | 边栏文字（discarded） |
+| `page_footnote` | 页面脚注（discarded） |
+| `ref_text` | 参考文献 |
+| `algorithm` | 算法伪代码 |
+| `phonetic` | 注音 |
+
+### 3.4 内容层级与标题级别
+
+`content_list.json` 中的 `text_level` 字段标识文档结构层级：
+
+| text_level | 含义 | 对应 Markdown |
+|------------|------|--------------|
+| `null` 或 `0` | 正文 | 无标记 |
+| `1` | 一级标题 | `# Heading` |
+| `2` | 二级标题 | `## Heading` |
+| `3` | 三级标题 | `### Heading` |
+| `4` | 四级标题 | `#### Heading` |
+| `5+` | 更深层标题 | `#####+ Heading` |
+
+### 3.5 布局精度提取指南
+
+针对不同数据类型的精确提取建议：
+
+#### 文本提取
+
+```python
+# 从 content_list.json 提取所有正文文本
+texts = [
+    block for block in content_list
+    if block["type"] == "text"
+]
+# 按页过滤
+page_0_texts = [b for b in texts if b["page_idx"] == 0]
+```
+
+#### 标题层级提取
+
+```python
+# 提取文档大纲结构
+headings = [
+    {"level": block["text_level"], "text": block["text"], "page": block["page_idx"]}
+    for block in content_list
+    if block["type"] == "text" and block.get("text_level") and block["text_level"] >= 1
+]
+```
+
+#### 表格数值提取
+
+```python
+# 表格以 HTML 形式存储在 table_body 中，可用 BeautifulSoup 解析
+from bs4 import BeautifulSoup
+
+tables = [b for b in content_list if b["type"] == "table"]
+for table in tables:
+    soup = BeautifulSoup(table["table_body"], "html.parser")
+    rows = []
+    for tr in soup.find_all("tr"):
+        cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
+        rows.append(cells)
+```
+
+#### 空间位置定位
+
+```python
+# 利用 bbox 判断内容在页面中的位置
+def get_position(bbox, threshold=500):
+    """判断内容在页面的上半部分还是下半部分（归一化坐标 0-1000）"""
+    y_center = (bbox[1] + bbox[3]) / 2
+    return "upper" if y_center < threshold else "lower"
+
+# 判断两个块是否水平相邻（同一行）
+def is_same_row(block_a, block_b, tolerance=20):
+    return abs(block_a["bbox"][1] - block_b["bbox"][1]) < tolerance
+```
+
+---
+
+## 四、云端 API MVP 必要字段
+
+### 4.1 认证配置
+
+| 配置项 | 值 | 获取方式 |
+|--------|-----|---------|
+| Token | Bearer Token 字符串 | [mineru.net/apiManage/token](https://mineru.net/apiManage/token) 注册后获取 |
+
+**请求头格式（所有接口通用）：**
+
+```
+Authorization: Bearer {your_token}
+Content-Type: application/json
+```
+
+---
+
+### 4.2 创建解析任务 — 请求规范
+
+**接口：** `POST https://mineru.net/api/v4/extract/task`
+
+#### 请求体字段
+
+| 字段 | 类型 | 必填 | 默认值 | 说明 |
+|------|------|------|--------|------|
+| `url` | `string` | **是** | — | 待解析文件的公网可访问 URL |
+| `is_ocr` | `bool` | 否 | `false` | 是否强制启用 OCR（扫描件建议开启） |
+| `enable_formula` | `bool` | 否 | `true` | 是否启用公式识别 |
+| `enable_table` | `bool` | 否 | `true` | 是否启用表格识别 |
+| `language` | `string` | 否 | `"zh"` | 文档主语言代码 |
+| `model` | `string` | 否 | 自动选择 | 模型版本：`pipeline` / `vlm` / `MinerU-HTML` |
+| `data_id` | `string` | 否 | — | 自定义业务标识（用于关联追踪） |
+| `callback_url` | `string` | 否 | — | 任务完成后的回调通知 URL |
+
+#### MVP 最小必填字段
+
+```json
+{
+  "url": "https://example.com/document.pdf"
+}
+```
+
+> 仅 `url` 为必填，其余参数均有合理默认值。
+
+---
+
+### 4.3 查询任务结果 — 响应规范
+
+**接口：** `GET https://mineru.net/api/v4/extract/task/{task_id}`
+
+#### 响应体字段
+
+| 字段 | 类型 | 说明 |
+|------|------|------|
+| `task_id` | `string` | 任务唯一标识 |
+| `state` | `string` | 任务状态（见下方枚举） |
+| `err_msg` | `string \| null` | 错误信息（失败时） |
+| `full_zip_url` | `string \| null` | 完整输出 ZIP 下载地址（成功时） |
+| `file_name` | `string` | 原始文件名 |
+| `batch_id` | `string \| null` | 批量任务 ID（如有） |
+
+#### 任务状态枚举
+
+| state | 说明 |
+|-------|------|
+| `pending` | 排队等待中 |
+| `processing` | 正在解析 |
+| `done` | 解析完成 |
+| `failed` | 解析失败（查看 `err_msg`） |
+
+---
+
+### 4.4 批量任务接口
+
+#### 4.4.1 批量获取上传 URL
+
+**接口：** `POST https://mineru.net/api/v4/file-urls/batch`
+
+用于获取文件上传的预签名 URL（适合本地文件上传场景）。
+
+#### 4.4.2 批量创建任务
+
+**接口：** `POST https://mineru.net/api/v4/extract/task/batch`
+
+请求体中 `files` 数组包含多个文件的解析参数。
+
+#### 4.4.3 批量查询结果
+
+**接口：** `GET https://mineru.net/api/v4/extract-results/batch/{batch_id}`
+
+---
+
+### 4.5 MVP 最小可用请求示例
+
+#### Python 实现
+
+```python
+import os
+import time
+import requests
+
+MINERU_API_TOKEN = os.getenv("MINERU_API_TOKEN")
+BASE_URL = "https://mineru.net/api/v4"
+HEADERS = {
+    "Authorization": f"Bearer {MINERU_API_TOKEN}",
+    "Content-Type": "application/json",
+}
+
+# ① 创建解析任务（仅需 url 一个必填字段）
+resp = requests.post(
+    f"{BASE_URL}/extract/task",
+    headers=HEADERS,
+    json={
+        "url": "https://example.com/sample.pdf",   # 必填：文件公网 URL
+        # "is_ocr": False,                          # 可选：默认 false
+        # "enable_formula": True,                   # 可选：默认 true
+        # "enable_table": True,                     # 可选：默认 true
+        # "language": "zh",                         # 可选：默认中文
+    },
+)
+task_id = resp.json()["task_id"]
+print(f"Task created: {task_id}")
+
+# ② 轮询查询结果
+while True:
+    result = requests.get(
+        f"{BASE_URL}/extract/task/{task_id}",
+        headers=HEADERS,
+    ).json()
+
+    state = result["state"]
+    print(f"State: {state}")
+
+    if state == "done":
+        zip_url = result["full_zip_url"]
+        print(f"Download: {zip_url}")
+        break
+    elif state == "failed":
+        print(f"Error: {result['err_msg']}")
+        break
+
+    time.sleep(5)
+
+# ③ 下载并解压结果
+import zipfile, io
+
+zip_data = requests.get(zip_url).content
+with zipfile.ZipFile(io.BytesIO(zip_data)) as zf:
+    zf.extractall("./mineru_output/")
+    print("Files:", zf.namelist())
+```
+
+#### cURL 实现
+
+```bash
+# 创建任务
+curl -X POST https://mineru.net/api/v4/extract/task \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/sample.pdf"}'
+
+# 查询结果
+curl https://mineru.net/api/v4/extract/task/{task_id} \
+  -H "Authorization: Bearer YOUR_TOKEN"
+```
+
+#### MVP 检查清单
+
+- [ ] 已在 [mineru.net](https://mineru.net/) 注册账号
+- [ ] 已在 [Token 管理页](https://mineru.net/apiManage/token) 获取 API Token
+- [ ] 已将 Token 配置到 `.env` 文件：`MINERU_API_TOKEN=xxx`
+- [ ] 准备了公网可访问的测试文件 URL（PDF/DOCX/PPT/图片）
+- [ ] 安装了 `requests` 库：`pip install requests`