Files

plf b02d3378fc GraphRAG Studio — initial commit: multimodal RAG system with KG visualization

Full-stack application for document-to-knowledge-graph pipeline:
- Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing
- Frontend: React 19 + Vite + D3.js + shadcn/ui
- Pipeline: MinerU parsing → LangExtract entity extraction → KG building

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-07 17:30:04 +08:00

20 KiB

Raw Blame History

MinerU 文档解析规范文档

基于 opendatalab/MinerU 官方文档及云端 API 调研版本基线：2026-03-04

一、支持的原始输入文件格式

1.1 支持格式清单

格式	扩展名	说明
PDF	`.pdf`	核心能力 — 文本型 / 扫描型 / 混合型均支持
Word	`.doc`, `.docx`	旧版和新版 Word 文档
PowerPoint	`.ppt`, `.pptx`	旧版和新版演示文稿
图片	`.png`, `.jpg`, `.jpeg`	单页图片文档，支持 EXIF 方向自动校正
HTML	`.html`	需指定 `MinerU-HTML` 模型版本

1.2 输入限制

约束项	限制值
单文件最大体积	200 MB
单文件最大页数	600 页
云端 API 每日免费额度	2,000 页（最高优先级），超出部分降低优先级

1.3 OCR 语言支持

MinerU 内置 OCR 引擎支持 109 种语言，可通过 language 参数指定文档主语言（默认 zh 中文）。常用语言代码：

代码	语言	代码	语言
`zh`	中文	`en`	英文
`ja`	日文	`ko`	韩文
`fr`	法文	`de`	德文

二、云端 API 输出格式规范

2.1 输出文件总览

云端 API 任务完成后，返回一个 ZIP 压缩包（通过 full_zip_url 获取），解压后包含以下文件：

output/
├── auto/
│   ├── auto.md                 # 多模态 Markdown（含图片引用）
│   └── images/                 # 提取的图片资源
│       ├── img_0_0.png
│       ├── table_0_1.png
│       └── ...
├── auto_nlp/
│   └── auto_nlp.md             # 纯文本 NLP Markdown（无图片）
├── middle.json                 # 富元数据中间格式（完整层级结构）
├── content_list.json           # 扁平化内容块列表（按阅读顺序）
├── layout.pdf                  # 布局分析可视化（调试用）
├── span.pdf                    # Span 级别标注（Pipeline 后端，调试用）
└── model.json                  # 原始模型推理结果（调试用）

文件	用途	推荐场景
`content_list.json`	扁平化内容块，按阅读顺序	推荐用于下游 NLP/KG 管道对接
`middle.json`	完整层级结构，含丰富元数据	需要精确布局信息或二次开发
`auto/auto.md`	多模态 Markdown	人工阅读、LLM 直接消费
`auto_nlp/auto_nlp.md`	纯文本 Markdown	纯文本 NLP 处理
`layout.pdf`	布局可视化	调试、验证解析质量

2.2 content_list.json 字段规范

content_list.json 是一个 JSON 数组，每个元素是一个内容块，按文档阅读顺序排列。

2.2.1 公共字段（所有类型共有）

字段	类型	说明
`type`	`string`	内容类型：`text` / `image` / `table` / `equation` / `code` / `list`
`page_idx`	`int`	所在页码（0-indexed）
`bbox`	`[x0, y0, x1, y1]`	边界框坐标，归一化到 0–1000 范围

2.2.2 文本块（type: "text"）

{
  "type": "text",
  "text": "段落正文内容...",
  "text_level": 0,
  "page_idx": 0,
  "bbox": [72, 120, 540, 145]
}

字段	类型	说明
`text`	`string`	文本内容
`text_level`	`int \| null`	标题级别：`null` 或 `0` = 正文，`1` = 一级标题，`2` = 二级标题，依此类推

2.2.3 图片块（type: "image"）

{
  "type": "image",
  "img_path": "images/img_0_0.png",
  "image_caption": ["Figure 1: System architecture"],
  "image_footnote": ["Source: internal report"],
  "page_idx": 1,
  "bbox": [100, 200, 500, 600]
}

字段	类型	说明
`img_path`	`string`	图片文件相对路径
`image_caption`	`string[]`	图片标题列表
`image_footnote`	`string[]`	图片脚注列表

2.2.4 表格块（type: "table"）

{
  "type": "table",
  "img_path": "images/table_0_1.png",
  "table_body": "<html><body><table><tr><td>...</td></tr></table></body></html>",
  "table_caption": ["Table 1: Performance comparison"],
  "table_footnote": ["* p < 0.05"],
  "page_idx": 2,
  "bbox": [50, 300, 950, 700]
}

字段	类型	说明
`img_path`	`string`	表格截图相对路径
`table_body`	`string`	表格 HTML 表示（`<table>` 标签）
`table_caption`	`string[]`	表格标题列表
`table_footnote`	`string[]`	表格脚注列表

2.2.5 公式块（type: "equation"）

{
  "type": "equation",
  "text": "E = mc^2",
  "text_format": "latex",
  "img_path": "images/eq_0_0.png",
  "page_idx": 3,
  "bbox": [200, 400, 800, 450]
}

字段	类型	说明
`text`	`string`	公式的 LaTeX 表示
`text_format`	`string`	固定值 `"latex"`
`img_path`	`string`	公式截图相对路径

2.2.6 代码块（type: "code"）— VLM 后端

{
  "type": "code",
  "sub_type": "code",
  "code_body": "def hello():\n    print('hello')",
  "code_caption": ["Listing 1: Example function"],
  "page_idx": 4,
  "bbox": [80, 100, 920, 300]
}

字段	类型	说明
`sub_type`	`string`	`"code"` 或 `"algorithm"`
`code_body`	`string`	代码文本内容
`code_caption`	`string[]`	代码块标题（可选）

2.2.7 列表块（type: "list"）— VLM 后端

{
  "type": "list",
  "sub_type": "text",
  "list_items": ["第一项", "第二项", "第三项"],
  "page_idx": 5,
  "bbox": [72, 200, 540, 350]
}

字段	类型	说明
`sub_type`	`string`	`"text"` 或 `"ref_text"`（参考文献列表）
`list_items`	`string[]`	列表项内容

2.3 middle.json 字段规范

middle.json 是 MinerU 的富元数据中间格式，保留完整的文档层级结构。

2.3.1 顶层结构

{
  "_backend": "pipeline | vlm | hybrid",
  "_version_name": "2.7.4",
  "pdf_info": [ ... ]
}

字段	类型	说明
`_backend`	`string`	使用的解析后端
`_version_name`	`string`	MinerU 版本标识
`pdf_info`	`array`	按页组织的解析结果数组

2.3.2 页级结构（pdf_info 数组元素）

{
  "page_idx": 0,
  "page_size": [595.0, 842.0],
  "preproc_blocks": [ ... ],
  "para_blocks": [ ... ],
  "images": [ ... ],
  "tables": [ ... ],
  "interline_equations": [ ... ],
  "discarded_blocks": [ ... ]
}

字段	类型	说明
`page_idx`	`int`	页码（0-indexed）
`page_size`	`[float, float]`	页面尺寸 `[宽, 高]`（原始 PDF 坐标系，单位 pt）
`preproc_blocks`	`array`	未分段的预处理块
`para_blocks`	`array`	已分段的内容块（主输出）
`images`	`array`	提取的图片块
`tables`	`array`	提取的表格块
`interline_equations`	`array`	行间公式块
`discarded_blocks`	`array`	被过滤的内容（页眉、页脚、页码等）

2.3.3 内容块层级结构

内容块采用三级层级：Block → Line → Span

一级块（Level 1）— 容器块：

{
  "type": "table",
  "bbox": [x0, y0, x1, y1],
  "blocks": [ ... ]
}

字段	类型	说明
`type`	`string`	`"table"` 或 `"image"`
`bbox`	`[x0, y0, x1, y1]`	边界框坐标（原始 PDF 坐标系）
`blocks`	`array`	包含的二级块

二级块（Level 2）— 语义块：

{
  "type": "text",
  "bbox": [x0, y0, x1, y1],
  "lines": [ ... ]
}

`type` 值	说明
`text`	正文段落
`title`	标题
`image_body`	图片主体
`image_caption`	图片标题
`image_footnote`	图片脚注
`table_body`	表格主体
`table_caption`	表格标题
`table_footnote`	表格脚注
`interline_equation`	行间公式
`index`	目录项
`list`	列表项

行结构（Line）：

{
  "bbox": [x0, y0, x1, y1],
  "spans": [ ... ]
}

Span 结构（最小粒度）：

{
  "bbox": [x0, y0, x1, y1],
  "type": "text",
  "content": "具体文本内容",
  "score": 0.95
}

字段	类型	说明
`bbox`	`[x0, y0, x1, y1]`	边界框坐标
`type`	`string`	`text` / `image` / `table` / `inline_equation` / `interline_equation`
`content`	`string`	文本内容（text 类型）
`img_path`	`string`	图片路径（image/table 类型）
`score`	`float`	模型置信度（0.0~1.0）

2.4 Markdown 输出规范

文件	特点
`auto/auto.md`	图片以 `![](images/img_x_x.png)` 引用；表格保留为 Markdown 表格或 HTML；公式使用 $...$ 和 `$$...$$` 定界符
`auto_nlp/auto_nlp.md`	纯文本，图片/表格替换为占位文本描述；适合直接送入 NLP 管道

2.5 调试与可视化文件

文件	格式	说明
`layout.pdf`	PDF	每页叠加带编号的检测框，不同颜色区分内容类型，验证布局分析准确性和阅读顺序
`span.pdf`	PDF	用不同颜色线框标注页面内容的 span 类型（仅 Pipeline 后端），排查文本丢失和公式识别问题
`model.json`	JSON	原始模型推理结果，包含 `category_id`、`poly`（四边形坐标）、`score`（置信度）

三、布局信息规范

3.1 坐标系定义

MinerU 使用两套坐标系，取决于输出文件：

坐标系	适用文件	范围	原点	说明
归一化坐标	`content_list.json`	`0 – 1000`	左上角	页面宽高均映射到 0~1000
原始 PDF 坐标	`middle.json`	实际 pt 值	左上角	与 PDF 页面尺寸一致（如 A4 = 595×842）
归一化比例坐标	`model.json`（VLM）	`0.0 – 1.0`	左上角	宽高均映射到 0~1

bbox 格式统一为：[x0, y0, x1, y1]

(x0, y0) ─────────────────── (x1, y0)
    │                            │
    │       内容区域              │
    │                            │
(x0, y1) ─────────────────── (x1, y1)

x0, y0：左上角坐标
x1, y1：右下角坐标

3.2 布局分类体系（Pipeline 后端）

model.json 中的 category_id 枚举：

category_id	类型	说明
0	`title`	标题
1	`plain_text`	正文文本
2	`abandon`	丢弃区域（页眉/页脚/页码等）
3	`figure`	图片
4	`figure_caption`	图片标题
5	`table`	表格
6	`table_caption`	表格标题
7	`table_footnote`	表格脚注
8	`isolate_formula`	独立行间公式
9	`formula_caption`	公式标题
13	`embedding`	嵌入内容
14	`isolated`	隔离内容
15	`OCR_text`	OCR 识别文本

3.3 布局分类体系（VLM 后端）

VLM 后端使用字符串类型标识，分类更细：

type 值	说明
`text`	正文
`title`	标题
`equation`	公式
`image`	图片
`image_caption`	图片标题
`image_footnote`	图片脚注
`table`	表格
`table_caption`	表格标题
`table_footnote`	表格脚注
`code`	代码块
`code_caption`	代码标题
`list`	列表
`header`	页眉（discarded）
`footer`	页脚（discarded）
`page_number`	页码（discarded）
`aside_text`	边栏文字（discarded）
`page_footnote`	页面脚注（discarded）
`ref_text`	参考文献
`algorithm`	算法伪代码
`phonetic`	注音

3.4 内容层级与标题级别

content_list.json 中的 text_level 字段标识文档结构层级：

text_level	含义	对应 Markdown
`null` 或 `0`	正文	无标记
`1`	一级标题	`# Heading`
`2`	二级标题	`## Heading`
`3`	三级标题	`### Heading`
`4`	四级标题	`#### Heading`
`5+`	更深层标题	`#####+ Heading`

3.5 布局精度提取指南

针对不同数据类型的精确提取建议：

文本提取

# 从 content_list.json 提取所有正文文本
texts = [
    block for block in content_list
    if block["type"] == "text"
]
# 按页过滤
page_0_texts = [b for b in texts if b["page_idx"] == 0]

标题层级提取

# 提取文档大纲结构
headings = [
    {"level": block["text_level"], "text": block["text"], "page": block["page_idx"]}
    for block in content_list
    if block["type"] == "text" and block.get("text_level") and block["text_level"] >= 1
]

表格数值提取

# 表格以 HTML 形式存储在 table_body 中，可用 BeautifulSoup 解析
from bs4 import BeautifulSoup

tables = [b for b in content_list if b["type"] == "table"]
for table in tables:
    soup = BeautifulSoup(table["table_body"], "html.parser")
    rows = []
    for tr in soup.find_all("tr"):
        cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
        rows.append(cells)

空间位置定位

# 利用 bbox 判断内容在页面中的位置
def get_position(bbox, threshold=500):
    """判断内容在页面的上半部分还是下半部分（归一化坐标 0-1000）"""
    y_center = (bbox[1] + bbox[3]) / 2
    return "upper" if y_center < threshold else "lower"

# 判断两个块是否水平相邻（同一行）
def is_same_row(block_a, block_b, tolerance=20):
    return abs(block_a["bbox"][1] - block_b["bbox"][1]) < tolerance

四、云端 API MVP 必要字段

4.1 认证配置

配置项	值	获取方式
Token	Bearer Token 字符串	mineru.net/apiManage/token 注册后获取

请求头格式（所有接口通用）：

Authorization: Bearer {your_token}
Content-Type: application/json

4.2 创建解析任务 — 请求规范

接口： POST https://mineru.net/api/v4/extract/task

请求体字段

字段	类型	必填	默认值	说明
`url`	`string`	是	—	待解析文件的公网可访问 URL
`is_ocr`	`bool`	否	`false`	是否强制启用 OCR（扫描件建议开启）
`enable_formula`	`bool`	否	`true`	是否启用公式识别
`enable_table`	`bool`	否	`true`	是否启用表格识别
`language`	`string`	否	`"zh"`	文档主语言代码
`model`	`string`	否	自动选择	模型版本：`pipeline` / `vlm` / `MinerU-HTML`
`data_id`	`string`	否	—	自定义业务标识（用于关联追踪）
`callback_url`	`string`	否	—	任务完成后的回调通知 URL

MVP 最小必填字段

{
  "url": "https://example.com/document.pdf"
}

仅 url 为必填，其余参数均有合理默认值。

4.3 查询任务结果 — 响应规范

接口： GET https://mineru.net/api/v4/extract/task/{task_id}

响应体字段

字段	类型	说明
`task_id`	`string`	任务唯一标识
`state`	`string`	任务状态（见下方枚举）
`err_msg`	`string \| null`	错误信息（失败时）
`full_zip_url`	`string \| null`	完整输出 ZIP 下载地址（成功时）
`file_name`	`string`	原始文件名
`batch_id`	`string \| null`	批量任务 ID（如有）

任务状态枚举

state	说明
`pending`	排队等待中
`processing`	正在解析
`done`	解析完成
`failed`	解析失败（查看 `err_msg`）

4.4 批量任务接口

4.4.1 批量获取上传 URL

接口： POST https://mineru.net/api/v4/file-urls/batch

用于获取文件上传的预签名 URL（适合本地文件上传场景）。

4.4.2 批量创建任务

接口： POST https://mineru.net/api/v4/extract/task/batch

请求体中 files 数组包含多个文件的解析参数。

4.4.3 批量查询结果

接口： GET https://mineru.net/api/v4/extract-results/batch/{batch_id}

4.5 MVP 最小可用请求示例

Python 实现

import os
import time
import requests

MINERU_API_TOKEN = os.getenv("MINERU_API_TOKEN")
BASE_URL = "https://mineru.net/api/v4"
HEADERS = {
    "Authorization": f"Bearer {MINERU_API_TOKEN}",
    "Content-Type": "application/json",
}

# ① 创建解析任务（仅需 url 一个必填字段）
resp = requests.post(
    f"{BASE_URL}/extract/task",
    headers=HEADERS,
    json={
        "url": "https://example.com/sample.pdf",   # 必填：文件公网 URL
        # "is_ocr": False,                          # 可选：默认 false
        # "enable_formula": True,                   # 可选：默认 true
        # "enable_table": True,                     # 可选：默认 true
        # "language": "zh",                         # 可选：默认中文
    },
)
task_id = resp.json()["task_id"]
print(f"Task created: {task_id}")

# ② 轮询查询结果
while True:
    result = requests.get(
        f"{BASE_URL}/extract/task/{task_id}",
        headers=HEADERS,
    ).json()

    state = result["state"]
    print(f"State: {state}")

    if state == "done":
        zip_url = result["full_zip_url"]
        print(f"Download: {zip_url}")
        break
    elif state == "failed":
        print(f"Error: {result['err_msg']}")
        break

    time.sleep(5)

# ③ 下载并解压结果
import zipfile, io

zip_data = requests.get(zip_url).content
with zipfile.ZipFile(io.BytesIO(zip_data)) as zf:
    zf.extractall("./mineru_output/")
    print("Files:", zf.namelist())

cURL 实现

# 创建任务
curl -X POST https://mineru.net/api/v4/extract/task \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/sample.pdf"}'

# 查询结果
curl https://mineru.net/api/v4/extract/task/{task_id} \
  -H "Authorization: Bearer YOUR_TOKEN"

MVP 检查清单

已在 mineru.net 注册账号
已在 Token 管理页获取 API Token
已将 Token 配置到 .env 文件：MINERU_API_TOKEN=xxx
准备了公网可访问的测试文件 URL（PDF/DOCX/PPT/图片）
安装了 requests 库：pip install requests

20 KiB Raw Blame History Unescape Escape