GraphRAG Studio — initial commit: multimodal RAG system with KG visualization
Full-stack application for document-to-knowledge-graph pipeline: - Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing - Frontend: React 19 + Vite + D3.js + shadcn/ui - Pipeline: MinerU parsing → LangExtract entity extraction → KG building Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
779
docs/agentic_rag_specification-v1.0.md
Normal file
779
docs/agentic_rag_specification-v1.0.md
Normal file
@@ -0,0 +1,779 @@
|
||||
# Agentic-RAG 规范文档 v1.0
|
||||
|
||||
> GraphRAG 问答阶段核心流程:Knowledge Graph → LangChain Agent → QA
|
||||
>
|
||||
> 数据来源:Bridge Pipeline 输出(`kg_nodes.json` + `kg_edges.json`)
|
||||
> 测试验证日期:2026-03-05
|
||||
> 全流程运行耗时:~40s(4 个测试查询)
|
||||
|
||||
---
|
||||
|
||||
## 目录
|
||||
|
||||
- [一、完整执行思路与脚本位置](#一完整执行思路与脚本位置)
|
||||
- [二、LangChain Agent 输入输出规范](#二langchain-agent-输入输出规范)
|
||||
- [三、MinerU ↔ Agentic-RAG 对接规范与核心架构](#三mineru--agentic-rag-对接规范与核心架构)
|
||||
- [四、问答流程最终数据返回格式规范](#四问答流程最终数据返回格式规范)
|
||||
- [五、虚拟环境与依赖](#五虚拟环境与依赖)
|
||||
|
||||
---
|
||||
|
||||
## 一、完整执行思路与脚本位置
|
||||
|
||||
### 1.1 总体架构定位
|
||||
|
||||
Agentic-RAG 是 GraphRAG 系统的**问答阶段**,位于 Bridge Pipeline 之后,负责将知识图谱转化为可交互的智能问答能力。
|
||||
|
||||
```
|
||||
【已完成阶段】 【本阶段:Agentic-RAG】
|
||||
──────────────────── ──────────────────────────
|
||||
PDF
|
||||
↓ MinerU Cloud API
|
||||
content_list.json
|
||||
↓ Bridge Pipeline
|
||||
kg_nodes.json (40 nodes) ──────────→ NetworkX Graph (内存)
|
||||
kg_edges.json (780 edges) ↓
|
||||
4 个 LangChain @tool
|
||||
↓
|
||||
LangChain v1 create_agent
|
||||
(DeepSeek deepseek-chat)
|
||||
↓
|
||||
ReAct 推理循环
|
||||
↓
|
||||
自然语言答案
|
||||
```
|
||||
|
||||
### 1.2 五步执行流程
|
||||
|
||||
| 步骤 | 模块 | 说明 |
|
||||
|------|------|------|
|
||||
| Step 0 | 环境 + 配置 | 加载 `.env`(DEEPSEEK_API_KEY),初始化 `ChatOpenAI` |
|
||||
| Step 1 | KG 加载 | 读取 `kg_nodes.json` + `kg_edges.json`,构建 NetworkX 无向图 |
|
||||
| Step 2 | Tool 注册 | 用 `@tool` 装饰器注册 4 个 KG 检索工具 |
|
||||
| Step 3 | Agent 构建 | `create_agent(model, tools, system_prompt)` 编译 LangGraph |
|
||||
| Step 4 | 问答调用 | `agent.invoke({"messages": [("human", question)]})` |
|
||||
| Step 5 | 结果提取 | `result["messages"][-1].content` 获取最终答案 |
|
||||
|
||||
### 1.3 测试脚本存放位置
|
||||
|
||||
```
|
||||
F:\GraphRAGAgent\graphrag_pipeline\
|
||||
├── agentic_rag_mvp.py ← 主测试脚本(本规范对应文件)
|
||||
├── .env ← DEEPSEEK_API_KEY 配置
|
||||
└── output/
|
||||
├── kg_nodes.json ← Bridge Pipeline 生成(40 节点)
|
||||
└── kg_edges.json ← Bridge Pipeline 生成(780 边)
|
||||
```
|
||||
|
||||
### 1.4 运行命令
|
||||
|
||||
```bash
|
||||
# MVP 连通性测试(4 个预设测试查询)
|
||||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe \
|
||||
F:/GraphRAGAgent/graphrag_pipeline/agentic_rag_mvp.py
|
||||
```
|
||||
|
||||
### 1.5 ReAct 推理循环详解
|
||||
|
||||
Agent 使用 **ReAct(Reasoning + Acting)** 模式,每个问题的处理流如下:
|
||||
|
||||
```
|
||||
用户输入 (question: str)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ LLM Reasoning(DeepSeek deepseek-chat) │
|
||||
│ 决策:需要调用哪个工具?参数是什么? │
|
||||
└─────────────────────────────────────────────────┘
|
||||
│ tool_call
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Tool Execution(NetworkX 本地计算,无 API 调用) │
|
||||
│ search_entities / get_neighbors / │
|
||||
│ get_entities_by_type / describe_graph │
|
||||
└─────────────────────────────────────────────────┘
|
||||
│ ToolMessage(工具返回的文本结果)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ LLM Observation(观察工具结果) │
|
||||
│ 决策:结果够用了吗?还需要调更多工具? │
|
||||
└─────────────────────────────────────────────────┘
|
||||
│ 继续 tool_call 或输出最终答案
|
||||
▼
|
||||
AIMessage(最终自然语言答案)
|
||||
```
|
||||
|
||||
**实测工具调用模式(4 个测试查询):**
|
||||
|
||||
| 查询类型 | 工具调用序列 | 特点 |
|
||||
|---------|------------|------|
|
||||
| 图谱整体概览 | `describe_graph` | 单次工具调用 |
|
||||
| 类型枚举 | `get_entities_by_type` | 单次工具调用 |
|
||||
| 多跳关系推理 | `search_entities` → `get_neighbors` | 两步串行调用 |
|
||||
| 概念精确查找 | `search_entities` → `get_neighbors` | 两步串行调用 |
|
||||
|
||||
---
|
||||
|
||||
## 二、LangChain Agent 输入输出规范
|
||||
|
||||
### 2.1 LLM 适配规范
|
||||
|
||||
#### 2.1.1 DeepSeek → LangChain 标准组件
|
||||
|
||||
LangChain v1 使用 `ChatOpenAI` 通过 `base_url` 覆盖接入任何 OpenAI 兼容 API:
|
||||
|
||||
```python
|
||||
from langchain_openai import ChatOpenAI
|
||||
|
||||
llm = ChatOpenAI(
|
||||
model="deepseek-chat", # DeepSeek 模型名
|
||||
api_key=DEEPSEEK_API_KEY, # 来自 graphrag_pipeline/.env
|
||||
base_url="https://api.deepseek.com", # OpenAI 兼容端点
|
||||
temperature=0, # 问答场景确定性输出
|
||||
)
|
||||
```
|
||||
|
||||
| 参数 | 值 | 说明 |
|
||||
|------|-----|------|
|
||||
| `model` | `"deepseek-chat"` | DeepSeek 实际模型标识 |
|
||||
| `api_key` | `${DEEPSEEK_API_KEY}` | 从 `.env` 读取,与 Bridge Pipeline 共用 |
|
||||
| `base_url` | `"https://api.deepseek.com"` | SDK 自动补全 `/v1` 路径 |
|
||||
| `temperature` | `0` | 问答场景设为 0,保证可重现性 |
|
||||
|
||||
#### 2.1.2 与 LangExtract 中 DeepSeek 的区别
|
||||
|
||||
| 对比项 | LangExtract 中的 DeepSeek | Agentic-RAG 中的 DeepSeek |
|
||||
|--------|--------------------------|--------------------------|
|
||||
| 接入方式 | 直接实例化 `OpenAILanguageModel` | LangChain `ChatOpenAI` 标准组件 |
|
||||
| API Key 环境变量 | `OPENAI_API_KEY` | `DEEPSEEK_API_KEY` |
|
||||
| 调用方式 | `lx.extract(model=model)` | `agent.invoke({"messages": ...})` |
|
||||
| 输出格式 | JSON(实体抽取) | 自然语言(问答) |
|
||||
| Tool Calling | 不支持(单轮推理) | 支持(ReAct 多轮) |
|
||||
|
||||
### 2.2 Agent 构建规范
|
||||
|
||||
#### 2.2.1 LangChain v1 create_agent
|
||||
|
||||
```python
|
||||
from langchain.agents import create_agent
|
||||
|
||||
agent = create_agent(
|
||||
model=llm, # ChatOpenAI 实例
|
||||
tools=_tools, # List[BaseTool],4 个工具
|
||||
system_prompt=SYSTEM_PROMPT, # 系统提示词字符串
|
||||
)
|
||||
```
|
||||
|
||||
**版本注意事项:**
|
||||
|
||||
| API | 状态 | 说明 |
|
||||
|-----|------|------|
|
||||
| `langchain.agents.create_agent` | ✅ LangChain v1 推荐 | 本项目使用 |
|
||||
| `langgraph.prebuilt.create_react_agent` | ⚠️ Deprecated in LangGraph V1.0 | 已废弃,勿用 |
|
||||
| `langchain.agents.create_react_agent` (旧版) | ❌ Legacy | 已移除 |
|
||||
|
||||
#### 2.2.2 System Prompt 规范
|
||||
|
||||
```
|
||||
You are a Knowledge Graph QA assistant. You have access to a knowledge graph
|
||||
extracted from academic documents about GraphRAG and related technologies.
|
||||
|
||||
The graph contains:
|
||||
- {node_count} deduplicated entities ({type_list} types)
|
||||
- {edge_count} CO_OCCURS_IN edges representing same-page co-occurrence
|
||||
|
||||
Available tools:
|
||||
1. search_entities — find entities by keyword substring
|
||||
2. get_neighbors — explore entity relationships (N-hop BFS)
|
||||
3. get_entities_by_type — list all entities of a type
|
||||
4. describe_graph — get graph statistics overview
|
||||
|
||||
Reasoning strategy:
|
||||
- Always use at least one tool before answering a factual question
|
||||
- For relationship questions, use get_neighbors after identifying the entity with search_entities
|
||||
- For enumeration questions, use get_entities_by_type
|
||||
- Synthesize tool results into a clear, concise answer
|
||||
- Cite the entity names and types in your final answer
|
||||
```
|
||||
|
||||
### 2.3 Agent 输入规范
|
||||
|
||||
#### 2.3.1 invoke 输入格式
|
||||
|
||||
```python
|
||||
result = agent.invoke({
|
||||
"messages": [
|
||||
("human", question) # 用户问题(自然语言字符串)
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
**输入字段规范:**
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `messages` | `list[tuple[str, str]]` | 消息列表,格式 `(role, content)` |
|
||||
| `role` | `"human"` \| `"ai"` \| `"system"` | 消息角色 |
|
||||
| `content` | `str` | 消息内容 |
|
||||
|
||||
**多轮对话输入(支持历史上下文):**
|
||||
|
||||
```python
|
||||
result = agent.invoke({
|
||||
"messages": [
|
||||
("human", "What is GraphRAG?"),
|
||||
("ai", "GraphRAG is a knowledge graph-enhanced RAG system..."),
|
||||
("human", "How does it relate to LLMs?"), # 当前问题
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
### 2.4 Agent 输出规范
|
||||
|
||||
#### 2.4.1 invoke 原始返回
|
||||
|
||||
```python
|
||||
{
|
||||
"messages": [
|
||||
HumanMessage(content="What is GraphRAG?"),
|
||||
AIMessage(content="", tool_calls=[...]), # 工具调用
|
||||
ToolMessage(content="...", tool_call_id="..."), # 工具结果
|
||||
AIMessage(content="GraphRAG is an advanced...") # 最终答案
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### 2.4.2 消息类型枚举
|
||||
|
||||
| 消息类型 | 角色 | 说明 |
|
||||
|---------|------|------|
|
||||
| `HumanMessage` | `human` | 用户输入 |
|
||||
| `AIMessage`(tool_calls 非空) | `ai` | LLM 决策发起工具调用 |
|
||||
| `ToolMessage` | `tool` | 工具执行结果 |
|
||||
| `AIMessage`(tool_calls 为空) | `ai` | 最终自然语言答案 |
|
||||
|
||||
#### 2.4.3 最终答案提取
|
||||
|
||||
```python
|
||||
final_msg = result["messages"][-1]
|
||||
answer = final_msg.content # str,最终自然语言答案
|
||||
```
|
||||
|
||||
### 2.5 四个工具输入输出规范
|
||||
|
||||
#### Tool 1: `search_entities`
|
||||
|
||||
| 项目 | 规范 |
|
||||
|------|------|
|
||||
| 入参 | `query: str` — 关键词(大小写不敏感子串匹配) |
|
||||
| 匹配逻辑 | `query.lower() in entity_name.lower()` |
|
||||
| 返回格式 | 多行文本,每行格式:`[{type}] "{name}" (confidence={c}, page={p}, id={id})` |
|
||||
| 无匹配时 | 返回提示 + 前 8 个样例实体名 |
|
||||
| 最多返回 | 15 条 |
|
||||
|
||||
**实际调用示例:**
|
||||
|
||||
```
|
||||
输入: query="GraphRAG"
|
||||
输出:
|
||||
Found 3 entity(ies) matching 'GraphRAG':
|
||||
[TECHNOLOGY] "GraphRAG" (confidence=match_exact, page=0, id=node_0)
|
||||
[CONCEPT] "GraphRAG pipeline" (confidence=match_exact, page=0, id=node_12)
|
||||
[CONCEPT] "GraphRAG (Global)" (confidence=match_exact, page=0, id=node_15)
|
||||
```
|
||||
|
||||
#### Tool 2: `get_neighbors`
|
||||
|
||||
| 项目 | 规范 |
|
||||
|------|------|
|
||||
| 入参 | `entity_name: str`,`hops: int = 1`(范围 1-3) |
|
||||
| 匹配逻辑 | 子串匹配找起始节点,取 `candidates[0]` |
|
||||
| 遍历算法 | `nx.single_source_shortest_path_length(G, node_id, cutoff=hops)` |
|
||||
| 返回格式 | 按 hop 分组,每组 `[{type}] {name}`,每组最多 20 条 |
|
||||
| 未找到时 | 返回提示,建议先用 `search_entities` |
|
||||
|
||||
**实际调用示例:**
|
||||
|
||||
```
|
||||
输入: entity_name="GraphRAG", hops=1
|
||||
输出:
|
||||
Neighbors of 'GraphRAG' [TECHNOLOGY] within 1 hop(s):
|
||||
|
||||
Hop 1 — 39 related entities:
|
||||
[CONCEPT] Knowledge Graph Enhanced RAG System
|
||||
[CONCEPT] retrieval-augmented generation
|
||||
...
|
||||
Total related entities: 39
|
||||
```
|
||||
|
||||
#### Tool 3: `get_entities_by_type`
|
||||
|
||||
| 项目 | 规范 |
|
||||
|------|------|
|
||||
| 入参 | `entity_type: str`(自动 `.upper()` 处理) |
|
||||
| 有效类型 | `TECHNOLOGY`, `CONCEPT`, `PERSON`, `ORGANIZATION`, `LOCATION` |
|
||||
| 返回格式 | 按 `name` 字母序排列,每行 `• {name} (confidence={c}, page={p})` |
|
||||
| 无效类型时 | 返回错误 + 图谱中实际存在的类型列表 |
|
||||
|
||||
**实际调用示例:**
|
||||
|
||||
```
|
||||
输入: entity_type="TECHNOLOGY"
|
||||
输出:
|
||||
TECHNOLOGY entities (4 total):
|
||||
• GraphRAG (confidence=match_exact, page=0)
|
||||
• LLMs (confidence=match_exact, page=0)
|
||||
• LangExtract (confidence=match_exact, page=0)
|
||||
• MinerU (confidence=match_exact, page=0)
|
||||
```
|
||||
|
||||
#### Tool 4: `describe_graph`
|
||||
|
||||
| 项目 | 规范 |
|
||||
|------|------|
|
||||
| 入参 | 无参数 |
|
||||
| 计算指标 | 节点数、边数、关系类型、图密度(`nx.density`)、度中心性(`nx.degree_centrality`) |
|
||||
| 返回格式 | 结构化文本,包含概览 + 类型分布 + Top-5 中心节点 |
|
||||
|
||||
**实际调用示例(实测输出):**
|
||||
|
||||
```
|
||||
=== Knowledge Graph Overview ===
|
||||
Nodes (entities): 40
|
||||
Edges (relations): 780
|
||||
Relation type: CO_OCCURS_IN (same-page co-occurrence)
|
||||
Graph density: 1.0000
|
||||
|
||||
Entity type distribution:
|
||||
CONCEPT : 36
|
||||
TECHNOLOGY : 4
|
||||
|
||||
Top-5 most connected entities (by degree centrality):
|
||||
[TECHNOLOGY] GraphRAG (centrality=1.000)
|
||||
[CONCEPT] Knowledge Graph Enhanced RAG System (centrality=1.000)
|
||||
[CONCEPT] retrieval-augmented generation (centrality=1.000)
|
||||
[CONCEPT] knowledge graphs (centrality=1.000)
|
||||
[CONCEPT] large language models (centrality=1.000)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、MinerU ↔ Agentic-RAG 对接规范与核心架构
|
||||
|
||||
### 3.1 全链路技术架构
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 阶段一:文档解析(MinerU Cloud API) │
|
||||
│ │
|
||||
│ PDF 文件 │
|
||||
│ │ POST /file-urls/batch (enable_table=True, language="en") │
|
||||
│ ├─ PUT {presigned_url}(裸上传,不带 Content-Type) │
|
||||
│ └─ GET /extract-results/batch/{batch_id}(轮询 done) │
|
||||
│ ↓ │
|
||||
│ full_zip_url → 解压 → {uuid}_content_list.json │
|
||||
│ │
|
||||
│ 关键输出字段:type, text, text_level, table_body, page_idx, bbox │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 阶段二:知识图谱构建(Bridge Pipeline) │
|
||||
│ │
|
||||
│ content_list.json │
|
||||
│ │ text_assembler.py │
|
||||
│ ├─ text blocks → .rstrip() 拼接 │
|
||||
│ ├─ table blocks → BeautifulSoup HTML → pipe 分隔文本 │
|
||||
│ └─ PageText(page_idx, text, block_spans) │
|
||||
│ ↓ │
|
||||
│ entity_extractor.py (LangExtract + DeepSeek) │
|
||||
│ ↓ │
|
||||
│ kg_builder.py (去重 + CO_OCCURS_IN 边) │
|
||||
│ ↓ │
|
||||
│ kg_nodes.json (40 nodes) + kg_edges.json (780 edges) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 阶段三:Agentic-RAG 问答(LangChain + LangGraph) │
|
||||
│ │
|
||||
│ kg_nodes.json → NetworkX.G.add_node(**node) │
|
||||
│ kg_edges.json → NetworkX.G.add_edge(source, target, **edge) │
|
||||
│ │
|
||||
│ @tool search_entities ← 子串匹配 │
|
||||
│ @tool get_neighbors ← BFS N-hop 遍历 │
|
||||
│ @tool get_entities_by_type ← 类型过滤 │
|
||||
│ @tool describe_graph ← 图统计 │
|
||||
│ ↓ │
|
||||
│ create_agent(ChatOpenAI("deepseek-chat"), tools, system_prompt) │
|
||||
│ ↓ │
|
||||
│ ReAct 推理循环(think → tool_call → observe → repeat) │
|
||||
│ ↓ │
|
||||
│ 自然语言答案(AIMessage.content) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 3.2 MinerU → KG 关键参数对接
|
||||
|
||||
| MinerU 输出字段 | Bridge Pipeline 处理 | Agentic-RAG 使用 |
|
||||
|---------------|-------------------|----------------|
|
||||
| `block["type"]` | 区分 `text`/`table`/`image` | 不直接使用(已由 Bridge 转换) |
|
||||
| `block["text"]` | `.rstrip()` 后加入 PageText | 已内化为 `node["name"]` |
|
||||
| `block["table_body"]` | BeautifulSoup → pipe 分隔文本 | 已内化为实体描述 |
|
||||
| `block["page_idx"]` | 分组依据,记入 BlockSpan | `node["page"]` 字段 |
|
||||
| `block["bbox"]` | 记录字符偏移位置 | `node["char_start"]` / `node["char_end"]` |
|
||||
| `{uuid}_content_list.json 文件名` | UUID 作为 `source_doc_id` | `node["source_doc"]` / `edge["doc_id"]` |
|
||||
|
||||
### 3.3 NetworkX 图构建规范
|
||||
|
||||
```python
|
||||
import networkx as nx
|
||||
|
||||
G = nx.Graph() # 无向图(CO_OCCURS_IN 关系无方向)
|
||||
|
||||
# 节点:来自 kg_nodes.json
|
||||
for node in kg_nodes:
|
||||
G.add_node(
|
||||
node["id"], # 主键:node_0, node_1, ...
|
||||
**node # 所有字段作为节点属性
|
||||
)
|
||||
|
||||
# 边:来自 kg_edges.json
|
||||
for edge in kg_edges:
|
||||
G.add_edge(
|
||||
edge["source"], # node_0
|
||||
edge["target"], # node_1
|
||||
relation=edge["relation"], # "CO_OCCURS_IN"
|
||||
doc_id=edge["doc_id"], # UUID
|
||||
page=edge["page"], # 0-indexed
|
||||
)
|
||||
```
|
||||
|
||||
**图属性:**
|
||||
|
||||
| 属性 | 实测值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `G.number_of_nodes()` | `40` | 去重实体数 |
|
||||
| `G.number_of_edges()` | `780` | CO_OCCURS_IN 边数 |
|
||||
| `nx.density(G)` | `1.0` | 完全图(单页文档所有节点两两连接) |
|
||||
| `G.nodes[nid]` | `dict` | 节点属性字典(id, name, type, page, confidence, ...) |
|
||||
|
||||
### 3.4 MinerU API 关键参数(与 Agentic-RAG 相关部分)
|
||||
|
||||
| 参数 | 推荐值 | 影响 Agentic-RAG 的原因 |
|
||||
|------|--------|----------------------|
|
||||
| `enable_table` | `True` | 表格被解析为 HTML `<table>`,Bridge 转为文本参与实体抽取,影响 KG 节点质量 |
|
||||
| `enable_formula` | `True`(默认) | 公式以 LaTeX 内联写入文本,影响文本纯净度,可能产生噪声实体 |
|
||||
| `language` | `"en"` / `"ch"` | 影响 OCR 精度,直接影响文本质量和实体对齐率 |
|
||||
| `model_version` | `"pipeline"` | 输出 `{uuid}_content_list.json`,Bridge 通过 glob `*_content_list.json` 匹配 |
|
||||
| `page_ranges` | 按需设置 | 多页文档可分批处理,减少每批实体数和边数规模 |
|
||||
|
||||
### 3.5 Agent 系统扩展点
|
||||
|
||||
当 KG 数据更新后(新文档接入),Agentic-RAG 只需**重新加载 JSON 文件**,不需要重新构建 agent:
|
||||
|
||||
```python
|
||||
# 动态重载 KG(新文档处理完成后)
|
||||
G.clear()
|
||||
G = _load_kg() # 重新读取 kg_nodes.json + kg_edges.json
|
||||
# agent 实例无需重建,tools 引用同一 G 对象
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、问答流程最终数据返回格式规范
|
||||
|
||||
### 4.1 invoke 完整返回结构
|
||||
|
||||
```python
|
||||
result = agent.invoke({"messages": [("human", question)]})
|
||||
# result 类型: dict
|
||||
# result.keys(): ["messages"]
|
||||
```
|
||||
|
||||
`result["messages"]` 是一个有序列表,包含完整的对话历史:
|
||||
|
||||
```python
|
||||
[
|
||||
HumanMessage, # 用户输入
|
||||
AIMessage, # 工具调用决策(可能多轮)
|
||||
ToolMessage, # 工具执行结果(可能多轮)
|
||||
... # 可能有多轮 AIMessage + ToolMessage
|
||||
AIMessage, # 最终答案(tool_calls=[])
|
||||
]
|
||||
```
|
||||
|
||||
### 4.2 HumanMessage 格式
|
||||
|
||||
```python
|
||||
HumanMessage(
|
||||
content="What technology entities are in the knowledge graph?",
|
||||
additional_kwargs={},
|
||||
response_metadata={},
|
||||
id="uuid-string", # 自动生成
|
||||
)
|
||||
```
|
||||
|
||||
### 4.3 AIMessage(工具调用)格式
|
||||
|
||||
```python
|
||||
AIMessage(
|
||||
content="", # 内容为空(LLM 决策调用工具)
|
||||
additional_kwargs={
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "call_abc123",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_entities_by_type",
|
||||
"arguments": "{\"entity_type\": \"TECHNOLOGY\"}"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
tool_calls=[
|
||||
{
|
||||
"name": "get_entities_by_type",
|
||||
"args": {"entity_type": "TECHNOLOGY"},
|
||||
"id": "call_abc123",
|
||||
"type": "tool_call",
|
||||
}
|
||||
],
|
||||
response_metadata={
|
||||
"model_name": "deepseek-chat",
|
||||
"finish_reason": "tool_calls",
|
||||
"usage": {
|
||||
"prompt_tokens": 580,
|
||||
"completion_tokens": 18,
|
||||
"total_tokens": 598,
|
||||
}
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
### 4.4 ToolMessage 格式
|
||||
|
||||
```python
|
||||
ToolMessage(
|
||||
content="TECHNOLOGY entities (4 total):\n • GraphRAG ...\n • LLMs ...",
|
||||
tool_call_id="call_abc123", # 与 AIMessage.tool_calls[i].id 对应
|
||||
name="get_entities_by_type", # 工具名称
|
||||
additional_kwargs={},
|
||||
response_metadata={},
|
||||
)
|
||||
```
|
||||
|
||||
### 4.5 AIMessage(最终答案)格式
|
||||
|
||||
```python
|
||||
AIMessage(
|
||||
content="## Technology Entities in the Knowledge Graph\n\n1. **GraphRAG** ...",
|
||||
additional_kwargs={
|
||||
"tool_calls": [] # 空列表,表示无更多工具调用
|
||||
},
|
||||
tool_calls=[],
|
||||
response_metadata={
|
||||
"model_name": "deepseek-chat",
|
||||
"finish_reason": "stop",
|
||||
"usage": {
|
||||
"prompt_tokens": 820,
|
||||
"completion_tokens": 350,
|
||||
"total_tokens": 1170,
|
||||
}
|
||||
},
|
||||
id="msg-uuid-string",
|
||||
)
|
||||
```
|
||||
|
||||
### 4.6 最终答案提取规范
|
||||
|
||||
```python
|
||||
# 标准提取方式
|
||||
final_msg = result["messages"][-1] # 最后一条消息必为最终 AIMessage
|
||||
answer: str = final_msg.content # 自然语言答案
|
||||
|
||||
# 安全提取方式(防御性编程)
|
||||
answer = (
|
||||
final_msg.content
|
||||
if hasattr(final_msg, "content")
|
||||
else str(final_msg)
|
||||
)
|
||||
```
|
||||
|
||||
### 4.7 推荐封装数据格式
|
||||
|
||||
业务层调用时建议封装为以下结构,便于下游使用:
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
@dataclass
|
||||
class AgenticRAGResponse:
|
||||
question: str # 用户原始问题
|
||||
answer: str # 最终答案(Markdown 格式)
|
||||
tool_calls: list[dict] # 工具调用链记录
|
||||
total_messages: int # 对话轮次(含 human/ai/tool 全部)
|
||||
token_usage: dict[str, int] # Token 用量统计
|
||||
kg_stats: dict[str, Any] # KG 规模信息
|
||||
```
|
||||
|
||||
**填充示例:**
|
||||
|
||||
```python
|
||||
def run_query_with_metadata(question: str) -> AgenticRAGResponse:
|
||||
result = agent.invoke({"messages": [("human", question)]})
|
||||
messages = result["messages"]
|
||||
|
||||
# 提取工具调用链
|
||||
tool_calls = []
|
||||
for msg in messages:
|
||||
if hasattr(msg, "tool_calls") and msg.tool_calls:
|
||||
for tc in msg.tool_calls:
|
||||
tool_calls.append({
|
||||
"tool": tc["name"],
|
||||
"args": tc["args"],
|
||||
"call_id": tc["id"],
|
||||
})
|
||||
|
||||
# Token 统计(来自最后一条 AIMessage)
|
||||
last_ai = messages[-1]
|
||||
usage = last_ai.response_metadata.get("usage", {})
|
||||
|
||||
return AgenticRAGResponse(
|
||||
question=question,
|
||||
answer=messages[-1].content,
|
||||
tool_calls=tool_calls,
|
||||
total_messages=len(messages),
|
||||
token_usage={
|
||||
"prompt_tokens": usage.get("prompt_tokens", 0),
|
||||
"completion_tokens": usage.get("completion_tokens", 0),
|
||||
"total_tokens": usage.get("total_tokens", 0),
|
||||
},
|
||||
kg_stats={
|
||||
"nodes": G.number_of_nodes(),
|
||||
"edges": G.number_of_edges(),
|
||||
"density": nx.density(G),
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
### 4.8 实测问答响应样例
|
||||
|
||||
#### 样例 1:T1-Overview(图谱概览类问题)
|
||||
|
||||
```
|
||||
问题:Give me an overview of the knowledge graph.
|
||||
What types of entities does it contain and which entities are most central?
|
||||
|
||||
工具调用链:
|
||||
[1] describe_graph()
|
||||
|
||||
最终答案(节选):
|
||||
## Knowledge Graph Overview
|
||||
**Structure:**
|
||||
- 40 entities (nodes) connected by 780 edges
|
||||
- All edges represent CO_OCCURS_IN relationships
|
||||
- Graph density: 1.000 — fully connected graph
|
||||
|
||||
**Entity Types:**
|
||||
1. TECHNOLOGY (4): GraphRAG, LLMs, LangExtract, MinerU
|
||||
2. CONCEPT (36): retrieval-augmented generation, knowledge graphs, ...
|
||||
|
||||
**Most Central Entities (centrality=1.000):**
|
||||
1. [TECHNOLOGY] GraphRAG
|
||||
2. [CONCEPT] Knowledge Graph Enhanced RAG System
|
||||
...
|
||||
|
||||
消息轮次:4 条(human + ai_tool_call + tool_result + ai_final)
|
||||
Token 用量:约 900 tokens
|
||||
```
|
||||
|
||||
#### 样例 2:T3-MultiHop(多跳推理类问题)
|
||||
|
||||
```
|
||||
问题:What concepts and technologies are most closely related to GraphRAG?
|
||||
Explore the graph neighborhood and explain the connections.
|
||||
|
||||
工具调用链:
|
||||
[1] search_entities(query="GraphRAG")
|
||||
[2] get_neighbors(entity_name="GraphRAG", hops=1)
|
||||
|
||||
最终答案(节选):
|
||||
## Concepts and Technologies Most Closely Related to GraphRAG
|
||||
|
||||
### Core Technologies (Directly Connected):
|
||||
1. LLMs — The foundation models that GraphRAG enhances
|
||||
2. LangExtract — Used for language extraction in the pipeline
|
||||
3. MinerU — Part of the data processing ecosystem
|
||||
|
||||
### Key Concepts:
|
||||
- Knowledge Graph Enhanced RAG System (overarching architecture)
|
||||
- retrieval-augmented generation (core paradigm)
|
||||
- multi-hop reasoning (key capability)
|
||||
...
|
||||
|
||||
消息轮次:6 条(human + 2×ai_tool_call + 2×tool_result + ai_final)
|
||||
Token 用量:约 1,200 tokens
|
||||
```
|
||||
|
||||
### 4.9 错误与边界情况处理
|
||||
|
||||
| 情况 | Agent 行为 | 返回内容 |
|
||||
|------|------------|---------|
|
||||
| 实体不存在 | 工具返回提示 + 样例实体名 | Agent 改写查询或给出不确定性说明 |
|
||||
| 类型不合法 | 工具返回有效类型列表 | Agent 自动纠正并重试 |
|
||||
| 问题超出 KG 范围 | 无工具调用结果支撑 | Agent 如实说明 "信息不在当前 KG 中" |
|
||||
| Token 超限 | LangChain 内部截断 | 减少 `hops` 或缩短问题 |
|
||||
|
||||
---
|
||||
|
||||
## 五、虚拟环境与依赖
|
||||
|
||||
### 5.1 运行环境
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|-----|
|
||||
| 虚拟环境 | `F:\GraphRAGAgent\langextract_src\.venv\`(复用 Bridge Pipeline 的 venv) |
|
||||
| Python 版本 | 3.12 |
|
||||
| 安装方式 | uv |
|
||||
|
||||
### 5.2 Agentic-RAG 新增依赖
|
||||
|
||||
| 包 | 版本(实测) | 用途 |
|
||||
|----|------------|------|
|
||||
| `langchain` | 1.2.10 | `@tool` 装饰器、`create_agent` |
|
||||
| `langchain-openai` | latest | `ChatOpenAI`(DeepSeek 适配) |
|
||||
| `langgraph` | latest | `create_agent` 底层运行时 |
|
||||
| `networkx` | latest | KG 图构建、BFS 遍历、中心性计算 |
|
||||
|
||||
### 5.3 完整依赖安装
|
||||
|
||||
```bash
|
||||
uv pip install langchain langchain-openai langgraph networkx \
|
||||
--python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe
|
||||
```
|
||||
|
||||
### 5.4 环境变量
|
||||
|
||||
`F:\GraphRAGAgent\graphrag_pipeline\.env`:
|
||||
|
||||
```env
|
||||
DEEPSEEK_API_KEY=sk-xxxxxxxxxxxxxxxx
|
||||
DEEPSEEK_BASE_URL=https://api.deepseek.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 附录:各阶段文件依赖速查
|
||||
|
||||
| 阶段 | 输入 | 输出 | 关键脚本 |
|
||||
|------|------|------|---------|
|
||||
| MinerU 解析 | `*.pdf` | `{uuid}_content_list.json` | `mineru_mvp/pipeline.py` |
|
||||
| Bridge Pipeline | `*_content_list.json` | `kg_nodes.json` + `kg_edges.json` | `graphrag_pipeline/bridge.py` |
|
||||
| Agentic-RAG | `kg_nodes.json` + `kg_edges.json` | 自然语言答案 | `graphrag_pipeline/agentic_rag_mvp.py` |
|
||||
|
||||
| 规范文档 | 覆盖范围 |
|
||||
|---------|---------|
|
||||
| `docs/mineru_specification-v1.0.md` | MinerU 解析阶段输入/输出 |
|
||||
| `docs/langextract_specification-v1.0.md` | LangExtract 实体抽取参数 |
|
||||
| `docs/bridge_pipeline_specification-v1.0.md` | Bridge Pipeline 对接规范与 KG 输出格式 |
|
||||
| `docs/agentic_rag_specification-v1.0.md` | **本文件** — Agentic-RAG 问答阶段规范 |
|
||||
1757
docs/backend_service_specification-v1.0.md
Normal file
1757
docs/backend_service_specification-v1.0.md
Normal file
File diff suppressed because it is too large
Load Diff
481
docs/bridge_pipeline_specification-v1.0.md
Normal file
481
docs/bridge_pipeline_specification-v1.0.md
Normal file
@@ -0,0 +1,481 @@
|
||||
# Bridge Pipeline Specification v1.0
|
||||
|
||||
> GraphRAG 索引阶段核心流程:MinerU → LangExtract → Knowledge Graph
|
||||
|
||||
---
|
||||
|
||||
## 1. Pipeline 执行思路
|
||||
|
||||
### 1.1 整体架构
|
||||
|
||||
Bridge Pipeline 是 GraphRAG 索引阶段的核心流程,负责将 MinerU 解析后的结构化 PDF 内容送入 LangExtract 完成实体抽取,最终生成知识图谱的节点(Nodes)和边(Edges)。
|
||||
|
||||
```
|
||||
MinerU output Bridge Pipeline KG output
|
||||
───────────── ─────────────── ─────────
|
||||
{uuid}_content_list.json → text_assembler.py
|
||||
├─ text blocks ├─ 按页拼接纯文本
|
||||
└─ table blocks (HTML) ├─ HTML表格→纯文本
|
||||
└─ 记录每个block的char偏移
|
||||
→ entity_extractor.py
|
||||
├─ 逐页调用 lx.extract()
|
||||
└─ DeepSeek via OpenAI Provider
|
||||
→ kg_builder.py
|
||||
├─ 过滤低质量对齐 → kg_nodes.json
|
||||
├─ 节点去重 (name.lower(), type)
|
||||
└─ 同页实体对→CO_OCCURS_IN边 → kg_edges.json
|
||||
```
|
||||
|
||||
### 1.2 五步执行流程
|
||||
|
||||
| 步骤 | 模块 | 说明 |
|
||||
|------|------|------|
|
||||
| Step 1 | `bridge.py` | 加载 MinerU 输出 `content_list.json`,解析输入路径和 source_doc_id |
|
||||
| Step 2 | `text_assembler.py` | 按 `page_idx` 分组,拼接纯文本,记录每个 block 的字符偏移 |
|
||||
| Step 3 | `entity_extractor.py` | 逐页调用 LangExtract + DeepSeek 完成实体抽取 |
|
||||
| Step 4 | `kg_builder.py` | 过滤低质量对齐 → 节点去重 → 同页配对生成 CO_OCCURS_IN 边 |
|
||||
| Step 5 | `bridge.py` | 保存 `kg_nodes.json` + `kg_edges.json` 到 output 目录 |
|
||||
|
||||
### 1.3 文件存放位置
|
||||
|
||||
```
|
||||
F:\GraphRAGAgent\graphrag_pipeline\
|
||||
├── .env # DeepSeek API 配置
|
||||
├── CLAUDE.md # 组件开发规范
|
||||
├── bridge.py # 主入口(串联完整 Pipeline)
|
||||
├── text_assembler.py # MinerU JSON → 按页纯文本 + 偏移映射
|
||||
├── entity_extractor.py # LangExtract + DeepSeek 封装
|
||||
├── kg_builder.py # KG 节点去重 + 边生成
|
||||
└── output/
|
||||
├── kg_nodes.json # 知识图谱节点(9,851 bytes)
|
||||
└── kg_edges.json # 知识图谱边(129,093 bytes)
|
||||
```
|
||||
|
||||
### 1.4 运行命令
|
||||
|
||||
```bash
|
||||
# 使用默认测试输入
|
||||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py
|
||||
|
||||
# 指定输入文件
|
||||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py path/to/content_list.json
|
||||
|
||||
# 指定输入目录(自动查找 *_content_list.json)
|
||||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py path/to/output_dir/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. 实际本地输出文档规范
|
||||
|
||||
### 2.1 测试运行结果
|
||||
|
||||
- **输入文件**: `F:\GraphRAGAgent\mineru_mvp\output\test_sample\8a719db4-2b50-405b-826d-7bb27b224fa0_content_list.json`
|
||||
- **输入规模**: 10 blocks(9 text + 1 table),1 页,2102 字符
|
||||
- **抽取结果**: 45 raw extractions → 40 去重节点,780 CO_OCCURS_IN 边
|
||||
- **对齐质量**: 全部 40 节点均为 `match_exact`(1 个 `match_fuzzy` 已被过滤)
|
||||
- **执行时间**: ~22s(DeepSeek API 调用)
|
||||
|
||||
### 2.2 kg_nodes.json — 实际输出
|
||||
|
||||
**文件大小**: 9,851 bytes | **节点数**: 40
|
||||
|
||||
**节点类型分布**:
|
||||
|
||||
| 类型 | 数量 | 示例 |
|
||||
|------|------|------|
|
||||
| TECHNOLOGY | 4 | GraphRAG, MinerU, LLMs, LangExtract |
|
||||
| CONCEPT | 36 | knowledge graphs, retrieval-augmented generation, multi-hop reasoning |
|
||||
|
||||
**节点格式(实际样例)**:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "node_0",
|
||||
"name": "GraphRAG",
|
||||
"type": "TECHNOLOGY",
|
||||
"source_doc": "8a719db4-2b50-405b-826d-7bb27b224fa0",
|
||||
"char_start": 0,
|
||||
"char_end": 8,
|
||||
"confidence": "match_exact",
|
||||
"page": 0
|
||||
}
|
||||
```
|
||||
|
||||
**完整节点列表(前 10 个)**:
|
||||
|
||||
| id | name | type | confidence |
|
||||
|----|------|------|-----------|
|
||||
| node_0 | GraphRAG | TECHNOLOGY | match_exact |
|
||||
| node_1 | Knowledge Graph Enhanced RAG System | CONCEPT | match_exact |
|
||||
| node_2 | retrieval-augmented generation | CONCEPT | match_exact |
|
||||
| node_3 | knowledge graphs | CONCEPT | match_exact |
|
||||
| node_4 | large language models | CONCEPT | match_exact |
|
||||
| node_5 | question answering | CONCEPT | match_exact |
|
||||
| node_6 | document collections | CONCEPT | match_exact |
|
||||
| node_7 | RAG systems | CONCEPT | match_exact |
|
||||
| node_8 | vector similarity search | CONCEPT | match_exact |
|
||||
| node_9 | hierarchical knowledge graph | CONCEPT | match_exact |
|
||||
|
||||
### 2.3 kg_edges.json — 实际输出
|
||||
|
||||
**文件大小**: 129,093 bytes | **边数**: 780
|
||||
|
||||
**数学验证**: 40 个节点全部在同一页 → C(40,2) = 40×39/2 = 780 条边 ✓
|
||||
|
||||
**边格式(实际样例)**:
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "node_0",
|
||||
"target": "node_1",
|
||||
"relation": "CO_OCCURS_IN",
|
||||
"doc_id": "8a719db4-2b50-405b-826d-7bb27b224fa0",
|
||||
"page": 0
|
||||
}
|
||||
```
|
||||
|
||||
**完整性校验结果**:
|
||||
- 自环数: 0 ✓
|
||||
- 重复边数: 0 ✓
|
||||
- 关系类型: 全部为 `CO_OCCURS_IN` ✓
|
||||
|
||||
---
|
||||
|
||||
## 3. MinerU Pipeline 关键参数规范
|
||||
|
||||
### 3.1 输入格式:content_list.json
|
||||
|
||||
MinerU 解析 PDF 后输出的 `{uuid}_content_list.json` 是一个 JSON 数组,每个元素代表一个内容块。
|
||||
|
||||
**text block 结构**:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "text",
|
||||
"text": "GraphRAG: Knowledge Graph Enhanced RAG System...",
|
||||
"text_level": null,
|
||||
"page_idx": 0,
|
||||
"bbox": [72, 43, 523, 57]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `type` | string | 块类型:`"text"` \| `"table"` \| `"image"` |
|
||||
| `text` | string | 文本内容(末尾可能有空格) |
|
||||
| `text_level` | int \| null | `null`=正文,`1`=一级标题 |
|
||||
| `page_idx` | int | 页码(从 0 开始) |
|
||||
| `bbox` | list[int] | 边界框坐标 `[x0, y0, x1, y1]`(归一化 0-1000) |
|
||||
|
||||
**table block 结构**:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "table",
|
||||
"table_body": "<table><tr><th>Method</th><th>Score</th></tr>...</table>",
|
||||
"table_caption": [],
|
||||
"page_idx": 0,
|
||||
"bbox": [72, 400, 523, 500]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `table_body` | string | HTML `<table>` 标签完整内容 |
|
||||
| `table_caption` | list | 表格标题(通常为空数组) |
|
||||
|
||||
### 3.2 关键约束
|
||||
|
||||
- 文件命名: `{uuid}_content_list.json`,UUID 用作 source_doc_id
|
||||
- block 排列顺序与 PDF 阅读顺序一致
|
||||
- `text` 字段末尾可能有多余空格,需 `.rstrip()` 处理
|
||||
- `image` 类型块不含可提取文本,Bridge 跳过处理
|
||||
|
||||
---
|
||||
|
||||
## 4. LangExtract Pipeline 关键参数规范
|
||||
|
||||
### 4.1 模型配置
|
||||
|
||||
```python
|
||||
from langextract.providers.openai import OpenAILanguageModel
|
||||
|
||||
model = OpenAILanguageModel(
|
||||
model_id="deepseek-chat",
|
||||
api_key=DEEPSEEK_API_KEY,
|
||||
base_url="https://api.deepseek.com",
|
||||
)
|
||||
```
|
||||
|
||||
**重要**: 必须直接实例化 `OpenAILanguageModel`,不能使用 `model_id` 路由。LangExtract 的 `model_id` 同时用于内部路由和 API 请求参数,DeepSeek 不识别 GPT 模型名称。
|
||||
|
||||
### 4.2 抽取调用
|
||||
|
||||
```python
|
||||
result = lx.extract(
|
||||
text_or_documents=page_text, # 纯文本字符串
|
||||
prompt_description=PROMPT, # 实体类型描述
|
||||
examples=EXAMPLES, # Few-shot 示例
|
||||
model=model, # 直接传入模型实例
|
||||
show_progress=True,
|
||||
)
|
||||
```
|
||||
|
||||
### 4.3 Prompt 配置
|
||||
|
||||
```
|
||||
Extract named entities from the text in order of appearance.
|
||||
Entity types:
|
||||
TECHNOLOGY — software, algorithms, models, tools
|
||||
ORGANIZATION — companies, research groups, institutions
|
||||
PERSON — individual people
|
||||
LOCATION — places, geographic entities
|
||||
CONCEPT — technical concepts, methodologies, frameworks
|
||||
```
|
||||
|
||||
### 4.4 Few-shot 示例
|
||||
|
||||
验证可用的示例(MVP 测试 94.1% match_exact):
|
||||
|
||||
```python
|
||||
lx.data.ExampleData(
|
||||
text="LangChain is a framework created by Harrison Chase for building "
|
||||
"LLM applications. It integrates with OpenAI models and Pinecone "
|
||||
"vector database for semantic search.",
|
||||
extractions=[
|
||||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="LangChain"),
|
||||
lx.data.Extraction(extraction_class="PERSON", extraction_text="Harrison Chase"),
|
||||
lx.data.Extraction(extraction_class="CONCEPT", extraction_text="LLM applications"),
|
||||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="OpenAI models"),
|
||||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="Pinecone"),
|
||||
lx.data.Extraction(extraction_class="CONCEPT", extraction_text="semantic search"),
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
### 4.5 输出格式:AnnotatedDocument
|
||||
|
||||
每页抽取返回一个 `AnnotatedDocument`,其 `extractions` 列表中每个元素包含:
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `extraction_text` | string | 实体名称(必须为输入文本的精确子串) |
|
||||
| `extraction_class` | string | 实体类型(TECHNOLOGY/ORGANIZATION/PERSON/LOCATION/CONCEPT) |
|
||||
| `char_interval.start_pos` | int | 在输入文本中的起始字符位置 |
|
||||
| `char_interval.end_pos` | int | 在输入文本中的结束字符位置 |
|
||||
| `alignment_status` | enum | 对齐质量:`match_exact` \| `match_greater` \| `match_lesser` \| `match_fuzzy` \| `None` |
|
||||
| `extraction_index` | int | 抽取序号(从 1 开始) |
|
||||
| `group_index` | int | 组序号(从 0 开始) |
|
||||
|
||||
### 4.6 对齐质量过滤规则
|
||||
|
||||
| alignment_status | 含义 | Bridge 处理 |
|
||||
|-----------------|------|------------|
|
||||
| `match_exact` | LLM 输出与原文完全匹配 | ✅ 接受 |
|
||||
| `match_greater` | LLM 输出是原文子串的超集 | ✅ 接受 |
|
||||
| `match_lesser` | LLM 输出是原文子串的子集 | ✅ 接受 |
|
||||
| `match_fuzzy` | 模糊匹配,偏移不可靠 | ❌ 过滤 |
|
||||
| `None` | 无法对齐 | ❌ 过滤 |
|
||||
|
||||
---
|
||||
|
||||
## 5. MinerU ↔ LangExtract 接口对接规范
|
||||
|
||||
### 5.1 核心挑战
|
||||
|
||||
MinerU 输出结构化 JSON 块(含 HTML 表格),而 LangExtract 仅接受纯文本 `str`。Bridge 的 `text_assembler` 模块负责转换和偏移映射。
|
||||
|
||||
### 5.2 对接转换规则
|
||||
|
||||
| 对接点 | MinerU 规范 | LangExtract 规范 | Bridge 处理 |
|
||||
|--------|------------|-----------------|------------|
|
||||
| 输入格式 | `content_list.json`(JSON 数组) | 仅接受纯文本 `str` | `text_assembler` 拼接转换 |
|
||||
| 文本块 | `block["text"]`,末尾可能有空格 | `extraction_text` 须为原文精确子串 | `.rstrip()` 去尾部空格 |
|
||||
| 表格块 | `table_body` 是 `<table>` HTML | 不接受 HTML | BeautifulSoup 转 pipe 分隔纯文本 |
|
||||
| 标题判断 | `text_level` 缺失=正文,存在=标题 | 不区分标题/正文 | 标题和正文一起拼入文本 |
|
||||
| 坐标系 | bbox 归一化 0-1000 | char_interval 基于输入字符 | BlockSpan 记录偏移映射 |
|
||||
| 分页 | `page_idx` 区分不同页 | 单次调用处理一段文本 | 逐页分别调用 `lx.extract()` |
|
||||
| 文件名 | `{uuid}_content_list.json` | — | glob `*_content_list.json` 匹配 |
|
||||
|
||||
### 5.3 文本拼接算法
|
||||
|
||||
```
|
||||
输入: content_list (按 page_idx 分组)
|
||||
输出: PageText 列表
|
||||
|
||||
对每页:
|
||||
cursor = 0
|
||||
对每个 block (保持原顺序):
|
||||
if type == "text":
|
||||
block_text = block["text"].rstrip()
|
||||
elif type == "table":
|
||||
block_text = html_table_to_text(block["table_body"])
|
||||
else:
|
||||
跳过 (image / equation 等)
|
||||
|
||||
记录 BlockSpan(char_start=cursor, char_end=cursor+len(block_text))
|
||||
buffer.append(block_text + "\n")
|
||||
cursor += len(block_text) + 1
|
||||
|
||||
PageText.text = "".join(buffer).rstrip("\n")
|
||||
```
|
||||
|
||||
### 5.4 偏移映射数据结构
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass
|
||||
class BlockSpan:
|
||||
block_index: int # content_list 数组下标
|
||||
block_type: str # "text" | "table"
|
||||
page_idx: int # 页码
|
||||
char_start: int # 在拼接文本中的起始位置
|
||||
char_end: int # 在拼接文本中的结束位置(不含)
|
||||
bbox: list[int] # MinerU 原始 bbox
|
||||
|
||||
@dataclasses.dataclass
|
||||
class PageText:
|
||||
page_idx: int # 页码
|
||||
text: str # 拼接后的纯文本
|
||||
block_spans: list[BlockSpan] # 每个 block 在 text 中的位置
|
||||
```
|
||||
|
||||
### 5.5 HTML 表格转换
|
||||
|
||||
```python
|
||||
def html_table_to_text(table_body: str) -> str:
|
||||
"""Convert <table> HTML → pipe-delimited plain text"""
|
||||
soup = BeautifulSoup(table_body, "html.parser")
|
||||
rows = []
|
||||
for tr in soup.find_all("tr"):
|
||||
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
|
||||
rows.append(" | ".join(cells))
|
||||
return "\n".join(rows)
|
||||
```
|
||||
|
||||
转换示例:
|
||||
|
||||
```html
|
||||
<table><tr><th>Method</th><th>Score</th></tr><tr><td>GraphRAG</td><td>0.85</td></tr></table>
|
||||
```
|
||||
|
||||
→
|
||||
|
||||
```
|
||||
Method | Score
|
||||
GraphRAG | 0.85
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Bridge Pipeline 最终输出关键参数规范
|
||||
|
||||
### 6.1 kg_nodes.json
|
||||
|
||||
**文件路径**: `graphrag_pipeline/output/kg_nodes.json`
|
||||
|
||||
**结构**: JSON 数组,每个元素为一个去重后的实体节点。
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `id` | string | 节点唯一标识,格式 `node_{index}` | `"node_0"` |
|
||||
| `name` | string | 实体名称(原文子串) | `"GraphRAG"` |
|
||||
| `type` | string | 实体类型 | `"TECHNOLOGY"` |
|
||||
| `source_doc` | string | 来源文档 UUID | `"8a719db4-2b50-405b-826d-7bb27b224fa0"` |
|
||||
| `char_start` | int | 在拼接文本中的起始字符位置 | `0` |
|
||||
| `char_end` | int | 在拼接文本中的结束字符位置 | `8` |
|
||||
| `confidence` | string | 对齐质量(仅 `match_exact`/`match_greater`/`match_lesser`) | `"match_exact"` |
|
||||
| `page` | int | 来源页码(从 0 开始) | `0` |
|
||||
|
||||
**去重规则**: key = `(name.lower(), type)`,保留首次出现的实体。
|
||||
|
||||
**实体类型枚举**:
|
||||
|
||||
| 类型 | 说明 |
|
||||
|------|------|
|
||||
| `TECHNOLOGY` | 软件、算法、模型、工具 |
|
||||
| `ORGANIZATION` | 公司、研究机构 |
|
||||
| `PERSON` | 个人 |
|
||||
| `LOCATION` | 地理位置 |
|
||||
| `CONCEPT` | 技术概念、方法论、框架 |
|
||||
|
||||
### 6.2 kg_edges.json
|
||||
|
||||
**文件路径**: `graphrag_pipeline/output/kg_edges.json`
|
||||
|
||||
**结构**: JSON 数组,每个元素为一条同页共现关系边。
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `source` | string | 源节点 ID | `"node_0"` |
|
||||
| `target` | string | 目标节点 ID | `"node_1"` |
|
||||
| `relation` | string | 关系类型(固定 `"CO_OCCURS_IN"`) | `"CO_OCCURS_IN"` |
|
||||
| `doc_id` | string | 来源文档 UUID | `"8a719db4-..."` |
|
||||
| `page` | int | 共现页码 | `0` |
|
||||
|
||||
**边生成规则**:
|
||||
1. 按页分组所有去重后的节点 ID
|
||||
2. 同页节点两两配对 → 生成 `CO_OCCURS_IN` 边
|
||||
3. 边方向规范化: `source < target`(字典序)
|
||||
4. 去重 key: `(source, target, doc_id, page)`
|
||||
5. 无自环(source ≠ target)
|
||||
|
||||
**边数公式**: 若某页有 N 个节点,则该页产生 C(N,2) = N×(N-1)/2 条边。
|
||||
|
||||
### 6.3 输出完整性约束
|
||||
|
||||
| 约束 | 说明 |
|
||||
|------|------|
|
||||
| 节点 ID 唯一 | 每个节点的 `id` 字段全局唯一 |
|
||||
| 边引用合法 | 每条边的 `source` 和 `target` 必须对应存在的节点 `id` |
|
||||
| 无自环 | 不存在 `source == target` 的边 |
|
||||
| 无重复边 | 同一 `(source, target, doc_id, page)` 组合仅出现一次 |
|
||||
| 对齐质量保证 | 所有节点的 `confidence` 仅为 accepted 值(非 fuzzy/null) |
|
||||
| char 偏移有效 | `char_start < char_end`,且可定位到拼接文本中的实体子串 |
|
||||
|
||||
---
|
||||
|
||||
## 7. 虚拟环境规范
|
||||
|
||||
Bridge Pipeline **复用 LangExtract 的虚拟环境**,不单独创建 venv。
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|------|
|
||||
| 虚拟环境路径 | `F:\GraphRAGAgent\langextract_src\.venv\` |
|
||||
| Python 版本 | 3.12 |
|
||||
| 核心依赖 | `langextract[all]`、`beautifulsoup4`、`python-dotenv` |
|
||||
| 安装新依赖 | `uv pip install <pkg> --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe` |
|
||||
|
||||
**所有 Python 命令必须使用该虚拟环境运行,禁止使用全局 Python 或其他组件的 venv。**
|
||||
|
||||
---
|
||||
|
||||
## 8. 环境配置
|
||||
|
||||
### 8.1 .env 文件
|
||||
|
||||
位置: `F:\GraphRAGAgent\graphrag_pipeline\.env`
|
||||
|
||||
```env
|
||||
DEEPSEEK_API_KEY=<your-api-key>
|
||||
DEEPSEEK_BASE_URL=https://api.deepseek.com
|
||||
```
|
||||
|
||||
### 8.2 依赖安装
|
||||
|
||||
```bash
|
||||
uv pip install beautifulsoup4 python-dotenv --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. 测试验证清单
|
||||
|
||||
- [x] text_assembler 正确读取 content_list.json(10 blocks: 9 text + 1 table)
|
||||
- [x] 表格 HTML 转为 pipe 分隔纯文本,无 HTML 标签残留
|
||||
- [x] 按页拼接文本长度合理(2102 字符/页)
|
||||
- [x] LangExtract 成功调用 DeepSeek 返回 AnnotatedDocument
|
||||
- [x] 抽取实体数 45,match_exact 占比 > 95%
|
||||
- [x] kg_nodes.json 节点已去重(40 个),每个节点有完整字段
|
||||
- [x] kg_edges.json 边为 CO_OCCURS_IN 关系(780 条),无自环,无重复
|
||||
- [x] match_fuzzy 对齐的实体已被过滤(1 个)
|
||||
1232
docs/frontend_design_specification-v1.0.md
Normal file
1232
docs/frontend_design_specification-v1.0.md
Normal file
File diff suppressed because it is too large
Load Diff
604
docs/langextract_specification-v1.0.md
Normal file
604
docs/langextract_specification-v1.0.md
Normal file
@@ -0,0 +1,604 @@
|
||||
# LangExtract Pipeline 规范文档 v1.0
|
||||
|
||||
> 基于 [google/langextract](https://github.com/google/langextract) 源码分析 + MVP 实测验证
|
||||
> 版本基线:2026-03-04 main 分支
|
||||
> 本地源码路径:`F:\GraphRAGAgent\langextract_src\`
|
||||
> 测试脚本路径:`F:\GraphRAGAgent\langextract_src\mvp_test_deepseek.py`
|
||||
|
||||
---
|
||||
|
||||
## 目录
|
||||
|
||||
- [〇、虚拟环境](#〇虚拟环境)
|
||||
- [一、Pipeline 执行流程](#一pipeline-执行流程)
|
||||
- [1.1 完整执行链路](#11-完整执行链路)
|
||||
- [1.2 MVP 测试脚本](#12-mvp-测试脚本)
|
||||
- [1.3 输入规范](#13-输入规范)
|
||||
- [1.4 不支持的输入格式](#14-不支持的输入格式)
|
||||
- [二、模型接入规范](#二模型接入规范)
|
||||
- [2.1 模型路由机制](#21-模型路由机制)
|
||||
- [2.2 DeepSeek 接入(实测验证)](#22-deepseek-接入实测验证)
|
||||
- [2.3 路由陷阱与规避方案](#23-路由陷阱与规避方案)
|
||||
- [2.4 OpenAI Provider 构造参数](#24-openai-provider-构造参数)
|
||||
- [三、关键参数规范](#三关键参数规范)
|
||||
- [3.1 extract() 核心参数](#31-extract-核心参数)
|
||||
- [3.2 ExampleData 示例数据格式](#32-exampledata-示例数据格式)
|
||||
- [3.3 Extraction 示例条目格式](#33-extraction-示例条目格式)
|
||||
- [3.4 分块参数](#34-分块参数)
|
||||
- [3.5 Resolver 对齐参数](#35-resolver-对齐参数)
|
||||
- [四、输出数据格式规范](#四输出数据格式规范)
|
||||
- [4.1 JSONL 输出文件(实际生成)](#41-jsonl-输出文件实际生成)
|
||||
- [4.2 AnnotatedDocument 顶层结构](#42-annotateddocument-顶层结构)
|
||||
- [4.3 Extraction 字段规范(实测对比)](#43-extraction-字段规范实测对比)
|
||||
- [4.4 CharInterval 字符锚点](#44-charinterval-字符锚点)
|
||||
- [4.5 AlignmentStatus 对齐状态枚举](#45-alignmentstatus-对齐状态枚举)
|
||||
- [4.6 extraction_summary.json(自定义摘要)](#46-extraction_summaryjson自定义摘要)
|
||||
- [五、本地生成文件清单](#五本地生成文件清单)
|
||||
- [附录:环境变量与常量速查](#附录环境变量与常量速查)
|
||||
|
||||
---
|
||||
|
||||
## 〇、虚拟环境
|
||||
|
||||
本组件使用独立的 Python 虚拟环境,与项目其他组件(MinerU MVP、GraphRAG Pipeline 等)完全隔离。
|
||||
|
||||
**所有 Python 命令必须在子虚拟环境中运行,禁止使用全局 Python 或其他组件的 venv。**
|
||||
|
||||
### 环境信息
|
||||
|
||||
- 虚拟环境路径:`F:\GraphRAGAgent\langextract_src\.venv\`
|
||||
- Python 版本:3.12
|
||||
- 创建工具:uv
|
||||
- 安装方式:`uv pip install -e ".[all]"` (含 openai、google-genai 等 60 个包)
|
||||
|
||||
### 运行方式
|
||||
|
||||
**方式一:直接使用 venv 内的 Python 解释器(推荐)**
|
||||
|
||||
```bash
|
||||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe mvp_test_deepseek.py
|
||||
```
|
||||
|
||||
**方式二:先激活环境再运行**
|
||||
|
||||
```bash
|
||||
cd F:/GraphRAGAgent/langextract_src
|
||||
source .venv/Scripts/activate
|
||||
python mvp_test_deepseek.py
|
||||
```
|
||||
|
||||
### 安装新依赖
|
||||
|
||||
```bash
|
||||
uv pip install <package> --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 一、Pipeline 执行流程
|
||||
|
||||
### 1.1 完整执行链路
|
||||
|
||||
基于 MVP 实测验证的完整 Pipeline 分为 5 个阶段:
|
||||
|
||||
```
|
||||
Step 0: 激活虚拟环境
|
||||
└── F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe
|
||||
|
||||
Step 1: 准备输入
|
||||
├── 构造纯文本字符串(str)
|
||||
├── 或构造 Document 对象列表
|
||||
└── LangExtract 仅接受纯文本,PDF/DOCX 等需前置解析
|
||||
|
||||
Step 2: 构造 Few-shot 示例
|
||||
├── 创建 ExampleData 对象列表
|
||||
├── 每个 ExampleData 包含:text(示例文本) + extractions(标注实体列表)
|
||||
└── extraction_text 必须是 text 的精确子串
|
||||
|
||||
Step 3: 配置模型并调用 extract()
|
||||
├── 直接实例化 OpenAILanguageModel(DeepSeek 场景)
|
||||
├── 传入 model_id="deepseek-chat", base_url, api_key
|
||||
└── 调用 lx.extract(text_or_documents=..., examples=..., model=model)
|
||||
|
||||
Step 4: LangExtract 内部处理
|
||||
├── 文本分块(基于句子边界,max_char_buffer=1000)
|
||||
├── 构造 Prompt(含 prompt_description + examples)
|
||||
├── 调用 LLM 推理(JSON 格式输出)
|
||||
├── 解析 LLM JSON 响应为 Extraction 对象
|
||||
└── 字符级对齐(char_interval + alignment_status)
|
||||
|
||||
Step 5: 保存输出
|
||||
├── lx.io.save_annotated_documents() → JSONL 文件
|
||||
└── 自定义 JSON 摘要(可选)
|
||||
```
|
||||
|
||||
### 1.2 MVP 测试脚本
|
||||
|
||||
**文件路径:** `F:\GraphRAGAgent\langextract_src\mvp_test_deepseek.py`
|
||||
|
||||
**执行命令:**
|
||||
|
||||
```bash
|
||||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe mvp_test_deepseek.py
|
||||
```
|
||||
|
||||
**脚本核心流程:**
|
||||
|
||||
```python
|
||||
from langextract.providers.openai import OpenAILanguageModel
|
||||
|
||||
# Step 1: 直接实例化 OpenAI Provider(指向 DeepSeek)
|
||||
model = OpenAILanguageModel(
|
||||
model_id="deepseek-chat",
|
||||
api_key="sk-...",
|
||||
base_url="https://api.deepseek.com",
|
||||
)
|
||||
|
||||
# Step 2: 构造示例数据
|
||||
examples = [
|
||||
lx.data.ExampleData(
|
||||
text="LangChain is a framework created by Harrison Chase...",
|
||||
extractions=[
|
||||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="LangChain"),
|
||||
lx.data.Extraction(extraction_class="ORGANIZATION", extraction_text="Harrison Chase"),
|
||||
...
|
||||
],
|
||||
)
|
||||
]
|
||||
|
||||
# Step 3: 调用抽取
|
||||
result = lx.extract(
|
||||
text_or_documents=input_text,
|
||||
prompt_description="Extract named entities...",
|
||||
examples=examples,
|
||||
model=model,
|
||||
show_progress=True,
|
||||
)
|
||||
|
||||
# Step 4: 保存结果
|
||||
lx.io.save_annotated_documents([result], output_name="graphrag_entities.jsonl", output_dir="mvp_output")
|
||||
```
|
||||
|
||||
**实测结果:**
|
||||
|
||||
| 指标 | 值 |
|
||||
|------|-----|
|
||||
| 输入文本长度 | 520 字符 |
|
||||
| 模型 | deepseek-chat |
|
||||
| 耗时 | 21.6 秒 |
|
||||
| 提取实体数 | 17 |
|
||||
| 实体类型分布 | TECHNOLOGY: 9, CONCEPT: 7, ORGANIZATION: 1 |
|
||||
| 精确匹配率 | 16/17 (94.1%) — 仅 1 个 match_fuzzy |
|
||||
| 输出文件 | 2 个(JSONL + JSON 摘要) |
|
||||
|
||||
### 1.3 输入规范
|
||||
|
||||
LangExtract **仅接受纯文本**作为输入,支持以下 4 种传入方式:
|
||||
|
||||
| 输入方式 | 示例 | 说明 |
|
||||
|---------|------|------|
|
||||
| **纯文本字符串** | `extract("这是一段文本...")` | 直接传入文本内容(MVP 实测使用此方式) |
|
||||
| **URL** | `extract("https://example.com/article.txt")` | 自动下载 URL 文本内容(`fetch_urls=True`) |
|
||||
| **Document 对象** | `extract([Document(text="...", document_id="doc1")])` | 传入 Document 可迭代集合 |
|
||||
| **CSV 文件** | 通过 `Dataset` 类加载后传入 | 指定 text 列和 id 列 |
|
||||
|
||||
### 1.4 不支持的输入格式
|
||||
|
||||
以下格式 **不被支持**,需要在 LangExtract 之前通过外部工具预处理为纯文本:
|
||||
|
||||
| 格式 | 状态 | 预处理方案 |
|
||||
|------|------|-----------|
|
||||
| PDF | ❌ 不支持 | 使用 MinerU / PyMuPDF 先转文本 |
|
||||
| DOCX | ❌ 不支持 | 使用 python-docx 先转文本 |
|
||||
| HTML | ❌ 不支持 | 使用 BeautifulSoup 先提取文本 |
|
||||
| 图片 | ❌ 不支持 | 使用 OCR 工具先识别文本 |
|
||||
| Markdown(含媒体) | ❌ 不支持 | 需提取纯文本部分 |
|
||||
| Excel / JSON | ❌ 不支持 | 需序列化为纯文本 |
|
||||
|
||||
---
|
||||
|
||||
## 二、模型接入规范
|
||||
|
||||
### 2.1 模型路由机制
|
||||
|
||||
文件路径:`langextract/providers/patterns.py`
|
||||
|
||||
LangExtract 通过 **正则匹配 `model_id`** 自动路由到对应的 Provider:
|
||||
|
||||
| Provider | 匹配模式 | 优先级 | 示例模型 |
|
||||
|----------|---------|--------|---------|
|
||||
| **Gemini** | `^gemini` | 10 | `gemini-2.5-flash`, `gemini-1.5-pro` |
|
||||
| **OpenAI** | `^gpt-4`, `^gpt4.`, `^gpt-5`, `^gpt5.` | 10 | `gpt-4o`, `gpt-4o-mini` |
|
||||
| **Ollama** | `gemma`, `llama`, `mistral`, `phi`, `qwen`, `deepseek` 等 | 10 | `gemma2:2b`, `llama3.2:1b` |
|
||||
|
||||
### 2.2 DeepSeek 接入(实测验证)
|
||||
|
||||
> **重要发现:** 规范文档 v0 中描述的 `model_id="gpt-4o-mini"` + `language_model_params={"base_url": ...}` 方式 **实测不可用**,因为 `model_id` 同时用于路由和 API 调用,DeepSeek 不识别 `gpt-4o-mini` 模型名。
|
||||
|
||||
**正确方式 — 直接实例化 OpenAI Provider:**
|
||||
|
||||
```python
|
||||
from langextract.providers.openai import OpenAILanguageModel
|
||||
|
||||
model = OpenAILanguageModel(
|
||||
model_id="deepseek-chat", # DeepSeek 实际模型名
|
||||
api_key="sk-your-deepseek-key",
|
||||
base_url="https://api.deepseek.com",
|
||||
)
|
||||
|
||||
result = lx.extract(
|
||||
text_or_documents="...",
|
||||
examples=[...],
|
||||
model=model, # 通过 model 参数传入,绕过路由
|
||||
show_progress=True,
|
||||
)
|
||||
```
|
||||
|
||||
**实测验证状态:** DeepSeek `deepseek-chat` 模型通过此方式成功完成实体抽取,JSON 格式输出正常。
|
||||
|
||||
### 2.3 路由陷阱与规避方案
|
||||
|
||||
| 方案 | 能否工作 | 原因 |
|
||||
|------|---------|------|
|
||||
| `model_id="gpt-4o-mini"` + `language_model_params={"base_url": "https://api.deepseek.com"}` | **不能** | `model_id` 被同时用作 API 调用的 `model` 参数,DeepSeek 返回 `400 Model Not Exist` |
|
||||
| `config=ModelConfig(model_id="deepseek-chat", provider="openai")` | **不能** | `_create_model_with_schema()` 中使用 `provider` 时未先调用 `load_builtins_once()`,导致 `No provider found` 错误(LangExtract 内部 bug) |
|
||||
| `model=OpenAILanguageModel(model_id="deepseek-chat", ...)` | **可以** | 直接实例化绕过路由,`model_id` 正确传递给 DeepSeek API |
|
||||
|
||||
### 2.4 OpenAI Provider 构造参数
|
||||
|
||||
文件路径:`langextract/providers/openai.py`
|
||||
|
||||
```python
|
||||
class OpenAILanguageModel(BaseLanguageModel):
|
||||
def __init__(
|
||||
self,
|
||||
model_id: str = 'gpt-4o-mini',
|
||||
api_key: str | None = None,
|
||||
base_url: str | None = None,
|
||||
organization: str | None = None,
|
||||
format_type: FormatType = FormatType.JSON,
|
||||
temperature: float | None = None,
|
||||
max_workers: int = 10,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `model_id` | `gpt-4o-mini` | 模型标识(同时作为 API 调用的 model 参数) |
|
||||
| `api_key` | `None` | 环境变量:`OPENAI_API_KEY` 或 `LANGEXTRACT_API_KEY` |
|
||||
| `base_url` | `None` | 自定义 API 端点(DeepSeek 使用 `https://api.deepseek.com`) |
|
||||
| `temperature` | `None` | 采样温度 |
|
||||
| `format_type` | `JSON` | 输出格式(JSON Mode) |
|
||||
|
||||
---
|
||||
|
||||
## 三、关键参数规范
|
||||
|
||||
### 3.1 extract() 核心参数
|
||||
|
||||
文件路径:`langextract/extraction.py`
|
||||
|
||||
```python
|
||||
def extract(
|
||||
text_or_documents: typing.Any, # 必填:纯文本或 Document 列表
|
||||
prompt_description: str | None = None, # 抽取提示词
|
||||
examples: typing.Sequence[Any] | None = None, # 必填:Few-shot 示例
|
||||
model_id: str = "gemini-2.5-flash", # 模型标识(用于路由)
|
||||
api_key: str | None = None, # API Key
|
||||
model: typing.Any = None, # 预配置的模型实例(最高优先级)
|
||||
max_char_buffer: int = 1000, # 分块最大字符数
|
||||
temperature: float | None = None, # 采样温度
|
||||
batch_length: int = 10, # 每批分块数
|
||||
max_workers: int = 10, # 最大并行线程
|
||||
additional_context: str | None = None, # 附加上下文
|
||||
resolver_params: dict | None = None, # 对齐参数
|
||||
language_model_params: dict | None = None, # Provider 构造参数
|
||||
extraction_passes: int = 1, # 抽取轮次
|
||||
context_window_chars: int | None = None, # 上下文窗口
|
||||
config: typing.Any = None, # ModelConfig 实例
|
||||
model_url: str | None = None, # 自托管端点
|
||||
show_progress: bool = True, # 显示进度条
|
||||
...
|
||||
) -> list[AnnotatedDocument] | AnnotatedDocument
|
||||
```
|
||||
|
||||
**MVP 实测使用的参数组合:**
|
||||
|
||||
| 参数 | 实测值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `text_or_documents` | 520 字符纯文本 | GraphRAG 领域相关文本 |
|
||||
| `prompt_description` | `"Extract named entities..."` | 指定 TECHNOLOGY/ORGANIZATION/CONCEPT 三类 |
|
||||
| `examples` | 1 个 ExampleData(含 6 个 Extraction) | Few-shot 示例 |
|
||||
| `model` | `OpenAILanguageModel` 实例 | 直接实例化,指向 DeepSeek |
|
||||
| `show_progress` | `True` | 显示进度 |
|
||||
| `max_char_buffer` | 1000(默认) | 文本未超过阈值,未触发分块 |
|
||||
|
||||
### 3.2 ExampleData 示例数据格式
|
||||
|
||||
文件路径:`langextract/core/data.py`
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass
|
||||
class ExampleData:
|
||||
text: str # 示例文本(必填)
|
||||
extractions: list[Extraction] # 标注的实体列表(必填)
|
||||
```
|
||||
|
||||
**MVP 实测示例:**
|
||||
|
||||
```python
|
||||
lx.data.ExampleData(
|
||||
text="LangChain is a framework created by Harrison Chase for building "
|
||||
"LLM applications. It integrates with OpenAI models and Pinecone "
|
||||
"vector database for semantic search.",
|
||||
extractions=[
|
||||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="LangChain"),
|
||||
lx.data.Extraction(extraction_class="ORGANIZATION", extraction_text="Harrison Chase"),
|
||||
lx.data.Extraction(extraction_class="CONCEPT", extraction_text="LLM applications"),
|
||||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="OpenAI models"),
|
||||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="Pinecone"),
|
||||
lx.data.Extraction(extraction_class="CONCEPT", extraction_text="semantic search"),
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
**约束条件:**
|
||||
- `extraction_text` **必须是** `text` 的精确子串(否则对齐失败)
|
||||
- `extraction_class` 为自定义字符串,无预定义枚举
|
||||
- `examples` 列表不能为空(否则抛出 `ValueError`)
|
||||
- 每个 ExampleData 可包含多个不同 `extraction_class` 的条目
|
||||
|
||||
### 3.3 Extraction 示例条目格式
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass(init=False)
|
||||
class Extraction:
|
||||
extraction_class: str # 必填:实体类型
|
||||
extraction_text: str # 必填:实体文本(须为原文子串)
|
||||
attributes: dict[str, str | list[str]] | None = None # 可选:附加属性
|
||||
description: str | None = None # 可选:实体描述
|
||||
```
|
||||
|
||||
在 examples 中创建时只需要 `extraction_class` 和 `extraction_text`,其余字段由 LangExtract 在推理后自动填充。
|
||||
|
||||
### 3.4 分块参数
|
||||
|
||||
文件路径:`langextract/chunking.py`
|
||||
|
||||
LangExtract 使用基于 **句子边界** 的确定性分块策略:
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `max_char_buffer` | 1000 | 每个分块最大字符数 |
|
||||
| `context_window_chars` | `None` | 前一分块的上下文窗口(用于指代消解) |
|
||||
| `batch_length` | 10 | 每批处理的分块数 |
|
||||
|
||||
**分块策略:**
|
||||
1. 如果单个句子超过 `max_char_buffer`,按换行符拆分
|
||||
2. 如果单个 token 超过 `max_char_buffer`,该 token 独占一个分块
|
||||
3. 如果多个句子可以放入 `max_char_buffer`,合并为一个分块
|
||||
|
||||
> **MVP 实测:** 输入文本 520 字符 < `max_char_buffer`(1000),整段文本作为单一分块处理,未触发分块逻辑。
|
||||
|
||||
### 3.5 Resolver 对齐参数
|
||||
|
||||
通过 `extract()` 的 `resolver_params` 字典传入:
|
||||
|
||||
| 参数 | 类型 | 默认值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `enable_fuzzy_alignment` | `bool` | `True` | 精确匹配失败后是否尝试模糊匹配 |
|
||||
| `fuzzy_alignment_threshold` | `float` | `0.75` | 模糊匹配最低 token 重叠比率 |
|
||||
| `accept_match_lesser` | `bool` | `True` | 是否接受部分精确匹配 |
|
||||
| `suppress_parse_errors` | `bool` | `False` | JSON 解析失败时是否继续 |
|
||||
|
||||
> **MVP 实测:** 未传入 `resolver_params`,使用全部默认值。17 个抽取中 16 个 `match_exact`,1 个 `match_fuzzy`("Microsoft Research")。
|
||||
|
||||
---
|
||||
|
||||
## 四、输出数据格式规范
|
||||
|
||||
### 4.1 JSONL 输出文件(实际生成)
|
||||
|
||||
**文件路径:** `mvp_output/graphrag_entities.jsonl`
|
||||
**文件大小:** 4,650 bytes
|
||||
**格式:** JSONL(JSON Lines),每行一个完整的 JSON 对象
|
||||
|
||||
保存 API:
|
||||
|
||||
```python
|
||||
lx.io.save_annotated_documents(
|
||||
[result],
|
||||
output_name="graphrag_entities.jsonl",
|
||||
output_dir="mvp_output"
|
||||
)
|
||||
```
|
||||
|
||||
### 4.2 AnnotatedDocument 顶层结构
|
||||
|
||||
**实际 JSONL 输出的顶层字段(基于本地生成文件):**
|
||||
|
||||
| 字段 | 类型 | 实测值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `text` | `string` | 520 字符 | 原始输入文本(完整保留) |
|
||||
| `document_id` | `string` | `"doc_8498f2b6"` | 自动生成,格式 `doc_{uuid_hex[:8]}` |
|
||||
| `extractions` | `array[Extraction]` | 17 个元素 | 抽取的实体列表 |
|
||||
|
||||
> **注意:** JSONL 中字段顺序为 `extractions` → `text` → `document_id`(与 dataclass 定义顺序不同,以实际输出为准)。
|
||||
|
||||
### 4.3 Extraction 字段规范(实测对比)
|
||||
|
||||
**实际输出的单条 Extraction 完整结构(摘自本地 JSONL 文件):**
|
||||
|
||||
```json
|
||||
{
|
||||
"extraction_class": "TECHNOLOGY",
|
||||
"extraction_text": "GraphRAG",
|
||||
"char_interval": {
|
||||
"start_pos": 0,
|
||||
"end_pos": 8
|
||||
},
|
||||
"alignment_status": "match_exact",
|
||||
"extraction_index": 1,
|
||||
"group_index": 0,
|
||||
"description": null,
|
||||
"attributes": {}
|
||||
}
|
||||
```
|
||||
|
||||
**实测字段对比(官方 Schema vs 实际输出):**
|
||||
|
||||
| 字段 | 官方 Schema | 实际输出 | 差异说明 |
|
||||
|------|------------|---------|---------|
|
||||
| `extraction_class` | `string` | `string` | 一致 |
|
||||
| `extraction_text` | `string` | `string` | 一致 |
|
||||
| `char_interval` | `object \| null` | `object`(始终存在) | 实测 17 个全部有值 |
|
||||
| `alignment_status` | `string \| null` | `string`(始终存在) | 实测 17 个全部有值 |
|
||||
| `extraction_index` | `int \| null` | `int`(从 1 开始) | **实测从 1 开始,非 0** |
|
||||
| `group_index` | `int \| null` | `int`(从 0 开始) | 实测从 0 开始递增 |
|
||||
| `description` | `string \| null` | `null` | 未使用 description 提示时为 null |
|
||||
| `attributes` | `dict \| null` | `{}`(空对象) | **实测为空对象 `{}`,非 `null`** |
|
||||
| `token_interval` | `object \| null` | **不存在** | **实际 JSONL 输出中无此字段** |
|
||||
|
||||
**关键差异总结:**
|
||||
|
||||
1. `extraction_index` 从 **1** 开始(非 0)
|
||||
2. `attributes` 未使用时输出空对象 `{}`(非 `null`)
|
||||
3. `token_interval` 字段 **不在 JSONL 输出中**(仅存在于内存对象)
|
||||
|
||||
### 4.4 CharInterval 字符锚点
|
||||
|
||||
```json
|
||||
{
|
||||
"start_pos": 0,
|
||||
"end_pos": 8
|
||||
}
|
||||
```
|
||||
|
||||
- `start_pos`:起始位置(包含),0-indexed
|
||||
- `end_pos`:结束位置(不包含)
|
||||
- 语义:`source_text[start_pos:end_pos]` 即为实体在原文中的精确位置
|
||||
|
||||
**实测验证(以 "GraphRAG" 为例):**
|
||||
|
||||
```python
|
||||
text = "GraphRAG is an advanced..."
|
||||
text[0:8] # → "GraphRAG" ✓ 匹配
|
||||
```
|
||||
|
||||
### 4.5 AlignmentStatus 对齐状态枚举
|
||||
|
||||
| 状态值 | 序列化值 | 含义 | 可信度 | MVP 实测数量 |
|
||||
|--------|---------|------|--------|-------------|
|
||||
| `MATCH_EXACT` | `"match_exact"` | LLM 输出与原文完全匹配 | 最高 | **16** |
|
||||
| `MATCH_GREATER` | `"match_greater"` | LLM 输出短于匹配到的原文 | 高 | 0 |
|
||||
| `MATCH_LESSER` | `"match_lesser"` | LLM 输出长于匹配到的原文 | 中 | 0 |
|
||||
| `MATCH_FUZZY` | `"match_fuzzy"` | 模糊匹配 | 低 | **1** |
|
||||
| `None` | `null` | 未找到对齐 | 不可信 | 0 |
|
||||
|
||||
> **实测精确匹配率:** 16/17 = 94.1%。唯一的 `match_fuzzy` 是 "Microsoft Research"。
|
||||
|
||||
### 4.6 extraction_summary.json(自定义摘要)
|
||||
|
||||
**文件路径:** `mvp_output/extraction_summary.json`
|
||||
**文件大小:** 2,863 bytes
|
||||
|
||||
此文件由 MVP 测试脚本自行生成(非 LangExtract 原生输出),结构如下:
|
||||
|
||||
```json
|
||||
{
|
||||
"total_extractions": 17,
|
||||
"extraction_classes": {
|
||||
"TECHNOLOGY": 9,
|
||||
"ORGANIZATION": 1,
|
||||
"CONCEPT": 7
|
||||
},
|
||||
"extractions": [
|
||||
{
|
||||
"class": "TECHNOLOGY",
|
||||
"text": "GraphRAG",
|
||||
"char_start": 0,
|
||||
"char_end": 8,
|
||||
"alignment": "match_exact"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、本地生成文件清单
|
||||
|
||||
MVP 测试后本地实际生成的文件(共 2 个输出文件):
|
||||
|
||||
```
|
||||
langextract_src/
|
||||
├── .env # DeepSeek API Key 配置
|
||||
├── .venv/ # 独立虚拟环境(Python 3.12)
|
||||
├── mvp_test_deepseek.py # MVP 测试脚本
|
||||
└── mvp_output/ # 输出目录
|
||||
├── graphrag_entities.jsonl # LangExtract 原生 JSONL 输出(4,650 bytes)
|
||||
└── extraction_summary.json # 自定义 JSON 摘要(2,863 bytes)
|
||||
```
|
||||
|
||||
| 文件 | 大小 | 来源 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `graphrag_entities.jsonl` | 4,650 bytes | `lx.io.save_annotated_documents()` | LangExtract 原生输出,1 行 JSONL,含 17 个 Extraction |
|
||||
| `extraction_summary.json` | 2,863 bytes | MVP 脚本自定义 | 扁平化摘要,含类型分布统计 |
|
||||
|
||||
---
|
||||
|
||||
## 附录:环境变量与常量速查
|
||||
|
||||
### 环境变量
|
||||
|
||||
| 变量名 | 适用 Provider | 说明 |
|
||||
|--------|--------------|------|
|
||||
| `LANGEXTRACT_API_KEY` | 所有 | 通用 API Key 后备 |
|
||||
| `GEMINI_API_KEY` | Gemini | Gemini API Key |
|
||||
| `OPENAI_API_KEY` | OpenAI | OpenAI / DeepSeek API Key |
|
||||
| `OLLAMA_BASE_URL` | Ollama | Ollama 服务地址(默认 `http://localhost:11434`) |
|
||||
|
||||
### .env 配置(MVP 实测)
|
||||
|
||||
```env
|
||||
OPENAI_API_KEY=sk-55cb39b8a3284355bc80217c11c85d1f
|
||||
```
|
||||
|
||||
### 模型优先级
|
||||
|
||||
```
|
||||
model(预配置的模型实例) > config(ModelConfig 实例) > model_id + api_key
|
||||
```
|
||||
|
||||
> **MVP 实测使用 `model` 参数**(最高优先级),直接传入 `OpenAILanguageModel` 实例。
|
||||
|
||||
### 结构化输出支持
|
||||
|
||||
| Provider | Schema 类型 | 结构化输出模式 |
|
||||
|----------|------------|---------------|
|
||||
| Gemini | `GeminiSchema` | 严格结构化输出 |
|
||||
| OpenAI | JSON Mode | 通过 `response_format` 约束 |
|
||||
| Ollama | `FormatModeSchema` | JSON 模式(非严格) |
|
||||
|
||||
### 17 个实测抽取实体完整列表
|
||||
|
||||
| # | extraction_class | extraction_text | char_interval | alignment_status |
|
||||
|---|-----------------|-----------------|---------------|-----------------|
|
||||
| 1 | TECHNOLOGY | GraphRAG | [0, 8] | match_exact |
|
||||
| 2 | ORGANIZATION | Microsoft Research | [75, 93] | match_fuzzy |
|
||||
| 3 | CONCEPT | retrieval-augmented generation | [24, 54] | match_exact |
|
||||
| 4 | CONCEPT | knowledge graphs | [107, 123] | match_exact |
|
||||
| 5 | TECHNOLOGY | GPT-4 | [156, 161] | match_exact |
|
||||
| 6 | CONCEPT | multi-hop reasoning | [172, 191] | match_exact |
|
||||
| 7 | CONCEPT | community detection algorithms | [209, 239] | match_exact |
|
||||
| 8 | TECHNOLOGY | Leiden clustering | [248, 265] | match_exact |
|
||||
| 9 | TECHNOLOGY | MinerU | [315, 321] | match_exact |
|
||||
| 10 | TECHNOLOGY | LangExtract | [344, 355] | match_exact |
|
||||
| 11 | TECHNOLOGY | Neo4j | [383, 388] | match_exact |
|
||||
| 12 | CONCEPT | graph database | [396, 410] | match_exact |
|
||||
| 13 | CONCEPT | pipeline | [424, 432] | match_exact |
|
||||
| 14 | TECHNOLOGY | PDF documents | [443, 456] | match_exact |
|
||||
| 15 | TECHNOLOGY | OCR | [465, 468] | match_exact |
|
||||
| 16 | TECHNOLOGY | NLP | [473, 476] | match_exact |
|
||||
| 17 | CONCEPT | knowledge graph | [504, 519] | match_exact |
|
||||
672
docs/langextract_specification.md
Normal file
672
docs/langextract_specification.md
Normal file
@@ -0,0 +1,672 @@
|
||||
# LangExtract Pipeline 规范文档
|
||||
|
||||
> 基于 [google/langextract](https://github.com/google/langextract) 源码分析
|
||||
> 版本基线:2026-03-04 main 分支
|
||||
|
||||
---
|
||||
|
||||
## 目录
|
||||
|
||||
- [一、输入规范](#一输入规范)
|
||||
- [1.1 核心入口函数签名](#11-核心入口函数签名)
|
||||
- [1.2 支持的输入类型](#12-支持的输入类型)
|
||||
- [1.3 Document 数据结构](#13-document-数据结构)
|
||||
- [1.4 CSV Dataset 输入](#14-csv-dataset-输入)
|
||||
- [1.5 URL 文本下载](#15-url-文本下载)
|
||||
- [1.6 分块参数配置](#16-分块参数配置)
|
||||
- [1.7 不支持的输入格式](#17-不支持的输入格式)
|
||||
- [二、模型接入规范](#二模型接入规范)
|
||||
- [2.1 模型路由机制](#21-模型路由机制)
|
||||
- [2.2 Gemini Provider](#22-gemini-provider)
|
||||
- [2.3 OpenAI Provider](#23-openai-provider)
|
||||
- [2.4 Ollama Provider](#24-ollama-provider)
|
||||
- [2.5 OpenAI 兼容接口适配(DeepSeek 等)](#25-openai-兼容接口适配deepseek-等)
|
||||
- [2.6 模型优先级与配置覆盖关系](#26-模型优先级与配置覆盖关系)
|
||||
- [2.7 关于 Embedding 模型](#27-关于-embedding-模型)
|
||||
- [三、输出数据格式规范](#三输出数据格式规范)
|
||||
- [3.1 AnnotatedDocument 结构](#31-annotateddocument-结构)
|
||||
- [3.2 Extraction 结构](#32-extraction-结构)
|
||||
- [3.3 CharInterval 字符锚点](#33-charinterval-字符锚点)
|
||||
- [3.4 AlignmentStatus 对齐状态枚举](#34-alignmentstatus-对齐状态枚举)
|
||||
- [3.5 Resolver 对齐参数](#35-resolver-对齐参数)
|
||||
- [3.6 JSONL 输出文件格式](#36-jsonl-输出文件格式)
|
||||
- [3.7 完整输出 JSON Schema 示例](#37-完整输出-json-schema-示例)
|
||||
- [3.8 HTML 可视化输出](#38-html-可视化输出)
|
||||
- [附录:环境变量与常量速查](#附录环境变量与常量速查)
|
||||
|
||||
---
|
||||
|
||||
## 一、输入规范
|
||||
|
||||
### 1.1 核心入口函数签名
|
||||
|
||||
文件路径:`langextract/extraction.py`
|
||||
|
||||
```python
|
||||
def extract(
|
||||
text_or_documents: typing.Any,
|
||||
prompt_description: str | None = None,
|
||||
examples: typing.Sequence[typing.Any] | None = None,
|
||||
model_id: str = "gemini-2.5-flash",
|
||||
api_key: str | None = None,
|
||||
language_model_type: typing.Type[typing.Any] | None = None, # 已废弃
|
||||
format_type: typing.Any = None,
|
||||
max_char_buffer: int = 1000,
|
||||
temperature: float | None = None,
|
||||
fence_output: bool | None = None,
|
||||
use_schema_constraints: bool = True,
|
||||
batch_length: int = 10,
|
||||
max_workers: int = 10,
|
||||
additional_context: str | None = None,
|
||||
resolver_params: dict | None = None,
|
||||
language_model_params: dict | None = None,
|
||||
debug: bool = False,
|
||||
model_url: str | None = None,
|
||||
extraction_passes: int = 1,
|
||||
context_window_chars: int | None = None,
|
||||
config: typing.Any = None,
|
||||
model: typing.Any = None,
|
||||
*,
|
||||
fetch_urls: bool = True,
|
||||
prompt_validation_level: PromptValidationLevel = PromptValidationLevel.WARNING,
|
||||
prompt_validation_strict: bool = False,
|
||||
show_progress: bool = True,
|
||||
tokenizer: Tokenizer | None = None,
|
||||
) -> list[AnnotatedDocument] | AnnotatedDocument
|
||||
```
|
||||
|
||||
**关键参数说明:**
|
||||
|
||||
| 参数 | 类型 | 默认值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `text_or_documents` | `Any` | **必填** | 纯文本字符串、URL、或 `Document` 对象的可迭代集合 |
|
||||
| `prompt_description` | `str \| None` | `None` | 抽取提示词,描述需要抽取什么实体 |
|
||||
| `examples` | `Sequence[Any] \| None` | `None` | **必填** — Few-shot 示例列表(为空则抛出 ValueError) |
|
||||
| `model_id` | `str` | `"gemini-2.5-flash"` | 模型标识符,用于自动路由到对应 Provider |
|
||||
| `api_key` | `str \| None` | `None` | LLM API Key(也可通过环境变量设置) |
|
||||
| `max_char_buffer` | `int` | `1000` | 每个文本分块的最大字符数 |
|
||||
| `temperature` | `float \| None` | `None` | 采样温度(`None` 使用模型默认值) |
|
||||
| `use_schema_constraints` | `bool` | `True` | 是否启用结构化输出约束 |
|
||||
| `batch_length` | `int` | `10` | 每批处理的文本分块数量 |
|
||||
| `max_workers` | `int` | `10` | 最大并行工作线程数 |
|
||||
| `additional_context` | `str \| None` | `None` | 附加到推理提示词中的上下文信息 |
|
||||
| `resolver_params` | `dict \| None` | `None` | 对齐解析器参数(见 [3.5 节](#35-resolver-对齐参数)) |
|
||||
| `extraction_passes` | `int` | `1` | 抽取轮次(>1 时多次抽取并合并非重叠结果) |
|
||||
| `context_window_chars` | `int \| None` | `None` | 前一分块的上下文窗口字符数(用于指代消解) |
|
||||
| `model_url` | `str \| None` | `None` | 自托管模型的 API 端点 URL |
|
||||
| `fetch_urls` | `bool` | `True` | 是否自动下载 http(s) URL 内容 |
|
||||
|
||||
---
|
||||
|
||||
### 1.2 支持的输入类型
|
||||
|
||||
LangExtract **仅接受纯文本**作为输入,支持以下 4 种传入方式:
|
||||
|
||||
| 输入方式 | 示例 | 说明 |
|
||||
|---------|------|------|
|
||||
| **纯文本字符串** | `extract("这是一段文本...")` | 直接传入文本内容 |
|
||||
| **URL** | `extract("https://example.com/article.txt")` | 自动下载 URL 文本内容(`fetch_urls=True`) |
|
||||
| **Document 对象** | `extract([Document(text="...", document_id="doc1")])` | 传入 Document 可迭代集合 |
|
||||
| **CSV 文件** | 通过 `Dataset` 类加载后传入 | 指定 text 列和 id 列 |
|
||||
|
||||
---
|
||||
|
||||
### 1.3 Document 数据结构
|
||||
|
||||
文件路径:`langextract/core/data.py`
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass
|
||||
class Document:
|
||||
text: str # 必填 — 原始文本内容
|
||||
additional_context: str | None = None # 可选 — 附加上下文
|
||||
document_id: str # 自动生成 — 格式 "doc_{uuid_hex[:8]}"
|
||||
tokenized_text: TokenizedText # 惰性计算 — 分词后的文本
|
||||
```
|
||||
|
||||
**字段说明:**
|
||||
|
||||
- `text`:**必填**,原始文本内容,类型为 `str`
|
||||
- `additional_context`:可选,会附加到推理提示词中
|
||||
- `document_id`:通过 property 访问,未设置时自动生成格式为 `doc_{uuid_hex[:8]}` 的唯一 ID
|
||||
- `tokenized_text`:通过 property 惰性计算,使用配置的 Tokenizer 进行分词
|
||||
|
||||
---
|
||||
|
||||
### 1.4 CSV Dataset 输入
|
||||
|
||||
文件路径:`langextract/io.py`
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass(frozen=True)
|
||||
class Dataset:
|
||||
input_path: pathlib.Path # CSV 文件路径
|
||||
id_key: str # 文档 ID 对应的列名
|
||||
text_key: str # 文本内容对应的列名
|
||||
|
||||
def load(self, delimiter: str = ',') -> Iterator[Document]:
|
||||
"""仅支持 .csv 后缀文件,其他格式抛出 NotImplementedError"""
|
||||
```
|
||||
|
||||
**CSV 文件要求:**
|
||||
- 文件后缀必须为 `.csv`
|
||||
- 必须包含 `text_key` 指定的文本列和 `id_key` 指定的 ID 列
|
||||
- 默认分隔符为逗号(`,`),可通过 `delimiter` 参数修改
|
||||
- 其他文件格式会直接抛出 `NotImplementedError`
|
||||
|
||||
---
|
||||
|
||||
### 1.5 URL 文本下载
|
||||
|
||||
文件路径:`langextract/io.py`
|
||||
|
||||
```python
|
||||
def download_text_from_url(
|
||||
url: str,
|
||||
timeout: int = 30, # 默认超时 30 秒
|
||||
show_progress: bool = True,
|
||||
chunk_size: int = 8192,
|
||||
) -> str
|
||||
```
|
||||
|
||||
**URL 要求:**
|
||||
- 必须以 `http://` 或 `https://` 开头
|
||||
- 仅下载文本内容(`response.text`),不解析 HTML/PDF 等
|
||||
- 需要 `fetch_urls=True`(默认开启)
|
||||
|
||||
---
|
||||
|
||||
### 1.6 分块参数配置
|
||||
|
||||
文件路径:`langextract/chunking.py`
|
||||
|
||||
LangExtract 使用基于**句子边界**的确定性分块策略(非语义分块),核心类为 `ChunkIterator`:
|
||||
|
||||
```python
|
||||
class ChunkIterator:
|
||||
def __init__(
|
||||
self,
|
||||
text: str | TokenizedText | None,
|
||||
max_char_buffer: int, # 每个分块最大字符数
|
||||
tokenizer_impl: Tokenizer, # 分词器实例
|
||||
document: Document | None = None,
|
||||
)
|
||||
```
|
||||
|
||||
**分块策略:**
|
||||
|
||||
1. 如果单个句子超过 `max_char_buffer`,按换行符拆分,同时尊重 token 边界
|
||||
2. 如果单个 token 超过 `max_char_buffer`,该 token 独占一个分块
|
||||
3. 如果多个句子可以放入 `max_char_buffer`,合并为一个分块
|
||||
|
||||
**TextChunk 输出结构:**
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass
|
||||
class TextChunk:
|
||||
token_interval: TokenInterval # 在源文档中的 token 区间
|
||||
document: Document | None = None # 源文档引用
|
||||
|
||||
# 属性
|
||||
chunk_text: str # 重建的文本内容
|
||||
sanitized_chunk_text: str # 标准化空白的文本
|
||||
char_interval: CharInterval # 在源文档中的字符区间
|
||||
document_id: str | None # 源文档 ID
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 1.7 不支持的输入格式
|
||||
|
||||
以下格式 **不被支持**,需要在 LangExtract 之前通过外部工具预处理为纯文本:
|
||||
|
||||
| 格式 | 状态 | 预处理方案 |
|
||||
|------|------|-----------|
|
||||
| PDF | ❌ 不支持 | 使用 MinerU / PyMuPDF 先转文本 |
|
||||
| DOCX | ❌ 不支持 | 使用 python-docx 先转文本 |
|
||||
| HTML | ❌ 不支持 | 使用 BeautifulSoup 先提取文本 |
|
||||
| 图片 | ❌ 不支持 | 使用 OCR 工具先识别文本 |
|
||||
| Markdown(含媒体) | ❌ 不支持 | 需提取纯文本部分 |
|
||||
| Excel / JSON | ❌ 不支持 | 需序列化为纯文本 |
|
||||
|
||||
---
|
||||
|
||||
## 二、模型接入规范
|
||||
|
||||
### 2.1 模型路由机制
|
||||
|
||||
文件路径:`langextract/providers/patterns.py`
|
||||
|
||||
LangExtract 通过 **正则匹配 `model_id`** 自动路由到对应的 Provider:
|
||||
|
||||
| Provider | 匹配模式 | 优先级 | 示例模型 |
|
||||
|----------|---------|--------|---------|
|
||||
| **Gemini** | `^gemini` | 10 | `gemini-2.5-flash`, `gemini-1.5-pro` |
|
||||
| **OpenAI** | `^gpt-4`, `^gpt4.`, `^gpt-5`, `^gpt5.` | 10 | `gpt-4o`, `gpt-4o-mini` |
|
||||
| **Ollama** | `gemma`, `llama`, `mistral`, `phi`, `qwen`, `deepseek` 等 | 10 | `gemma2:2b`, `llama3.2:1b` |
|
||||
|
||||
Ollama 额外支持 HuggingFace 格式的模型名:`meta-llama/Llama*`, `google/gemma*`, `mistralai/*`, `microsoft/phi*` 等。
|
||||
|
||||
---
|
||||
|
||||
### 2.2 Gemini Provider
|
||||
|
||||
文件路径:`langextract/providers/gemini.py`
|
||||
|
||||
```python
|
||||
class GeminiLanguageModel(BaseLanguageModel):
|
||||
def __init__(
|
||||
self,
|
||||
model_id: str = 'gemini-2.5-flash',
|
||||
api_key: str | None = None,
|
||||
vertexai: bool = False,
|
||||
credentials: Any | None = None,
|
||||
project: str | None = None,
|
||||
location: str | None = None,
|
||||
http_options: Any | None = None,
|
||||
gemini_schema: GeminiSchema | None = None,
|
||||
format_type: FormatType = FormatType.JSON,
|
||||
temperature: float = 0.0,
|
||||
max_workers: int = 10,
|
||||
fence_output: bool = False,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `model_id` | `gemini-2.5-flash` | Gemini 模型标识 |
|
||||
| `api_key` | `None` | 环境变量:`GEMINI_API_KEY` 或 `LANGEXTRACT_API_KEY` |
|
||||
| `vertexai` | `False` | 是否使用 Vertex AI 企业认证 |
|
||||
| `temperature` | `0.0` | 采样温度(确定性输出) |
|
||||
| `format_type` | `JSON` | 输出格式 |
|
||||
|
||||
**运行时可配参数:** `temperature`, `max_output_tokens`, `top_p`, `top_k`
|
||||
|
||||
**额外参数白名单:** `response_schema`, `response_mime_type`, `safety_settings`, `system_instruction`, `tools`, `stop_sequences`, `candidate_count`
|
||||
|
||||
---
|
||||
|
||||
### 2.3 OpenAI Provider
|
||||
|
||||
文件路径:`langextract/providers/openai.py`
|
||||
|
||||
```python
|
||||
class OpenAILanguageModel(BaseLanguageModel):
|
||||
def __init__(
|
||||
self,
|
||||
model_id: str = 'gpt-4o-mini',
|
||||
api_key: str | None = None,
|
||||
base_url: str | None = None,
|
||||
organization: str | None = None,
|
||||
format_type: FormatType = FormatType.JSON,
|
||||
temperature: float | None = None,
|
||||
max_workers: int = 10,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `model_id` | `gpt-4o-mini` | OpenAI 模型标识 |
|
||||
| `api_key` | `None` | 环境变量:`OPENAI_API_KEY` 或 `LANGEXTRACT_API_KEY` |
|
||||
| `base_url` | `None` | 自定义 API 端点(用于兼容接口) |
|
||||
| `organization` | `None` | OpenAI 组织 ID |
|
||||
| `temperature` | `None` | 采样温度 |
|
||||
|
||||
**运行时可配参数:** `temperature`, `max_output_tokens`, `top_p`, `frequency_penalty`, `presence_penalty`, `seed`, `stop`, `logprobs`, `top_logprobs`, `reasoning_effort`, `reasoning`, `response_format`
|
||||
|
||||
---
|
||||
|
||||
### 2.4 Ollama Provider
|
||||
|
||||
文件路径:`langextract/providers/ollama.py`
|
||||
|
||||
```python
|
||||
class OllamaLanguageModel(BaseLanguageModel):
|
||||
def __init__(
|
||||
self,
|
||||
model_id: str, # 必填
|
||||
model_url: str = 'http://localhost:11434',
|
||||
base_url: str | None = None,
|
||||
format_type: FormatType | None = None,
|
||||
constraint: Constraint = Constraint(),
|
||||
timeout: int | None = None,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `model_id` | **必填** | Ollama 模型名(如 `gemma2:2b`) |
|
||||
| `model_url` | `http://localhost:11434` | Ollama 服务地址 |
|
||||
| `timeout` | `120` | 请求超时(秒) |
|
||||
| `format_type` | `JSON` | 输出格式 |
|
||||
|
||||
**内部默认常量:**
|
||||
|
||||
| 常量 | 值 | 说明 |
|
||||
|------|-----|------|
|
||||
| `_DEFAULT_TEMPERATURE` | `0.1` | 默认温度 |
|
||||
| `_DEFAULT_TIMEOUT` | `120` | 默认超时(秒) |
|
||||
| `_DEFAULT_KEEP_ALIVE` | `300` | 模型保活时间(秒) |
|
||||
| `_DEFAULT_NUM_CTX` | `2048` | 默认上下文窗口大小 |
|
||||
|
||||
**认证支持:** 可配置 `api_key`、`auth_scheme`(默认 `Bearer`)、`auth_header`(默认 `Authorization`)用于代理 Ollama 实例。
|
||||
|
||||
---
|
||||
|
||||
### 2.5 OpenAI 兼容接口适配(DeepSeek 等)
|
||||
|
||||
LangExtract 的 OpenAI Provider 支持 `base_url` 参数,因此可以接入任何 OpenAI 兼容 API:
|
||||
|
||||
```python
|
||||
# DeepSeek 接入示例
|
||||
result = lx.extract(
|
||||
text_or_documents="...",
|
||||
model_id="gpt-4o-mini", # 触发 OpenAI Provider 路由
|
||||
api_key="sk-your-deepseek-key",
|
||||
examples=[...],
|
||||
language_model_params={
|
||||
"base_url": "https://api.deepseek.com",
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
> **注意:** 由于路由基于 `model_id` 正则匹配,使用 DeepSeek 等兼容接口时 `model_id` 仍需使用 `gpt-*` 前缀来命中 OpenAI Provider,或通过 `config` 参数显式指定 Provider。
|
||||
|
||||
---
|
||||
|
||||
### 2.6 模型优先级与配置覆盖关系
|
||||
|
||||
模型配置的优先级从高到低:
|
||||
|
||||
```
|
||||
model(预配置的模型实例) > config(ModelConfig 实例) > model_id + api_key
|
||||
```
|
||||
|
||||
**ModelConfig 结构**(`langextract/factory.py`):
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass(slots=True, frozen=True)
|
||||
class ModelConfig:
|
||||
model_id: str | None = None # 模型标识
|
||||
provider: str | None = None # 显式指定 Provider 名称
|
||||
provider_kwargs: dict[str, Any] = field(default_factory=dict) # Provider 构造参数
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2.7 关于 Embedding 模型
|
||||
|
||||
**LangExtract 不使用也不依赖任何 Embedding 模型。**
|
||||
|
||||
- 文本分块使用基于句子边界的确定性分割算法,不涉及语义相似度计算
|
||||
- 没有向量索引或向量检索功能
|
||||
- 整个代码库中没有任何 Embedding 相关的调用
|
||||
|
||||
---
|
||||
|
||||
## 三、输出数据格式规范
|
||||
|
||||
### 3.1 AnnotatedDocument 结构
|
||||
|
||||
文件路径:`langextract/core/data.py`
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass
|
||||
class AnnotatedDocument:
|
||||
extractions: list[Extraction] | None = None # 抽取结果列表
|
||||
text: str | None = None # 原始文本
|
||||
document_id: str # 文档唯一标识(自动生成)
|
||||
tokenized_text: TokenizedText # 分词后文本(惰性计算)
|
||||
```
|
||||
|
||||
**序列化后的 JSON 顶层字段:**
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `document_id` | `string` | 文档唯一标识,格式 `doc_{uuid_hex[:8]}` |
|
||||
| `text` | `string \| null` | 原始输入文本 |
|
||||
| `extractions` | `array[Extraction] \| null` | 抽取的实体列表 |
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Extraction 结构
|
||||
|
||||
文件路径:`langextract/core/data.py`
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass(init=False)
|
||||
class Extraction:
|
||||
extraction_class: str # 实体类型
|
||||
extraction_text: str # 实体文本
|
||||
char_interval: CharInterval | None = None # 字符位置锚点
|
||||
alignment_status: AlignmentStatus | None = None # 对齐状态
|
||||
extraction_index: int | None = None # 抽取顺序索引
|
||||
group_index: int | None = None # 分组索引
|
||||
description: str | None = None # 实体描述
|
||||
attributes: dict[str, str | list[str]] | None = None # 附加属性
|
||||
token_interval: TokenInterval | None = None # Token 位置锚点
|
||||
```
|
||||
|
||||
**字段详细说明:**
|
||||
|
||||
| 字段 | 类型 | 必填 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `extraction_class` | `str` | 是 | 实体类型/分类名称(如 `PERSON`, `ORGANIZATION`) |
|
||||
| `extraction_text` | `str` | 是 | 抽取的文本内容(应为原文的子串) |
|
||||
| `char_interval` | `CharInterval \| null` | 否 | 在原文中的字符偏移位置 |
|
||||
| `alignment_status` | `string \| null` | 否 | 文本对齐质量(见 [3.4 节](#34-alignmentstatus-对齐状态枚举)) |
|
||||
| `extraction_index` | `int \| null` | 否 | 在结果列表中的顺序位置 |
|
||||
| `group_index` | `int \| null` | 否 | 分组归属(用于关联抽取) |
|
||||
| `description` | `string \| null` | 否 | 对该实体的补充描述 |
|
||||
| `attributes` | `dict \| null` | 否 | 键值对形式的附加属性 |
|
||||
| `token_interval` | `TokenInterval \| null` | 否 | 在原文中的 token 偏移位置 |
|
||||
|
||||
---
|
||||
|
||||
### 3.3 CharInterval 字符锚点
|
||||
|
||||
文件路径:`langextract/core/data.py`
|
||||
|
||||
```python
|
||||
@dataclasses.dataclass
|
||||
class CharInterval:
|
||||
start_pos: int | None = None # 起始位置(包含),0-indexed
|
||||
end_pos: int | None = None # 结束位置(不包含)
|
||||
```
|
||||
|
||||
**语义:** `source_text[start_pos:end_pos]` 即为抽取的文本在原文中的精确位置。
|
||||
|
||||
---
|
||||
|
||||
### 3.4 AlignmentStatus 对齐状态枚举
|
||||
|
||||
文件路径:`langextract/core/data.py`
|
||||
|
||||
```python
|
||||
class AlignmentStatus(enum.Enum):
|
||||
MATCH_EXACT = "match_exact"
|
||||
MATCH_GREATER = "match_greater"
|
||||
MATCH_LESSER = "match_lesser"
|
||||
MATCH_FUZZY = "match_fuzzy"
|
||||
```
|
||||
|
||||
| 状态值 | 序列化值 | 含义 | 可信度 |
|
||||
|--------|---------|------|--------|
|
||||
| `MATCH_EXACT` | `"match_exact"` | LLM 输出与原文 token 序列完全匹配 | 最高 |
|
||||
| `MATCH_GREATER` | `"match_greater"` | LLM 输出的 token 序列短于匹配到的原文(找到最佳重叠) | 高 |
|
||||
| `MATCH_LESSER` | `"match_lesser"` | LLM 输出长于匹配到的原文(部分精确匹配) | 中 |
|
||||
| `MATCH_FUZZY` | `"match_fuzzy"` | 模糊匹配,重叠率达到阈值(默认 ≥0.75) | 低 |
|
||||
| `None` | `null` | 未找到任何对齐 | 不可信 |
|
||||
|
||||
**对齐流程:**
|
||||
|
||||
```
|
||||
1. 尝试精确 token 级别匹配(difflib)
|
||||
├── 成功且长度相等 → MATCH_EXACT
|
||||
├── 成功但 LLM 输出更长 → MATCH_LESSER
|
||||
└── 成功但匹配区域更大 → MATCH_GREATER
|
||||
2. 精确匹配失败且 enable_fuzzy_alignment=True
|
||||
├── 最佳重叠窗口 ≥ fuzzy_alignment_threshold → MATCH_FUZZY
|
||||
└── 低于阈值 → None
|
||||
3. 精确匹配失败且 enable_fuzzy_alignment=False → None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.5 Resolver 对齐参数
|
||||
|
||||
文件路径:`langextract/resolver.py`
|
||||
|
||||
通过 `extract()` 的 `resolver_params` 字典传入:
|
||||
|
||||
```python
|
||||
result = lx.extract(
|
||||
...,
|
||||
resolver_params={
|
||||
"enable_fuzzy_alignment": True, # 是否启用模糊对齐(默认 True)
|
||||
"fuzzy_alignment_threshold": 0.75, # 模糊匹配最低重叠率(默认 0.75)
|
||||
"accept_match_lesser": True, # 是否接受 MATCH_LESSER(默认 True)
|
||||
"suppress_parse_errors": False, # 是否忽略 JSON 解析错误(默认 False)
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
| 参数 | 类型 | 默认值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `enable_fuzzy_alignment` | `bool` | `True` | 精确匹配失败后是否尝试模糊匹配 |
|
||||
| `fuzzy_alignment_threshold` | `float` | `0.75` | 模糊匹配的最低 token 重叠比率(0.0~1.0) |
|
||||
| `accept_match_lesser` | `bool` | `True` | 是否接受部分精确匹配结果 |
|
||||
| `suppress_parse_errors` | `bool` | `False` | JSON 解析失败时是否继续而非报错 |
|
||||
|
||||
---
|
||||
|
||||
### 3.6 JSONL 输出文件格式
|
||||
|
||||
文件路径:`langextract/io.py`
|
||||
|
||||
```python
|
||||
def save_annotated_documents(
|
||||
annotated_documents: Iterator[AnnotatedDocument],
|
||||
output_dir: pathlib.Path | str | None = None,
|
||||
output_name: str = 'data.jsonl',
|
||||
show_progress: bool = True,
|
||||
) -> None
|
||||
```
|
||||
|
||||
**输出规范:**
|
||||
- 文件格式:**JSONL**(JSON Lines),每行一个完整的 JSON 对象
|
||||
- 默认文件名:`data.jsonl`
|
||||
- 序列化规则:
|
||||
- Enum 值转为字符串(如 `AlignmentStatus.MATCH_EXACT` → `"match_exact"`)
|
||||
- NumPy / integral 数值类型转为 `int`
|
||||
- 以 `_` 开头的私有字段被排除
|
||||
|
||||
---
|
||||
|
||||
### 3.7 完整输出 JSON Schema 示例
|
||||
|
||||
单条 JSONL 记录的完整结构:
|
||||
|
||||
```json
|
||||
{
|
||||
"document_id": "doc_a1b2c3d4",
|
||||
"text": "GraphRAG is a technique developed by Microsoft Research that combines knowledge graphs with retrieval-augmented generation.",
|
||||
"extractions": [
|
||||
{
|
||||
"extraction_class": "TECHNOLOGY",
|
||||
"extraction_text": "GraphRAG",
|
||||
"char_interval": {
|
||||
"start_pos": 0,
|
||||
"end_pos": 8
|
||||
},
|
||||
"alignment_status": "match_exact",
|
||||
"extraction_index": 0,
|
||||
"group_index": null,
|
||||
"description": "A technique combining knowledge graphs with RAG",
|
||||
"attributes": {
|
||||
"category": "AI/ML",
|
||||
"developer": "Microsoft Research"
|
||||
},
|
||||
"token_interval": {
|
||||
"start_index": 0,
|
||||
"end_index": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"extraction_class": "ORGANIZATION",
|
||||
"extraction_text": "Microsoft Research",
|
||||
"char_interval": {
|
||||
"start_pos": 46,
|
||||
"end_pos": 64
|
||||
},
|
||||
"alignment_status": "match_exact",
|
||||
"extraction_index": 1,
|
||||
"group_index": null,
|
||||
"description": null,
|
||||
"attributes": null,
|
||||
"token_interval": {
|
||||
"start_index": 7,
|
||||
"end_index": 9
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.8 HTML 可视化输出
|
||||
|
||||
文件路径:`langextract/visualization.py`
|
||||
|
||||
```python
|
||||
def visualize(doc: AnnotatedDocument) -> HTML
|
||||
```
|
||||
|
||||
**功能特性:**
|
||||
- 按 `extraction_class` 进行颜色编码高亮(10 色调色板)
|
||||
- 交互式 tooltip 显示实体类型和属性
|
||||
- 动画导航控件,支持多实体浏览
|
||||
- 进度滑块
|
||||
- 响应式 HTML/CSS/JavaScript 嵌入
|
||||
- 支持 Jupyter / IPython 环境直接渲染
|
||||
|
||||
---
|
||||
|
||||
## 附录:环境变量与常量速查
|
||||
|
||||
### 环境变量
|
||||
|
||||
| 变量名 | 适用 Provider | 说明 |
|
||||
|--------|--------------|------|
|
||||
| `LANGEXTRACT_API_KEY` | 所有 | 通用 API Key 后备 |
|
||||
| `GEMINI_API_KEY` | Gemini | Gemini API Key |
|
||||
| `OPENAI_API_KEY` | OpenAI | OpenAI API Key |
|
||||
| `OLLAMA_BASE_URL` | Ollama | Ollama 服务地址(默认 `http://localhost:11434`) |
|
||||
|
||||
### FormatType 枚举
|
||||
|
||||
```python
|
||||
class FormatType(enum.Enum):
|
||||
YAML = 'yaml'
|
||||
JSON = 'json'
|
||||
```
|
||||
|
||||
### 结构化输出支持
|
||||
|
||||
| Provider | Schema 类型 | 结构化输出模式 |
|
||||
|----------|------------|---------------|
|
||||
| Gemini | `GeminiSchema` | 严格结构化输出 |
|
||||
| OpenAI | JSON Mode | 通过 `response_format` 约束 |
|
||||
| Ollama | `FormatModeSchema` | JSON 模式(非严格) |
|
||||
|
||||
### Fence Output 逻辑
|
||||
|
||||
| Provider | 默认值 | 说明 |
|
||||
|----------|--------|------|
|
||||
| Gemini | `False` | 有 Schema 时不需要 fence |
|
||||
| OpenAI | `False` | JSON Mode 返回原始 JSON |
|
||||
| Ollama | `False` | 返回原始 JSON |
|
||||
879
docs/mineru_specification-v1.0.md
Normal file
879
docs/mineru_specification-v1.0.md
Normal file
@@ -0,0 +1,879 @@
|
||||
# MinerU 文档解析规范文档 v1.0
|
||||
|
||||
> 基于 [opendatalab/MinerU](https://github.com/opendatalab/MinerU) 官方 API 文档 + 本地 MVP 实测验证
|
||||
> 实测后端版本:`pipeline` / `_version_name: 2.6.4`
|
||||
> 更新日期:2026-03-04
|
||||
|
||||
---
|
||||
|
||||
## 目录
|
||||
|
||||
- [一、Pipeline 执行流程与测试脚本](#一pipeline-执行流程与测试脚本)
|
||||
- [1.1 虚拟环境配置(环境隔离)](#11-虚拟环境配置环境隔离)
|
||||
- [1.2 完整执行流程(本地文件 → 云端解析 → 本地存储)](#12-完整执行流程本地文件--云端解析--本地存储)
|
||||
- [1.3 测试脚本存放位置](#13-测试脚本存放位置)
|
||||
- [1.4 Pipeline 各步骤详解](#14-pipeline-各步骤详解)
|
||||
- [二、输入格式规范](#二输入格式规范)
|
||||
- [2.1 支持的文件格式](#21-支持的文件格式)
|
||||
- [2.2 输入限制](#22-输入限制)
|
||||
- [2.3 OCR 语言支持](#23-ocr-语言支持)
|
||||
- [三、输出格式规范(实测验证)](#三输出格式规范实测验证)
|
||||
- [3.1 实际输出文件清单(实测 vs 官方文档对比)](#31-实际输出文件清单实测-vs-官方文档对比)
|
||||
- [3.2 content_list.json 字段规范(实测验证)](#32-content_listjson-字段规范实测验证)
|
||||
- [3.3 layout.json 字段规范(实测验证)](#33-layoutjson-字段规范实测验证)
|
||||
- [3.4 full.md Markdown 输出规范(实测验证)](#34-fullmd-markdown-输出规范实测验证)
|
||||
- [四、布局信息规范](#四布局信息规范)
|
||||
- [4.1 坐标系定义(实测验证)](#41-坐标系定义实测验证)
|
||||
- [4.2 布局分类体系](#42-布局分类体系)
|
||||
- [4.3 内容层级与标题级别](#43-内容层级与标题级别)
|
||||
- [4.4 布局精度提取指南](#44-布局精度提取指南)
|
||||
- [五、云端 API 关键参数规范](#五云端-api-关键参数规范)
|
||||
- [5.1 认证配置](#51-认证配置)
|
||||
- [5.2 本地文件上传流程 — file-urls/batch](#52-本地文件上传流程--file-urlsbatch)
|
||||
- [5.3 URL 直传解析 — extract/task](#53-url-直传解析--extracttask)
|
||||
- [5.4 批量 URL 解析 — extract/task/batch](#54-批量-url-解析--extracttaskbatch)
|
||||
- [5.5 查询结果接口](#55-查询结果接口)
|
||||
- [5.6 通用响应包装结构](#56-通用响应包装结构)
|
||||
- [5.7 任务状态枚举(实测验证)](#57-任务状态枚举实测验证)
|
||||
- [5.8 错误码速查](#58-错误码速查)
|
||||
|
||||
---
|
||||
|
||||
## 一、Pipeline 执行流程与测试脚本
|
||||
|
||||
### 1.1 虚拟环境配置(环境隔离)
|
||||
|
||||
MinerU MVP 组件使用 **独立的 Python 虚拟环境**,与项目其他组件(LangExtract、GraphRAG Pipeline 等)完全隔离,避免依赖污染。
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|-----|
|
||||
| 虚拟环境路径 | `F:\GraphRAGAgent\mineru_mvp\.venv\` |
|
||||
| Python 版本 | 3.12 |
|
||||
| 创建工具 | uv |
|
||||
| Python 解释器 | `F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe` |
|
||||
|
||||
**启动 Pipeline 前必须切换到子虚拟环境:**
|
||||
|
||||
```bash
|
||||
# 方式一:直接指定解释器路径(推荐,无需手动激活)
|
||||
F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe pipeline.py
|
||||
|
||||
# 方式二:先激活环境再运行
|
||||
cd F:/GraphRAGAgent/mineru_mvp
|
||||
source .venv/Scripts/activate
|
||||
python pipeline.py
|
||||
```
|
||||
|
||||
**安装新依赖:**
|
||||
|
||||
```bash
|
||||
uv pip install <package> --python F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe
|
||||
```
|
||||
|
||||
**已安装依赖清单:**
|
||||
|
||||
| 包 | 用途 |
|
||||
|----|------|
|
||||
| `requests` | HTTP 客户端(API 调用、文件上传下载) |
|
||||
| `python-dotenv` | `.env` 配置文件加载 |
|
||||
| `reportlab` | 测试 PDF 生成 |
|
||||
|
||||
---
|
||||
|
||||
### 1.2 完整执行流程(本地文件 → 云端解析 → 本地存储)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Step 0: 激活虚拟环境 │
|
||||
│ source .venv/Scripts/activate 或 直接使用 .venv 内 python │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Step 1: 获取预签名上传 URL │
|
||||
│ POST /file-urls/batch → 返回 batch_id + file_urls[] │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Step 2: 上传本地文件 │
|
||||
│ PUT {file_urls[0]} ← 本地文件二进制流(不带 Content-Type) │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Step 3: 轮询解析结果 │
|
||||
│ GET /extract-results/batch/{batch_id} │
|
||||
│ 状态流转: waiting-file → pending → running → done/failed │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Step 4: 下载解析结果 ZIP │
|
||||
│ GET {full_zip_url} → 解压到本地 output/ 目录 │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Step 5: 分析解析产物 │
|
||||
│ 读取 *content_list.json → 统计块类型、页数、生成 summary │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
> **关键发现(实测):** 上传文件时 **不能** 携带 `Content-Type` 请求头,否则 OSS 预签名 URL 校验失败返回 403 `SignatureDoesNotMatch`。必须使用裸 `PUT` 请求。
|
||||
|
||||
### 1.3 测试脚本存放位置
|
||||
|
||||
```
|
||||
F:\GraphRAGAgent\mineru_mvp\
|
||||
├── .env # API Token 配置
|
||||
├── .venv/ # 独立虚拟环境(Python 3.12, uv 创建)
|
||||
├── CLAUDE.md # Claude Code 组件规范
|
||||
├── create_test_pdf.py # 测试 PDF 生成脚本(reportlab)
|
||||
├── pipeline.py # 完整 Pipeline 脚本(5 步)
|
||||
├── test_sample.pdf # 生成的测试 PDF(1 页,含标题/段落/表格)
|
||||
└── output/
|
||||
└── test_sample/ # 解析输出结果
|
||||
├── full.md
|
||||
├── {uuid}_content_list.json
|
||||
├── layout.json
|
||||
├── {uuid}_origin.pdf
|
||||
└── images/
|
||||
└── {hash}.jpg
|
||||
```
|
||||
|
||||
### 1.4 Pipeline 各步骤详解
|
||||
|
||||
#### Step 1 — 获取预签名上传 URL
|
||||
|
||||
```python
|
||||
resp = requests.post(
|
||||
f"{API_BASE}/file-urls/batch",
|
||||
headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
|
||||
json={
|
||||
"files": [{"name": "test_sample.pdf", "data_id": "mvp_test"}],
|
||||
"enable_formula": True,
|
||||
"enable_table": True,
|
||||
"language": "en",
|
||||
},
|
||||
)
|
||||
batch_id = resp.json()["data"]["batch_id"]
|
||||
upload_url = resp.json()["data"]["file_urls"][0]
|
||||
```
|
||||
|
||||
#### Step 2 — 上传文件(裸 PUT,不带 Content-Type)
|
||||
|
||||
```python
|
||||
with open("test_sample.pdf", "rb") as f:
|
||||
requests.put(upload_url, data=f) # 不传 headers
|
||||
```
|
||||
|
||||
#### Step 3 — 轮询结果
|
||||
|
||||
```python
|
||||
while True:
|
||||
result = requests.get(
|
||||
f"{API_BASE}/extract-results/batch/{batch_id}",
|
||||
headers=headers,
|
||||
).json()
|
||||
state = result["data"]["extract_result"][0]["state"]
|
||||
if state == "done":
|
||||
zip_url = result["data"]["extract_result"][0]["full_zip_url"]
|
||||
break
|
||||
time.sleep(5)
|
||||
```
|
||||
|
||||
#### Step 4 — 下载解压
|
||||
|
||||
```python
|
||||
zip_data = requests.get(zip_url).content
|
||||
with zipfile.ZipFile(io.BytesIO(zip_data)) as zf:
|
||||
zf.extractall("output/test_sample/")
|
||||
```
|
||||
|
||||
#### Step 5 — 分析产物
|
||||
|
||||
```python
|
||||
content_list = json.load(open("output/test_sample/*content_list.json"))
|
||||
# 按 type 分类统计、按 page_idx 分组、提取标题层级等
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 二、输入格式规范
|
||||
|
||||
### 2.1 支持的文件格式
|
||||
|
||||
| 格式 | 扩展名 | 说明 |
|
||||
|------|--------|------|
|
||||
| **PDF** | `.pdf` | 核心能力 — 文本型 / 扫描型 / 混合型均支持 |
|
||||
| **Word** | `.doc`, `.docx` | 旧版和新版 Word 文档 |
|
||||
| **PowerPoint** | `.ppt`, `.pptx` | 旧版和新版演示文稿 |
|
||||
| **图片** | `.png`, `.jpg`, `.jpeg` | 单页图片文档,支持 EXIF 方向自动校正 |
|
||||
| **HTML** | `.html` | 须指定 `model_version: "MinerU-HTML"` |
|
||||
|
||||
### 2.2 输入限制
|
||||
|
||||
| 约束项 | 限制值 |
|
||||
|--------|--------|
|
||||
| 单文件最大体积 | **200 MB** |
|
||||
| 单文件最大页数 | **600 页** |
|
||||
| 批量请求最大文件数 | **200 个** |
|
||||
| 预签名上传 URL 有效期 | **24 小时** |
|
||||
| 云端 API 每日最高优先级额度 | **2,000 页**,超出部分降低优先级 |
|
||||
|
||||
### 2.3 OCR 语言支持
|
||||
|
||||
MinerU 内置 OCR 引擎支持 **109 种语言**(基于 PaddleOCR v3),可通过 `language` 参数指定文档主语言。
|
||||
|
||||
> **注意(官方文档):** `language` 的默认值为 `"ch"`(非 `"zh"`),遵循 PaddleOCR 语言代码规范。
|
||||
|
||||
| 代码 | 语言 | 代码 | 语言 |
|
||||
|------|------|------|------|
|
||||
| `ch` | 中文 | `en` | 英文 |
|
||||
| `japan` | 日文 | `korean` | 韩文 |
|
||||
| `french` | 法文 | `german` | 德文 |
|
||||
|
||||
---
|
||||
|
||||
## 三、输出格式规范(实测验证)
|
||||
|
||||
### 3.1 实际输出文件清单(实测 vs 官方文档对比)
|
||||
|
||||
**实测输出(ZIP 解压后,共 5 个文件):**
|
||||
|
||||
```
|
||||
output/test_sample/
|
||||
├── full.md # Markdown 输出(单文件)
|
||||
├── {uuid}_content_list.json # 扁平化内容块列表
|
||||
├── layout.json # 富元数据中间格式
|
||||
├── {uuid}_origin.pdf # 原始 PDF 副本
|
||||
└── images/
|
||||
└── {sha256_hash}.jpg # 表格/图片截图
|
||||
```
|
||||
|
||||
**与官方文档差异对比:**
|
||||
|
||||
| 项目 | 官方文档描述 | 实测结果 | 差异说明 |
|
||||
|------|-------------|---------|---------|
|
||||
| Markdown 文件 | `auto/auto.md` + `auto_nlp/auto_nlp.md`(两个子目录) | **`full.md`**(单文件,根目录) | 云端 API 输出为合并的 `full.md`,无子目录拆分 |
|
||||
| 中间格式 | `middle.json` | **`layout.json`** | 文件名不同,结构一致 |
|
||||
| content_list | `content_list.json` | **`{uuid}_content_list.json`** | 文件名带 UUID 前缀 |
|
||||
| 原始文件副本 | 未提及 | **`{uuid}_origin.pdf`** | 云端 API 额外返回原始文件副本 |
|
||||
| 调试文件 | `layout.pdf` + `span.pdf` + `model.json` | **无** | 云端 API 不返回调试 PDF 和 model.json |
|
||||
| 图片命名 | `img_0_0.png` / `table_0_1.png` | **`{sha256}.jpg`** | 使用内容哈希命名,格式为 JPG |
|
||||
|
||||
> **重要结论:** 以实测为准。对接下游系统时,文件匹配应使用 glob 模式(如 `*content_list.json`)而非固定文件名。
|
||||
|
||||
### 3.2 content_list.json 字段规范(实测验证)
|
||||
|
||||
文件为 **JSON 数组**,每个元素是一个内容块,按文档阅读顺序排列。
|
||||
|
||||
#### 3.2.1 公共字段
|
||||
|
||||
| 字段 | 类型 | 说明 | 实测验证 |
|
||||
|------|------|------|---------|
|
||||
| `type` | `string` | 内容类型 | 实测出现:`text`, `table` |
|
||||
| `page_idx` | `int` | 所在页码(0-indexed) | 实测值:`0` |
|
||||
| `bbox` | `[int, int, int, int]` | 边界框 `[x0, y0, x1, y1]` | 实测范围:`0–1000`(归一化) |
|
||||
|
||||
#### 3.2.2 文本块(type: "text")
|
||||
|
||||
**实测完整结构:**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "text",
|
||||
"text": "GraphRAG: Knowledge Graph Enhanced RAG System ",
|
||||
"text_level": 1,
|
||||
"bbox": [141, 93, 860, 151],
|
||||
"page_idx": 0
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 必现 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `text` | `string` | 是 | 文本内容(末尾可能有空格) |
|
||||
| `text_level` | `int \| 缺失` | 否 | 标题级别:`1`=一级标题;**正文时该字段缺失而非为 `0` 或 `null`** |
|
||||
|
||||
> **实测发现:** 正文段落中 `text_level` 字段 **完全不存在**(不是 `null` 或 `0`),仅标题块才携带该字段。判断标题应使用 `block.get("text_level")` 而非 `block["text_level"] >= 1`。
|
||||
|
||||
#### 3.2.3 表格块(type: "table")
|
||||
|
||||
**实测完整结构:**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "table",
|
||||
"img_path": "images/e382eaafdf341d361c2567b20d9ce56456c17a7dd10ae5dadbcc3961256169c9.jpg",
|
||||
"table_caption": [],
|
||||
"table_footnote": [],
|
||||
"table_body": "<table><tr><td rowspan=1 colspan=2>Method Comprehensiveness</td>...</table>",
|
||||
"bbox": [115, 563, 882, 708],
|
||||
"page_idx": 0
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 必现 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `img_path` | `string` | 是 | 表格截图路径(`images/{sha256}.jpg`) |
|
||||
| `table_body` | `string` | 是 | HTML 表格(`<table>` 标签,无 `<html>/<body>` 外层包裹) |
|
||||
| `table_caption` | `string[]` | 是 | 表格标题(可为空数组 `[]`) |
|
||||
| `table_footnote` | `string[]` | 是 | 表格脚注(可为空数组 `[]`) |
|
||||
|
||||
> **实测发现:** `table_body` 的 HTML 直接以 `<table>` 开头,**不含** `<html><body>` 外层包裹(官方文档示例中有外层包裹,以实测为准)。
|
||||
|
||||
#### 3.2.4 图片块(type: "image")— 官方文档
|
||||
|
||||
本次测试 PDF 不含独立图片,以下为官方文档规范(待后续实测验证):
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "image",
|
||||
"img_path": "images/{hash}.jpg",
|
||||
"image_caption": ["Figure 1: ..."],
|
||||
"image_footnote": [],
|
||||
"bbox": [x0, y0, x1, y1],
|
||||
"page_idx": 0
|
||||
}
|
||||
```
|
||||
|
||||
#### 3.2.5 公式块(type: "equation")— 官方文档
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "equation",
|
||||
"text": "E = mc^2",
|
||||
"text_format": "latex",
|
||||
"img_path": "images/{hash}.jpg",
|
||||
"bbox": [x0, y0, x1, y1],
|
||||
"page_idx": 0
|
||||
}
|
||||
```
|
||||
|
||||
> **实测发现:** 测试 PDF 结论段的百分数被解析为 LaTeX 内联公式(`$7 2 . 0 \%$`),嵌入在 `text` 类型块中,而非独立的 `equation` 块。这说明 Pipeline 后端会将简单公式内联到文本块中。
|
||||
|
||||
---
|
||||
|
||||
### 3.3 layout.json 字段规范(实测验证)
|
||||
|
||||
`layout.json` 对应官方文档中的 `middle.json`,是富元数据中间格式。
|
||||
|
||||
#### 3.3.1 顶层结构(实测)
|
||||
|
||||
```json
|
||||
{
|
||||
"_backend": "pipeline",
|
||||
"_version_name": "2.6.4",
|
||||
"pdf_info": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 实测值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `_backend` | `string` | `"pipeline"` | 使用的解析后端 |
|
||||
| `_version_name` | `string` | `"2.6.4"` | MinerU 版本标识 |
|
||||
| `pdf_info` | `array` | 含 1 个元素 | 按页组织的解析结果 |
|
||||
|
||||
#### 3.3.2 页级结构(实测)
|
||||
|
||||
```json
|
||||
{
|
||||
"page_idx": 0,
|
||||
"page_size": [595, 841],
|
||||
"preproc_blocks": [ ... ],
|
||||
"para_blocks": [ ... ],
|
||||
"discarded_blocks": []
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 实测值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `page_idx` | `int` | `0` | 页码(0-indexed) |
|
||||
| `page_size` | `[int, int]` | `[595, 841]` | 页面尺寸 `[宽, 高]`(PDF pt 单位,A4≈595×841) |
|
||||
| `preproc_blocks` | `array` | 10 个块 | 预处理阶段的内容块 |
|
||||
| `para_blocks` | `array` | 10 个块 | 段落分段后的内容块 |
|
||||
| `discarded_blocks` | `array` | `[]` | 被过滤的内容(页眉/页脚等) |
|
||||
|
||||
> **与官方文档差异:** 实测页级结构 **仅包含 3 个数组**(`preproc_blocks`、`para_blocks`、`discarded_blocks`),**不含** 官方文档提到的 `images`、`tables`、`interline_equations` 独立数组。表格和图片直接嵌入在 `preproc_blocks` / `para_blocks` 中。
|
||||
|
||||
#### 3.3.3 内容块层级结构(Block → Line → Span,实测验证)
|
||||
|
||||
**文本/标题块(实测):**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "title",
|
||||
"bbox": [84, 79, 512, 127],
|
||||
"lines": [
|
||||
{
|
||||
"bbox": [80, 77, 515, 106],
|
||||
"spans": [
|
||||
{
|
||||
"bbox": [80, 77, 515, 106],
|
||||
"score": 1.0,
|
||||
"content": "GraphRAG: Knowledge Graph Enhanced",
|
||||
"type": "text"
|
||||
}
|
||||
],
|
||||
"index": 0
|
||||
}
|
||||
],
|
||||
"index": 0.5
|
||||
}
|
||||
```
|
||||
|
||||
**Block 字段(实测):**
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `type` | `string` | 块类型:实测出现 `title`, `text`, `table` |
|
||||
| `bbox` | `[int, int, int, int]` | 边界框(原始 PDF pt 坐标) |
|
||||
| `lines` | `array` | 行数组(文本/标题块) |
|
||||
| `blocks` | `array` | 子块数组(仅 `table` 类型容器块) |
|
||||
| `index` | `int \| float` | 排序索引(可为小数,如 `0.5`) |
|
||||
|
||||
**Line 字段(实测):**
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `bbox` | `[int, int, int, int]` | 行边界框 |
|
||||
| `spans` | `array` | Span 数组 |
|
||||
| `index` | `int` | 行内排序索引 |
|
||||
|
||||
**Span 字段(实测):**
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `bbox` | `[int, int, int, int]` | Span 边界框 |
|
||||
| `type` | `string` | 实测出现:`text`, `table` |
|
||||
| `content` | `string` | 文本内容(`type=text` 时) |
|
||||
| `score` | `float` | 置信度(实测多为 `1.0`) |
|
||||
|
||||
**表格容器块(实测):**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "table",
|
||||
"bbox": [69, 474, 525, 596],
|
||||
"blocks": [
|
||||
{
|
||||
"type": "table_body",
|
||||
"bbox": [69, 474, 525, 596],
|
||||
"group_id": 0,
|
||||
"lines": [ ... ],
|
||||
"index": 0,
|
||||
"virtual_lines": [ ... ]
|
||||
}
|
||||
],
|
||||
"index": 7
|
||||
}
|
||||
```
|
||||
|
||||
表格容器块内的子块额外包含:
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `group_id` | `int` | 分组 ID |
|
||||
| `virtual_lines` | `array` | 虚拟行结构(表格布局专用) |
|
||||
|
||||
**`para_blocks` 额外字段(实测):**
|
||||
|
||||
部分 `para_blocks` 中的文本块额外包含 `bbox_fs` 字段(疑似字体大小相关的边界框),如:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "text",
|
||||
"bbox": [77, 198, 518, 259],
|
||||
"lines": [...],
|
||||
"index": 2,
|
||||
"bbox_fs": [77, 198, 518, 259]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.4 full.md Markdown 输出规范(实测验证)
|
||||
|
||||
**实测产物:** 单个 `full.md` 文件(非官方文档描述的 `auto/auto.md` + `auto_nlp/auto_nlp.md` 双目录结构)。
|
||||
|
||||
**实测特征:**
|
||||
|
||||
| 特征 | 实测行为 |
|
||||
|------|---------|
|
||||
| 标题 | 使用 `# ` 前缀,所有标题均为一级(`# `) |
|
||||
| 段落 | 纯文本,段落间以空行分隔 |
|
||||
| 表格 | 直接嵌入 HTML `<table>` 标签 |
|
||||
| 公式 | 内联使用 `$...$` 定界符(如 `$7 2 . 0 \%$`) |
|
||||
| 图片引用 | 本次未出现独立图片引用 |
|
||||
|
||||
**实测输出示例(节选):**
|
||||
|
||||
```markdown
|
||||
# GraphRAG: Knowledge Graph Enhanced RAG System
|
||||
|
||||
# 1. Introduction
|
||||
|
||||
GraphRAG is an advanced retrieval-augmented generation technique developed by...
|
||||
|
||||
# 3. Performance Comparison
|
||||
|
||||
The following table compares GraphRAG with traditional RAG approaches...
|
||||
|
||||
<table><tr><td rowspan=1 colspan=2>Method Comprehensiveness</td>...</table>
|
||||
|
||||
# 4. Conclusion
|
||||
|
||||
...comprehensiveness $7 2 . 0 \%$ vs $3 2 . 4 \%$...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、布局信息规范
|
||||
|
||||
### 4.1 坐标系定义(实测验证)
|
||||
|
||||
| 坐标系 | 适用文件 | 实测范围 | 原点 | 说明 |
|
||||
|--------|---------|---------|------|------|
|
||||
| **归一化整数坐标** | `*content_list.json` | `0 – 1000` | 左上角 | 页面宽高均映射到 0~1000 |
|
||||
| **原始 PDF 坐标** | `layout.json` | 实测 `[595, 841]`(A4 pt) | 左上角 | 与 PDF 页面尺寸一致 |
|
||||
|
||||
**bbox 格式统一为 `[x0, y0, x1, y1]`:**
|
||||
|
||||
```
|
||||
(x0, y0) ─────────────────── (x1, y0)
|
||||
│ │
|
||||
│ 内容区域 │
|
||||
│ │
|
||||
(x0, y1) ─────────────────── (x1, y1)
|
||||
```
|
||||
|
||||
**实测对照(标题块 "1. Introduction"):**
|
||||
|
||||
| 文件 | bbox | 坐标系 |
|
||||
|------|------|--------|
|
||||
| `content_list.json` | `[131, 200, 317, 222]` | 归一化 0-1000 |
|
||||
| `layout.json` | `[78, 169, 189, 187]` | PDF pt(页面 595×841) |
|
||||
|
||||
### 4.2 布局分类体系
|
||||
|
||||
#### Pipeline 后端(实测 + 官方文档合并)
|
||||
|
||||
**layout.json 中的 `type` 值(实测出现标记 ✅):**
|
||||
|
||||
| type 值 | 说明 | 实测出现 |
|
||||
|---------|------|---------|
|
||||
| `title` | 标题 | ✅ |
|
||||
| `text` | 正文段落 | ✅ |
|
||||
| `table` | 表格容器 | ✅ |
|
||||
| `table_body` | 表格主体(子块) | ✅ |
|
||||
| `table_caption` | 表格标题 | — |
|
||||
| `table_footnote` | 表格脚注 | — |
|
||||
| `image_body` | 图片主体 | — |
|
||||
| `image_caption` | 图片标题 | — |
|
||||
| `image_footnote` | 图片脚注 | — |
|
||||
| `interline_equation` | 行间公式 | — |
|
||||
| `index` | 目录项 | — |
|
||||
| `list` | 列表项 | — |
|
||||
|
||||
#### VLM 后端(官方文档,未实测)
|
||||
|
||||
VLM 后端额外支持:`code`, `code_caption`, `list`, `header`, `footer`, `page_number`, `aside_text`, `page_footnote`, `ref_text`, `algorithm`, `phonetic`。
|
||||
|
||||
### 4.3 内容层级与标题级别
|
||||
|
||||
`content_list.json` 中 `text_level` 字段标识文档结构层级:
|
||||
|
||||
| text_level | 含义 | Markdown | 实测验证 |
|
||||
|------------|------|----------|---------|
|
||||
| **字段缺失** | 正文 | 无标记 | ✅ 实测正文块不含 `text_level` 字段 |
|
||||
| `1` | 一级标题 | `# Heading` | ✅ 实测验证 |
|
||||
| `2` | 二级标题 | `## Heading` | — |
|
||||
| `3` | 三级标题 | `### Heading` | — |
|
||||
| `4+` | 更深层标题 | `####+ Heading` | — |
|
||||
|
||||
> **重要纠正:** 官方文档描述正文为 `text_level: null` 或 `0`,但实测正文块中 **该字段完全不存在**。正确判断方式:
|
||||
|
||||
```python
|
||||
# 正确写法
|
||||
is_heading = block.get("text_level") is not None
|
||||
|
||||
# 错误写法(会 KeyError)
|
||||
is_heading = block["text_level"] >= 1
|
||||
```
|
||||
|
||||
### 4.4 布局精度提取指南
|
||||
|
||||
#### 提取文档大纲
|
||||
|
||||
```python
|
||||
headings = [
|
||||
{"level": b["text_level"], "text": b["text"].strip(), "page": b["page_idx"]}
|
||||
for b in content_list
|
||||
if b["type"] == "text" and b.get("text_level") is not None
|
||||
]
|
||||
```
|
||||
|
||||
#### 提取正文段落
|
||||
|
||||
```python
|
||||
paragraphs = [
|
||||
b["text"].strip()
|
||||
for b in content_list
|
||||
if b["type"] == "text" and b.get("text_level") is None
|
||||
]
|
||||
```
|
||||
|
||||
#### 解析表格数值
|
||||
|
||||
```python
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
for b in content_list:
|
||||
if b["type"] != "table":
|
||||
continue
|
||||
soup = BeautifulSoup(b["table_body"], "html.parser")
|
||||
rows = []
|
||||
for tr in soup.find_all("tr"):
|
||||
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
|
||||
rows.append(cells)
|
||||
# rows 即为二维表格数据
|
||||
```
|
||||
|
||||
#### 按页面位置过滤
|
||||
|
||||
```python
|
||||
def is_upper_half(block):
|
||||
"""判断内容块是否在页面上半部分(归一化坐标 0-1000)"""
|
||||
y_center = (block["bbox"][1] + block["bbox"][3]) / 2
|
||||
return y_center < 500
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、云端 API 关键参数规范
|
||||
|
||||
### 5.1 认证配置
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|-----|
|
||||
| 请求头 | `Authorization: Bearer {token}` |
|
||||
| Token 获取 | [mineru.net/apiManage/token](https://mineru.net/apiManage/token) |
|
||||
| .env 配置 | `MINERU_API_TOKEN=xxx` |
|
||||
|
||||
所有接口均需携带 `Authorization` 头,`Content-Type: application/json`(上传文件 PUT 请求除外)。
|
||||
|
||||
---
|
||||
|
||||
### 5.2 本地文件上传流程 — file-urls/batch
|
||||
|
||||
**用途:** 本地文件场景 — 获取预签名 URL → PUT 上传 → 自动触发解析
|
||||
|
||||
**接口:** `POST https://mineru.net/api/v4/file-urls/batch`
|
||||
|
||||
#### 请求体
|
||||
|
||||
| 字段 | 类型 | 必填 | 默认值 | 说明 |
|
||||
|------|------|------|--------|------|
|
||||
| `files` | `array[object]` | **是** | — | 文件列表(最多 200 个) |
|
||||
| `files[].name` | `string` | **是** | — | 文件名(须含正确扩展名) |
|
||||
| `files[].data_id` | `string` | 否 | — | 业务标识(最长 128 字符,支持字母数字 `_` `-` `.`) |
|
||||
| `files[].is_ocr` | `bool` | 否 | `false` | 是否强制 OCR |
|
||||
| `files[].page_ranges` | `string` | 否 | — | 页码范围(如 `"2,4-6"` 或 `"2--2"` 表示到倒数第二页) |
|
||||
| `model_version` | `string` | 否 | `"pipeline"` | 模型版本:`pipeline` / `vlm` / `MinerU-HTML` |
|
||||
| `enable_formula` | `bool` | 否 | `true` | 是否启用公式识别 |
|
||||
| `enable_table` | `bool` | 否 | `true` | 是否启用表格识别 |
|
||||
| `language` | `string` | 否 | `"ch"` | OCR 语言(PaddleOCR v3 语言代码) |
|
||||
| `callback` | `string` | 否 | — | 回调通知 URL(HTTP/HTTPS POST) |
|
||||
| `seed` | `string` | 否 | — | 回调签名种子(与 callback 配合,最长 64 字符) |
|
||||
| `extra_formats` | `string[]` | 否 | — | 额外输出格式:`"docx"`, `"html"`, `"latex"` |
|
||||
|
||||
#### 响应体(实测验证)
|
||||
|
||||
```json
|
||||
{
|
||||
"code": 0,
|
||||
"msg": "ok",
|
||||
"trace_id": "9ef836ce2a65f46c5f54389e55a14039",
|
||||
"data": {
|
||||
"batch_id": "6ce0e838-b324-4f1d-8b06-01ddc07e4cd4",
|
||||
"file_urls": [
|
||||
"https://mineru.oss-cn-shanghai.aliyuncs.com/api-upload/extract/2026-03-04/{batch_id}/{file_uuid}.pdf?Expires=...&OSSAccessKeyId=...&Signature=..."
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| 响应字段 | 类型 | 说明 |
|
||||
|---------|------|------|
|
||||
| `code` | `int` | `0` 表示成功 |
|
||||
| `msg` | `string` | 状态信息 |
|
||||
| `trace_id` | `string` | 请求追踪 ID |
|
||||
| `data.batch_id` | `string` | 批次 ID(后续查询结果使用) |
|
||||
| `data.file_urls` | `string[]` | 预签名上传 URL 列表(与 `files` 一一对应) |
|
||||
|
||||
#### 文件上传
|
||||
|
||||
```
|
||||
PUT {file_urls[i]}
|
||||
Body: 文件二进制流
|
||||
```
|
||||
|
||||
> **不要传任何请求头**(包括 `Content-Type`),否则 OSS 签名校验失败。
|
||||
|
||||
---
|
||||
|
||||
### 5.3 URL 直传解析 — extract/task
|
||||
|
||||
**用途:** 文件已有公网 URL 时直接提交解析
|
||||
|
||||
**接口:** `POST https://mineru.net/api/v4/extract/task`
|
||||
|
||||
#### 请求体
|
||||
|
||||
| 字段 | 类型 | 必填 | 默认值 | 说明 |
|
||||
|------|------|------|--------|------|
|
||||
| `url` | `string` | **是** | — | 文件公网 URL |
|
||||
| `model_version` | `string` | 否 | `"pipeline"` | 模型版本 |
|
||||
| `is_ocr` | `bool` | 否 | `false` | 是否强制 OCR |
|
||||
| `enable_formula` | `bool` | 否 | `true` | 是否启用公式识别 |
|
||||
| `enable_table` | `bool` | 否 | `true` | 是否启用表格识别 |
|
||||
| `language` | `string` | 否 | `"ch"` | OCR 语言 |
|
||||
| `data_id` | `string` | 否 | — | 业务标识 |
|
||||
| `callback` | `string` | 否 | — | 回调 URL |
|
||||
| `seed` | `string` | 否 | — | 回调种子 |
|
||||
| `extra_formats` | `string[]` | 否 | — | 额外输出格式 |
|
||||
| `page_ranges` | `string` | 否 | — | 页码范围 |
|
||||
| `no_cache` | `bool` | 否 | `false` | 跳过 URL 缓存 |
|
||||
| `cache_tolerance` | `int` | 否 | `900` | 缓存容忍时间(秒) |
|
||||
|
||||
#### 响应体
|
||||
|
||||
```json
|
||||
{
|
||||
"code": 0,
|
||||
"msg": "ok",
|
||||
"trace_id": "string",
|
||||
"data": { "task_id": "string" }
|
||||
}
|
||||
```
|
||||
|
||||
#### 查询结果
|
||||
|
||||
`GET https://mineru.net/api/v4/extract/task/{task_id}`
|
||||
|
||||
```json
|
||||
{
|
||||
"code": 0,
|
||||
"data": {
|
||||
"task_id": "string",
|
||||
"data_id": "string",
|
||||
"state": "done",
|
||||
"full_zip_url": "https://cdn-mineru.openxlab.org.cn/...",
|
||||
"err_msg": null,
|
||||
"extract_progress": {
|
||||
"extracted_pages": 1,
|
||||
"total_pages": 1,
|
||||
"start_time": "2026-03-04 12:00:00"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.4 批量 URL 解析 — extract/task/batch
|
||||
|
||||
**接口:** `POST https://mineru.net/api/v4/extract/task/batch`
|
||||
|
||||
#### 请求体
|
||||
|
||||
```json
|
||||
{
|
||||
"files": [
|
||||
{"url": "https://...", "data_id": "doc1", "is_ocr": false, "page_ranges": "1-5"}
|
||||
],
|
||||
"model_version": "pipeline",
|
||||
"enable_formula": true,
|
||||
"enable_table": true,
|
||||
"language": "ch",
|
||||
"extra_formats": ["docx"],
|
||||
"no_cache": false,
|
||||
"cache_tolerance": 900
|
||||
}
|
||||
```
|
||||
|
||||
#### 响应体
|
||||
|
||||
```json
|
||||
{
|
||||
"code": 0,
|
||||
"data": { "batch_id": "string" }
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.5 查询结果接口
|
||||
|
||||
#### 单任务查询
|
||||
|
||||
`GET https://mineru.net/api/v4/extract/task/{task_id}`
|
||||
|
||||
#### 批量查询(实测验证)
|
||||
|
||||
`GET https://mineru.net/api/v4/extract-results/batch/{batch_id}`
|
||||
|
||||
**响应体(实测验证):**
|
||||
|
||||
```json
|
||||
{
|
||||
"code": 0,
|
||||
"msg": "ok",
|
||||
"trace_id": "string",
|
||||
"data": {
|
||||
"batch_id": "3b1729e9-c833-44b4-b9c2-201164001ab0",
|
||||
"extract_result": [
|
||||
{
|
||||
"file_name": "test_sample.pdf",
|
||||
"state": "done",
|
||||
"full_zip_url": "https://cdn-mineru.openxlab.org.cn/pdf/2026-03-04/...",
|
||||
"err_msg": null,
|
||||
"data_id": "mvp_test",
|
||||
"extract_progress": {
|
||||
"extracted_pages": 1,
|
||||
"total_pages": 1,
|
||||
"start_time": "2026-03-04 ..."
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.6 通用响应包装结构
|
||||
|
||||
所有 API 响应均遵循统一包装格式:
|
||||
|
||||
```json
|
||||
{
|
||||
"code": 0, // 0 = 成功,非 0 = 失败
|
||||
"msg": "ok", // 状态描述
|
||||
"trace_id": "...", // 请求追踪 ID
|
||||
"data": { ... } // 业务数据
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.7 任务状态枚举(实测验证)
|
||||
|
||||
| state | 说明 | 实测出现 |
|
||||
|-------|------|---------|
|
||||
| `waiting-file` | 等待文件上传完成 | ✅ |
|
||||
| `pending` | 排队等待解析 | ✅ |
|
||||
| `running` | 正在解析 | — |
|
||||
| `converting` | 格式转换中 | — |
|
||||
| `done` | 解析完成 | ✅ |
|
||||
| `failed` | 解析失败 | — |
|
||||
|
||||
> **实测状态流转:** `waiting-file` → `pending` → `done`(小文件跳过 `running`)
|
||||
|
||||
---
|
||||
|
||||
### 5.8 错误码速查
|
||||
|
||||
| 错误码 | 含义 |
|
||||
|--------|------|
|
||||
| `A0202` | Token 无效 |
|
||||
| `A0211` | Token 过期 |
|
||||
| `-60005` | 文件超过 200MB |
|
||||
| `-60006` | 页数超过 600 页 |
|
||||
| `-60018` | 当日解析额度用尽 |
|
||||
680
docs/mineru_specification.md
Normal file
680
docs/mineru_specification.md
Normal file
@@ -0,0 +1,680 @@
|
||||
# MinerU 文档解析规范文档
|
||||
|
||||
> 基于 [opendatalab/MinerU](https://github.com/opendatalab/MinerU) 官方文档及云端 API 调研
|
||||
> 版本基线:2026-03-04
|
||||
|
||||
---
|
||||
|
||||
## 目录
|
||||
|
||||
- [一、支持的原始输入文件格式](#一支持的原始输入文件格式)
|
||||
- [1.1 支持格式清单](#11-支持格式清单)
|
||||
- [1.2 输入限制](#12-输入限制)
|
||||
- [1.3 OCR 语言支持](#13-ocr-语言支持)
|
||||
- [二、云端 API 输出格式规范](#二云端-api-输出格式规范)
|
||||
- [2.1 输出文件总览](#21-输出文件总览)
|
||||
- [2.2 content_list.json 字段规范](#22-content_listjson-字段规范)
|
||||
- [2.3 middle.json 字段规范](#23-middlejson-字段规范)
|
||||
- [2.4 Markdown 输出规范](#24-markdown-输出规范)
|
||||
- [2.5 调试与可视化文件](#25-调试与可视化文件)
|
||||
- [三、布局信息规范](#三布局信息规范)
|
||||
- [3.1 坐标系定义](#31-坐标系定义)
|
||||
- [3.2 布局分类体系(Pipeline 后端)](#32-布局分类体系pipeline-后端)
|
||||
- [3.3 布局分类体系(VLM 后端)](#33-布局分类体系vlm-后端)
|
||||
- [3.4 内容层级与标题级别](#34-内容层级与标题级别)
|
||||
- [3.5 布局精度提取指南](#35-布局精度提取指南)
|
||||
- [四、云端 API MVP 必要字段](#四云端-api-mvp-必要字段)
|
||||
- [4.1 认证配置](#41-认证配置)
|
||||
- [4.2 创建解析任务 — 请求规范](#42-创建解析任务--请求规范)
|
||||
- [4.3 查询任务结果 — 响应规范](#43-查询任务结果--响应规范)
|
||||
- [4.4 批量任务接口](#44-批量任务接口)
|
||||
- [4.5 MVP 最小可用请求示例](#45-mvp-最小可用请求示例)
|
||||
|
||||
---
|
||||
|
||||
## 一、支持的原始输入文件格式
|
||||
|
||||
### 1.1 支持格式清单
|
||||
|
||||
| 格式 | 扩展名 | 说明 |
|
||||
|------|--------|------|
|
||||
| **PDF** | `.pdf` | 核心能力 — 文本型 / 扫描型 / 混合型均支持 |
|
||||
| **Word** | `.doc`, `.docx` | 旧版和新版 Word 文档 |
|
||||
| **PowerPoint** | `.ppt`, `.pptx` | 旧版和新版演示文稿 |
|
||||
| **图片** | `.png`, `.jpg`, `.jpeg` | 单页图片文档,支持 EXIF 方向自动校正 |
|
||||
| **HTML** | `.html` | 需指定 `MinerU-HTML` 模型版本 |
|
||||
|
||||
### 1.2 输入限制
|
||||
|
||||
| 约束项 | 限制值 |
|
||||
|--------|--------|
|
||||
| 单文件最大体积 | **200 MB** |
|
||||
| 单文件最大页数 | **600 页** |
|
||||
| 云端 API 每日免费额度 | **2,000 页**(最高优先级),超出部分降低优先级 |
|
||||
|
||||
### 1.3 OCR 语言支持
|
||||
|
||||
MinerU 内置 OCR 引擎支持 **109 种语言**,可通过 `language` 参数指定文档主语言(默认 `zh` 中文)。常用语言代码:
|
||||
|
||||
| 代码 | 语言 | 代码 | 语言 |
|
||||
|------|------|------|------|
|
||||
| `zh` | 中文 | `en` | 英文 |
|
||||
| `ja` | 日文 | `ko` | 韩文 |
|
||||
| `fr` | 法文 | `de` | 德文 |
|
||||
|
||||
---
|
||||
|
||||
## 二、云端 API 输出格式规范
|
||||
|
||||
### 2.1 输出文件总览
|
||||
|
||||
云端 API 任务完成后,返回一个 ZIP 压缩包(通过 `full_zip_url` 获取),解压后包含以下文件:
|
||||
|
||||
```
|
||||
output/
|
||||
├── auto/
|
||||
│ ├── auto.md # 多模态 Markdown(含图片引用)
|
||||
│ └── images/ # 提取的图片资源
|
||||
│ ├── img_0_0.png
|
||||
│ ├── table_0_1.png
|
||||
│ └── ...
|
||||
├── auto_nlp/
|
||||
│ └── auto_nlp.md # 纯文本 NLP Markdown(无图片)
|
||||
├── middle.json # 富元数据中间格式(完整层级结构)
|
||||
├── content_list.json # 扁平化内容块列表(按阅读顺序)
|
||||
├── layout.pdf # 布局分析可视化(调试用)
|
||||
├── span.pdf # Span 级别标注(Pipeline 后端,调试用)
|
||||
└── model.json # 原始模型推理结果(调试用)
|
||||
```
|
||||
|
||||
| 文件 | 用途 | 推荐场景 |
|
||||
|------|------|---------|
|
||||
| `content_list.json` | 扁平化内容块,按阅读顺序 | **推荐用于下游 NLP/KG 管道对接** |
|
||||
| `middle.json` | 完整层级结构,含丰富元数据 | 需要精确布局信息或二次开发 |
|
||||
| `auto/auto.md` | 多模态 Markdown | 人工阅读、LLM 直接消费 |
|
||||
| `auto_nlp/auto_nlp.md` | 纯文本 Markdown | 纯文本 NLP 处理 |
|
||||
| `layout.pdf` | 布局可视化 | 调试、验证解析质量 |
|
||||
|
||||
---
|
||||
|
||||
### 2.2 content_list.json 字段规范
|
||||
|
||||
`content_list.json` 是一个 **JSON 数组**,每个元素是一个内容块,按文档阅读顺序排列。
|
||||
|
||||
#### 2.2.1 公共字段(所有类型共有)
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `type` | `string` | 内容类型:`text` / `image` / `table` / `equation` / `code` / `list` |
|
||||
| `page_idx` | `int` | 所在页码(**0-indexed**) |
|
||||
| `bbox` | `[x0, y0, x1, y1]` | 边界框坐标,归一化到 **0–1000** 范围 |
|
||||
|
||||
#### 2.2.2 文本块(type: "text")
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "text",
|
||||
"text": "段落正文内容...",
|
||||
"text_level": 0,
|
||||
"page_idx": 0,
|
||||
"bbox": [72, 120, 540, 145]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `text` | `string` | 文本内容 |
|
||||
| `text_level` | `int \| null` | 标题级别:`null` 或 `0` = 正文,`1` = 一级标题,`2` = 二级标题,依此类推 |
|
||||
|
||||
#### 2.2.3 图片块(type: "image")
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "image",
|
||||
"img_path": "images/img_0_0.png",
|
||||
"image_caption": ["Figure 1: System architecture"],
|
||||
"image_footnote": ["Source: internal report"],
|
||||
"page_idx": 1,
|
||||
"bbox": [100, 200, 500, 600]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `img_path` | `string` | 图片文件相对路径 |
|
||||
| `image_caption` | `string[]` | 图片标题列表 |
|
||||
| `image_footnote` | `string[]` | 图片脚注列表 |
|
||||
|
||||
#### 2.2.4 表格块(type: "table")
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "table",
|
||||
"img_path": "images/table_0_1.png",
|
||||
"table_body": "<html><body><table><tr><td>...</td></tr></table></body></html>",
|
||||
"table_caption": ["Table 1: Performance comparison"],
|
||||
"table_footnote": ["* p < 0.05"],
|
||||
"page_idx": 2,
|
||||
"bbox": [50, 300, 950, 700]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `img_path` | `string` | 表格截图相对路径 |
|
||||
| `table_body` | `string` | 表格 HTML 表示(`<table>` 标签) |
|
||||
| `table_caption` | `string[]` | 表格标题列表 |
|
||||
| `table_footnote` | `string[]` | 表格脚注列表 |
|
||||
|
||||
#### 2.2.5 公式块(type: "equation")
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "equation",
|
||||
"text": "E = mc^2",
|
||||
"text_format": "latex",
|
||||
"img_path": "images/eq_0_0.png",
|
||||
"page_idx": 3,
|
||||
"bbox": [200, 400, 800, 450]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `text` | `string` | 公式的 LaTeX 表示 |
|
||||
| `text_format` | `string` | 固定值 `"latex"` |
|
||||
| `img_path` | `string` | 公式截图相对路径 |
|
||||
|
||||
#### 2.2.6 代码块(type: "code")— VLM 后端
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "code",
|
||||
"sub_type": "code",
|
||||
"code_body": "def hello():\n print('hello')",
|
||||
"code_caption": ["Listing 1: Example function"],
|
||||
"page_idx": 4,
|
||||
"bbox": [80, 100, 920, 300]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `sub_type` | `string` | `"code"` 或 `"algorithm"` |
|
||||
| `code_body` | `string` | 代码文本内容 |
|
||||
| `code_caption` | `string[]` | 代码块标题(可选) |
|
||||
|
||||
#### 2.2.7 列表块(type: "list")— VLM 后端
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "list",
|
||||
"sub_type": "text",
|
||||
"list_items": ["第一项", "第二项", "第三项"],
|
||||
"page_idx": 5,
|
||||
"bbox": [72, 200, 540, 350]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `sub_type` | `string` | `"text"` 或 `"ref_text"`(参考文献列表) |
|
||||
| `list_items` | `string[]` | 列表项内容 |
|
||||
|
||||
---
|
||||
|
||||
### 2.3 middle.json 字段规范
|
||||
|
||||
`middle.json` 是 MinerU 的富元数据中间格式,保留完整的文档层级结构。
|
||||
|
||||
#### 2.3.1 顶层结构
|
||||
|
||||
```json
|
||||
{
|
||||
"_backend": "pipeline | vlm | hybrid",
|
||||
"_version_name": "2.7.4",
|
||||
"pdf_info": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `_backend` | `string` | 使用的解析后端 |
|
||||
| `_version_name` | `string` | MinerU 版本标识 |
|
||||
| `pdf_info` | `array` | 按页组织的解析结果数组 |
|
||||
|
||||
#### 2.3.2 页级结构(pdf_info 数组元素)
|
||||
|
||||
```json
|
||||
{
|
||||
"page_idx": 0,
|
||||
"page_size": [595.0, 842.0],
|
||||
"preproc_blocks": [ ... ],
|
||||
"para_blocks": [ ... ],
|
||||
"images": [ ... ],
|
||||
"tables": [ ... ],
|
||||
"interline_equations": [ ... ],
|
||||
"discarded_blocks": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `page_idx` | `int` | 页码(0-indexed) |
|
||||
| `page_size` | `[float, float]` | 页面尺寸 `[宽, 高]`(原始 PDF 坐标系,单位 pt) |
|
||||
| `preproc_blocks` | `array` | 未分段的预处理块 |
|
||||
| `para_blocks` | `array` | **已分段的内容块**(主输出) |
|
||||
| `images` | `array` | 提取的图片块 |
|
||||
| `tables` | `array` | 提取的表格块 |
|
||||
| `interline_equations` | `array` | 行间公式块 |
|
||||
| `discarded_blocks` | `array` | 被过滤的内容(页眉、页脚、页码等) |
|
||||
|
||||
#### 2.3.3 内容块层级结构
|
||||
|
||||
内容块采用三级层级:**Block → Line → Span**
|
||||
|
||||
**一级块(Level 1)— 容器块:**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "table",
|
||||
"bbox": [x0, y0, x1, y1],
|
||||
"blocks": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `type` | `string` | `"table"` 或 `"image"` |
|
||||
| `bbox` | `[x0, y0, x1, y1]` | 边界框坐标(原始 PDF 坐标系) |
|
||||
| `blocks` | `array` | 包含的二级块 |
|
||||
|
||||
**二级块(Level 2)— 语义块:**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "text",
|
||||
"bbox": [x0, y0, x1, y1],
|
||||
"lines": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
| `type` 值 | 说明 |
|
||||
|-----------|------|
|
||||
| `text` | 正文段落 |
|
||||
| `title` | 标题 |
|
||||
| `image_body` | 图片主体 |
|
||||
| `image_caption` | 图片标题 |
|
||||
| `image_footnote` | 图片脚注 |
|
||||
| `table_body` | 表格主体 |
|
||||
| `table_caption` | 表格标题 |
|
||||
| `table_footnote` | 表格脚注 |
|
||||
| `interline_equation` | 行间公式 |
|
||||
| `index` | 目录项 |
|
||||
| `list` | 列表项 |
|
||||
|
||||
**行结构(Line):**
|
||||
|
||||
```json
|
||||
{
|
||||
"bbox": [x0, y0, x1, y1],
|
||||
"spans": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
**Span 结构(最小粒度):**
|
||||
|
||||
```json
|
||||
{
|
||||
"bbox": [x0, y0, x1, y1],
|
||||
"type": "text",
|
||||
"content": "具体文本内容",
|
||||
"score": 0.95
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `bbox` | `[x0, y0, x1, y1]` | 边界框坐标 |
|
||||
| `type` | `string` | `text` / `image` / `table` / `inline_equation` / `interline_equation` |
|
||||
| `content` | `string` | 文本内容(text 类型)|
|
||||
| `img_path` | `string` | 图片路径(image/table 类型)|
|
||||
| `score` | `float` | 模型置信度(0.0~1.0) |
|
||||
|
||||
---
|
||||
|
||||
### 2.4 Markdown 输出规范
|
||||
|
||||
| 文件 | 特点 |
|
||||
|------|------|
|
||||
| `auto/auto.md` | 图片以 `` 引用;表格保留为 Markdown 表格或 HTML;公式使用 `$...$` 和 `$$...$$` 定界符 |
|
||||
| `auto_nlp/auto_nlp.md` | 纯文本,图片/表格替换为占位文本描述;适合直接送入 NLP 管道 |
|
||||
|
||||
---
|
||||
|
||||
### 2.5 调试与可视化文件
|
||||
|
||||
| 文件 | 格式 | 说明 |
|
||||
|------|------|------|
|
||||
| `layout.pdf` | PDF | 每页叠加带编号的检测框,不同颜色区分内容类型,验证布局分析准确性和阅读顺序 |
|
||||
| `span.pdf` | PDF | 用不同颜色线框标注页面内容的 span 类型(仅 Pipeline 后端),排查文本丢失和公式识别问题 |
|
||||
| `model.json` | JSON | 原始模型推理结果,包含 `category_id`、`poly`(四边形坐标)、`score`(置信度) |
|
||||
|
||||
---
|
||||
|
||||
## 三、布局信息规范
|
||||
|
||||
### 3.1 坐标系定义
|
||||
|
||||
MinerU 使用两套坐标系,取决于输出文件:
|
||||
|
||||
| 坐标系 | 适用文件 | 范围 | 原点 | 说明 |
|
||||
|--------|---------|------|------|------|
|
||||
| **归一化坐标** | `content_list.json` | `0 – 1000` | 左上角 | 页面宽高均映射到 0~1000 |
|
||||
| **原始 PDF 坐标** | `middle.json` | 实际 pt 值 | 左上角 | 与 PDF 页面尺寸一致(如 A4 = 595×842) |
|
||||
| **归一化比例坐标** | `model.json`(VLM) | `0.0 – 1.0` | 左上角 | 宽高均映射到 0~1 |
|
||||
|
||||
**bbox 格式统一为:`[x0, y0, x1, y1]`**
|
||||
|
||||
```
|
||||
(x0, y0) ─────────────────── (x1, y0)
|
||||
│ │
|
||||
│ 内容区域 │
|
||||
│ │
|
||||
(x0, y1) ─────────────────── (x1, y1)
|
||||
```
|
||||
|
||||
- `x0, y0`:左上角坐标
|
||||
- `x1, y1`:右下角坐标
|
||||
|
||||
### 3.2 布局分类体系(Pipeline 后端)
|
||||
|
||||
`model.json` 中的 `category_id` 枚举:
|
||||
|
||||
| category_id | 类型 | 说明 |
|
||||
|-------------|------|------|
|
||||
| 0 | `title` | 标题 |
|
||||
| 1 | `plain_text` | 正文文本 |
|
||||
| 2 | `abandon` | 丢弃区域(页眉/页脚/页码等) |
|
||||
| 3 | `figure` | 图片 |
|
||||
| 4 | `figure_caption` | 图片标题 |
|
||||
| 5 | `table` | 表格 |
|
||||
| 6 | `table_caption` | 表格标题 |
|
||||
| 7 | `table_footnote` | 表格脚注 |
|
||||
| 8 | `isolate_formula` | 独立行间公式 |
|
||||
| 9 | `formula_caption` | 公式标题 |
|
||||
| 13 | `embedding` | 嵌入内容 |
|
||||
| 14 | `isolated` | 隔离内容 |
|
||||
| 15 | `OCR_text` | OCR 识别文本 |
|
||||
|
||||
### 3.3 布局分类体系(VLM 后端)
|
||||
|
||||
VLM 后端使用字符串类型标识,分类更细:
|
||||
|
||||
| type 值 | 说明 |
|
||||
|---------|------|
|
||||
| `text` | 正文 |
|
||||
| `title` | 标题 |
|
||||
| `equation` | 公式 |
|
||||
| `image` | 图片 |
|
||||
| `image_caption` | 图片标题 |
|
||||
| `image_footnote` | 图片脚注 |
|
||||
| `table` | 表格 |
|
||||
| `table_caption` | 表格标题 |
|
||||
| `table_footnote` | 表格脚注 |
|
||||
| `code` | 代码块 |
|
||||
| `code_caption` | 代码标题 |
|
||||
| `list` | 列表 |
|
||||
| `header` | 页眉(discarded) |
|
||||
| `footer` | 页脚(discarded) |
|
||||
| `page_number` | 页码(discarded) |
|
||||
| `aside_text` | 边栏文字(discarded) |
|
||||
| `page_footnote` | 页面脚注(discarded) |
|
||||
| `ref_text` | 参考文献 |
|
||||
| `algorithm` | 算法伪代码 |
|
||||
| `phonetic` | 注音 |
|
||||
|
||||
### 3.4 内容层级与标题级别
|
||||
|
||||
`content_list.json` 中的 `text_level` 字段标识文档结构层级:
|
||||
|
||||
| text_level | 含义 | 对应 Markdown |
|
||||
|------------|------|--------------|
|
||||
| `null` 或 `0` | 正文 | 无标记 |
|
||||
| `1` | 一级标题 | `# Heading` |
|
||||
| `2` | 二级标题 | `## Heading` |
|
||||
| `3` | 三级标题 | `### Heading` |
|
||||
| `4` | 四级标题 | `#### Heading` |
|
||||
| `5+` | 更深层标题 | `#####+ Heading` |
|
||||
|
||||
### 3.5 布局精度提取指南
|
||||
|
||||
针对不同数据类型的精确提取建议:
|
||||
|
||||
#### 文本提取
|
||||
|
||||
```python
|
||||
# 从 content_list.json 提取所有正文文本
|
||||
texts = [
|
||||
block for block in content_list
|
||||
if block["type"] == "text"
|
||||
]
|
||||
# 按页过滤
|
||||
page_0_texts = [b for b in texts if b["page_idx"] == 0]
|
||||
```
|
||||
|
||||
#### 标题层级提取
|
||||
|
||||
```python
|
||||
# 提取文档大纲结构
|
||||
headings = [
|
||||
{"level": block["text_level"], "text": block["text"], "page": block["page_idx"]}
|
||||
for block in content_list
|
||||
if block["type"] == "text" and block.get("text_level") and block["text_level"] >= 1
|
||||
]
|
||||
```
|
||||
|
||||
#### 表格数值提取
|
||||
|
||||
```python
|
||||
# 表格以 HTML 形式存储在 table_body 中,可用 BeautifulSoup 解析
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
tables = [b for b in content_list if b["type"] == "table"]
|
||||
for table in tables:
|
||||
soup = BeautifulSoup(table["table_body"], "html.parser")
|
||||
rows = []
|
||||
for tr in soup.find_all("tr"):
|
||||
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
|
||||
rows.append(cells)
|
||||
```
|
||||
|
||||
#### 空间位置定位
|
||||
|
||||
```python
|
||||
# 利用 bbox 判断内容在页面中的位置
|
||||
def get_position(bbox, threshold=500):
|
||||
"""判断内容在页面的上半部分还是下半部分(归一化坐标 0-1000)"""
|
||||
y_center = (bbox[1] + bbox[3]) / 2
|
||||
return "upper" if y_center < threshold else "lower"
|
||||
|
||||
# 判断两个块是否水平相邻(同一行)
|
||||
def is_same_row(block_a, block_b, tolerance=20):
|
||||
return abs(block_a["bbox"][1] - block_b["bbox"][1]) < tolerance
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、云端 API MVP 必要字段
|
||||
|
||||
### 4.1 认证配置
|
||||
|
||||
| 配置项 | 值 | 获取方式 |
|
||||
|--------|-----|---------|
|
||||
| Token | Bearer Token 字符串 | [mineru.net/apiManage/token](https://mineru.net/apiManage/token) 注册后获取 |
|
||||
|
||||
**请求头格式(所有接口通用):**
|
||||
|
||||
```
|
||||
Authorization: Bearer {your_token}
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4.2 创建解析任务 — 请求规范
|
||||
|
||||
**接口:** `POST https://mineru.net/api/v4/extract/task`
|
||||
|
||||
#### 请求体字段
|
||||
|
||||
| 字段 | 类型 | 必填 | 默认值 | 说明 |
|
||||
|------|------|------|--------|------|
|
||||
| `url` | `string` | **是** | — | 待解析文件的公网可访问 URL |
|
||||
| `is_ocr` | `bool` | 否 | `false` | 是否强制启用 OCR(扫描件建议开启) |
|
||||
| `enable_formula` | `bool` | 否 | `true` | 是否启用公式识别 |
|
||||
| `enable_table` | `bool` | 否 | `true` | 是否启用表格识别 |
|
||||
| `language` | `string` | 否 | `"zh"` | 文档主语言代码 |
|
||||
| `model` | `string` | 否 | 自动选择 | 模型版本:`pipeline` / `vlm` / `MinerU-HTML` |
|
||||
| `data_id` | `string` | 否 | — | 自定义业务标识(用于关联追踪) |
|
||||
| `callback_url` | `string` | 否 | — | 任务完成后的回调通知 URL |
|
||||
|
||||
#### MVP 最小必填字段
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com/document.pdf"
|
||||
}
|
||||
```
|
||||
|
||||
> 仅 `url` 为必填,其余参数均有合理默认值。
|
||||
|
||||
---
|
||||
|
||||
### 4.3 查询任务结果 — 响应规范
|
||||
|
||||
**接口:** `GET https://mineru.net/api/v4/extract/task/{task_id}`
|
||||
|
||||
#### 响应体字段
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `task_id` | `string` | 任务唯一标识 |
|
||||
| `state` | `string` | 任务状态(见下方枚举) |
|
||||
| `err_msg` | `string \| null` | 错误信息(失败时) |
|
||||
| `full_zip_url` | `string \| null` | 完整输出 ZIP 下载地址(成功时) |
|
||||
| `file_name` | `string` | 原始文件名 |
|
||||
| `batch_id` | `string \| null` | 批量任务 ID(如有) |
|
||||
|
||||
#### 任务状态枚举
|
||||
|
||||
| state | 说明 |
|
||||
|-------|------|
|
||||
| `pending` | 排队等待中 |
|
||||
| `processing` | 正在解析 |
|
||||
| `done` | 解析完成 |
|
||||
| `failed` | 解析失败(查看 `err_msg`) |
|
||||
|
||||
---
|
||||
|
||||
### 4.4 批量任务接口
|
||||
|
||||
#### 4.4.1 批量获取上传 URL
|
||||
|
||||
**接口:** `POST https://mineru.net/api/v4/file-urls/batch`
|
||||
|
||||
用于获取文件上传的预签名 URL(适合本地文件上传场景)。
|
||||
|
||||
#### 4.4.2 批量创建任务
|
||||
|
||||
**接口:** `POST https://mineru.net/api/v4/extract/task/batch`
|
||||
|
||||
请求体中 `files` 数组包含多个文件的解析参数。
|
||||
|
||||
#### 4.4.3 批量查询结果
|
||||
|
||||
**接口:** `GET https://mineru.net/api/v4/extract-results/batch/{batch_id}`
|
||||
|
||||
---
|
||||
|
||||
### 4.5 MVP 最小可用请求示例
|
||||
|
||||
#### Python 实现
|
||||
|
||||
```python
|
||||
import os
|
||||
import time
|
||||
import requests
|
||||
|
||||
MINERU_API_TOKEN = os.getenv("MINERU_API_TOKEN")
|
||||
BASE_URL = "https://mineru.net/api/v4"
|
||||
HEADERS = {
|
||||
"Authorization": f"Bearer {MINERU_API_TOKEN}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
# ① 创建解析任务(仅需 url 一个必填字段)
|
||||
resp = requests.post(
|
||||
f"{BASE_URL}/extract/task",
|
||||
headers=HEADERS,
|
||||
json={
|
||||
"url": "https://example.com/sample.pdf", # 必填:文件公网 URL
|
||||
# "is_ocr": False, # 可选:默认 false
|
||||
# "enable_formula": True, # 可选:默认 true
|
||||
# "enable_table": True, # 可选:默认 true
|
||||
# "language": "zh", # 可选:默认中文
|
||||
},
|
||||
)
|
||||
task_id = resp.json()["task_id"]
|
||||
print(f"Task created: {task_id}")
|
||||
|
||||
# ② 轮询查询结果
|
||||
while True:
|
||||
result = requests.get(
|
||||
f"{BASE_URL}/extract/task/{task_id}",
|
||||
headers=HEADERS,
|
||||
).json()
|
||||
|
||||
state = result["state"]
|
||||
print(f"State: {state}")
|
||||
|
||||
if state == "done":
|
||||
zip_url = result["full_zip_url"]
|
||||
print(f"Download: {zip_url}")
|
||||
break
|
||||
elif state == "failed":
|
||||
print(f"Error: {result['err_msg']}")
|
||||
break
|
||||
|
||||
time.sleep(5)
|
||||
|
||||
# ③ 下载并解压结果
|
||||
import zipfile, io
|
||||
|
||||
zip_data = requests.get(zip_url).content
|
||||
with zipfile.ZipFile(io.BytesIO(zip_data)) as zf:
|
||||
zf.extractall("./mineru_output/")
|
||||
print("Files:", zf.namelist())
|
||||
```
|
||||
|
||||
#### cURL 实现
|
||||
|
||||
```bash
|
||||
# 创建任务
|
||||
curl -X POST https://mineru.net/api/v4/extract/task \
|
||||
-H "Authorization: Bearer YOUR_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url": "https://example.com/sample.pdf"}'
|
||||
|
||||
# 查询结果
|
||||
curl https://mineru.net/api/v4/extract/task/{task_id} \
|
||||
-H "Authorization: Bearer YOUR_TOKEN"
|
||||
```
|
||||
|
||||
#### MVP 检查清单
|
||||
|
||||
- [ ] 已在 [mineru.net](https://mineru.net/) 注册账号
|
||||
- [ ] 已在 [Token 管理页](https://mineru.net/apiManage/token) 获取 API Token
|
||||
- [ ] 已将 Token 配置到 `.env` 文件:`MINERU_API_TOKEN=xxx`
|
||||
- [ ] 准备了公网可访问的测试文件 URL(PDF/DOCX/PPT/图片)
|
||||
- [ ] 安装了 `requests` 库:`pip install requests`
|
||||
1442
docs/product_requirements_document-v1.0.md
Normal file
1442
docs/product_requirements_document-v1.0.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user