Full-stack application for document-to-knowledge-graph pipeline: - Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing - Frontend: React 19 + Vite + D3.js + shadcn/ui - Pipeline: MinerU parsing → LangExtract entity extraction → KG building Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
482 lines
17 KiB
Markdown
482 lines
17 KiB
Markdown
# Bridge Pipeline Specification v1.0
|
||
|
||
> GraphRAG 索引阶段核心流程:MinerU → LangExtract → Knowledge Graph
|
||
|
||
---
|
||
|
||
## 1. Pipeline 执行思路
|
||
|
||
### 1.1 整体架构
|
||
|
||
Bridge Pipeline 是 GraphRAG 索引阶段的核心流程,负责将 MinerU 解析后的结构化 PDF 内容送入 LangExtract 完成实体抽取,最终生成知识图谱的节点(Nodes)和边(Edges)。
|
||
|
||
```
|
||
MinerU output Bridge Pipeline KG output
|
||
───────────── ─────────────── ─────────
|
||
{uuid}_content_list.json → text_assembler.py
|
||
├─ text blocks ├─ 按页拼接纯文本
|
||
└─ table blocks (HTML) ├─ HTML表格→纯文本
|
||
└─ 记录每个block的char偏移
|
||
→ entity_extractor.py
|
||
├─ 逐页调用 lx.extract()
|
||
└─ DeepSeek via OpenAI Provider
|
||
→ kg_builder.py
|
||
├─ 过滤低质量对齐 → kg_nodes.json
|
||
├─ 节点去重 (name.lower(), type)
|
||
└─ 同页实体对→CO_OCCURS_IN边 → kg_edges.json
|
||
```
|
||
|
||
### 1.2 五步执行流程
|
||
|
||
| 步骤 | 模块 | 说明 |
|
||
|------|------|------|
|
||
| Step 1 | `bridge.py` | 加载 MinerU 输出 `content_list.json`,解析输入路径和 source_doc_id |
|
||
| Step 2 | `text_assembler.py` | 按 `page_idx` 分组,拼接纯文本,记录每个 block 的字符偏移 |
|
||
| Step 3 | `entity_extractor.py` | 逐页调用 LangExtract + DeepSeek 完成实体抽取 |
|
||
| Step 4 | `kg_builder.py` | 过滤低质量对齐 → 节点去重 → 同页配对生成 CO_OCCURS_IN 边 |
|
||
| Step 5 | `bridge.py` | 保存 `kg_nodes.json` + `kg_edges.json` 到 output 目录 |
|
||
|
||
### 1.3 文件存放位置
|
||
|
||
```
|
||
F:\GraphRAGAgent\graphrag_pipeline\
|
||
├── .env # DeepSeek API 配置
|
||
├── CLAUDE.md # 组件开发规范
|
||
├── bridge.py # 主入口(串联完整 Pipeline)
|
||
├── text_assembler.py # MinerU JSON → 按页纯文本 + 偏移映射
|
||
├── entity_extractor.py # LangExtract + DeepSeek 封装
|
||
├── kg_builder.py # KG 节点去重 + 边生成
|
||
└── output/
|
||
├── kg_nodes.json # 知识图谱节点(9,851 bytes)
|
||
└── kg_edges.json # 知识图谱边(129,093 bytes)
|
||
```
|
||
|
||
### 1.4 运行命令
|
||
|
||
```bash
|
||
# 使用默认测试输入
|
||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py
|
||
|
||
# 指定输入文件
|
||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py path/to/content_list.json
|
||
|
||
# 指定输入目录(自动查找 *_content_list.json)
|
||
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py path/to/output_dir/
|
||
```
|
||
|
||
---
|
||
|
||
## 2. 实际本地输出文档规范
|
||
|
||
### 2.1 测试运行结果
|
||
|
||
- **输入文件**: `F:\GraphRAGAgent\mineru_mvp\output\test_sample\8a719db4-2b50-405b-826d-7bb27b224fa0_content_list.json`
|
||
- **输入规模**: 10 blocks(9 text + 1 table),1 页,2102 字符
|
||
- **抽取结果**: 45 raw extractions → 40 去重节点,780 CO_OCCURS_IN 边
|
||
- **对齐质量**: 全部 40 节点均为 `match_exact`(1 个 `match_fuzzy` 已被过滤)
|
||
- **执行时间**: ~22s(DeepSeek API 调用)
|
||
|
||
### 2.2 kg_nodes.json — 实际输出
|
||
|
||
**文件大小**: 9,851 bytes | **节点数**: 40
|
||
|
||
**节点类型分布**:
|
||
|
||
| 类型 | 数量 | 示例 |
|
||
|------|------|------|
|
||
| TECHNOLOGY | 4 | GraphRAG, MinerU, LLMs, LangExtract |
|
||
| CONCEPT | 36 | knowledge graphs, retrieval-augmented generation, multi-hop reasoning |
|
||
|
||
**节点格式(实际样例)**:
|
||
|
||
```json
|
||
{
|
||
"id": "node_0",
|
||
"name": "GraphRAG",
|
||
"type": "TECHNOLOGY",
|
||
"source_doc": "8a719db4-2b50-405b-826d-7bb27b224fa0",
|
||
"char_start": 0,
|
||
"char_end": 8,
|
||
"confidence": "match_exact",
|
||
"page": 0
|
||
}
|
||
```
|
||
|
||
**完整节点列表(前 10 个)**:
|
||
|
||
| id | name | type | confidence |
|
||
|----|------|------|-----------|
|
||
| node_0 | GraphRAG | TECHNOLOGY | match_exact |
|
||
| node_1 | Knowledge Graph Enhanced RAG System | CONCEPT | match_exact |
|
||
| node_2 | retrieval-augmented generation | CONCEPT | match_exact |
|
||
| node_3 | knowledge graphs | CONCEPT | match_exact |
|
||
| node_4 | large language models | CONCEPT | match_exact |
|
||
| node_5 | question answering | CONCEPT | match_exact |
|
||
| node_6 | document collections | CONCEPT | match_exact |
|
||
| node_7 | RAG systems | CONCEPT | match_exact |
|
||
| node_8 | vector similarity search | CONCEPT | match_exact |
|
||
| node_9 | hierarchical knowledge graph | CONCEPT | match_exact |
|
||
|
||
### 2.3 kg_edges.json — 实际输出
|
||
|
||
**文件大小**: 129,093 bytes | **边数**: 780
|
||
|
||
**数学验证**: 40 个节点全部在同一页 → C(40,2) = 40×39/2 = 780 条边 ✓
|
||
|
||
**边格式(实际样例)**:
|
||
|
||
```json
|
||
{
|
||
"source": "node_0",
|
||
"target": "node_1",
|
||
"relation": "CO_OCCURS_IN",
|
||
"doc_id": "8a719db4-2b50-405b-826d-7bb27b224fa0",
|
||
"page": 0
|
||
}
|
||
```
|
||
|
||
**完整性校验结果**:
|
||
- 自环数: 0 ✓
|
||
- 重复边数: 0 ✓
|
||
- 关系类型: 全部为 `CO_OCCURS_IN` ✓
|
||
|
||
---
|
||
|
||
## 3. MinerU Pipeline 关键参数规范
|
||
|
||
### 3.1 输入格式:content_list.json
|
||
|
||
MinerU 解析 PDF 后输出的 `{uuid}_content_list.json` 是一个 JSON 数组,每个元素代表一个内容块。
|
||
|
||
**text block 结构**:
|
||
|
||
```json
|
||
{
|
||
"type": "text",
|
||
"text": "GraphRAG: Knowledge Graph Enhanced RAG System...",
|
||
"text_level": null,
|
||
"page_idx": 0,
|
||
"bbox": [72, 43, 523, 57]
|
||
}
|
||
```
|
||
|
||
| 字段 | 类型 | 说明 |
|
||
|------|------|------|
|
||
| `type` | string | 块类型:`"text"` \| `"table"` \| `"image"` |
|
||
| `text` | string | 文本内容(末尾可能有空格) |
|
||
| `text_level` | int \| null | `null`=正文,`1`=一级标题 |
|
||
| `page_idx` | int | 页码(从 0 开始) |
|
||
| `bbox` | list[int] | 边界框坐标 `[x0, y0, x1, y1]`(归一化 0-1000) |
|
||
|
||
**table block 结构**:
|
||
|
||
```json
|
||
{
|
||
"type": "table",
|
||
"table_body": "<table><tr><th>Method</th><th>Score</th></tr>...</table>",
|
||
"table_caption": [],
|
||
"page_idx": 0,
|
||
"bbox": [72, 400, 523, 500]
|
||
}
|
||
```
|
||
|
||
| 字段 | 类型 | 说明 |
|
||
|------|------|------|
|
||
| `table_body` | string | HTML `<table>` 标签完整内容 |
|
||
| `table_caption` | list | 表格标题(通常为空数组) |
|
||
|
||
### 3.2 关键约束
|
||
|
||
- 文件命名: `{uuid}_content_list.json`,UUID 用作 source_doc_id
|
||
- block 排列顺序与 PDF 阅读顺序一致
|
||
- `text` 字段末尾可能有多余空格,需 `.rstrip()` 处理
|
||
- `image` 类型块不含可提取文本,Bridge 跳过处理
|
||
|
||
---
|
||
|
||
## 4. LangExtract Pipeline 关键参数规范
|
||
|
||
### 4.1 模型配置
|
||
|
||
```python
|
||
from langextract.providers.openai import OpenAILanguageModel
|
||
|
||
model = OpenAILanguageModel(
|
||
model_id="deepseek-chat",
|
||
api_key=DEEPSEEK_API_KEY,
|
||
base_url="https://api.deepseek.com",
|
||
)
|
||
```
|
||
|
||
**重要**: 必须直接实例化 `OpenAILanguageModel`,不能使用 `model_id` 路由。LangExtract 的 `model_id` 同时用于内部路由和 API 请求参数,DeepSeek 不识别 GPT 模型名称。
|
||
|
||
### 4.2 抽取调用
|
||
|
||
```python
|
||
result = lx.extract(
|
||
text_or_documents=page_text, # 纯文本字符串
|
||
prompt_description=PROMPT, # 实体类型描述
|
||
examples=EXAMPLES, # Few-shot 示例
|
||
model=model, # 直接传入模型实例
|
||
show_progress=True,
|
||
)
|
||
```
|
||
|
||
### 4.3 Prompt 配置
|
||
|
||
```
|
||
Extract named entities from the text in order of appearance.
|
||
Entity types:
|
||
TECHNOLOGY — software, algorithms, models, tools
|
||
ORGANIZATION — companies, research groups, institutions
|
||
PERSON — individual people
|
||
LOCATION — places, geographic entities
|
||
CONCEPT — technical concepts, methodologies, frameworks
|
||
```
|
||
|
||
### 4.4 Few-shot 示例
|
||
|
||
验证可用的示例(MVP 测试 94.1% match_exact):
|
||
|
||
```python
|
||
lx.data.ExampleData(
|
||
text="LangChain is a framework created by Harrison Chase for building "
|
||
"LLM applications. It integrates with OpenAI models and Pinecone "
|
||
"vector database for semantic search.",
|
||
extractions=[
|
||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="LangChain"),
|
||
lx.data.Extraction(extraction_class="PERSON", extraction_text="Harrison Chase"),
|
||
lx.data.Extraction(extraction_class="CONCEPT", extraction_text="LLM applications"),
|
||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="OpenAI models"),
|
||
lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="Pinecone"),
|
||
lx.data.Extraction(extraction_class="CONCEPT", extraction_text="semantic search"),
|
||
],
|
||
)
|
||
```
|
||
|
||
### 4.5 输出格式:AnnotatedDocument
|
||
|
||
每页抽取返回一个 `AnnotatedDocument`,其 `extractions` 列表中每个元素包含:
|
||
|
||
| 字段 | 类型 | 说明 |
|
||
|------|------|------|
|
||
| `extraction_text` | string | 实体名称(必须为输入文本的精确子串) |
|
||
| `extraction_class` | string | 实体类型(TECHNOLOGY/ORGANIZATION/PERSON/LOCATION/CONCEPT) |
|
||
| `char_interval.start_pos` | int | 在输入文本中的起始字符位置 |
|
||
| `char_interval.end_pos` | int | 在输入文本中的结束字符位置 |
|
||
| `alignment_status` | enum | 对齐质量:`match_exact` \| `match_greater` \| `match_lesser` \| `match_fuzzy` \| `None` |
|
||
| `extraction_index` | int | 抽取序号(从 1 开始) |
|
||
| `group_index` | int | 组序号(从 0 开始) |
|
||
|
||
### 4.6 对齐质量过滤规则
|
||
|
||
| alignment_status | 含义 | Bridge 处理 |
|
||
|-----------------|------|------------|
|
||
| `match_exact` | LLM 输出与原文完全匹配 | ✅ 接受 |
|
||
| `match_greater` | LLM 输出是原文子串的超集 | ✅ 接受 |
|
||
| `match_lesser` | LLM 输出是原文子串的子集 | ✅ 接受 |
|
||
| `match_fuzzy` | 模糊匹配,偏移不可靠 | ❌ 过滤 |
|
||
| `None` | 无法对齐 | ❌ 过滤 |
|
||
|
||
---
|
||
|
||
## 5. MinerU ↔ LangExtract 接口对接规范
|
||
|
||
### 5.1 核心挑战
|
||
|
||
MinerU 输出结构化 JSON 块(含 HTML 表格),而 LangExtract 仅接受纯文本 `str`。Bridge 的 `text_assembler` 模块负责转换和偏移映射。
|
||
|
||
### 5.2 对接转换规则
|
||
|
||
| 对接点 | MinerU 规范 | LangExtract 规范 | Bridge 处理 |
|
||
|--------|------------|-----------------|------------|
|
||
| 输入格式 | `content_list.json`(JSON 数组) | 仅接受纯文本 `str` | `text_assembler` 拼接转换 |
|
||
| 文本块 | `block["text"]`,末尾可能有空格 | `extraction_text` 须为原文精确子串 | `.rstrip()` 去尾部空格 |
|
||
| 表格块 | `table_body` 是 `<table>` HTML | 不接受 HTML | BeautifulSoup 转 pipe 分隔纯文本 |
|
||
| 标题判断 | `text_level` 缺失=正文,存在=标题 | 不区分标题/正文 | 标题和正文一起拼入文本 |
|
||
| 坐标系 | bbox 归一化 0-1000 | char_interval 基于输入字符 | BlockSpan 记录偏移映射 |
|
||
| 分页 | `page_idx` 区分不同页 | 单次调用处理一段文本 | 逐页分别调用 `lx.extract()` |
|
||
| 文件名 | `{uuid}_content_list.json` | — | glob `*_content_list.json` 匹配 |
|
||
|
||
### 5.3 文本拼接算法
|
||
|
||
```
|
||
输入: content_list (按 page_idx 分组)
|
||
输出: PageText 列表
|
||
|
||
对每页:
|
||
cursor = 0
|
||
对每个 block (保持原顺序):
|
||
if type == "text":
|
||
block_text = block["text"].rstrip()
|
||
elif type == "table":
|
||
block_text = html_table_to_text(block["table_body"])
|
||
else:
|
||
跳过 (image / equation 等)
|
||
|
||
记录 BlockSpan(char_start=cursor, char_end=cursor+len(block_text))
|
||
buffer.append(block_text + "\n")
|
||
cursor += len(block_text) + 1
|
||
|
||
PageText.text = "".join(buffer).rstrip("\n")
|
||
```
|
||
|
||
### 5.4 偏移映射数据结构
|
||
|
||
```python
|
||
@dataclasses.dataclass
|
||
class BlockSpan:
|
||
block_index: int # content_list 数组下标
|
||
block_type: str # "text" | "table"
|
||
page_idx: int # 页码
|
||
char_start: int # 在拼接文本中的起始位置
|
||
char_end: int # 在拼接文本中的结束位置(不含)
|
||
bbox: list[int] # MinerU 原始 bbox
|
||
|
||
@dataclasses.dataclass
|
||
class PageText:
|
||
page_idx: int # 页码
|
||
text: str # 拼接后的纯文本
|
||
block_spans: list[BlockSpan] # 每个 block 在 text 中的位置
|
||
```
|
||
|
||
### 5.5 HTML 表格转换
|
||
|
||
```python
|
||
def html_table_to_text(table_body: str) -> str:
|
||
"""Convert <table> HTML → pipe-delimited plain text"""
|
||
soup = BeautifulSoup(table_body, "html.parser")
|
||
rows = []
|
||
for tr in soup.find_all("tr"):
|
||
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
|
||
rows.append(" | ".join(cells))
|
||
return "\n".join(rows)
|
||
```
|
||
|
||
转换示例:
|
||
|
||
```html
|
||
<table><tr><th>Method</th><th>Score</th></tr><tr><td>GraphRAG</td><td>0.85</td></tr></table>
|
||
```
|
||
|
||
→
|
||
|
||
```
|
||
Method | Score
|
||
GraphRAG | 0.85
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Bridge Pipeline 最终输出关键参数规范
|
||
|
||
### 6.1 kg_nodes.json
|
||
|
||
**文件路径**: `graphrag_pipeline/output/kg_nodes.json`
|
||
|
||
**结构**: JSON 数组,每个元素为一个去重后的实体节点。
|
||
|
||
| 字段 | 类型 | 说明 | 示例 |
|
||
|------|------|------|------|
|
||
| `id` | string | 节点唯一标识,格式 `node_{index}` | `"node_0"` |
|
||
| `name` | string | 实体名称(原文子串) | `"GraphRAG"` |
|
||
| `type` | string | 实体类型 | `"TECHNOLOGY"` |
|
||
| `source_doc` | string | 来源文档 UUID | `"8a719db4-2b50-405b-826d-7bb27b224fa0"` |
|
||
| `char_start` | int | 在拼接文本中的起始字符位置 | `0` |
|
||
| `char_end` | int | 在拼接文本中的结束字符位置 | `8` |
|
||
| `confidence` | string | 对齐质量(仅 `match_exact`/`match_greater`/`match_lesser`) | `"match_exact"` |
|
||
| `page` | int | 来源页码(从 0 开始) | `0` |
|
||
|
||
**去重规则**: key = `(name.lower(), type)`,保留首次出现的实体。
|
||
|
||
**实体类型枚举**:
|
||
|
||
| 类型 | 说明 |
|
||
|------|------|
|
||
| `TECHNOLOGY` | 软件、算法、模型、工具 |
|
||
| `ORGANIZATION` | 公司、研究机构 |
|
||
| `PERSON` | 个人 |
|
||
| `LOCATION` | 地理位置 |
|
||
| `CONCEPT` | 技术概念、方法论、框架 |
|
||
|
||
### 6.2 kg_edges.json
|
||
|
||
**文件路径**: `graphrag_pipeline/output/kg_edges.json`
|
||
|
||
**结构**: JSON 数组,每个元素为一条同页共现关系边。
|
||
|
||
| 字段 | 类型 | 说明 | 示例 |
|
||
|------|------|------|------|
|
||
| `source` | string | 源节点 ID | `"node_0"` |
|
||
| `target` | string | 目标节点 ID | `"node_1"` |
|
||
| `relation` | string | 关系类型(固定 `"CO_OCCURS_IN"`) | `"CO_OCCURS_IN"` |
|
||
| `doc_id` | string | 来源文档 UUID | `"8a719db4-..."` |
|
||
| `page` | int | 共现页码 | `0` |
|
||
|
||
**边生成规则**:
|
||
1. 按页分组所有去重后的节点 ID
|
||
2. 同页节点两两配对 → 生成 `CO_OCCURS_IN` 边
|
||
3. 边方向规范化: `source < target`(字典序)
|
||
4. 去重 key: `(source, target, doc_id, page)`
|
||
5. 无自环(source ≠ target)
|
||
|
||
**边数公式**: 若某页有 N 个节点,则该页产生 C(N,2) = N×(N-1)/2 条边。
|
||
|
||
### 6.3 输出完整性约束
|
||
|
||
| 约束 | 说明 |
|
||
|------|------|
|
||
| 节点 ID 唯一 | 每个节点的 `id` 字段全局唯一 |
|
||
| 边引用合法 | 每条边的 `source` 和 `target` 必须对应存在的节点 `id` |
|
||
| 无自环 | 不存在 `source == target` 的边 |
|
||
| 无重复边 | 同一 `(source, target, doc_id, page)` 组合仅出现一次 |
|
||
| 对齐质量保证 | 所有节点的 `confidence` 仅为 accepted 值(非 fuzzy/null) |
|
||
| char 偏移有效 | `char_start < char_end`,且可定位到拼接文本中的实体子串 |
|
||
|
||
---
|
||
|
||
## 7. 虚拟环境规范
|
||
|
||
Bridge Pipeline **复用 LangExtract 的虚拟环境**,不单独创建 venv。
|
||
|
||
| 项目 | 值 |
|
||
|------|------|
|
||
| 虚拟环境路径 | `F:\GraphRAGAgent\langextract_src\.venv\` |
|
||
| Python 版本 | 3.12 |
|
||
| 核心依赖 | `langextract[all]`、`beautifulsoup4`、`python-dotenv` |
|
||
| 安装新依赖 | `uv pip install <pkg> --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe` |
|
||
|
||
**所有 Python 命令必须使用该虚拟环境运行,禁止使用全局 Python 或其他组件的 venv。**
|
||
|
||
---
|
||
|
||
## 8. 环境配置
|
||
|
||
### 8.1 .env 文件
|
||
|
||
位置: `F:\GraphRAGAgent\graphrag_pipeline\.env`
|
||
|
||
```env
|
||
DEEPSEEK_API_KEY=<your-api-key>
|
||
DEEPSEEK_BASE_URL=https://api.deepseek.com
|
||
```
|
||
|
||
### 8.2 依赖安装
|
||
|
||
```bash
|
||
uv pip install beautifulsoup4 python-dotenv --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe
|
||
```
|
||
|
||
---
|
||
|
||
## 9. 测试验证清单
|
||
|
||
- [x] text_assembler 正确读取 content_list.json(10 blocks: 9 text + 1 table)
|
||
- [x] 表格 HTML 转为 pipe 分隔纯文本,无 HTML 标签残留
|
||
- [x] 按页拼接文本长度合理(2102 字符/页)
|
||
- [x] LangExtract 成功调用 DeepSeek 返回 AnnotatedDocument
|
||
- [x] 抽取实体数 45,match_exact 占比 > 95%
|
||
- [x] kg_nodes.json 节点已去重(40 个),每个节点有完整字段
|
||
- [x] kg_edges.json 边为 CO_OCCURS_IN 关系(780 条),无自环,无重复
|
||
- [x] match_fuzzy 对齐的实体已被过滤(1 个)
|