Files

plf b02d3378fc GraphRAG Studio — initial commit: multimodal RAG system with KG visualization

Full-stack application for document-to-knowledge-graph pipeline:
- Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing
- Frontend: React 19 + Vite + D3.js + shadcn/ui
- Pipeline: MinerU parsing → LangExtract entity extraction → KG building

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-07 17:30:04 +08:00

17 KiB

Raw Blame History

Bridge Pipeline Specification v1.0

GraphRAG 索引阶段核心流程：MinerU → LangExtract → Knowledge Graph

1. Pipeline 执行思路

1.1 整体架构

Bridge Pipeline 是 GraphRAG 索引阶段的核心流程，负责将 MinerU 解析后的结构化 PDF 内容送入 LangExtract 完成实体抽取，最终生成知识图谱的节点（Nodes）和边（Edges）。

MinerU output                    Bridge Pipeline                      KG output
─────────────                    ───────────────                      ─────────
{uuid}_content_list.json    →    text_assembler.py
  ├─ text blocks                   ├─ 按页拼接纯文本
  └─ table blocks (HTML)           ├─ HTML表格→纯文本
                                   └─ 记录每个block的char偏移
                              →    entity_extractor.py
                                   ├─ 逐页调用 lx.extract()
                                   └─ DeepSeek via OpenAI Provider
                              →    kg_builder.py
                                   ├─ 过滤低质量对齐                  →  kg_nodes.json
                                   ├─ 节点去重 (name.lower(), type)
                                   └─ 同页实体对→CO_OCCURS_IN边       →  kg_edges.json

1.2 五步执行流程

步骤	模块	说明
Step 1	`bridge.py`	加载 MinerU 输出 `content_list.json`，解析输入路径和 source_doc_id
Step 2	`text_assembler.py`	按 `page_idx` 分组，拼接纯文本，记录每个 block 的字符偏移
Step 3	`entity_extractor.py`	逐页调用 LangExtract + DeepSeek 完成实体抽取
Step 4	`kg_builder.py`	过滤低质量对齐 → 节点去重 → 同页配对生成 CO_OCCURS_IN 边
Step 5	`bridge.py`	保存 `kg_nodes.json` + `kg_edges.json` 到 output 目录

1.3 文件存放位置

F:\GraphRAGAgent\graphrag_pipeline\
├── .env                     # DeepSeek API 配置
├── CLAUDE.md                # 组件开发规范
├── bridge.py                # 主入口（串联完整 Pipeline）
├── text_assembler.py        # MinerU JSON → 按页纯文本 + 偏移映射
├── entity_extractor.py      # LangExtract + DeepSeek 封装
├── kg_builder.py            # KG 节点去重 + 边生成
└── output/
    ├── kg_nodes.json        # 知识图谱节点（9,851 bytes）
    └── kg_edges.json        # 知识图谱边（129,093 bytes）

1.4 运行命令

# 使用默认测试输入
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py

# 指定输入文件
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py path/to/content_list.json

# 指定输入目录（自动查找 *_content_list.json）
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe F:/GraphRAGAgent/graphrag_pipeline/bridge.py path/to/output_dir/

2. 实际本地输出文档规范

2.1 测试运行结果

输入文件: F:\GraphRAGAgent\mineru_mvp\output\test_sample\8a719db4-2b50-405b-826d-7bb27b224fa0_content_list.json
输入规模: 10 blocks（9 text + 1 table），1 页，2102 字符
抽取结果: 45 raw extractions → 40 去重节点，780 CO_OCCURS_IN 边
对齐质量: 全部 40 节点均为 match_exact（1 个 match_fuzzy 已被过滤）
执行时间: ~22s（DeepSeek API 调用）

2.2 kg_nodes.json — 实际输出

文件大小: 9,851 bytes | 节点数: 40

节点类型分布:

类型	数量	示例
TECHNOLOGY	4	GraphRAG, MinerU, LLMs, LangExtract
CONCEPT	36	knowledge graphs, retrieval-augmented generation, multi-hop reasoning

节点格式（实际样例）:

{
  "id": "node_0",
  "name": "GraphRAG",
  "type": "TECHNOLOGY",
  "source_doc": "8a719db4-2b50-405b-826d-7bb27b224fa0",
  "char_start": 0,
  "char_end": 8,
  "confidence": "match_exact",
  "page": 0
}

完整节点列表（前 10 个）:

id	name	type	confidence
node_0	GraphRAG	TECHNOLOGY	match_exact
node_1	Knowledge Graph Enhanced RAG System	CONCEPT	match_exact
node_2	retrieval-augmented generation	CONCEPT	match_exact
node_3	knowledge graphs	CONCEPT	match_exact
node_4	large language models	CONCEPT	match_exact
node_5	question answering	CONCEPT	match_exact
node_6	document collections	CONCEPT	match_exact
node_7	RAG systems	CONCEPT	match_exact
node_8	vector similarity search	CONCEPT	match_exact
node_9	hierarchical knowledge graph	CONCEPT	match_exact

2.3 kg_edges.json — 实际输出

文件大小: 129,093 bytes | 边数: 780

数学验证: 40 个节点全部在同一页 → C(40,2) = 40×39/2 = 780 条边 ✓

边格式（实际样例）:

{
  "source": "node_0",
  "target": "node_1",
  "relation": "CO_OCCURS_IN",
  "doc_id": "8a719db4-2b50-405b-826d-7bb27b224fa0",
  "page": 0
}

完整性校验结果:

自环数: 0 ✓
重复边数: 0 ✓
关系类型: 全部为 CO_OCCURS_IN ✓

3. MinerU Pipeline 关键参数规范

3.1 输入格式：content_list.json

MinerU 解析 PDF 后输出的 {uuid}_content_list.json 是一个 JSON 数组，每个元素代表一个内容块。

text block 结构:

{
  "type": "text",
  "text": "GraphRAG: Knowledge Graph Enhanced RAG System...",
  "text_level": null,
  "page_idx": 0,
  "bbox": [72, 43, 523, 57]
}

字段	类型	说明
`type`	string	块类型：`"text"` \| `"table"` \| `"image"`
`text`	string	文本内容（末尾可能有空格）
`text_level`	int \| null	`null`=正文，`1`=一级标题
`page_idx`	int	页码（从 0 开始）
`bbox`	list[int]	边界框坐标 `[x0, y0, x1, y1]`（归一化 0-1000）

table block 结构:

{
  "type": "table",
  "table_body": "<table><tr><th>Method</th><th>Score</th></tr>...</table>",
  "table_caption": [],
  "page_idx": 0,
  "bbox": [72, 400, 523, 500]
}

字段	类型	说明
`table_body`	string	HTML `<table>` 标签完整内容
`table_caption`	list	表格标题（通常为空数组）

3.2 关键约束

文件命名: {uuid}_content_list.json，UUID 用作 source_doc_id
block 排列顺序与 PDF 阅读顺序一致
text 字段末尾可能有多余空格，需 .rstrip() 处理
image 类型块不含可提取文本，Bridge 跳过处理

4. LangExtract Pipeline 关键参数规范

4.1 模型配置

from langextract.providers.openai import OpenAILanguageModel

model = OpenAILanguageModel(
    model_id="deepseek-chat",
    api_key=DEEPSEEK_API_KEY,
    base_url="https://api.deepseek.com",
)

重要: 必须直接实例化 OpenAILanguageModel，不能使用 model_id 路由。LangExtract 的 model_id 同时用于内部路由和 API 请求参数，DeepSeek 不识别 GPT 模型名称。

4.2 抽取调用

result = lx.extract(
    text_or_documents=page_text,       # 纯文本字符串
    prompt_description=PROMPT,          # 实体类型描述
    examples=EXAMPLES,                  # Few-shot 示例
    model=model,                        # 直接传入模型实例
    show_progress=True,
)

4.3 Prompt 配置

Extract named entities from the text in order of appearance.
Entity types:
  TECHNOLOGY — software, algorithms, models, tools
  ORGANIZATION — companies, research groups, institutions
  PERSON — individual people
  LOCATION — places, geographic entities
  CONCEPT — technical concepts, methodologies, frameworks

4.4 Few-shot 示例

验证可用的示例（MVP 测试 94.1% match_exact）：

lx.data.ExampleData(
    text="LangChain is a framework created by Harrison Chase for building "
         "LLM applications. It integrates with OpenAI models and Pinecone "
         "vector database for semantic search.",
    extractions=[
        lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="LangChain"),
        lx.data.Extraction(extraction_class="PERSON", extraction_text="Harrison Chase"),
        lx.data.Extraction(extraction_class="CONCEPT", extraction_text="LLM applications"),
        lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="OpenAI models"),
        lx.data.Extraction(extraction_class="TECHNOLOGY", extraction_text="Pinecone"),
        lx.data.Extraction(extraction_class="CONCEPT", extraction_text="semantic search"),
    ],
)

4.5 输出格式：AnnotatedDocument

每页抽取返回一个 AnnotatedDocument，其 extractions 列表中每个元素包含：

字段	类型	说明
`extraction_text`	string	实体名称（必须为输入文本的精确子串）
`extraction_class`	string	实体类型（TECHNOLOGY/ORGANIZATION/PERSON/LOCATION/CONCEPT）
`char_interval.start_pos`	int	在输入文本中的起始字符位置
`char_interval.end_pos`	int	在输入文本中的结束字符位置
`alignment_status`	enum	对齐质量：`match_exact` \| `match_greater` \| `match_lesser` \| `match_fuzzy` \| `None`
`extraction_index`	int	抽取序号（从 1 开始）
`group_index`	int	组序号（从 0 开始）

4.6 对齐质量过滤规则

alignment_status	含义	Bridge 处理
`match_exact`	LLM 输出与原文完全匹配	✅ 接受
`match_greater`	LLM 输出是原文子串的超集	✅ 接受
`match_lesser`	LLM 输出是原文子串的子集	✅ 接受
`match_fuzzy`	模糊匹配，偏移不可靠	❌ 过滤
`None`	无法对齐	❌ 过滤

5. MinerU ↔ LangExtract 接口对接规范

5.1 核心挑战

MinerU 输出结构化 JSON 块（含 HTML 表格），而 LangExtract 仅接受纯文本 str。Bridge 的 text_assembler 模块负责转换和偏移映射。

5.2 对接转换规则

对接点	MinerU 规范	LangExtract 规范	Bridge 处理
输入格式	`content_list.json`（JSON 数组）	仅接受纯文本 `str`	`text_assembler` 拼接转换
文本块	`block["text"]`，末尾可能有空格	`extraction_text` 须为原文精确子串	`.rstrip()` 去尾部空格
表格块	`table_body` 是 `<table>` HTML	不接受 HTML	BeautifulSoup 转 pipe 分隔纯文本
标题判断	`text_level` 缺失=正文，存在=标题	不区分标题/正文	标题和正文一起拼入文本
坐标系	bbox 归一化 0-1000	char_interval 基于输入字符	BlockSpan 记录偏移映射
分页	`page_idx` 区分不同页	单次调用处理一段文本	逐页分别调用 `lx.extract()`
文件名	`{uuid}_content_list.json`	—	glob `*_content_list.json` 匹配

5.3 文本拼接算法

输入: content_list (按 page_idx 分组)
输出: PageText 列表

对每页:
  cursor = 0
  对每个 block (保持原顺序):
    if type == "text":
      block_text = block["text"].rstrip()
    elif type == "table":
      block_text = html_table_to_text(block["table_body"])
    else:
      跳过 (image / equation 等)

    记录 BlockSpan(char_start=cursor, char_end=cursor+len(block_text))
    buffer.append(block_text + "\n")
    cursor += len(block_text) + 1

  PageText.text = "".join(buffer).rstrip("\n")

5.4 偏移映射数据结构

@dataclasses.dataclass
class BlockSpan:
    block_index: int    # content_list 数组下标
    block_type: str     # "text" | "table"
    page_idx: int       # 页码
    char_start: int     # 在拼接文本中的起始位置
    char_end: int       # 在拼接文本中的结束位置（不含）
    bbox: list[int]     # MinerU 原始 bbox

@dataclasses.dataclass
class PageText:
    page_idx: int                   # 页码
    text: str                       # 拼接后的纯文本
    block_spans: list[BlockSpan]    # 每个 block 在 text 中的位置

5.5 HTML 表格转换

def html_table_to_text(table_body: str) -> str:
    """Convert <table> HTML → pipe-delimited plain text"""
    soup = BeautifulSoup(table_body, "html.parser")
    rows = []
    for tr in soup.find_all("tr"):
        cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
        rows.append(" | ".join(cells))
    return "\n".join(rows)

转换示例：

<table><tr><th>Method</th><th>Score</th></tr><tr><td>GraphRAG</td><td>0.85</td></tr></table>

→

Method | Score
GraphRAG | 0.85

6. Bridge Pipeline 最终输出关键参数规范

6.1 kg_nodes.json

文件路径: graphrag_pipeline/output/kg_nodes.json

结构: JSON 数组，每个元素为一个去重后的实体节点。

字段	类型	说明	示例
`id`	string	节点唯一标识，格式 `node_{index}`	`"node_0"`
`name`	string	实体名称（原文子串）	`"GraphRAG"`
`type`	string	实体类型	`"TECHNOLOGY"`
`source_doc`	string	来源文档 UUID	`"8a719db4-2b50-405b-826d-7bb27b224fa0"`
`char_start`	int	在拼接文本中的起始字符位置	`0`
`char_end`	int	在拼接文本中的结束字符位置	`8`
`confidence`	string	对齐质量（仅 `match_exact`/`match_greater`/`match_lesser`）	`"match_exact"`
`page`	int	来源页码（从 0 开始）	`0`

去重规则: key = (name.lower(), type)，保留首次出现的实体。

实体类型枚举:

类型	说明
`TECHNOLOGY`	软件、算法、模型、工具
`ORGANIZATION`	公司、研究机构
`PERSON`	个人
`LOCATION`	地理位置
`CONCEPT`	技术概念、方法论、框架

6.2 kg_edges.json

文件路径: graphrag_pipeline/output/kg_edges.json

结构: JSON 数组，每个元素为一条同页共现关系边。

字段	类型	说明	示例
`source`	string	源节点 ID	`"node_0"`
`target`	string	目标节点 ID	`"node_1"`
`relation`	string	关系类型（固定 `"CO_OCCURS_IN"`）	`"CO_OCCURS_IN"`
`doc_id`	string	来源文档 UUID	`"8a719db4-..."`
`page`	int	共现页码	`0`

边生成规则:

按页分组所有去重后的节点 ID
同页节点两两配对 → 生成 CO_OCCURS_IN 边
边方向规范化: source < target（字典序）
去重 key: (source, target, doc_id, page)
无自环（source ≠ target）

边数公式: 若某页有 N 个节点，则该页产生 C(N,2) = N×(N-1)/2 条边。

6.3 输出完整性约束

约束	说明
节点 ID 唯一	每个节点的 `id` 字段全局唯一
边引用合法	每条边的 `source` 和 `target` 必须对应存在的节点 `id`
无自环	不存在 `source == target` 的边
无重复边	同一 `(source, target, doc_id, page)` 组合仅出现一次
对齐质量保证	所有节点的 `confidence` 仅为 accepted 值（非 fuzzy/null）
char 偏移有效	`char_start < char_end`，且可定位到拼接文本中的实体子串

7. 虚拟环境规范

Bridge Pipeline 复用 LangExtract 的虚拟环境，不单独创建 venv。

项目	值
虚拟环境路径	`F:\GraphRAGAgent\langextract_src\.venv\`
Python 版本	3.12
核心依赖	`langextract[all]`、`beautifulsoup4`、`python-dotenv`
安装新依赖	`uv pip install <pkg> --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe`

所有 Python 命令必须使用该虚拟环境运行，禁止使用全局 Python 或其他组件的 venv。

8. 环境配置

8.1 .env 文件

位置: F:\GraphRAGAgent\graphrag_pipeline\.env

DEEPSEEK_API_KEY=<your-api-key>
DEEPSEEK_BASE_URL=https://api.deepseek.com

8.2 依赖安装

uv pip install beautifulsoup4 python-dotenv --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe

9. 测试验证清单

text_assembler 正确读取 content_list.json（10 blocks: 9 text + 1 table）
表格 HTML 转为 pipe 分隔纯文本，无 HTML 标签残留
按页拼接文本长度合理（2102 字符/页）
LangExtract 成功调用 DeepSeek 返回 AnnotatedDocument
抽取实体数 45，match_exact 占比 > 95%
kg_nodes.json 节点已去重（40 个），每个节点有完整字段
kg_edges.json 边为 CO_OCCURS_IN 关系（780 条），无自环，无重复
match_fuzzy 对齐的实体已被过滤（1 个）

17 KiB Raw Blame History Unescape Escape