Files
GraphRAGAgent/docs/backend_service_specification-v1.0.md
plf b02d3378fc GraphRAG Studio — initial commit: multimodal RAG system with KG visualization
Full-stack application for document-to-knowledge-graph pipeline:
- Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing
- Frontend: React 19 + Vite + D3.js + shadcn/ui
- Pipeline: MinerU parsing → LangExtract entity extraction → KG building

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 17:30:04 +08:00

1758 lines
52 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 多模态 RAG 后端服务接口规范 v1.0
> 基于 MinerU + LangExtract Bridge Pipeline + Agentic-RAG MVP 实测验证结果
> Web 框架FastAPI (Python 3.12 async)
> 存储方案纯文件系统JSON
> 更新日期2026-03-05
---
## 目录
- [一、系统架构总览](#一系统架构总览)
- [1.1 四层架构](#11-四层架构)
- [1.2 双 venv 协调方案](#12-双-venv-协调方案)
- [1.3 完整数据流](#13-完整数据流)
- [1.4 Job 状态机](#14-job-状态机)
- [1.5 FastAPI 项目目录结构](#15-fastapi-项目目录结构)
- [1.6 文件系统存储结构](#16-文件系统存储结构)
- [二、统一响应封装格式](#二统一响应封装格式)
- [2.1 通用响应结构](#21-通用响应结构)
- [2.2 错误码体系](#22-错误码体系)
- [三、核心数据对象 Schema](#三核心数据对象-schema)
- [3.1 DocumentInfo](#31-documentinfo)
- [3.2 IndexingJobStatus](#32-indexingjobstatus)
- [3.3 KGNode](#33-kgnode)
- [3.4 KGEdge](#34-kgedge)
- [3.5 ExtractionRecord](#35-extractionrecord)
- [3.6 QAResult](#36-qaresult)
- [四、A 组文档管理4 个端点)](#四a-组文档管理4-个端点)
- [五、B 组Indexing Pipeline4 个端点)](#五b-组indexing-pipeline4-个端点)
- [六、C 组知识图谱6 个端点)](#六c-组知识图谱6-个端点)
- [七、D 组QA 问答4 个端点)](#七d-组qa-问答4-个端点)
- [八、E 组搜索3 个端点)](#八e-组搜索3-个端点)
- [九、F 组系统4 个端点)](#九f-组系统4-个端点)
- [十、文件格式支持矩阵](#十文件格式支持矩阵)
- [十一、依赖与运行](#十一依赖与运行)
---
## 一、系统架构总览
### 1.1 四层架构
```
┌─────────────────────────────────────────────────────────────────────┐
│ 客户端层 │
│ 浏览器 / API 调用方 / 可视化前端 │
└──────────────────────────────┬──────────────────────────────────────┘
│ HTTP/HTTPS
┌──────────────────────────────▼──────────────────────────────────────┐
│ API 网关层 │
│ Nginx 反向代理 | 限流per-IP/per-key | 请求日志 | TLS 终止 │
└──────────────────────────────┬──────────────────────────────────────┘
┌──────────────────────────────▼──────────────────────────────────────┐
│ 服务层 — FastAPI Application │
│ Python 3.12 async / uvicorn │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌───────────────────────┐ │
│ │ DocumentService│ │ IndexingService│ │ KGService │ │
│ │ 文件上传/管理 │ │ Pipeline 调度 │ │ NetworkX 图操作 │ │
│ └────────────────┘ └────────────────┘ └───────────────────────┘ │
│ ┌────────────────┐ ┌────────────────┐ ┌───────────────────────┐ │
│ │ QAService │ │ SearchService │ │ SystemService │ │
│ │ Agentic-RAG │ │ 实体/图谱搜索 │ │ 健康检查 / 统计 │ │
│ └────────────────┘ └────────────────┘ └───────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
┌──────────────────────────────▼──────────────────────────────────────┐
│ Pipeline 执行层 │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ MinerU Pipelinesubprocess → mineru_mvp/.venv │ │
│ │ 输入: 文件路径 输出: *content_list.json + layout.json │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Bridge Pipeline直接 import → langextract_src/.venv │ │
│ │ text_assembler → entity_extractor → kg_builder │ │
│ │ 输出: kg_nodes.json + kg_edges.json │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Agentic-RAGLangChain create_agent → langextract_src/.venv│ │
│ │ 工具: search_entities / get_neighbors / get_entities_by_type │ │
│ │ describe_graph │ │
│ │ LLM: DeepSeek deepseek-chat via ChatOpenAI │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
┌──────────────────────────────▼──────────────────────────────────────┐
│ 存储层(纯文件系统) │
│ uploads/ ← 原始上传文件 │
│ jobs/{job_id}/ ← 每个 job 的中间产物和结果 JSON │
│ kg/ ← 全局合并的 KGkg_nodes.json + kg_edges.json
└─────────────────────────────────────────────────────────────────────┘
```
### 1.2 双 venv 协调方案
项目中存在两个隔离的 Python 虚拟环境FastAPI 服务通过以下方式协调:
| 组件 | 虚拟环境 | 调用方式 |
|------|---------|---------|
| **FastAPI 服务本体** | `langextract_src/.venv` | 直接运行 |
| **Bridge Pipeline** | `langextract_src/.venv` | `from text_assembler import ...` 直接 import |
| **Agentic-RAG** | `langextract_src/.venv` | `from agentic_rag_mvp import ...` 直接 import |
| **MinerU Pipeline** | `mineru_mvp/.venv` | `subprocess.run([MINERU_PYTHON, MINERU_PIPELINE, pdf_path])` |
```python
# 双 venv 协调核心代码
MINERU_PYTHON = Path("F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe")
MINERU_PIPELINE = Path("F:/GraphRAGAgent/mineru_mvp/pipeline.py")
# Stage 1: MinerU — subprocess 隔离调用
result = subprocess.run(
[str(MINERU_PYTHON), str(MINERU_PIPELINE), str(pdf_path)],
cwd=str(MINERU_DIR), capture_output=True, text=True, timeout=600
)
# Stage 2-4: Bridge + RAG — 直接 import同 venv
from text_assembler import load_content_list, assemble_pages
from entity_extractor import create_model, extract_entities
from kg_builder import build_kg
```
### 1.3 完整数据流
```
上传文件PDF/DOCX/PPT/PNG/JPG/HTML
▼ POST /api/v1/documents/upload
DocumentService: 保存到 uploads/{doc_id}_{filename}
▼ POST /api/v1/index/start
IndexingService: 启动后台 threading.Thread
├─ Stage: parsing
│ MinerU subprocess → mineru_mvp/output/{stem}/*_content_list.json
├─ Stage: extracting
│ text_assembler.assemble_pages() → PageText[]
│ entity_extractor.extract_entities() → AnnotatedDocument[]
│ → ExtractionRecord[] 保存到 jobs/{job_id}/extractions.json
├─ Stage: indexing
│ kg_builder.build_kg() → KGNode[] + KGEdge[]
│ → 保存到 jobs/{job_id}/kg_nodes.json + kg_edges.json
│ → 合并到全局 kg/kg_nodes.json + kg/kg_edges.json
└─ Status: done
GET /api/v1/index/result/{job_id} → 完整结果
用户查询(自然语言问题)
▼ POST /api/v1/query
QAService: 加载全局 KG → NetworkX Graph
├─ LangChain create_agentDeepSeek
│ ReAct 循环: think → tool_call → observe → repeat
│ 工具调用链: search_entities / get_neighbors / ...
└─ QAResult: answer + tool_calls + cited_nodes
```
### 1.4 Job 状态机
```
┌─────────┐
│submitted│
└────┬────┘
│ 后台线程启动
┌────▼────┐
│ queued │ (等待线程池,当前实现立即转 parsing
└────┬────┘
│ MinerU subprocess 开始
┌────▼────┐
│ parsing │ MinerU 云端 API 解析
└────┬────┘
│ content_list.json 就绪
┌─────▼──────┐
│ extracting │ LangExtract + DeepSeek 实体抽取
└─────┬──────┘
│ extractions.json 就绪
┌─────▼──────┐
│ indexing │ kg_builder 构建知识图谱
└─────┬──────┘
│ kg_nodes/edges 就绪
┌──────────▼──────────┐
┌─────▼─────┐ ┌──────▼──────┐
│ done │ │ failed │
└───────────┘ └─────────────┘
```
**进度字段说明(`progress` 对象):**
| 阶段 | `parsed_pages` | `total_pages` | `extracted_entities` |
|------|----------------|---------------|----------------------|
| parsing | 实时更新MinerU 进度) | MinerU 返回总页数 | 0 |
| extracting | total_pages | total_pages | 实时累加 |
| indexing | total_pages | total_pages | 最终值 |
| done | total_pages | total_pages | 最终值 |
### 1.5 FastAPI 项目目录结构
```
F:\GraphRAGAgent\graphrag_pipeline\
├── api_server.py # FastAPI 主入口app 实例、路由注册、启动配置)
├── routers/
│ ├── __init__.py
│ ├── documents.py # A 组文档管理4 个端点)
│ ├── indexing.py # B 组Indexing Pipeline4 个端点)
│ ├── kg.py # C 组知识图谱6 个端点)
│ ├── query.py # D 组QA 问答4 个端点)
│ ├── search.py # E 组搜索3 个端点)
│ └── system.py # F 组系统4 个端点)
├── services/
│ ├── __init__.py
│ ├── document_service.py # 文件保存、元数据读写
│ ├── indexing_service.py # Pipeline 调度MinerU subprocess + Bridge import
│ ├── kg_service.py # NetworkX 图加载、BFS、中心性计算
│ ├── qa_service.py # create_agent 封装、ReAct 调用、结果解析
│ └── search_service.py # 实体搜索、路径搜索、子图搜索
├── models/
│ ├── __init__.py
│ └── schemas.py # Pydantic v2 models所有数据对象 Schema
├── storage/
│ ├── __init__.py
│ └── file_store.py # 统一文件读写JSON 序列化/反序列化、目录管理)
├── .env # DEEPSEEK_API_KEY + DEEPSEEK_BASE_URL + MINERU_API_TOKEN
│ # 现有文件(不修改)
├── bridge.py
├── text_assembler.py
├── entity_extractor.py
├── kg_builder.py
├── agentic_rag_mvp.py
├── web_server.py # 旧 Flask 原型(保留,不删除)
└── output/
├── kg_nodes.json # 向后兼容的全局 KG与 kg/ 目录同步)
└── kg_edges.json
```
### 1.6 文件系统存储结构
```
F:\GraphRAGAgent\graphrag_pipeline\
├── uploads/
│ └── {doc_id}_{filename} # 上传的原始文件(如 abc12345_paper.pdf
├── jobs/
│ └── {job_id}/
│ ├── meta.json # job 元数据
│ │ {
│ │ "job_id": "job_xyz789",
│ │ "doc_id": "abc12345",
│ │ "status": "done",
│ │ "stage": "Complete",
│ │ "progress": {...},
│ │ "created_at": "ISO8601",
│ │ "elapsed_seconds": 42.1,
│ │ "error": null,
│ │ "pdf_name": "paper.pdf",
│ │ "pdf_path": "uploads/abc12345_paper.pdf"
│ │ }
│ ├── mineru_output/ # MinerU 解析产物(原样保留)
│ │ ├── {uuid}_content_list.json
│ │ ├── layout.json
│ │ ├── full.md
│ │ ├── {uuid}_origin.pdf
│ │ └── images/
│ │ └── {sha256}.jpg
│ ├── extractions.json # LangExtract 全部抽取记录ExtractionRecord[]
│ ├── kg_nodes.json # 本 job 生成的 KG 节点KGNode[]
│ └── kg_edges.json # 本 job 生成的 KG 边KGEdge[]
└── kg/
├── kg_nodes.json # 全局合并的 KG 节点(所有 job 合并去重)
└── kg_edges.json # 全局合并的 KG 边(所有 job 合并去重)
```
---
## 二、统一响应封装格式
### 2.1 通用响应结构
所有 API 端点均使用以下统一包装格式:
```json
{
"code": 0,
"msg": "success",
"request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"data": { ... }
}
```
| 字段 | 类型 | 说明 |
|------|------|------|
| `code` | `int` | `0` = 成功;非 `0` = 失败(见错误码表) |
| `msg` | `string` | 状态描述(成功为 `"success"`,失败为错误信息) |
| `request_id` | `string` | UUID v4用于日志追踪 |
| `data` | `object \| null` | 业务数据(失败时为 `null` |
**HTTP 状态码映射:**
| HTTP 状态码 | 适用场景 |
|------------|---------|
| `200 OK` | 同步请求成功 |
| `202 Accepted` | 异步任务已接受Job 启动) |
| `400 Bad Request` | 参数校验失败code 1001/1002/1003 |
| `404 Not Found` | 资源不存在code 2001/3001 |
| `500 Internal Server Error` | 服务器内部错误code 5000 |
**FastAPI Pydantic 响应模型:**
```python
from pydantic import BaseModel
from typing import Generic, TypeVar, Optional
import uuid
T = TypeVar("T")
class APIResponse(BaseModel, Generic[T]):
code: int = 0
msg: str = "success"
request_id: str = str(uuid.uuid4())
data: Optional[T] = None
```
### 2.2 错误码体系
| code | HTTP 状态码 | 含义 | 说明 |
|------|------------|------|------|
| `0` | 200 | 成功 | |
| `1001` | 400 | 参数校验失败 | 缺少必填字段或类型错误 |
| `1002` | 400 | 文件格式不支持 | 仅支持 pdf/docx/doc/pptx/ppt/png/jpg/jpeg/html |
| `1003` | 400 | 文件超出大小限制 | 单文件最大 200MBMinerU 限制) |
| `1004` | 400 | 文件页数超限 | 单文件最大 600 页MinerU 限制) |
| `2001` | 404 | 文档不存在 | `doc_id` 对应的文档未找到 |
| `2002` | 400 | Job 不存在 | `job_id` 对应的任务未找到 |
| `2003` | 400 | Job 仍在执行 | 请求结果时任务尚未完成 |
| `2004` | 400 | Job 状态不可取消 | 仅 submitted/queued 可取消 |
| `3001` | 404 | KG 节点不存在 | `node_id` 对应节点未找到 |
| `3002` | 400 | KG 为空 | 尚未完成任何 Indexing无图谱数据 |
| `4001` | 500 | QA 服务异常 | LangChain Agent 或 DeepSeek API 调用失败 |
| `5000` | 500 | 服务器内部错误 | 未预期的系统异常 |
**错误响应示例:**
```json
{
"code": 1002,
"msg": "Unsupported file format: .xlsx. Supported formats: pdf, docx, doc, pptx, ppt, png, jpg, jpeg, html",
"request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"data": null
}
```
---
## 三、核心数据对象 Schema
### 3.1 DocumentInfo
文档元数据对象,由 `POST /api/v1/documents/upload` 创建,持久化到 `jobs/` 下的 `meta.json`
```json
{
"doc_id": "abc12345",
"filename": "graphrag_overview.pdf",
"format": "pdf",
"size_bytes": 1048576,
"pages": 4,
"uploaded_at": "2026-03-05T10:00:00Z",
"status": "indexed",
"language": "en",
"enable_formula": true,
"enable_table": true
}
```
| 字段 | 类型 | 说明 |
|------|------|------|
| `doc_id` | `string` | 文档唯一 IDUUID hex 前 8 位,如 `"abc12345"` |
| `filename` | `string` | 原始文件名 |
| `format` | `string` | 文件格式(小写扩展名,不含点) |
| `size_bytes` | `int` | 文件大小(字节) |
| `pages` | `int \| null` | 总页数MinerU 解析后填充;上传时为 `null` |
| `uploaded_at` | `string` | ISO 8601 上传时间 |
| `status` | `string` | `"uploaded"` / `"indexed"` / `"failed"` |
| `language` | `string` | OCR 语言码PaddleOCR默认 `"ch"` |
| `enable_formula` | `bool` | 是否启用公式识别 |
| `enable_table` | `bool` | 是否启用表格识别 |
### 3.2 IndexingJobStatus
Indexing Pipeline 的任务状态对象。
```json
{
"job_id": "job_xyz789",
"doc_id": "abc12345",
"status": "extracting",
"stage": "Extracting entities (LangExtract + DeepSeek)...",
"progress": {
"parsed_pages": 4,
"total_pages": 4,
"extracted_entities": 23
},
"created_at": "2026-03-05T10:00:05Z",
"elapsed_seconds": 18.3,
"error": null
}
```
| 字段 | 类型 | 说明 |
|------|------|------|
| `job_id` | `string` | 任务唯一 ID`"job_"` + UUID hex 前 8 位) |
| `doc_id` | `string` | 关联文档 ID |
| `status` | `string` | 状态枚举(见 1.4 状态机) |
| `stage` | `string` | 当前阶段人类可读描述 |
| `progress.parsed_pages` | `int` | 已解析页数 |
| `progress.total_pages` | `int` | 总页数0 = 未知) |
| `progress.extracted_entities` | `int` | 已抽取实体数 |
| `created_at` | `string` | ISO 8601 任务创建时间 |
| `elapsed_seconds` | `float` | 已耗时(秒) |
| `error` | `string \| null` | 错误信息(失败时非 null |
### 3.3 KGNode
知识图谱节点,直接对应 `kg_nodes.json` 格式,新增 `degree` 字段。
```json
{
"id": "tech_graphrag_0",
"name": "GraphRAG",
"type": "TECHNOLOGY",
"source_doc": "abc12345",
"char_start": 0,
"char_end": 8,
"confidence": "match_exact",
"page": 0,
"degree": 39
}
```
| 字段 | 类型 | 说明 |
|------|------|------|
| `id` | `string` | 节点唯一 ID来自 kg_nodes.json |
| `name` | `string` | 实体名称 |
| `type` | `string` | 实体类型:`TECHNOLOGY` / `CONCEPT` / `PERSON` / `ORGANIZATION` / `LOCATION` |
| `source_doc` | `string` | 来源文档 IDdoc_id |
| `char_start` | `int` | 实体在原文中的起始字符位置LangExtract `char_interval.start_pos` |
| `char_end` | `int` | 实体在原文中的结束字符位置(不含,`char_interval.end_pos` |
| `confidence` | `string` | LangExtract 对齐状态:`match_exact` / `match_greater` / `match_lesser` / `match_fuzzy` |
| `page` | `int` | 所在页码0-indexed来自 MinerU content_list.json `page_idx` |
| `degree` | `int` | 节点度数连接边数NetworkX 计算,仅 API 返回时填充) |
### 3.4 KGEdge
知识图谱边,直接对应 `kg_edges.json` 格式。
```json
{
"source": "tech_graphrag_0",
"target": "concept_knowledgegraph_1",
"relation": "CO_OCCURS_IN",
"doc_id": "abc12345",
"page": 0
}
```
| 字段 | 类型 | 说明 |
|------|------|------|
| `source` | `string` | 起始节点 ID |
| `target` | `string` | 目标节点 ID |
| `relation` | `string` | 关系类型(当前固定为 `"CO_OCCURS_IN"`,表示同页共现) |
| `doc_id` | `string` | 边来源文档 ID |
| `page` | `int` | 共现所在页码0-indexed |
### 3.5 ExtractionRecord
LangExtract 单条实体抽取记录,对应 `AnnotatedDocument.extractions[]` 的扁平化结构。
```json
{
"text": "GraphRAG",
"type": "TECHNOLOGY",
"char_start": 0,
"char_end": 8,
"alignment": "match_exact",
"page": 0,
"doc_id": "abc12345"
}
```
| 字段 | 类型 | 说明 |
|------|------|------|
| `text` | `string` | 实体文本(`extraction_text`,原文子串) |
| `type` | `string` | 实体类型(`extraction_class` |
| `char_start` | `int \| null` | 字符起始位置(`char_interval.start_pos` |
| `char_end` | `int \| null` | 字符结束位置(`char_interval.end_pos`,不含) |
| `alignment` | `string \| null` | 对齐状态(`alignment_status.value``null` 表示未对齐) |
| `page` | `int` | 所在页码0-indexed |
| `doc_id` | `string` | 来源文档 ID |
> **过滤规则**KG 构建时过滤掉 `alignment = null`(未对齐),`match_fuzzy` 根据项目配置可选是否过滤。当前实测:`match_exact` 占 94%+。
### 3.6 QAResult
Agentic-RAG 问答返回对象,包含答案 + 完整推理溯源链。
```json
{
"query_id": "q_20260305_001",
"question": "What is GraphRAG and how does it relate to knowledge graphs?",
"answer": "GraphRAG is a knowledge graph-enhanced retrieval-augmented generation system...",
"tool_calls": [
{
"tool": "search_entities",
"input": {"query": "GraphRAG"},
"output": "Found 1 entity(ies) matching 'GraphRAG':\n [TECHNOLOGY] \"GraphRAG\" (confidence=match_exact, page=0, id=tech_graphrag_0)"
},
{
"tool": "get_neighbors",
"input": {"entity_name": "GraphRAG", "hops": 1},
"output": "Neighbors of 'GraphRAG' [TECHNOLOGY] within 1 hop(s):\n Hop 1 — 39 related entities:\n [CONCEPT] knowledge graphs\n ..."
}
],
"cited_nodes": ["tech_graphrag_0", "concept_knowledgegraph_1"],
"elapsed_seconds": 8.4,
"created_at": "2026-03-05T10:30:00Z"
}
```
| 字段 | 类型 | 说明 |
|------|------|------|
| `query_id` | `string` | 查询唯一 ID |
| `question` | `string` | 用户原始问题 |
| `answer` | `string` | Agent 生成的最终自然语言答案(`result["messages"][-1].content` |
| `tool_calls` | `array` | ReAct 循环中的工具调用记录(顺序) |
| `tool_calls[].tool` | `string` | 工具名4 个 KG 工具之一) |
| `tool_calls[].input` | `object` | 工具调用参数 |
| `tool_calls[].output` | `string` | 工具返回的文本结果ToolMessage.content |
| `cited_nodes` | `string[]` | 答案中引用的节点 ID 列表(从 tool_calls 解析) |
| `elapsed_seconds` | `float` | 问答总耗时(包括所有 LLM 调用) |
| `created_at` | `string` | ISO 8601 查询时间 |
---
## 四、A 组文档管理4 个端点)
### A1. 上传文件
```
POST /api/v1/documents/upload
Content-Type: multipart/form-data
```
**RequestForm Data**
| 字段 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| `file` | `binary` | **是** | — | 文件二进制内容 |
| `language` | `string` | 否 | `"ch"` | OCR 语言PaddleOCR 语言码) |
| `enable_formula` | `bool` | 否 | `true` | 是否启用公式识别 |
| `enable_table` | `bool` | 否 | `true` | 是否启用表格识别 |
**验证规则:**
- 文件扩展名必须在支持列表中(见第十章)
- 文件大小不得超过 200MB
- 文件名不得包含路径分隔符(防目录穿越)
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "f47ac10b-...",
"data": {
"doc_id": "abc12345",
"filename": "graphrag_overview.pdf",
"format": "pdf",
"size_bytes": 1048576,
"pages": null,
"uploaded_at": "2026-03-05T10:00:00Z",
"status": "uploaded",
"language": "en",
"enable_formula": true,
"enable_table": true
}
}
```
**错误响应:**
```json
// 1002: 格式不支持
{ "code": 1002, "msg": "Unsupported file format: .xlsx", "data": null }
// 1003: 超过大小限制
{ "code": 1003, "msg": "File size 256MB exceeds 200MB limit", "data": null }
```
---
### A2. 获取文档信息
```
GET /api/v1/documents/{doc_id}
```
**Path Params**
| 参数 | 类型 | 说明 |
|------|------|------|
| `doc_id` | `string` | 文档 ID |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"doc_id": "abc12345",
"filename": "graphrag_overview.pdf",
"format": "pdf",
"size_bytes": 1048576,
"pages": 4,
"uploaded_at": "2026-03-05T10:00:00Z",
"status": "indexed",
"language": "en",
"enable_formula": true,
"enable_table": true
}
}
```
**错误:** `2001` (doc_id 不存在)
---
### A3. 列出所有文档
```
GET /api/v1/documents
```
**Query Params**
| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `page` | `int` | `1` | 页码(从 1 开始) |
| `page_size` | `int` | `20` | 每页数量(最大 100 |
| `status` | `string` | — | 按状态筛选:`uploaded` / `indexed` / `failed` |
| `format` | `string` | — | 按格式筛选:如 `pdf` |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"total": 5,
"page": 1,
"page_size": 20,
"items": [
{
"doc_id": "abc12345",
"filename": "graphrag_overview.pdf",
"format": "pdf",
"size_bytes": 1048576,
"pages": 4,
"uploaded_at": "2026-03-05T10:00:00Z",
"status": "indexed",
"language": "en",
"enable_formula": true,
"enable_table": true
}
]
}
}
```
---
### A4. 删除文档
```
DELETE /api/v1/documents/{doc_id}
```
**说明:** 删除文档及其关联的 job 产物文件(`uploads/``jobs/` 下的对应目录),并从全局 KG 中移除该文档贡献的节点和边。
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"deleted": true,
"doc_id": "abc12345",
"removed_nodes": 40,
"removed_edges": 780
}
}
```
**错误:** `2001` (doc_id 不存在)
---
## 五、B 组Indexing Pipeline4 个端点)
### B1. 启动索引任务
```
POST /api/v1/index/start
Content-Type: application/json
```
**Request Body**
```json
{
"doc_id": "abc12345"
}
```
| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `doc_id` | `string` | **是** | 已上传文档的 ID状态须为 `uploaded` |
**Response 202**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"job_id": "job_xyz789",
"doc_id": "abc12345",
"status": "submitted",
"stage": "Job submitted",
"created_at": "2026-03-05T10:00:05Z"
}
}
```
**实现说明:**
```python
# IndexingService 内部实现
def start_indexing(doc_id: str) -> IndexingJobStatus:
job_id = f"job_{uuid.uuid4().hex[:8]}"
job_dir = JOBS_DIR / job_id
job_dir.mkdir(parents=True)
meta = { "job_id": job_id, "doc_id": doc_id, "status": "submitted", ... }
save_meta(job_dir / "meta.json", meta)
thread = threading.Thread(target=run_pipeline, args=(job_id,), daemon=True)
thread.start()
return meta
```
**Pipeline 执行顺序(后台线程):**
1. `status = "parsing"``subprocess.run([MINERU_PYTHON, MINERU_PIPELINE, pdf_path])`
2. `status = "extracting"``load_content_list()``assemble_pages()``extract_entities()` per page
3. `status = "indexing"``build_kg()` → 保存 `jobs/{job_id}/kg_nodes.json` → 合并到 `kg/`
4. `status = "done"`
---
### B2. 查询任务状态(含实时进度)
```
GET /api/v1/index/status/{job_id}
```
**推荐轮询间隔:** 3 秒
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"job_id": "job_xyz789",
"doc_id": "abc12345",
"status": "extracting",
"stage": "Extracting entities page 2/4 (LangExtract + DeepSeek)...",
"progress": {
"parsed_pages": 4,
"total_pages": 4,
"extracted_entities": 23
},
"created_at": "2026-03-05T10:00:05Z",
"elapsed_seconds": 18.3,
"error": null
}
}
```
**各状态 `stage` 典型值:**
| status | stage |
|--------|-------|
| `submitted` | `"Job submitted"` |
| `queued` | `"Waiting for worker..."` |
| `parsing` | `"MinerU PDF parsing (cloud API)..."` |
| `extracting` | `"Extracting entities page 2/4 (LangExtract + DeepSeek)..."` |
| `indexing` | `"Building knowledge graph..."` |
| `done` | `"Complete"` |
| `failed` | `"Error: {error message}"` |
**错误:** `2002` (job_id 不存在)
---
### B3. 获取索引结果(完整数据)
```
GET /api/v1/index/result/{job_id}
```
**Response 200status = done**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"job_id": "job_xyz789",
"doc_id": "abc12345",
"status": "done",
"stats": {
"blocks": 32,
"block_types": {"text": 31, "table": 1},
"pages": 4,
"raw_extractions": 45,
"nodes": 40,
"edges": 780,
"type_counts": {"TECHNOLOGY": 4, "CONCEPT": 36},
"alignment_counts": {"match_exact": 40, "match_fuzzy": 5},
"elapsed_seconds": 42.1
},
"extractions": [
{
"text": "GraphRAG",
"type": "TECHNOLOGY",
"char_start": 0,
"char_end": 8,
"alignment": "match_exact",
"page": 0,
"doc_id": "abc12345"
}
],
"nodes": [
{
"id": "tech_graphrag_0",
"name": "GraphRAG",
"type": "TECHNOLOGY",
"source_doc": "abc12345",
"char_start": 0,
"char_end": 8,
"confidence": "match_exact",
"page": 0,
"degree": 39
}
],
"edges": [
{
"source": "tech_graphrag_0",
"target": "concept_knowledgegraph_1",
"relation": "CO_OCCURS_IN",
"doc_id": "abc12345",
"page": 0
}
]
}
}
```
**Response 200status ≠ done** 返回 `IndexingJobStatus`(不含 stats/extractions/nodes/edges
**错误:** `2002` (job_id 不存在)
---
### B4. 取消任务
```
DELETE /api/v1/index/jobs/{job_id}
```
**限制:**`submitted``queued` 状态可取消;`parsing`/`extracting`/`indexing` 状态无法中断后台线程,仅标记状态为 `cancelled`
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"cancelled": true,
"job_id": "job_xyz789",
"previous_status": "submitted"
}
}
```
**错误:** `2002` (不存在), `2004` (状态不可取消)
---
## 六、C 组知识图谱6 个端点)
### C1. 获取所有节点(分页 + 筛选)
```
GET /api/v1/kg/nodes
```
**Query Params**
| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `type` | `string` | — | 实体类型筛选(大小写不敏感) |
| `doc_id` | `string` | — | 按来源文档筛选 |
| `confidence` | `string` | — | 对齐状态筛选(如 `match_exact` |
| `page` | `int` | `1` | 页码 |
| `page_size` | `int` | `50` | 每页数量(最大 200 |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"total": 40,
"page": 1,
"page_size": 50,
"items": [
{
"id": "tech_graphrag_0",
"name": "GraphRAG",
"type": "TECHNOLOGY",
"source_doc": "abc12345",
"char_start": 0,
"char_end": 8,
"confidence": "match_exact",
"page": 0,
"degree": 39
}
]
}
}
```
**错误:** `3002` (KG 为空)
---
### C2. 获取所有边(分页)
```
GET /api/v1/kg/edges
```
**Query Params**
| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `doc_id` | `string` | — | 按来源文档筛选 |
| `relation` | `string` | — | 关系类型筛选(如 `CO_OCCURS_IN` |
| `page` | `int` | `1` | 页码 |
| `page_size` | `int` | `100` | 每页数量(最大 500 |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"total": 780,
"page": 1,
"page_size": 100,
"items": [
{
"source": "tech_graphrag_0",
"target": "concept_knowledgegraph_1",
"relation": "CO_OCCURS_IN",
"doc_id": "abc12345",
"page": 0
}
]
}
}
```
---
### C3. 获取单个节点详情
```
GET /api/v1/kg/nodes/{node_id}
```
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"id": "tech_graphrag_0",
"name": "GraphRAG",
"type": "TECHNOLOGY",
"source_doc": "abc12345",
"char_start": 0,
"char_end": 8,
"confidence": "match_exact",
"page": 0,
"degree": 39,
"degree_centrality": 1.000,
"neighbor_count": 39
}
}
```
**额外字段(仅单节点详情):**
| 字段 | 说明 |
|------|------|
| `degree_centrality` | NetworkX `degree_centrality(G)[node_id]`0-1 范围) |
| `neighbor_count` | 直接邻居数量(等于 `degree` |
**错误:** `3001` (节点不存在)
---
### C4. 获取节点邻居N-hop BFS
```
GET /api/v1/kg/nodes/{node_id}/neighbors
```
**Query Params**
| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `hops` | `int` | `1` | 跳数1-3 |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"center": {
"id": "tech_graphrag_0",
"name": "GraphRAG",
"type": "TECHNOLOGY",
"page": 0
},
"hops": 1,
"neighbors_by_hop": {
"1": [
{ "id": "concept_knowledgegraph_1", "name": "knowledge graphs", "type": "CONCEPT", "page": 0 }
]
},
"total_neighbors": 39
}
}
```
**实现参考(来自 `agentic_rag_mvp.py`**
```python
reachable = nx.single_source_shortest_path_length(G, node_id, cutoff=hops)
by_hop = {dist: [] for dist in range(1, hops+1)}
for nid, dist in reachable.items():
if dist > 0:
by_hop[dist].append(G.nodes[nid])
```
**错误:** `3001` (节点不存在)
---
### C5. 知识图谱统计
```
GET /api/v1/kg/stats
```
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"total_nodes": 40,
"total_edges": 780,
"density": 1.0000,
"type_distribution": {
"TECHNOLOGY": 4,
"CONCEPT": 36
},
"relation_types": {
"CO_OCCURS_IN": 780
},
"top5_central_nodes": [
{ "node_id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "centrality": 1.000 },
{ "node_id": "concept_kgrag_1", "name": "Knowledge Graph Enhanced RAG System", "type": "CONCEPT", "centrality": 1.000 },
{ "node_id": "concept_rag_2", "name": "retrieval-augmented generation", "type": "CONCEPT", "centrality": 1.000 },
{ "node_id": "concept_kg_3", "name": "knowledge graphs", "type": "CONCEPT", "centrality": 1.000 },
{ "node_id": "concept_llm_4", "name": "large language models", "type": "CONCEPT", "centrality": 1.000 }
],
"source_documents": ["abc12345", "def67890"]
}
}
```
---
### C6. 导出完整 KG
```
GET /api/v1/kg/export
```
**Query Params**
| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `format` | `string` | `"json"` | 导出格式(当前仅支持 `json` |
| `doc_id` | `string` | — | 可选,仅导出指定文档的 KG |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"format": "json",
"doc_id": null,
"total_nodes": 40,
"total_edges": 780,
"exported_at": "2026-03-05T12:00:00Z",
"nodes": [ ...KGNode[] ],
"edges": [ ...KGEdge[] ]
}
}
```
---
## 七、D 组QA 问答4 个端点)
### D1. 提交 QA 查询(同步)
```
POST /api/v1/query
Content-Type: application/json
```
**Request Body**
```json
{
"question": "What is GraphRAG and how does it relate to knowledge graphs?",
"history": [
{ "role": "human", "content": "Previous question..." },
{ "role": "ai", "content": "Previous answer..." }
]
}
```
| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `question` | `string` | **是** | 用户自然语言问题 |
| `history` | `array` | 否 | 多轮对话历史(最多 10 轮,即 20 条消息) |
| `history[].role` | `"human"` \| `"ai"` | — | 消息角色 |
| `history[].content` | `string` | — | 消息内容 |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"query_id": "q_20260305_a1b2c3",
"question": "What is GraphRAG and how does it relate to knowledge graphs?",
"answer": "Based on the knowledge graph, GraphRAG [TECHNOLOGY] is a knowledge graph-enhanced retrieval-augmented generation system that...",
"tool_calls": [
{
"tool": "search_entities",
"input": { "query": "GraphRAG" },
"output": "Found 1 entity(ies) matching 'GraphRAG':\n [TECHNOLOGY] \"GraphRAG\" (confidence=match_exact, page=0, id=tech_graphrag_0)"
},
{
"tool": "get_neighbors",
"input": { "entity_name": "GraphRAG", "hops": 1 },
"output": "Neighbors of 'GraphRAG' [TECHNOLOGY] within 1 hop(s):\n Hop 1 — 39 related entities:\n [CONCEPT] knowledge graphs\n ..."
}
],
"cited_nodes": ["tech_graphrag_0", "concept_knowledgegraph_1"],
"elapsed_seconds": 8.4,
"created_at": "2026-03-05T10:30:00Z"
}
}
```
**实现说明QAService 核心逻辑):**
```python
# 将 history 拼接为 LangChain messages 格式
messages = []
for h in request.history:
messages.append((h["role"], h["content"]))
messages.append(("human", request.question))
# 调用 LangChain create_agent
result = agent.invoke({"messages": messages})
# 提取工具调用链(遍历 result["messages"]
tool_calls = []
for msg in result["messages"]:
if hasattr(msg, "tool_calls") and msg.tool_calls:
for tc in msg.tool_calls:
tool_calls.append({"tool": tc["name"], "input": tc["args"], "output": ""})
elif hasattr(msg, "tool_call_id"): # ToolMessage
if tool_calls:
tool_calls[-1]["output"] = msg.content
# 最终答案
answer = result["messages"][-1].content
```
**错误:** `3002` (KG 为空), `4001` (Agent/LLM 调用失败)
**注意:** 此接口为同步调用,通常耗时 5-30 秒(取决于 DeepSeek API 响应速度和工具调用次数)。
---
### D2. 批量查询(异步)
```
POST /api/v1/query/batch
Content-Type: application/json
```
**Request Body**
```json
{
"questions": [
"What is GraphRAG?",
"List all TECHNOLOGY entities in the knowledge graph.",
"How does MinerU relate to LangExtract?"
]
}
```
| 字段 | 类型 | 必填 | 约束 | 说明 |
|------|------|------|------|------|
| `questions` | `string[]` | **是** | 最多 20 个 | 问题列表 |
**Response 202**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"batch_id": "batch_20260305_x1y2",
"total": 3,
"status": "submitted",
"created_at": "2026-03-05T10:30:00Z"
}
}
```
---
### D3. 获取批量查询状态与结果
```
GET /api/v1/query/batch/{batch_id}
```
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"batch_id": "batch_20260305_x1y2",
"total": 3,
"completed": 2,
"failed": 0,
"status": "running",
"results": [
{ ...QAResult },
{ ...QAResult }
]
}
}
```
**错误:** `2002` (batch_id 不存在)
---
### D4. 查询历史
```
GET /api/v1/query/history
```
**Query Params**
| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `page` | `int` | `1` | 页码 |
| `page_size` | `int` | `20` | 每页数量(最大 50 |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"total": 50,
"page": 1,
"page_size": 20,
"items": [ ...QAResult[] ]
}
}
```
**存储说明:** 历史记录以 JSONL 格式持久化到 `jobs/query_history.jsonl`,每行一条 `QAResult`
---
## 八、E 组搜索3 个端点)
### E1. 实体关键词搜索
```
GET /api/v1/search/entities
```
**Query Params**
| 参数 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `q` | `string` | **是** | 关键词(大小写不敏感子串匹配,对应 `agentic_rag_mvp.py: search_entities` |
| `type` | `string` | 否 | 类型过滤(如 `TECHNOLOGY` |
| `limit` | `int` | 否 | 最多返回数量(默认 15最大 100 |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"query": "GraphRAG",
"total": 1,
"items": [
{
"id": "tech_graphrag_0",
"name": "GraphRAG",
"type": "TECHNOLOGY",
"source_doc": "abc12345",
"char_start": 0,
"char_end": 8,
"confidence": "match_exact",
"page": 0,
"degree": 39
}
]
}
}
```
**实现(参考 `agentic_rag_mvp.py: search_entities`**
```python
q = query.lower()
matches = [data for _, data in G.nodes(data=True) if q in data.get("name", "").lower()]
```
---
### E2. 图谱路径搜索(两节点间路径)
```
GET /api/v1/search/path
```
**Query Params**
| 参数 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `from` | `string` | **是** | 起始节点 ID |
| `to` | `string` | **是** | 目标节点 ID |
| `max_hops` | `int` | 否 | 最大路径长度(默认 3最大 5 |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"from": { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY" },
"to": { "id": "tech_mineru_3", "name": "MinerU", "type": "TECHNOLOGY" },
"max_hops": 3,
"paths": [
{
"length": 1,
"nodes": [
{ "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY" },
{ "id": "tech_mineru_3", "name": "MinerU", "type": "TECHNOLOGY" }
],
"edges": [
{ "source": "tech_graphrag_0", "target": "tech_mineru_3", "relation": "CO_OCCURS_IN" }
]
}
],
"total_paths": 1
}
}
```
**实现NetworkX**
```python
paths = list(nx.all_simple_paths(G, from_id, to_id, cutoff=max_hops))
```
**错误:** `3001` (节点不存在)
---
### E3. 全图关键词搜索(含子图)
```
GET /api/v1/search/graph
```
**Query Params**
| 参数 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `q` | `string` | **是** | 关键词(大小写不敏感子串匹配) |
| `include_neighbors` | `bool` | 否 | 是否返回匹配节点的直接邻居边(默认 `false` |
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"query": "retrieval",
"matched_nodes": [
{ "id": "concept_rag_2", "name": "retrieval-augmented generation", "type": "CONCEPT", "page": 0 }
],
"subgraph_edges": [
{ "source": "concept_rag_2", "target": "tech_graphrag_0", "relation": "CO_OCCURS_IN" }
]
}
}
```
---
## 九、F 组系统4 个端点)
### F1. 健康检查
```
GET /api/v1/health
```
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"status": "healthy",
"version": "1.0.0",
"uptime_seconds": 3600,
"components": {
"mineru_venv": {
"status": "ok",
"path": "F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe",
"exists": true
},
"langextract_venv": {
"status": "ok",
"path": "F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe",
"exists": true
},
"deepseek_api": {
"status": "ok",
"base_url": "https://api.deepseek.com",
"key_configured": true
},
"storage": {
"status": "ok",
"kg_nodes_exists": true,
"kg_edges_exists": true,
"uploads_dir_exists": true
}
}
}
}
```
**说明:** 此端点仅检查配置和文件存在性,不发起实际 API 调用(避免消耗 DeepSeek token
---
### F2. 系统统计
```
GET /api/v1/system/stats
```
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"total_documents": 5,
"indexed_documents": 4,
"failed_documents": 1,
"total_nodes": 200,
"total_edges": 3900,
"type_distribution": { "TECHNOLOGY": 20, "CONCEPT": 180 },
"total_queries": 50,
"active_jobs": 1,
"storage_used_mb": 12.4
}
}
```
---
### F3. 支持的文件格式列表
```
GET /api/v1/system/formats
```
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"formats": [
{ "ext": "pdf", "description": "PDF 文档(文本型/扫描型/混合型)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
{ "ext": "docx", "description": "Microsoft Word新版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
{ "ext": "doc", "description": "Microsoft Word旧版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
{ "ext": "pptx", "description": "PowerPoint新版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
{ "ext": "ppt", "description": "PowerPoint旧版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
{ "ext": "png", "description": "PNG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true },
{ "ext": "jpg", "description": "JPEG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true },
{ "ext": "jpeg", "description": "JPEG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true },
{ "ext": "html", "description": "HTML 文件(需指定 model_version=MinerU-HTML", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false }
],
"ocr_languages": [
{ "code": "ch", "name": "中文(默认)" },
{ "code": "en", "name": "英文" },
{ "code": "japan", "name": "日文" },
{ "code": "korean", "name": "韩文" },
{ "code": "french", "name": "法文" },
{ "code": "german", "name": "德文" }
],
"notes": [
"language 参数默认值为 'ch'(非 'zh'),遵循 PaddleOCR v3 语言代码规范",
"上传时不需要携带 Content-Type: application/pdf 等,服务端自动识别",
"PNG/JPG/JPEG 单次最多处理 1 页(图片文件视为单页文档)"
]
}
}
```
---
### F4. Demo 数据(快速预览)
```
GET /api/v1/system/demo
```
**说明:** 返回现有 `output/kg_nodes.json` + `output/kg_edges.json` 数据,无需上传 PDF 即可预览 KG 可视化效果。与旧版 `GET /api/demo`Flask web_server.py兼容。
**Response 200**
```json
{
"code": 0,
"msg": "success",
"request_id": "...",
"data": {
"nodes": [ ...KGNode[] ],
"edges": [ ...KGEdge[] ],
"stats": {
"nodes": 40,
"edges": 780,
"type_counts": { "TECHNOLOGY": 4, "CONCEPT": 36 },
"density": 1.0000
}
}
}
```
**错误:** `3002` (demo 数据文件不存在,需先运行 bridge.py 生成)
---
## 十、文件格式支持矩阵
| 格式 | 扩展名 | 最大体积 | 最大页数 | OCR | MinerU model_version | 说明 |
|------|--------|---------|---------|-----|----------------------|------|
| PDF | `.pdf` | 200MB | 600 页 | 可选 | `pipeline`(默认) | 核心能力,文本型/扫描型/混合型均支持 |
| Word | `.docx` | 200MB | 600 页 | 可选 | `pipeline` | |
| Word | `.doc` | 200MB | 600 页 | 可选 | `pipeline` | |
| PPT | `.pptx` | 200MB | 600 页 | 可选 | `pipeline` | |
| PPT | `.ppt` | 200MB | 600 页 | 可选 | `pipeline` | |
| PNG 图片 | `.png` | 200MB | 1 页 | 必须 | `pipeline` | EXIF 方向自动校正 |
| JPEG 图片 | `.jpg` | 200MB | 1 页 | 必须 | `pipeline` | EXIF 方向自动校正 |
| JPEG 图片 | `.jpeg` | 200MB | 1 页 | 必须 | `pipeline` | 同 `.jpg` |
| HTML | `.html` | 200MB | 600 页 | 否 | `MinerU-HTML` | 必须指定特定 model_version |
**MinerU 云端 API 限制(来自 mineru_specification-v1.0.md**
| 约束项 | 限制值 |
|--------|--------|
| 单文件最大体积 | 200 MB |
| 单文件最大页数 | 600 页 |
| 批量请求最大文件数 | 200 个 |
| 预签名上传 URL 有效期 | 24 小时 |
| 云端 API 每日最高优先级额度 | 2,000 页(超出降低优先级) |
**服务端验证代码FastAPI + Pydantic**
```python
ALLOWED_EXTENSIONS = {"pdf", "docx", "doc", "pptx", "ppt", "png", "jpg", "jpeg", "html"}
MAX_FILE_SIZE_MB = 200
async def upload_document(file: UploadFile = File(...), ...):
ext = Path(file.filename).suffix.lower().lstrip(".")
if ext not in ALLOWED_EXTENSIONS:
raise HTTPException(400, detail=f"Unsupported format: .{ext}")
content = await file.read()
size_mb = len(content) / (1024 * 1024)
if size_mb > MAX_FILE_SIZE_MB:
raise HTTPException(400, detail=f"File size {size_mb:.1f}MB exceeds 200MB limit")
```
---
## 十一、依赖与运行
### 安装依赖
```bash
# FastAPI + uvicorn + multipart 文件上传
uv pip install fastapi uvicorn[standard] python-multipart \
--python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe
# 已有依赖(无需重复安装)
# langextract[all]、langchain、langchain-openai、networkx、python-dotenv、flask、requests
```
### 启动服务
```bash
# 开发模式(--reload 热重载)
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe -m uvicorn \
graphrag_pipeline.api_server:app \
--host 0.0.0.0 --port 8000 --reload
# 或直接运行主入口
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe \
F:/GraphRAGAgent/graphrag_pipeline/api_server.py
```
### API 文档访问
FastAPI 自动生成 OpenAPI 文档,启动后可访问:
| 地址 | 说明 |
|------|------|
| `http://localhost:8000/api/v1/health` | 健康检查(验证服务启动) |
| `http://localhost:8000/docs` | Swagger UI交互式 API 文档) |
| `http://localhost:8000/redoc` | ReDoc只读 API 文档) |
| `http://localhost:8000/openapi.json` | OpenAPI JSON Schema |
### 端口说明
| 服务 | 端口 | 说明 |
|------|------|------|
| **FastAPI** | `8000` | 本规范描述的生产级 API |
| Flask web_server.py | `5000` | 原型,保留用于对比 |