Files
GraphRAGAgent/docs/backend_service_specification-v1.0.md
plf b02d3378fc GraphRAG Studio — initial commit: multimodal RAG system with KG visualization
Full-stack application for document-to-knowledge-graph pipeline:
- Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing
- Frontend: React 19 + Vite + D3.js + shadcn/ui
- Pipeline: MinerU parsing → LangExtract entity extraction → KG building

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 17:30:04 +08:00

52 KiB
Raw Blame History

多模态 RAG 后端服务接口规范 v1.0

基于 MinerU + LangExtract Bridge Pipeline + Agentic-RAG MVP 实测验证结果 Web 框架FastAPI (Python 3.12 async) 存储方案纯文件系统JSON 更新日期2026-03-05


目录


一、系统架构总览

1.1 四层架构

┌─────────────────────────────────────────────────────────────────────┐
│                          客户端层                                    │
│              浏览器 / API 调用方 / 可视化前端                         │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ HTTP/HTTPS
┌──────────────────────────────▼──────────────────────────────────────┐
│                         API 网关层                                   │
│   Nginx 反向代理 | 限流per-IP/per-key | 请求日志 | TLS 终止       │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                  服务层 — FastAPI Application                        │
│                   Python 3.12 async / uvicorn                        │
│                                                                      │
│  ┌────────────────┐  ┌────────────────┐  ┌───────────────────────┐ │
│  │ DocumentService│  │ IndexingService│  │    KGService           │ │
│  │  文件上传/管理  │  │  Pipeline 调度 │  │  NetworkX 图操作       │ │
│  └────────────────┘  └────────────────┘  └───────────────────────┘ │
│  ┌────────────────┐  ┌────────────────┐  ┌───────────────────────┐ │
│  │   QAService    │  │  SearchService │  │    SystemService       │ │
│  │  Agentic-RAG   │  │  实体/图谱搜索  │  │  健康检查 / 统计        │ │
│  └────────────────┘  └────────────────┘  └───────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                      Pipeline 执行层                                 │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  MinerU Pipelinesubprocess → mineru_mvp/.venv             │  │
│  │  输入: 文件路径  输出: *content_list.json + layout.json       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  Bridge Pipeline直接 import → langextract_src/.venv        │  │
│  │  text_assembler → entity_extractor → kg_builder              │  │
│  │  输出: kg_nodes.json + kg_edges.json                         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  Agentic-RAGLangChain create_agent → langextract_src/.venv│  │
│  │  工具: search_entities / get_neighbors / get_entities_by_type │  │
│  │       describe_graph                                          │  │
│  │  LLM: DeepSeek deepseek-chat via ChatOpenAI                  │  │
│  └──────────────────────────────────────────────────────────────┘  │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                      存储层(纯文件系统)                             │
│  uploads/        ← 原始上传文件                                      │
│  jobs/{job_id}/  ← 每个 job 的中间产物和结果 JSON                    │
│  kg/             ← 全局合并的 KGkg_nodes.json + kg_edges.json   │
└─────────────────────────────────────────────────────────────────────┘

1.2 双 venv 协调方案

项目中存在两个隔离的 Python 虚拟环境FastAPI 服务通过以下方式协调:

组件 虚拟环境 调用方式
FastAPI 服务本体 langextract_src/.venv 直接运行
Bridge Pipeline langextract_src/.venv from text_assembler import ... 直接 import
Agentic-RAG langextract_src/.venv from agentic_rag_mvp import ... 直接 import
MinerU Pipeline mineru_mvp/.venv subprocess.run([MINERU_PYTHON, MINERU_PIPELINE, pdf_path])
# 双 venv 协调核心代码
MINERU_PYTHON = Path("F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe")
MINERU_PIPELINE = Path("F:/GraphRAGAgent/mineru_mvp/pipeline.py")

# Stage 1: MinerU — subprocess 隔离调用
result = subprocess.run(
    [str(MINERU_PYTHON), str(MINERU_PIPELINE), str(pdf_path)],
    cwd=str(MINERU_DIR), capture_output=True, text=True, timeout=600
)

# Stage 2-4: Bridge + RAG — 直接 import同 venv
from text_assembler import load_content_list, assemble_pages
from entity_extractor import create_model, extract_entities
from kg_builder import build_kg

1.3 完整数据流

上传文件PDF/DOCX/PPT/PNG/JPG/HTML
    │
    ▼ POST /api/v1/documents/upload
DocumentService: 保存到 uploads/{doc_id}_{filename}
    │
    ▼ POST /api/v1/index/start
IndexingService: 启动后台 threading.Thread
    │
    ├─ Stage: parsing
    │    MinerU subprocess → mineru_mvp/output/{stem}/*_content_list.json
    │
    ├─ Stage: extracting
    │    text_assembler.assemble_pages() → PageText[]
    │    entity_extractor.extract_entities() → AnnotatedDocument[]
    │    → ExtractionRecord[] 保存到 jobs/{job_id}/extractions.json
    │
    ├─ Stage: indexing
    │    kg_builder.build_kg() → KGNode[] + KGEdge[]
    │    → 保存到 jobs/{job_id}/kg_nodes.json + kg_edges.json
    │    → 合并到全局 kg/kg_nodes.json + kg/kg_edges.json
    │
    └─ Status: done
         GET /api/v1/index/result/{job_id} → 完整结果

用户查询(自然语言问题)
    │
    ▼ POST /api/v1/query
QAService: 加载全局 KG → NetworkX Graph
    │
    ├─ LangChain create_agentDeepSeek
    │    ReAct 循环: think → tool_call → observe → repeat
    │    工具调用链: search_entities / get_neighbors / ...
    │
    └─ QAResult: answer + tool_calls + cited_nodes

1.4 Job 状态机

                          ┌─────────┐
                          │submitted│
                          └────┬────┘
                               │ 后台线程启动
                          ┌────▼────┐
                          │ queued  │  (等待线程池,当前实现立即转 parsing
                          └────┬────┘
                               │ MinerU subprocess 开始
                          ┌────▼────┐
                          │ parsing │  MinerU 云端 API 解析
                          └────┬────┘
                               │ content_list.json 就绪
                         ┌─────▼──────┐
                         │ extracting │  LangExtract + DeepSeek 实体抽取
                         └─────┬──────┘
                               │ extractions.json 就绪
                         ┌─────▼──────┐
                         │  indexing  │  kg_builder 构建知识图谱
                         └─────┬──────┘
                               │ kg_nodes/edges 就绪
                    ┌──────────▼──────────┐
              ┌─────▼─────┐        ┌──────▼──────┐
              │   done    │        │   failed    │
              └───────────┘        └─────────────┘

进度字段说明(progress 对象):

阶段 parsed_pages total_pages extracted_entities
parsing 实时更新MinerU 进度) MinerU 返回总页数 0
extracting total_pages total_pages 实时累加
indexing total_pages total_pages 最终值
done total_pages total_pages 最终值

1.5 FastAPI 项目目录结构

F:\GraphRAGAgent\graphrag_pipeline\
├── api_server.py              # FastAPI 主入口app 实例、路由注册、启动配置)
├── routers/
│   ├── __init__.py
│   ├── documents.py           # A 组文档管理4 个端点)
│   ├── indexing.py            # B 组Indexing Pipeline4 个端点)
│   ├── kg.py                  # C 组知识图谱6 个端点)
│   ├── query.py               # D 组QA 问答4 个端点)
│   ├── search.py              # E 组搜索3 个端点)
│   └── system.py              # F 组系统4 个端点)
├── services/
│   ├── __init__.py
│   ├── document_service.py    # 文件保存、元数据读写
│   ├── indexing_service.py    # Pipeline 调度MinerU subprocess + Bridge import
│   ├── kg_service.py          # NetworkX 图加载、BFS、中心性计算
│   ├── qa_service.py          # create_agent 封装、ReAct 调用、结果解析
│   └── search_service.py      # 实体搜索、路径搜索、子图搜索
├── models/
│   ├── __init__.py
│   └── schemas.py             # Pydantic v2 models所有数据对象 Schema
├── storage/
│   ├── __init__.py
│   └── file_store.py          # 统一文件读写JSON 序列化/反序列化、目录管理)
├── .env                       # DEEPSEEK_API_KEY + DEEPSEEK_BASE_URL + MINERU_API_TOKEN
│
│ # 现有文件(不修改)
├── bridge.py
├── text_assembler.py
├── entity_extractor.py
├── kg_builder.py
├── agentic_rag_mvp.py
├── web_server.py              # 旧 Flask 原型(保留,不删除)
└── output/
    ├── kg_nodes.json          # 向后兼容的全局 KG与 kg/ 目录同步)
    └── kg_edges.json

1.6 文件系统存储结构

F:\GraphRAGAgent\graphrag_pipeline\
│
├── uploads/
│   └── {doc_id}_{filename}              # 上传的原始文件(如 abc12345_paper.pdf
│
├── jobs/
│   └── {job_id}/
│       ├── meta.json                    # job 元数据
│       │   {
│       │     "job_id": "job_xyz789",
│       │     "doc_id": "abc12345",
│       │     "status": "done",
│       │     "stage": "Complete",
│       │     "progress": {...},
│       │     "created_at": "ISO8601",
│       │     "elapsed_seconds": 42.1,
│       │     "error": null,
│       │     "pdf_name": "paper.pdf",
│       │     "pdf_path": "uploads/abc12345_paper.pdf"
│       │   }
│       ├── mineru_output/               # MinerU 解析产物(原样保留)
│       │   ├── {uuid}_content_list.json
│       │   ├── layout.json
│       │   ├── full.md
│       │   ├── {uuid}_origin.pdf
│       │   └── images/
│       │       └── {sha256}.jpg
│       ├── extractions.json             # LangExtract 全部抽取记录ExtractionRecord[]
│       ├── kg_nodes.json                # 本 job 生成的 KG 节点KGNode[]
│       └── kg_edges.json                # 本 job 生成的 KG 边KGEdge[]
│
└── kg/
    ├── kg_nodes.json                    # 全局合并的 KG 节点(所有 job 合并去重)
    └── kg_edges.json                    # 全局合并的 KG 边(所有 job 合并去重)

二、统一响应封装格式

2.1 通用响应结构

所有 API 端点均使用以下统一包装格式:

{
  "code": 0,
  "msg": "success",
  "request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "data": { ... }
}
字段 类型 说明
code int 0 = 成功;非 0 = 失败(见错误码表)
msg string 状态描述(成功为 "success",失败为错误信息)
request_id string UUID v4用于日志追踪
data object | null 业务数据(失败时为 null

HTTP 状态码映射:

HTTP 状态码 适用场景
200 OK 同步请求成功
202 Accepted 异步任务已接受Job 启动)
400 Bad Request 参数校验失败code 1001/1002/1003
404 Not Found 资源不存在code 2001/3001
500 Internal Server Error 服务器内部错误code 5000

FastAPI Pydantic 响应模型:

from pydantic import BaseModel
from typing import Generic, TypeVar, Optional
import uuid

T = TypeVar("T")

class APIResponse(BaseModel, Generic[T]):
    code: int = 0
    msg: str = "success"
    request_id: str = str(uuid.uuid4())
    data: Optional[T] = None

2.2 错误码体系

code HTTP 状态码 含义 说明
0 200 成功
1001 400 参数校验失败 缺少必填字段或类型错误
1002 400 文件格式不支持 仅支持 pdf/docx/doc/pptx/ppt/png/jpg/jpeg/html
1003 400 文件超出大小限制 单文件最大 200MBMinerU 限制)
1004 400 文件页数超限 单文件最大 600 页MinerU 限制)
2001 404 文档不存在 doc_id 对应的文档未找到
2002 400 Job 不存在 job_id 对应的任务未找到
2003 400 Job 仍在执行 请求结果时任务尚未完成
2004 400 Job 状态不可取消 仅 submitted/queued 可取消
3001 404 KG 节点不存在 node_id 对应节点未找到
3002 400 KG 为空 尚未完成任何 Indexing无图谱数据
4001 500 QA 服务异常 LangChain Agent 或 DeepSeek API 调用失败
5000 500 服务器内部错误 未预期的系统异常

错误响应示例:

{
  "code": 1002,
  "msg": "Unsupported file format: .xlsx. Supported formats: pdf, docx, doc, pptx, ppt, png, jpg, jpeg, html",
  "request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "data": null
}

三、核心数据对象 Schema

3.1 DocumentInfo

文档元数据对象,由 POST /api/v1/documents/upload 创建,持久化到 jobs/ 下的 meta.json

{
  "doc_id": "abc12345",
  "filename": "graphrag_overview.pdf",
  "format": "pdf",
  "size_bytes": 1048576,
  "pages": 4,
  "uploaded_at": "2026-03-05T10:00:00Z",
  "status": "indexed",
  "language": "en",
  "enable_formula": true,
  "enable_table": true
}
字段 类型 说明
doc_id string 文档唯一 IDUUID hex 前 8 位,如 "abc12345"
filename string 原始文件名
format string 文件格式(小写扩展名,不含点)
size_bytes int 文件大小(字节)
pages int | null 总页数MinerU 解析后填充;上传时为 null
uploaded_at string ISO 8601 上传时间
status string "uploaded" / "indexed" / "failed"
language string OCR 语言码PaddleOCR默认 "ch"
enable_formula bool 是否启用公式识别
enable_table bool 是否启用表格识别

3.2 IndexingJobStatus

Indexing Pipeline 的任务状态对象。

{
  "job_id": "job_xyz789",
  "doc_id": "abc12345",
  "status": "extracting",
  "stage": "Extracting entities (LangExtract + DeepSeek)...",
  "progress": {
    "parsed_pages": 4,
    "total_pages": 4,
    "extracted_entities": 23
  },
  "created_at": "2026-03-05T10:00:05Z",
  "elapsed_seconds": 18.3,
  "error": null
}
字段 类型 说明
job_id string 任务唯一 ID"job_" + UUID hex 前 8 位)
doc_id string 关联文档 ID
status string 状态枚举(见 1.4 状态机)
stage string 当前阶段人类可读描述
progress.parsed_pages int 已解析页数
progress.total_pages int 总页数0 = 未知)
progress.extracted_entities int 已抽取实体数
created_at string ISO 8601 任务创建时间
elapsed_seconds float 已耗时(秒)
error string | null 错误信息(失败时非 null

3.3 KGNode

知识图谱节点,直接对应 kg_nodes.json 格式,新增 degree 字段。

{
  "id": "tech_graphrag_0",
  "name": "GraphRAG",
  "type": "TECHNOLOGY",
  "source_doc": "abc12345",
  "char_start": 0,
  "char_end": 8,
  "confidence": "match_exact",
  "page": 0,
  "degree": 39
}
字段 类型 说明
id string 节点唯一 ID来自 kg_nodes.json
name string 实体名称
type string 实体类型:TECHNOLOGY / CONCEPT / PERSON / ORGANIZATION / LOCATION
source_doc string 来源文档 IDdoc_id
char_start int 实体在原文中的起始字符位置LangExtract char_interval.start_pos
char_end int 实体在原文中的结束字符位置(不含,char_interval.end_pos
confidence string LangExtract 对齐状态:match_exact / match_greater / match_lesser / match_fuzzy
page int 所在页码0-indexed来自 MinerU content_list.json page_idx
degree int 节点度数连接边数NetworkX 计算,仅 API 返回时填充)

3.4 KGEdge

知识图谱边,直接对应 kg_edges.json 格式。

{
  "source": "tech_graphrag_0",
  "target": "concept_knowledgegraph_1",
  "relation": "CO_OCCURS_IN",
  "doc_id": "abc12345",
  "page": 0
}
字段 类型 说明
source string 起始节点 ID
target string 目标节点 ID
relation string 关系类型(当前固定为 "CO_OCCURS_IN",表示同页共现)
doc_id string 边来源文档 ID
page int 共现所在页码0-indexed

3.5 ExtractionRecord

LangExtract 单条实体抽取记录,对应 AnnotatedDocument.extractions[] 的扁平化结构。

{
  "text": "GraphRAG",
  "type": "TECHNOLOGY",
  "char_start": 0,
  "char_end": 8,
  "alignment": "match_exact",
  "page": 0,
  "doc_id": "abc12345"
}
字段 类型 说明
text string 实体文本(extraction_text,原文子串)
type string 实体类型(extraction_class
char_start int | null 字符起始位置(char_interval.start_pos
char_end int | null 字符结束位置(char_interval.end_pos,不含)
alignment string | null 对齐状态(alignment_status.valuenull 表示未对齐)
page int 所在页码0-indexed
doc_id string 来源文档 ID

过滤规则KG 构建时过滤掉 alignment = null(未对齐),match_fuzzy 根据项目配置可选是否过滤。当前实测:match_exact 占 94%+。

3.6 QAResult

Agentic-RAG 问答返回对象,包含答案 + 完整推理溯源链。

{
  "query_id": "q_20260305_001",
  "question": "What is GraphRAG and how does it relate to knowledge graphs?",
  "answer": "GraphRAG is a knowledge graph-enhanced retrieval-augmented generation system...",
  "tool_calls": [
    {
      "tool": "search_entities",
      "input": {"query": "GraphRAG"},
      "output": "Found 1 entity(ies) matching 'GraphRAG':\n  [TECHNOLOGY] \"GraphRAG\" (confidence=match_exact, page=0, id=tech_graphrag_0)"
    },
    {
      "tool": "get_neighbors",
      "input": {"entity_name": "GraphRAG", "hops": 1},
      "output": "Neighbors of 'GraphRAG' [TECHNOLOGY] within 1 hop(s):\n  Hop 1 — 39 related entities:\n    [CONCEPT] knowledge graphs\n    ..."
    }
  ],
  "cited_nodes": ["tech_graphrag_0", "concept_knowledgegraph_1"],
  "elapsed_seconds": 8.4,
  "created_at": "2026-03-05T10:30:00Z"
}
字段 类型 说明
query_id string 查询唯一 ID
question string 用户原始问题
answer string Agent 生成的最终自然语言答案(result["messages"][-1].content
tool_calls array ReAct 循环中的工具调用记录(顺序)
tool_calls[].tool string 工具名4 个 KG 工具之一)
tool_calls[].input object 工具调用参数
tool_calls[].output string 工具返回的文本结果ToolMessage.content
cited_nodes string[] 答案中引用的节点 ID 列表(从 tool_calls 解析)
elapsed_seconds float 问答总耗时(包括所有 LLM 调用)
created_at string ISO 8601 查询时间

四、A 组文档管理4 个端点)

A1. 上传文件

POST /api/v1/documents/upload
Content-Type: multipart/form-data

RequestForm Data

字段 类型 必填 默认值 说明
file binary 文件二进制内容
language string "ch" OCR 语言PaddleOCR 语言码)
enable_formula bool true 是否启用公式识别
enable_table bool true 是否启用表格识别

验证规则:

  • 文件扩展名必须在支持列表中(见第十章)
  • 文件大小不得超过 200MB
  • 文件名不得包含路径分隔符(防目录穿越)

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "f47ac10b-...",
  "data": {
    "doc_id": "abc12345",
    "filename": "graphrag_overview.pdf",
    "format": "pdf",
    "size_bytes": 1048576,
    "pages": null,
    "uploaded_at": "2026-03-05T10:00:00Z",
    "status": "uploaded",
    "language": "en",
    "enable_formula": true,
    "enable_table": true
  }
}

错误响应:

// 1002: 格式不支持
{ "code": 1002, "msg": "Unsupported file format: .xlsx", "data": null }

// 1003: 超过大小限制
{ "code": 1003, "msg": "File size 256MB exceeds 200MB limit", "data": null }

A2. 获取文档信息

GET /api/v1/documents/{doc_id}

Path Params

参数 类型 说明
doc_id string 文档 ID

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "doc_id": "abc12345",
    "filename": "graphrag_overview.pdf",
    "format": "pdf",
    "size_bytes": 1048576,
    "pages": 4,
    "uploaded_at": "2026-03-05T10:00:00Z",
    "status": "indexed",
    "language": "en",
    "enable_formula": true,
    "enable_table": true
  }
}

错误: 2001 (doc_id 不存在)


A3. 列出所有文档

GET /api/v1/documents

Query Params

参数 类型 默认值 说明
page int 1 页码(从 1 开始)
page_size int 20 每页数量(最大 100
status string 按状态筛选:uploaded / indexed / failed
format string 按格式筛选:如 pdf

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "total": 5,
    "page": 1,
    "page_size": 20,
    "items": [
      {
        "doc_id": "abc12345",
        "filename": "graphrag_overview.pdf",
        "format": "pdf",
        "size_bytes": 1048576,
        "pages": 4,
        "uploaded_at": "2026-03-05T10:00:00Z",
        "status": "indexed",
        "language": "en",
        "enable_formula": true,
        "enable_table": true
      }
    ]
  }
}

A4. 删除文档

DELETE /api/v1/documents/{doc_id}

说明: 删除文档及其关联的 job 产物文件(uploads/jobs/ 下的对应目录),并从全局 KG 中移除该文档贡献的节点和边。

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "deleted": true,
    "doc_id": "abc12345",
    "removed_nodes": 40,
    "removed_edges": 780
  }
}

错误: 2001 (doc_id 不存在)


五、B 组Indexing Pipeline4 个端点)

B1. 启动索引任务

POST /api/v1/index/start
Content-Type: application/json

Request Body

{
  "doc_id": "abc12345"
}
字段 类型 必填 说明
doc_id string 已上传文档的 ID状态须为 uploaded

Response 202

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "job_id": "job_xyz789",
    "doc_id": "abc12345",
    "status": "submitted",
    "stage": "Job submitted",
    "created_at": "2026-03-05T10:00:05Z"
  }
}

实现说明:

# IndexingService 内部实现
def start_indexing(doc_id: str) -> IndexingJobStatus:
    job_id = f"job_{uuid.uuid4().hex[:8]}"
    job_dir = JOBS_DIR / job_id
    job_dir.mkdir(parents=True)

    meta = { "job_id": job_id, "doc_id": doc_id, "status": "submitted", ... }
    save_meta(job_dir / "meta.json", meta)

    thread = threading.Thread(target=run_pipeline, args=(job_id,), daemon=True)
    thread.start()
    return meta

Pipeline 执行顺序(后台线程):

  1. status = "parsing"subprocess.run([MINERU_PYTHON, MINERU_PIPELINE, pdf_path])
  2. status = "extracting"load_content_list()assemble_pages()extract_entities() per page
  3. status = "indexing"build_kg() → 保存 jobs/{job_id}/kg_nodes.json → 合并到 kg/
  4. status = "done"

B2. 查询任务状态(含实时进度)

GET /api/v1/index/status/{job_id}

推荐轮询间隔: 3 秒

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "job_id": "job_xyz789",
    "doc_id": "abc12345",
    "status": "extracting",
    "stage": "Extracting entities page 2/4 (LangExtract + DeepSeek)...",
    "progress": {
      "parsed_pages": 4,
      "total_pages": 4,
      "extracted_entities": 23
    },
    "created_at": "2026-03-05T10:00:05Z",
    "elapsed_seconds": 18.3,
    "error": null
  }
}

各状态 stage 典型值:

status stage
submitted "Job submitted"
queued "Waiting for worker..."
parsing "MinerU PDF parsing (cloud API)..."
extracting "Extracting entities page 2/4 (LangExtract + DeepSeek)..."
indexing "Building knowledge graph..."
done "Complete"
failed "Error: {error message}"

错误: 2002 (job_id 不存在)


B3. 获取索引结果(完整数据)

GET /api/v1/index/result/{job_id}

Response 200status = done

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "job_id": "job_xyz789",
    "doc_id": "abc12345",
    "status": "done",
    "stats": {
      "blocks": 32,
      "block_types": {"text": 31, "table": 1},
      "pages": 4,
      "raw_extractions": 45,
      "nodes": 40,
      "edges": 780,
      "type_counts": {"TECHNOLOGY": 4, "CONCEPT": 36},
      "alignment_counts": {"match_exact": 40, "match_fuzzy": 5},
      "elapsed_seconds": 42.1
    },
    "extractions": [
      {
        "text": "GraphRAG",
        "type": "TECHNOLOGY",
        "char_start": 0,
        "char_end": 8,
        "alignment": "match_exact",
        "page": 0,
        "doc_id": "abc12345"
      }
    ],
    "nodes": [
      {
        "id": "tech_graphrag_0",
        "name": "GraphRAG",
        "type": "TECHNOLOGY",
        "source_doc": "abc12345",
        "char_start": 0,
        "char_end": 8,
        "confidence": "match_exact",
        "page": 0,
        "degree": 39
      }
    ],
    "edges": [
      {
        "source": "tech_graphrag_0",
        "target": "concept_knowledgegraph_1",
        "relation": "CO_OCCURS_IN",
        "doc_id": "abc12345",
        "page": 0
      }
    ]
  }
}

Response 200status ≠ done 返回 IndexingJobStatus(不含 stats/extractions/nodes/edges

错误: 2002 (job_id 不存在)


B4. 取消任务

DELETE /api/v1/index/jobs/{job_id}

限制:submittedqueued 状态可取消;parsing/extracting/indexing 状态无法中断后台线程,仅标记状态为 cancelled

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "cancelled": true,
    "job_id": "job_xyz789",
    "previous_status": "submitted"
  }
}

错误: 2002 (不存在), 2004 (状态不可取消)


六、C 组知识图谱6 个端点)

C1. 获取所有节点(分页 + 筛选)

GET /api/v1/kg/nodes

Query Params

参数 类型 默认值 说明
type string 实体类型筛选(大小写不敏感)
doc_id string 按来源文档筛选
confidence string 对齐状态筛选(如 match_exact
page int 1 页码
page_size int 50 每页数量(最大 200

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "total": 40,
    "page": 1,
    "page_size": 50,
    "items": [
      {
        "id": "tech_graphrag_0",
        "name": "GraphRAG",
        "type": "TECHNOLOGY",
        "source_doc": "abc12345",
        "char_start": 0,
        "char_end": 8,
        "confidence": "match_exact",
        "page": 0,
        "degree": 39
      }
    ]
  }
}

错误: 3002 (KG 为空)


C2. 获取所有边(分页)

GET /api/v1/kg/edges

Query Params

参数 类型 默认值 说明
doc_id string 按来源文档筛选
relation string 关系类型筛选(如 CO_OCCURS_IN
page int 1 页码
page_size int 100 每页数量(最大 500

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "total": 780,
    "page": 1,
    "page_size": 100,
    "items": [
      {
        "source": "tech_graphrag_0",
        "target": "concept_knowledgegraph_1",
        "relation": "CO_OCCURS_IN",
        "doc_id": "abc12345",
        "page": 0
      }
    ]
  }
}

C3. 获取单个节点详情

GET /api/v1/kg/nodes/{node_id}

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "id": "tech_graphrag_0",
    "name": "GraphRAG",
    "type": "TECHNOLOGY",
    "source_doc": "abc12345",
    "char_start": 0,
    "char_end": 8,
    "confidence": "match_exact",
    "page": 0,
    "degree": 39,
    "degree_centrality": 1.000,
    "neighbor_count": 39
  }
}

额外字段(仅单节点详情):

字段 说明
degree_centrality NetworkX degree_centrality(G)[node_id]0-1 范围)
neighbor_count 直接邻居数量(等于 degree

错误: 3001 (节点不存在)


C4. 获取节点邻居N-hop BFS

GET /api/v1/kg/nodes/{node_id}/neighbors

Query Params

参数 类型 默认值 说明
hops int 1 跳数1-3

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "center": {
      "id": "tech_graphrag_0",
      "name": "GraphRAG",
      "type": "TECHNOLOGY",
      "page": 0
    },
    "hops": 1,
    "neighbors_by_hop": {
      "1": [
        { "id": "concept_knowledgegraph_1", "name": "knowledge graphs", "type": "CONCEPT", "page": 0 }
      ]
    },
    "total_neighbors": 39
  }
}

实现参考(来自 agentic_rag_mvp.py

reachable = nx.single_source_shortest_path_length(G, node_id, cutoff=hops)
by_hop = {dist: [] for dist in range(1, hops+1)}
for nid, dist in reachable.items():
    if dist > 0:
        by_hop[dist].append(G.nodes[nid])

错误: 3001 (节点不存在)


C5. 知识图谱统计

GET /api/v1/kg/stats

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "total_nodes": 40,
    "total_edges": 780,
    "density": 1.0000,
    "type_distribution": {
      "TECHNOLOGY": 4,
      "CONCEPT": 36
    },
    "relation_types": {
      "CO_OCCURS_IN": 780
    },
    "top5_central_nodes": [
      { "node_id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "centrality": 1.000 },
      { "node_id": "concept_kgrag_1", "name": "Knowledge Graph Enhanced RAG System", "type": "CONCEPT", "centrality": 1.000 },
      { "node_id": "concept_rag_2", "name": "retrieval-augmented generation", "type": "CONCEPT", "centrality": 1.000 },
      { "node_id": "concept_kg_3", "name": "knowledge graphs", "type": "CONCEPT", "centrality": 1.000 },
      { "node_id": "concept_llm_4", "name": "large language models", "type": "CONCEPT", "centrality": 1.000 }
    ],
    "source_documents": ["abc12345", "def67890"]
  }
}

C6. 导出完整 KG

GET /api/v1/kg/export

Query Params

参数 类型 默认值 说明
format string "json" 导出格式(当前仅支持 json
doc_id string 可选,仅导出指定文档的 KG

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "format": "json",
    "doc_id": null,
    "total_nodes": 40,
    "total_edges": 780,
    "exported_at": "2026-03-05T12:00:00Z",
    "nodes": [ ...KGNode[] ],
    "edges": [ ...KGEdge[] ]
  }
}

七、D 组QA 问答4 个端点)

D1. 提交 QA 查询(同步)

POST /api/v1/query
Content-Type: application/json

Request Body

{
  "question": "What is GraphRAG and how does it relate to knowledge graphs?",
  "history": [
    { "role": "human", "content": "Previous question..." },
    { "role": "ai", "content": "Previous answer..." }
  ]
}
字段 类型 必填 说明
question string 用户自然语言问题
history array 多轮对话历史(最多 10 轮,即 20 条消息)
history[].role "human" | "ai" 消息角色
history[].content string 消息内容

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "query_id": "q_20260305_a1b2c3",
    "question": "What is GraphRAG and how does it relate to knowledge graphs?",
    "answer": "Based on the knowledge graph, GraphRAG [TECHNOLOGY] is a knowledge graph-enhanced retrieval-augmented generation system that...",
    "tool_calls": [
      {
        "tool": "search_entities",
        "input": { "query": "GraphRAG" },
        "output": "Found 1 entity(ies) matching 'GraphRAG':\n  [TECHNOLOGY] \"GraphRAG\" (confidence=match_exact, page=0, id=tech_graphrag_0)"
      },
      {
        "tool": "get_neighbors",
        "input": { "entity_name": "GraphRAG", "hops": 1 },
        "output": "Neighbors of 'GraphRAG' [TECHNOLOGY] within 1 hop(s):\n  Hop 1 — 39 related entities:\n    [CONCEPT] knowledge graphs\n    ..."
      }
    ],
    "cited_nodes": ["tech_graphrag_0", "concept_knowledgegraph_1"],
    "elapsed_seconds": 8.4,
    "created_at": "2026-03-05T10:30:00Z"
  }
}

实现说明QAService 核心逻辑):

# 将 history 拼接为 LangChain messages 格式
messages = []
for h in request.history:
    messages.append((h["role"], h["content"]))
messages.append(("human", request.question))

# 调用 LangChain create_agent
result = agent.invoke({"messages": messages})

# 提取工具调用链(遍历 result["messages"]
tool_calls = []
for msg in result["messages"]:
    if hasattr(msg, "tool_calls") and msg.tool_calls:
        for tc in msg.tool_calls:
            tool_calls.append({"tool": tc["name"], "input": tc["args"], "output": ""})
    elif hasattr(msg, "tool_call_id"):  # ToolMessage
        if tool_calls:
            tool_calls[-1]["output"] = msg.content

# 最终答案
answer = result["messages"][-1].content

错误: 3002 (KG 为空), 4001 (Agent/LLM 调用失败)

注意: 此接口为同步调用,通常耗时 5-30 秒(取决于 DeepSeek API 响应速度和工具调用次数)。


D2. 批量查询(异步)

POST /api/v1/query/batch
Content-Type: application/json

Request Body

{
  "questions": [
    "What is GraphRAG?",
    "List all TECHNOLOGY entities in the knowledge graph.",
    "How does MinerU relate to LangExtract?"
  ]
}
字段 类型 必填 约束 说明
questions string[] 最多 20 个 问题列表

Response 202

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "batch_id": "batch_20260305_x1y2",
    "total": 3,
    "status": "submitted",
    "created_at": "2026-03-05T10:30:00Z"
  }
}

D3. 获取批量查询状态与结果

GET /api/v1/query/batch/{batch_id}

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "batch_id": "batch_20260305_x1y2",
    "total": 3,
    "completed": 2,
    "failed": 0,
    "status": "running",
    "results": [
      { ...QAResult },
      { ...QAResult }
    ]
  }
}

错误: 2002 (batch_id 不存在)


D4. 查询历史

GET /api/v1/query/history

Query Params

参数 类型 默认值 说明
page int 1 页码
page_size int 20 每页数量(最大 50

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "total": 50,
    "page": 1,
    "page_size": 20,
    "items": [ ...QAResult[] ]
  }
}

存储说明: 历史记录以 JSONL 格式持久化到 jobs/query_history.jsonl,每行一条 QAResult


八、E 组搜索3 个端点)

E1. 实体关键词搜索

GET /api/v1/search/entities

Query Params

参数 类型 必填 说明
q string 关键词(大小写不敏感子串匹配,对应 agentic_rag_mvp.py: search_entities
type string 类型过滤(如 TECHNOLOGY
limit int 最多返回数量(默认 15最大 100

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "query": "GraphRAG",
    "total": 1,
    "items": [
      {
        "id": "tech_graphrag_0",
        "name": "GraphRAG",
        "type": "TECHNOLOGY",
        "source_doc": "abc12345",
        "char_start": 0,
        "char_end": 8,
        "confidence": "match_exact",
        "page": 0,
        "degree": 39
      }
    ]
  }
}

实现(参考 agentic_rag_mvp.py: search_entities

q = query.lower()
matches = [data for _, data in G.nodes(data=True) if q in data.get("name", "").lower()]

E2. 图谱路径搜索(两节点间路径)

GET /api/v1/search/path

Query Params

参数 类型 必填 说明
from string 起始节点 ID
to string 目标节点 ID
max_hops int 最大路径长度(默认 3最大 5

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "from": { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY" },
    "to": { "id": "tech_mineru_3", "name": "MinerU", "type": "TECHNOLOGY" },
    "max_hops": 3,
    "paths": [
      {
        "length": 1,
        "nodes": [
          { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY" },
          { "id": "tech_mineru_3", "name": "MinerU", "type": "TECHNOLOGY" }
        ],
        "edges": [
          { "source": "tech_graphrag_0", "target": "tech_mineru_3", "relation": "CO_OCCURS_IN" }
        ]
      }
    ],
    "total_paths": 1
  }
}

实现NetworkX

paths = list(nx.all_simple_paths(G, from_id, to_id, cutoff=max_hops))

错误: 3001 (节点不存在)


E3. 全图关键词搜索(含子图)

GET /api/v1/search/graph

Query Params

参数 类型 必填 说明
q string 关键词(大小写不敏感子串匹配)
include_neighbors bool 是否返回匹配节点的直接邻居边(默认 false

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "query": "retrieval",
    "matched_nodes": [
      { "id": "concept_rag_2", "name": "retrieval-augmented generation", "type": "CONCEPT", "page": 0 }
    ],
    "subgraph_edges": [
      { "source": "concept_rag_2", "target": "tech_graphrag_0", "relation": "CO_OCCURS_IN" }
    ]
  }
}

九、F 组系统4 个端点)

F1. 健康检查

GET /api/v1/health

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "status": "healthy",
    "version": "1.0.0",
    "uptime_seconds": 3600,
    "components": {
      "mineru_venv": {
        "status": "ok",
        "path": "F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe",
        "exists": true
      },
      "langextract_venv": {
        "status": "ok",
        "path": "F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe",
        "exists": true
      },
      "deepseek_api": {
        "status": "ok",
        "base_url": "https://api.deepseek.com",
        "key_configured": true
      },
      "storage": {
        "status": "ok",
        "kg_nodes_exists": true,
        "kg_edges_exists": true,
        "uploads_dir_exists": true
      }
    }
  }
}

说明: 此端点仅检查配置和文件存在性,不发起实际 API 调用(避免消耗 DeepSeek token


F2. 系统统计

GET /api/v1/system/stats

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "total_documents": 5,
    "indexed_documents": 4,
    "failed_documents": 1,
    "total_nodes": 200,
    "total_edges": 3900,
    "type_distribution": { "TECHNOLOGY": 20, "CONCEPT": 180 },
    "total_queries": 50,
    "active_jobs": 1,
    "storage_used_mb": 12.4
  }
}

F3. 支持的文件格式列表

GET /api/v1/system/formats

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "formats": [
      { "ext": "pdf",  "description": "PDF 文档(文本型/扫描型/混合型)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
      { "ext": "docx", "description": "Microsoft Word新版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
      { "ext": "doc",  "description": "Microsoft Word旧版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
      { "ext": "pptx", "description": "PowerPoint新版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
      { "ext": "ppt",  "description": "PowerPoint旧版", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false },
      { "ext": "png",  "description": "PNG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true },
      { "ext": "jpg",  "description": "JPEG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true },
      { "ext": "jpeg", "description": "JPEG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true },
      { "ext": "html", "description": "HTML 文件(需指定 model_version=MinerU-HTML", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false }
    ],
    "ocr_languages": [
      { "code": "ch", "name": "中文(默认)" },
      { "code": "en", "name": "英文" },
      { "code": "japan", "name": "日文" },
      { "code": "korean", "name": "韩文" },
      { "code": "french", "name": "法文" },
      { "code": "german", "name": "德文" }
    ],
    "notes": [
      "language 参数默认值为 'ch'(非 'zh'),遵循 PaddleOCR v3 语言代码规范",
      "上传时不需要携带 Content-Type: application/pdf 等,服务端自动识别",
      "PNG/JPG/JPEG 单次最多处理 1 页(图片文件视为单页文档)"
    ]
  }
}

F4. Demo 数据(快速预览)

GET /api/v1/system/demo

说明: 返回现有 output/kg_nodes.json + output/kg_edges.json 数据,无需上传 PDF 即可预览 KG 可视化效果。与旧版 GET /api/demoFlask web_server.py兼容。

Response 200

{
  "code": 0,
  "msg": "success",
  "request_id": "...",
  "data": {
    "nodes": [ ...KGNode[] ],
    "edges": [ ...KGEdge[] ],
    "stats": {
      "nodes": 40,
      "edges": 780,
      "type_counts": { "TECHNOLOGY": 4, "CONCEPT": 36 },
      "density": 1.0000
    }
  }
}

错误: 3002 (demo 数据文件不存在,需先运行 bridge.py 生成)


十、文件格式支持矩阵

格式 扩展名 最大体积 最大页数 OCR MinerU model_version 说明
PDF .pdf 200MB 600 页 可选 pipeline(默认) 核心能力,文本型/扫描型/混合型均支持
Word .docx 200MB 600 页 可选 pipeline
Word .doc 200MB 600 页 可选 pipeline
PPT .pptx 200MB 600 页 可选 pipeline
PPT .ppt 200MB 600 页 可选 pipeline
PNG 图片 .png 200MB 1 页 必须 pipeline EXIF 方向自动校正
JPEG 图片 .jpg 200MB 1 页 必须 pipeline EXIF 方向自动校正
JPEG 图片 .jpeg 200MB 1 页 必须 pipeline .jpg
HTML .html 200MB 600 页 MinerU-HTML 必须指定特定 model_version

MinerU 云端 API 限制(来自 mineru_specification-v1.0.md

约束项 限制值
单文件最大体积 200 MB
单文件最大页数 600 页
批量请求最大文件数 200 个
预签名上传 URL 有效期 24 小时
云端 API 每日最高优先级额度 2,000 页(超出降低优先级)

服务端验证代码FastAPI + Pydantic

ALLOWED_EXTENSIONS = {"pdf", "docx", "doc", "pptx", "ppt", "png", "jpg", "jpeg", "html"}
MAX_FILE_SIZE_MB = 200

async def upload_document(file: UploadFile = File(...), ...):
    ext = Path(file.filename).suffix.lower().lstrip(".")
    if ext not in ALLOWED_EXTENSIONS:
        raise HTTPException(400, detail=f"Unsupported format: .{ext}")

    content = await file.read()
    size_mb = len(content) / (1024 * 1024)
    if size_mb > MAX_FILE_SIZE_MB:
        raise HTTPException(400, detail=f"File size {size_mb:.1f}MB exceeds 200MB limit")

十一、依赖与运行

安装依赖

# FastAPI + uvicorn + multipart 文件上传
uv pip install fastapi uvicorn[standard] python-multipart \
    --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe

# 已有依赖(无需重复安装)
# langextract[all]、langchain、langchain-openai、networkx、python-dotenv、flask、requests

启动服务

# 开发模式(--reload 热重载)
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe -m uvicorn \
    graphrag_pipeline.api_server:app \
    --host 0.0.0.0 --port 8000 --reload

# 或直接运行主入口
F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe \
    F:/GraphRAGAgent/graphrag_pipeline/api_server.py

API 文档访问

FastAPI 自动生成 OpenAPI 文档,启动后可访问:

地址 说明
http://localhost:8000/api/v1/health 健康检查(验证服务启动)
http://localhost:8000/docs Swagger UI交互式 API 文档)
http://localhost:8000/redoc ReDoc只读 API 文档)
http://localhost:8000/openapi.json OpenAPI JSON Schema

端口说明

服务 端口 说明
FastAPI 8000 本规范描述的生产级 API
Flask web_server.py 5000 原型,保留用于对比