# 多模态 RAG 后端服务接口规范 v1.0 > 基于 MinerU + LangExtract Bridge Pipeline + Agentic-RAG MVP 实测验证结果 > Web 框架:FastAPI (Python 3.12 async) > 存储方案:纯文件系统(JSON) > 更新日期:2026-03-05 --- ## 目录 - [一、系统架构总览](#一系统架构总览) - [1.1 四层架构](#11-四层架构) - [1.2 双 venv 协调方案](#12-双-venv-协调方案) - [1.3 完整数据流](#13-完整数据流) - [1.4 Job 状态机](#14-job-状态机) - [1.5 FastAPI 项目目录结构](#15-fastapi-项目目录结构) - [1.6 文件系统存储结构](#16-文件系统存储结构) - [二、统一响应封装格式](#二统一响应封装格式) - [2.1 通用响应结构](#21-通用响应结构) - [2.2 错误码体系](#22-错误码体系) - [三、核心数据对象 Schema](#三核心数据对象-schema) - [3.1 DocumentInfo](#31-documentinfo) - [3.2 IndexingJobStatus](#32-indexingjobstatus) - [3.3 KGNode](#33-kgnode) - [3.4 KGEdge](#34-kgedge) - [3.5 ExtractionRecord](#35-extractionrecord) - [3.6 QAResult](#36-qaresult) - [四、A 组:文档管理(4 个端点)](#四a-组文档管理4-个端点) - [五、B 组:Indexing Pipeline(4 个端点)](#五b-组indexing-pipeline4-个端点) - [六、C 组:知识图谱(6 个端点)](#六c-组知识图谱6-个端点) - [七、D 组:QA 问答(4 个端点)](#七d-组qa-问答4-个端点) - [八、E 组:搜索(3 个端点)](#八e-组搜索3-个端点) - [九、F 组:系统(4 个端点)](#九f-组系统4-个端点) - [十、文件格式支持矩阵](#十文件格式支持矩阵) - [十一、依赖与运行](#十一依赖与运行) --- ## 一、系统架构总览 ### 1.1 四层架构 ``` ┌─────────────────────────────────────────────────────────────────────┐ │ 客户端层 │ │ 浏览器 / API 调用方 / 可视化前端 │ └──────────────────────────────┬──────────────────────────────────────┘ │ HTTP/HTTPS ┌──────────────────────────────▼──────────────────────────────────────┐ │ API 网关层 │ │ Nginx 反向代理 | 限流(per-IP/per-key) | 请求日志 | TLS 终止 │ └──────────────────────────────┬──────────────────────────────────────┘ │ ┌──────────────────────────────▼──────────────────────────────────────┐ │ 服务层 — FastAPI Application │ │ Python 3.12 async / uvicorn │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────────────┐ │ │ │ DocumentService│ │ IndexingService│ │ KGService │ │ │ │ 文件上传/管理 │ │ Pipeline 调度 │ │ NetworkX 图操作 │ │ │ └────────────────┘ └────────────────┘ └───────────────────────┘ │ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────────────┐ │ │ │ QAService │ │ SearchService │ │ SystemService │ │ │ │ Agentic-RAG │ │ 实体/图谱搜索 │ │ 健康检查 / 统计 │ │ │ └────────────────┘ └────────────────┘ └───────────────────────┘ │ └──────────────────────────────┬──────────────────────────────────────┘ │ ┌──────────────────────────────▼──────────────────────────────────────┐ │ Pipeline 执行层 │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ MinerU Pipeline(subprocess → mineru_mvp/.venv) │ │ │ │ 输入: 文件路径 输出: *content_list.json + layout.json │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Bridge Pipeline(直接 import → langextract_src/.venv) │ │ │ │ text_assembler → entity_extractor → kg_builder │ │ │ │ 输出: kg_nodes.json + kg_edges.json │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Agentic-RAG(LangChain create_agent → langextract_src/.venv)│ │ │ │ 工具: search_entities / get_neighbors / get_entities_by_type │ │ │ │ describe_graph │ │ │ │ LLM: DeepSeek deepseek-chat via ChatOpenAI │ │ │ └──────────────────────────────────────────────────────────────┘ │ └──────────────────────────────┬──────────────────────────────────────┘ │ ┌──────────────────────────────▼──────────────────────────────────────┐ │ 存储层(纯文件系统) │ │ uploads/ ← 原始上传文件 │ │ jobs/{job_id}/ ← 每个 job 的中间产物和结果 JSON │ │ kg/ ← 全局合并的 KG(kg_nodes.json + kg_edges.json) │ └─────────────────────────────────────────────────────────────────────┘ ``` ### 1.2 双 venv 协调方案 项目中存在两个隔离的 Python 虚拟环境,FastAPI 服务通过以下方式协调: | 组件 | 虚拟环境 | 调用方式 | |------|---------|---------| | **FastAPI 服务本体** | `langextract_src/.venv` | 直接运行 | | **Bridge Pipeline** | `langextract_src/.venv` | `from text_assembler import ...` 直接 import | | **Agentic-RAG** | `langextract_src/.venv` | `from agentic_rag_mvp import ...` 直接 import | | **MinerU Pipeline** | `mineru_mvp/.venv` | `subprocess.run([MINERU_PYTHON, MINERU_PIPELINE, pdf_path])` | ```python # 双 venv 协调核心代码 MINERU_PYTHON = Path("F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe") MINERU_PIPELINE = Path("F:/GraphRAGAgent/mineru_mvp/pipeline.py") # Stage 1: MinerU — subprocess 隔离调用 result = subprocess.run( [str(MINERU_PYTHON), str(MINERU_PIPELINE), str(pdf_path)], cwd=str(MINERU_DIR), capture_output=True, text=True, timeout=600 ) # Stage 2-4: Bridge + RAG — 直接 import(同 venv) from text_assembler import load_content_list, assemble_pages from entity_extractor import create_model, extract_entities from kg_builder import build_kg ``` ### 1.3 完整数据流 ``` 上传文件(PDF/DOCX/PPT/PNG/JPG/HTML) │ ▼ POST /api/v1/documents/upload DocumentService: 保存到 uploads/{doc_id}_{filename} │ ▼ POST /api/v1/index/start IndexingService: 启动后台 threading.Thread │ ├─ Stage: parsing │ MinerU subprocess → mineru_mvp/output/{stem}/*_content_list.json │ ├─ Stage: extracting │ text_assembler.assemble_pages() → PageText[] │ entity_extractor.extract_entities() → AnnotatedDocument[] │ → ExtractionRecord[] 保存到 jobs/{job_id}/extractions.json │ ├─ Stage: indexing │ kg_builder.build_kg() → KGNode[] + KGEdge[] │ → 保存到 jobs/{job_id}/kg_nodes.json + kg_edges.json │ → 合并到全局 kg/kg_nodes.json + kg/kg_edges.json │ └─ Status: done GET /api/v1/index/result/{job_id} → 完整结果 用户查询(自然语言问题) │ ▼ POST /api/v1/query QAService: 加载全局 KG → NetworkX Graph │ ├─ LangChain create_agent(DeepSeek) │ ReAct 循环: think → tool_call → observe → repeat │ 工具调用链: search_entities / get_neighbors / ... │ └─ QAResult: answer + tool_calls + cited_nodes ``` ### 1.4 Job 状态机 ``` ┌─────────┐ │submitted│ └────┬────┘ │ 后台线程启动 ┌────▼────┐ │ queued │ (等待线程池,当前实现立即转 parsing) └────┬────┘ │ MinerU subprocess 开始 ┌────▼────┐ │ parsing │ MinerU 云端 API 解析 └────┬────┘ │ content_list.json 就绪 ┌─────▼──────┐ │ extracting │ LangExtract + DeepSeek 实体抽取 └─────┬──────┘ │ extractions.json 就绪 ┌─────▼──────┐ │ indexing │ kg_builder 构建知识图谱 └─────┬──────┘ │ kg_nodes/edges 就绪 ┌──────────▼──────────┐ ┌─────▼─────┐ ┌──────▼──────┐ │ done │ │ failed │ └───────────┘ └─────────────┘ ``` **进度字段说明(`progress` 对象):** | 阶段 | `parsed_pages` | `total_pages` | `extracted_entities` | |------|----------------|---------------|----------------------| | parsing | 实时更新(MinerU 进度) | MinerU 返回总页数 | 0 | | extracting | total_pages | total_pages | 实时累加 | | indexing | total_pages | total_pages | 最终值 | | done | total_pages | total_pages | 最终值 | ### 1.5 FastAPI 项目目录结构 ``` F:\GraphRAGAgent\graphrag_pipeline\ ├── api_server.py # FastAPI 主入口(app 实例、路由注册、启动配置) ├── routers/ │ ├── __init__.py │ ├── documents.py # A 组:文档管理(4 个端点) │ ├── indexing.py # B 组:Indexing Pipeline(4 个端点) │ ├── kg.py # C 组:知识图谱(6 个端点) │ ├── query.py # D 组:QA 问答(4 个端点) │ ├── search.py # E 组:搜索(3 个端点) │ └── system.py # F 组:系统(4 个端点) ├── services/ │ ├── __init__.py │ ├── document_service.py # 文件保存、元数据读写 │ ├── indexing_service.py # Pipeline 调度(MinerU subprocess + Bridge import) │ ├── kg_service.py # NetworkX 图加载、BFS、中心性计算 │ ├── qa_service.py # create_agent 封装、ReAct 调用、结果解析 │ └── search_service.py # 实体搜索、路径搜索、子图搜索 ├── models/ │ ├── __init__.py │ └── schemas.py # Pydantic v2 models(所有数据对象 Schema) ├── storage/ │ ├── __init__.py │ └── file_store.py # 统一文件读写(JSON 序列化/反序列化、目录管理) ├── .env # DEEPSEEK_API_KEY + DEEPSEEK_BASE_URL + MINERU_API_TOKEN │ │ # 现有文件(不修改) ├── bridge.py ├── text_assembler.py ├── entity_extractor.py ├── kg_builder.py ├── agentic_rag_mvp.py ├── web_server.py # 旧 Flask 原型(保留,不删除) └── output/ ├── kg_nodes.json # 向后兼容的全局 KG(与 kg/ 目录同步) └── kg_edges.json ``` ### 1.6 文件系统存储结构 ``` F:\GraphRAGAgent\graphrag_pipeline\ │ ├── uploads/ │ └── {doc_id}_{filename} # 上传的原始文件(如 abc12345_paper.pdf) │ ├── jobs/ │ └── {job_id}/ │ ├── meta.json # job 元数据 │ │ { │ │ "job_id": "job_xyz789", │ │ "doc_id": "abc12345", │ │ "status": "done", │ │ "stage": "Complete", │ │ "progress": {...}, │ │ "created_at": "ISO8601", │ │ "elapsed_seconds": 42.1, │ │ "error": null, │ │ "pdf_name": "paper.pdf", │ │ "pdf_path": "uploads/abc12345_paper.pdf" │ │ } │ ├── mineru_output/ # MinerU 解析产物(原样保留) │ │ ├── {uuid}_content_list.json │ │ ├── layout.json │ │ ├── full.md │ │ ├── {uuid}_origin.pdf │ │ └── images/ │ │ └── {sha256}.jpg │ ├── extractions.json # LangExtract 全部抽取记录(ExtractionRecord[]) │ ├── kg_nodes.json # 本 job 生成的 KG 节点(KGNode[]) │ └── kg_edges.json # 本 job 生成的 KG 边(KGEdge[]) │ └── kg/ ├── kg_nodes.json # 全局合并的 KG 节点(所有 job 合并去重) └── kg_edges.json # 全局合并的 KG 边(所有 job 合并去重) ``` --- ## 二、统一响应封装格式 ### 2.1 通用响应结构 所有 API 端点均使用以下统一包装格式: ```json { "code": 0, "msg": "success", "request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "data": { ... } } ``` | 字段 | 类型 | 说明 | |------|------|------| | `code` | `int` | `0` = 成功;非 `0` = 失败(见错误码表) | | `msg` | `string` | 状态描述(成功为 `"success"`,失败为错误信息) | | `request_id` | `string` | UUID v4,用于日志追踪 | | `data` | `object \| null` | 业务数据(失败时为 `null`) | **HTTP 状态码映射:** | HTTP 状态码 | 适用场景 | |------------|---------| | `200 OK` | 同步请求成功 | | `202 Accepted` | 异步任务已接受(Job 启动) | | `400 Bad Request` | 参数校验失败(code 1001/1002/1003) | | `404 Not Found` | 资源不存在(code 2001/3001) | | `500 Internal Server Error` | 服务器内部错误(code 5000) | **FastAPI Pydantic 响应模型:** ```python from pydantic import BaseModel from typing import Generic, TypeVar, Optional import uuid T = TypeVar("T") class APIResponse(BaseModel, Generic[T]): code: int = 0 msg: str = "success" request_id: str = str(uuid.uuid4()) data: Optional[T] = None ``` ### 2.2 错误码体系 | code | HTTP 状态码 | 含义 | 说明 | |------|------------|------|------| | `0` | 200 | 成功 | | | `1001` | 400 | 参数校验失败 | 缺少必填字段或类型错误 | | `1002` | 400 | 文件格式不支持 | 仅支持 pdf/docx/doc/pptx/ppt/png/jpg/jpeg/html | | `1003` | 400 | 文件超出大小限制 | 单文件最大 200MB(MinerU 限制) | | `1004` | 400 | 文件页数超限 | 单文件最大 600 页(MinerU 限制) | | `2001` | 404 | 文档不存在 | `doc_id` 对应的文档未找到 | | `2002` | 400 | Job 不存在 | `job_id` 对应的任务未找到 | | `2003` | 400 | Job 仍在执行 | 请求结果时任务尚未完成 | | `2004` | 400 | Job 状态不可取消 | 仅 submitted/queued 可取消 | | `3001` | 404 | KG 节点不存在 | `node_id` 对应节点未找到 | | `3002` | 400 | KG 为空 | 尚未完成任何 Indexing,无图谱数据 | | `4001` | 500 | QA 服务异常 | LangChain Agent 或 DeepSeek API 调用失败 | | `5000` | 500 | 服务器内部错误 | 未预期的系统异常 | **错误响应示例:** ```json { "code": 1002, "msg": "Unsupported file format: .xlsx. Supported formats: pdf, docx, doc, pptx, ppt, png, jpg, jpeg, html", "request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "data": null } ``` --- ## 三、核心数据对象 Schema ### 3.1 DocumentInfo 文档元数据对象,由 `POST /api/v1/documents/upload` 创建,持久化到 `jobs/` 下的 `meta.json`。 ```json { "doc_id": "abc12345", "filename": "graphrag_overview.pdf", "format": "pdf", "size_bytes": 1048576, "pages": 4, "uploaded_at": "2026-03-05T10:00:00Z", "status": "indexed", "language": "en", "enable_formula": true, "enable_table": true } ``` | 字段 | 类型 | 说明 | |------|------|------| | `doc_id` | `string` | 文档唯一 ID(UUID hex 前 8 位,如 `"abc12345"`) | | `filename` | `string` | 原始文件名 | | `format` | `string` | 文件格式(小写扩展名,不含点) | | `size_bytes` | `int` | 文件大小(字节) | | `pages` | `int \| null` | 总页数(MinerU 解析后填充;上传时为 `null`) | | `uploaded_at` | `string` | ISO 8601 上传时间 | | `status` | `string` | `"uploaded"` / `"indexed"` / `"failed"` | | `language` | `string` | OCR 语言码(PaddleOCR,默认 `"ch"`) | | `enable_formula` | `bool` | 是否启用公式识别 | | `enable_table` | `bool` | 是否启用表格识别 | ### 3.2 IndexingJobStatus Indexing Pipeline 的任务状态对象。 ```json { "job_id": "job_xyz789", "doc_id": "abc12345", "status": "extracting", "stage": "Extracting entities (LangExtract + DeepSeek)...", "progress": { "parsed_pages": 4, "total_pages": 4, "extracted_entities": 23 }, "created_at": "2026-03-05T10:00:05Z", "elapsed_seconds": 18.3, "error": null } ``` | 字段 | 类型 | 说明 | |------|------|------| | `job_id` | `string` | 任务唯一 ID(`"job_"` + UUID hex 前 8 位) | | `doc_id` | `string` | 关联文档 ID | | `status` | `string` | 状态枚举(见 1.4 状态机) | | `stage` | `string` | 当前阶段人类可读描述 | | `progress.parsed_pages` | `int` | 已解析页数 | | `progress.total_pages` | `int` | 总页数(0 = 未知) | | `progress.extracted_entities` | `int` | 已抽取实体数 | | `created_at` | `string` | ISO 8601 任务创建时间 | | `elapsed_seconds` | `float` | 已耗时(秒) | | `error` | `string \| null` | 错误信息(失败时非 null) | ### 3.3 KGNode 知识图谱节点,直接对应 `kg_nodes.json` 格式,新增 `degree` 字段。 ```json { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "source_doc": "abc12345", "char_start": 0, "char_end": 8, "confidence": "match_exact", "page": 0, "degree": 39 } ``` | 字段 | 类型 | 说明 | |------|------|------| | `id` | `string` | 节点唯一 ID(来自 kg_nodes.json) | | `name` | `string` | 实体名称 | | `type` | `string` | 实体类型:`TECHNOLOGY` / `CONCEPT` / `PERSON` / `ORGANIZATION` / `LOCATION` | | `source_doc` | `string` | 来源文档 ID(doc_id) | | `char_start` | `int` | 实体在原文中的起始字符位置(LangExtract `char_interval.start_pos`) | | `char_end` | `int` | 实体在原文中的结束字符位置(不含,`char_interval.end_pos`) | | `confidence` | `string` | LangExtract 对齐状态:`match_exact` / `match_greater` / `match_lesser` / `match_fuzzy` | | `page` | `int` | 所在页码(0-indexed,来自 MinerU content_list.json `page_idx`) | | `degree` | `int` | 节点度数(连接边数,NetworkX 计算,仅 API 返回时填充) | ### 3.4 KGEdge 知识图谱边,直接对应 `kg_edges.json` 格式。 ```json { "source": "tech_graphrag_0", "target": "concept_knowledgegraph_1", "relation": "CO_OCCURS_IN", "doc_id": "abc12345", "page": 0 } ``` | 字段 | 类型 | 说明 | |------|------|------| | `source` | `string` | 起始节点 ID | | `target` | `string` | 目标节点 ID | | `relation` | `string` | 关系类型(当前固定为 `"CO_OCCURS_IN"`,表示同页共现) | | `doc_id` | `string` | 边来源文档 ID | | `page` | `int` | 共现所在页码(0-indexed) | ### 3.5 ExtractionRecord LangExtract 单条实体抽取记录,对应 `AnnotatedDocument.extractions[]` 的扁平化结构。 ```json { "text": "GraphRAG", "type": "TECHNOLOGY", "char_start": 0, "char_end": 8, "alignment": "match_exact", "page": 0, "doc_id": "abc12345" } ``` | 字段 | 类型 | 说明 | |------|------|------| | `text` | `string` | 实体文本(`extraction_text`,原文子串) | | `type` | `string` | 实体类型(`extraction_class`) | | `char_start` | `int \| null` | 字符起始位置(`char_interval.start_pos`) | | `char_end` | `int \| null` | 字符结束位置(`char_interval.end_pos`,不含) | | `alignment` | `string \| null` | 对齐状态(`alignment_status.value`,`null` 表示未对齐) | | `page` | `int` | 所在页码(0-indexed) | | `doc_id` | `string` | 来源文档 ID | > **过滤规则**:KG 构建时过滤掉 `alignment = null`(未对齐),`match_fuzzy` 根据项目配置可选是否过滤。当前实测:`match_exact` 占 94%+。 ### 3.6 QAResult Agentic-RAG 问答返回对象,包含答案 + 完整推理溯源链。 ```json { "query_id": "q_20260305_001", "question": "What is GraphRAG and how does it relate to knowledge graphs?", "answer": "GraphRAG is a knowledge graph-enhanced retrieval-augmented generation system...", "tool_calls": [ { "tool": "search_entities", "input": {"query": "GraphRAG"}, "output": "Found 1 entity(ies) matching 'GraphRAG':\n [TECHNOLOGY] \"GraphRAG\" (confidence=match_exact, page=0, id=tech_graphrag_0)" }, { "tool": "get_neighbors", "input": {"entity_name": "GraphRAG", "hops": 1}, "output": "Neighbors of 'GraphRAG' [TECHNOLOGY] within 1 hop(s):\n Hop 1 — 39 related entities:\n [CONCEPT] knowledge graphs\n ..." } ], "cited_nodes": ["tech_graphrag_0", "concept_knowledgegraph_1"], "elapsed_seconds": 8.4, "created_at": "2026-03-05T10:30:00Z" } ``` | 字段 | 类型 | 说明 | |------|------|------| | `query_id` | `string` | 查询唯一 ID | | `question` | `string` | 用户原始问题 | | `answer` | `string` | Agent 生成的最终自然语言答案(`result["messages"][-1].content`) | | `tool_calls` | `array` | ReAct 循环中的工具调用记录(顺序) | | `tool_calls[].tool` | `string` | 工具名(4 个 KG 工具之一) | | `tool_calls[].input` | `object` | 工具调用参数 | | `tool_calls[].output` | `string` | 工具返回的文本结果(ToolMessage.content) | | `cited_nodes` | `string[]` | 答案中引用的节点 ID 列表(从 tool_calls 解析) | | `elapsed_seconds` | `float` | 问答总耗时(包括所有 LLM 调用) | | `created_at` | `string` | ISO 8601 查询时间 | --- ## 四、A 组:文档管理(4 个端点) ### A1. 上传文件 ``` POST /api/v1/documents/upload Content-Type: multipart/form-data ``` **Request(Form Data):** | 字段 | 类型 | 必填 | 默认值 | 说明 | |------|------|------|--------|------| | `file` | `binary` | **是** | — | 文件二进制内容 | | `language` | `string` | 否 | `"ch"` | OCR 语言(PaddleOCR 语言码) | | `enable_formula` | `bool` | 否 | `true` | 是否启用公式识别 | | `enable_table` | `bool` | 否 | `true` | 是否启用表格识别 | **验证规则:** - 文件扩展名必须在支持列表中(见第十章) - 文件大小不得超过 200MB - 文件名不得包含路径分隔符(防目录穿越) **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "f47ac10b-...", "data": { "doc_id": "abc12345", "filename": "graphrag_overview.pdf", "format": "pdf", "size_bytes": 1048576, "pages": null, "uploaded_at": "2026-03-05T10:00:00Z", "status": "uploaded", "language": "en", "enable_formula": true, "enable_table": true } } ``` **错误响应:** ```json // 1002: 格式不支持 { "code": 1002, "msg": "Unsupported file format: .xlsx", "data": null } // 1003: 超过大小限制 { "code": 1003, "msg": "File size 256MB exceeds 200MB limit", "data": null } ``` --- ### A2. 获取文档信息 ``` GET /api/v1/documents/{doc_id} ``` **Path Params:** | 参数 | 类型 | 说明 | |------|------|------| | `doc_id` | `string` | 文档 ID | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "doc_id": "abc12345", "filename": "graphrag_overview.pdf", "format": "pdf", "size_bytes": 1048576, "pages": 4, "uploaded_at": "2026-03-05T10:00:00Z", "status": "indexed", "language": "en", "enable_formula": true, "enable_table": true } } ``` **错误:** `2001` (doc_id 不存在) --- ### A3. 列出所有文档 ``` GET /api/v1/documents ``` **Query Params:** | 参数 | 类型 | 默认值 | 说明 | |------|------|--------|------| | `page` | `int` | `1` | 页码(从 1 开始) | | `page_size` | `int` | `20` | 每页数量(最大 100) | | `status` | `string` | — | 按状态筛选:`uploaded` / `indexed` / `failed` | | `format` | `string` | — | 按格式筛选:如 `pdf` | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "total": 5, "page": 1, "page_size": 20, "items": [ { "doc_id": "abc12345", "filename": "graphrag_overview.pdf", "format": "pdf", "size_bytes": 1048576, "pages": 4, "uploaded_at": "2026-03-05T10:00:00Z", "status": "indexed", "language": "en", "enable_formula": true, "enable_table": true } ] } } ``` --- ### A4. 删除文档 ``` DELETE /api/v1/documents/{doc_id} ``` **说明:** 删除文档及其关联的 job 产物文件(`uploads/`、`jobs/` 下的对应目录),并从全局 KG 中移除该文档贡献的节点和边。 **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "deleted": true, "doc_id": "abc12345", "removed_nodes": 40, "removed_edges": 780 } } ``` **错误:** `2001` (doc_id 不存在) --- ## 五、B 组:Indexing Pipeline(4 个端点) ### B1. 启动索引任务 ``` POST /api/v1/index/start Content-Type: application/json ``` **Request Body:** ```json { "doc_id": "abc12345" } ``` | 字段 | 类型 | 必填 | 说明 | |------|------|------|------| | `doc_id` | `string` | **是** | 已上传文档的 ID(状态须为 `uploaded`) | **Response 202:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "job_id": "job_xyz789", "doc_id": "abc12345", "status": "submitted", "stage": "Job submitted", "created_at": "2026-03-05T10:00:05Z" } } ``` **实现说明:** ```python # IndexingService 内部实现 def start_indexing(doc_id: str) -> IndexingJobStatus: job_id = f"job_{uuid.uuid4().hex[:8]}" job_dir = JOBS_DIR / job_id job_dir.mkdir(parents=True) meta = { "job_id": job_id, "doc_id": doc_id, "status": "submitted", ... } save_meta(job_dir / "meta.json", meta) thread = threading.Thread(target=run_pipeline, args=(job_id,), daemon=True) thread.start() return meta ``` **Pipeline 执行顺序(后台线程):** 1. `status = "parsing"` → `subprocess.run([MINERU_PYTHON, MINERU_PIPELINE, pdf_path])` 2. `status = "extracting"` → `load_content_list()` → `assemble_pages()` → `extract_entities()` per page 3. `status = "indexing"` → `build_kg()` → 保存 `jobs/{job_id}/kg_nodes.json` → 合并到 `kg/` 4. `status = "done"` --- ### B2. 查询任务状态(含实时进度) ``` GET /api/v1/index/status/{job_id} ``` **推荐轮询间隔:** 3 秒 **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "job_id": "job_xyz789", "doc_id": "abc12345", "status": "extracting", "stage": "Extracting entities page 2/4 (LangExtract + DeepSeek)...", "progress": { "parsed_pages": 4, "total_pages": 4, "extracted_entities": 23 }, "created_at": "2026-03-05T10:00:05Z", "elapsed_seconds": 18.3, "error": null } } ``` **各状态 `stage` 典型值:** | status | stage | |--------|-------| | `submitted` | `"Job submitted"` | | `queued` | `"Waiting for worker..."` | | `parsing` | `"MinerU PDF parsing (cloud API)..."` | | `extracting` | `"Extracting entities page 2/4 (LangExtract + DeepSeek)..."` | | `indexing` | `"Building knowledge graph..."` | | `done` | `"Complete"` | | `failed` | `"Error: {error message}"` | **错误:** `2002` (job_id 不存在) --- ### B3. 获取索引结果(完整数据) ``` GET /api/v1/index/result/{job_id} ``` **Response 200(status = done):** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "job_id": "job_xyz789", "doc_id": "abc12345", "status": "done", "stats": { "blocks": 32, "block_types": {"text": 31, "table": 1}, "pages": 4, "raw_extractions": 45, "nodes": 40, "edges": 780, "type_counts": {"TECHNOLOGY": 4, "CONCEPT": 36}, "alignment_counts": {"match_exact": 40, "match_fuzzy": 5}, "elapsed_seconds": 42.1 }, "extractions": [ { "text": "GraphRAG", "type": "TECHNOLOGY", "char_start": 0, "char_end": 8, "alignment": "match_exact", "page": 0, "doc_id": "abc12345" } ], "nodes": [ { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "source_doc": "abc12345", "char_start": 0, "char_end": 8, "confidence": "match_exact", "page": 0, "degree": 39 } ], "edges": [ { "source": "tech_graphrag_0", "target": "concept_knowledgegraph_1", "relation": "CO_OCCURS_IN", "doc_id": "abc12345", "page": 0 } ] } } ``` **Response 200(status ≠ done):** 返回 `IndexingJobStatus`(不含 stats/extractions/nodes/edges) **错误:** `2002` (job_id 不存在) --- ### B4. 取消任务 ``` DELETE /api/v1/index/jobs/{job_id} ``` **限制:** 仅 `submitted` 或 `queued` 状态可取消;`parsing`/`extracting`/`indexing` 状态无法中断后台线程,仅标记状态为 `cancelled`。 **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "cancelled": true, "job_id": "job_xyz789", "previous_status": "submitted" } } ``` **错误:** `2002` (不存在), `2004` (状态不可取消) --- ## 六、C 组:知识图谱(6 个端点) ### C1. 获取所有节点(分页 + 筛选) ``` GET /api/v1/kg/nodes ``` **Query Params:** | 参数 | 类型 | 默认值 | 说明 | |------|------|--------|------| | `type` | `string` | — | 实体类型筛选(大小写不敏感) | | `doc_id` | `string` | — | 按来源文档筛选 | | `confidence` | `string` | — | 对齐状态筛选(如 `match_exact`) | | `page` | `int` | `1` | 页码 | | `page_size` | `int` | `50` | 每页数量(最大 200) | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "total": 40, "page": 1, "page_size": 50, "items": [ { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "source_doc": "abc12345", "char_start": 0, "char_end": 8, "confidence": "match_exact", "page": 0, "degree": 39 } ] } } ``` **错误:** `3002` (KG 为空) --- ### C2. 获取所有边(分页) ``` GET /api/v1/kg/edges ``` **Query Params:** | 参数 | 类型 | 默认值 | 说明 | |------|------|--------|------| | `doc_id` | `string` | — | 按来源文档筛选 | | `relation` | `string` | — | 关系类型筛选(如 `CO_OCCURS_IN`) | | `page` | `int` | `1` | 页码 | | `page_size` | `int` | `100` | 每页数量(最大 500) | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "total": 780, "page": 1, "page_size": 100, "items": [ { "source": "tech_graphrag_0", "target": "concept_knowledgegraph_1", "relation": "CO_OCCURS_IN", "doc_id": "abc12345", "page": 0 } ] } } ``` --- ### C3. 获取单个节点详情 ``` GET /api/v1/kg/nodes/{node_id} ``` **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "source_doc": "abc12345", "char_start": 0, "char_end": 8, "confidence": "match_exact", "page": 0, "degree": 39, "degree_centrality": 1.000, "neighbor_count": 39 } } ``` **额外字段(仅单节点详情):** | 字段 | 说明 | |------|------| | `degree_centrality` | NetworkX `degree_centrality(G)[node_id]`(0-1 范围) | | `neighbor_count` | 直接邻居数量(等于 `degree`) | **错误:** `3001` (节点不存在) --- ### C4. 获取节点邻居(N-hop BFS) ``` GET /api/v1/kg/nodes/{node_id}/neighbors ``` **Query Params:** | 参数 | 类型 | 默认值 | 说明 | |------|------|--------|------| | `hops` | `int` | `1` | 跳数(1-3) | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "center": { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "page": 0 }, "hops": 1, "neighbors_by_hop": { "1": [ { "id": "concept_knowledgegraph_1", "name": "knowledge graphs", "type": "CONCEPT", "page": 0 } ] }, "total_neighbors": 39 } } ``` **实现参考(来自 `agentic_rag_mvp.py`):** ```python reachable = nx.single_source_shortest_path_length(G, node_id, cutoff=hops) by_hop = {dist: [] for dist in range(1, hops+1)} for nid, dist in reachable.items(): if dist > 0: by_hop[dist].append(G.nodes[nid]) ``` **错误:** `3001` (节点不存在) --- ### C5. 知识图谱统计 ``` GET /api/v1/kg/stats ``` **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "total_nodes": 40, "total_edges": 780, "density": 1.0000, "type_distribution": { "TECHNOLOGY": 4, "CONCEPT": 36 }, "relation_types": { "CO_OCCURS_IN": 780 }, "top5_central_nodes": [ { "node_id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "centrality": 1.000 }, { "node_id": "concept_kgrag_1", "name": "Knowledge Graph Enhanced RAG System", "type": "CONCEPT", "centrality": 1.000 }, { "node_id": "concept_rag_2", "name": "retrieval-augmented generation", "type": "CONCEPT", "centrality": 1.000 }, { "node_id": "concept_kg_3", "name": "knowledge graphs", "type": "CONCEPT", "centrality": 1.000 }, { "node_id": "concept_llm_4", "name": "large language models", "type": "CONCEPT", "centrality": 1.000 } ], "source_documents": ["abc12345", "def67890"] } } ``` --- ### C6. 导出完整 KG ``` GET /api/v1/kg/export ``` **Query Params:** | 参数 | 类型 | 默认值 | 说明 | |------|------|--------|------| | `format` | `string` | `"json"` | 导出格式(当前仅支持 `json`) | | `doc_id` | `string` | — | 可选,仅导出指定文档的 KG | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "format": "json", "doc_id": null, "total_nodes": 40, "total_edges": 780, "exported_at": "2026-03-05T12:00:00Z", "nodes": [ ...KGNode[] ], "edges": [ ...KGEdge[] ] } } ``` --- ## 七、D 组:QA 问答(4 个端点) ### D1. 提交 QA 查询(同步) ``` POST /api/v1/query Content-Type: application/json ``` **Request Body:** ```json { "question": "What is GraphRAG and how does it relate to knowledge graphs?", "history": [ { "role": "human", "content": "Previous question..." }, { "role": "ai", "content": "Previous answer..." } ] } ``` | 字段 | 类型 | 必填 | 说明 | |------|------|------|------| | `question` | `string` | **是** | 用户自然语言问题 | | `history` | `array` | 否 | 多轮对话历史(最多 10 轮,即 20 条消息) | | `history[].role` | `"human"` \| `"ai"` | — | 消息角色 | | `history[].content` | `string` | — | 消息内容 | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "query_id": "q_20260305_a1b2c3", "question": "What is GraphRAG and how does it relate to knowledge graphs?", "answer": "Based on the knowledge graph, GraphRAG [TECHNOLOGY] is a knowledge graph-enhanced retrieval-augmented generation system that...", "tool_calls": [ { "tool": "search_entities", "input": { "query": "GraphRAG" }, "output": "Found 1 entity(ies) matching 'GraphRAG':\n [TECHNOLOGY] \"GraphRAG\" (confidence=match_exact, page=0, id=tech_graphrag_0)" }, { "tool": "get_neighbors", "input": { "entity_name": "GraphRAG", "hops": 1 }, "output": "Neighbors of 'GraphRAG' [TECHNOLOGY] within 1 hop(s):\n Hop 1 — 39 related entities:\n [CONCEPT] knowledge graphs\n ..." } ], "cited_nodes": ["tech_graphrag_0", "concept_knowledgegraph_1"], "elapsed_seconds": 8.4, "created_at": "2026-03-05T10:30:00Z" } } ``` **实现说明(QAService 核心逻辑):** ```python # 将 history 拼接为 LangChain messages 格式 messages = [] for h in request.history: messages.append((h["role"], h["content"])) messages.append(("human", request.question)) # 调用 LangChain create_agent result = agent.invoke({"messages": messages}) # 提取工具调用链(遍历 result["messages"]) tool_calls = [] for msg in result["messages"]: if hasattr(msg, "tool_calls") and msg.tool_calls: for tc in msg.tool_calls: tool_calls.append({"tool": tc["name"], "input": tc["args"], "output": ""}) elif hasattr(msg, "tool_call_id"): # ToolMessage if tool_calls: tool_calls[-1]["output"] = msg.content # 最终答案 answer = result["messages"][-1].content ``` **错误:** `3002` (KG 为空), `4001` (Agent/LLM 调用失败) **注意:** 此接口为同步调用,通常耗时 5-30 秒(取决于 DeepSeek API 响应速度和工具调用次数)。 --- ### D2. 批量查询(异步) ``` POST /api/v1/query/batch Content-Type: application/json ``` **Request Body:** ```json { "questions": [ "What is GraphRAG?", "List all TECHNOLOGY entities in the knowledge graph.", "How does MinerU relate to LangExtract?" ] } ``` | 字段 | 类型 | 必填 | 约束 | 说明 | |------|------|------|------|------| | `questions` | `string[]` | **是** | 最多 20 个 | 问题列表 | **Response 202:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "batch_id": "batch_20260305_x1y2", "total": 3, "status": "submitted", "created_at": "2026-03-05T10:30:00Z" } } ``` --- ### D3. 获取批量查询状态与结果 ``` GET /api/v1/query/batch/{batch_id} ``` **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "batch_id": "batch_20260305_x1y2", "total": 3, "completed": 2, "failed": 0, "status": "running", "results": [ { ...QAResult }, { ...QAResult } ] } } ``` **错误:** `2002` (batch_id 不存在) --- ### D4. 查询历史 ``` GET /api/v1/query/history ``` **Query Params:** | 参数 | 类型 | 默认值 | 说明 | |------|------|--------|------| | `page` | `int` | `1` | 页码 | | `page_size` | `int` | `20` | 每页数量(最大 50) | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "total": 50, "page": 1, "page_size": 20, "items": [ ...QAResult[] ] } } ``` **存储说明:** 历史记录以 JSONL 格式持久化到 `jobs/query_history.jsonl`,每行一条 `QAResult`。 --- ## 八、E 组:搜索(3 个端点) ### E1. 实体关键词搜索 ``` GET /api/v1/search/entities ``` **Query Params:** | 参数 | 类型 | 必填 | 说明 | |------|------|------|------| | `q` | `string` | **是** | 关键词(大小写不敏感子串匹配,对应 `agentic_rag_mvp.py: search_entities`) | | `type` | `string` | 否 | 类型过滤(如 `TECHNOLOGY`) | | `limit` | `int` | 否 | 最多返回数量(默认 15,最大 100) | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "query": "GraphRAG", "total": 1, "items": [ { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY", "source_doc": "abc12345", "char_start": 0, "char_end": 8, "confidence": "match_exact", "page": 0, "degree": 39 } ] } } ``` **实现(参考 `agentic_rag_mvp.py: search_entities`):** ```python q = query.lower() matches = [data for _, data in G.nodes(data=True) if q in data.get("name", "").lower()] ``` --- ### E2. 图谱路径搜索(两节点间路径) ``` GET /api/v1/search/path ``` **Query Params:** | 参数 | 类型 | 必填 | 说明 | |------|------|------|------| | `from` | `string` | **是** | 起始节点 ID | | `to` | `string` | **是** | 目标节点 ID | | `max_hops` | `int` | 否 | 最大路径长度(默认 3,最大 5) | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "from": { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY" }, "to": { "id": "tech_mineru_3", "name": "MinerU", "type": "TECHNOLOGY" }, "max_hops": 3, "paths": [ { "length": 1, "nodes": [ { "id": "tech_graphrag_0", "name": "GraphRAG", "type": "TECHNOLOGY" }, { "id": "tech_mineru_3", "name": "MinerU", "type": "TECHNOLOGY" } ], "edges": [ { "source": "tech_graphrag_0", "target": "tech_mineru_3", "relation": "CO_OCCURS_IN" } ] } ], "total_paths": 1 } } ``` **实现(NetworkX):** ```python paths = list(nx.all_simple_paths(G, from_id, to_id, cutoff=max_hops)) ``` **错误:** `3001` (节点不存在) --- ### E3. 全图关键词搜索(含子图) ``` GET /api/v1/search/graph ``` **Query Params:** | 参数 | 类型 | 必填 | 说明 | |------|------|------|------| | `q` | `string` | **是** | 关键词(大小写不敏感子串匹配) | | `include_neighbors` | `bool` | 否 | 是否返回匹配节点的直接邻居边(默认 `false`) | **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "query": "retrieval", "matched_nodes": [ { "id": "concept_rag_2", "name": "retrieval-augmented generation", "type": "CONCEPT", "page": 0 } ], "subgraph_edges": [ { "source": "concept_rag_2", "target": "tech_graphrag_0", "relation": "CO_OCCURS_IN" } ] } } ``` --- ## 九、F 组:系统(4 个端点) ### F1. 健康检查 ``` GET /api/v1/health ``` **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "status": "healthy", "version": "1.0.0", "uptime_seconds": 3600, "components": { "mineru_venv": { "status": "ok", "path": "F:/GraphRAGAgent/mineru_mvp/.venv/Scripts/python.exe", "exists": true }, "langextract_venv": { "status": "ok", "path": "F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe", "exists": true }, "deepseek_api": { "status": "ok", "base_url": "https://api.deepseek.com", "key_configured": true }, "storage": { "status": "ok", "kg_nodes_exists": true, "kg_edges_exists": true, "uploads_dir_exists": true } } } } ``` **说明:** 此端点仅检查配置和文件存在性,不发起实际 API 调用(避免消耗 DeepSeek token)。 --- ### F2. 系统统计 ``` GET /api/v1/system/stats ``` **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "total_documents": 5, "indexed_documents": 4, "failed_documents": 1, "total_nodes": 200, "total_edges": 3900, "type_distribution": { "TECHNOLOGY": 20, "CONCEPT": 180 }, "total_queries": 50, "active_jobs": 1, "storage_used_mb": 12.4 } } ``` --- ### F3. 支持的文件格式列表 ``` GET /api/v1/system/formats ``` **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "formats": [ { "ext": "pdf", "description": "PDF 文档(文本型/扫描型/混合型)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false }, { "ext": "docx", "description": "Microsoft Word(新版)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false }, { "ext": "doc", "description": "Microsoft Word(旧版)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false }, { "ext": "pptx", "description": "PowerPoint(新版)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false }, { "ext": "ppt", "description": "PowerPoint(旧版)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false }, { "ext": "png", "description": "PNG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true }, { "ext": "jpg", "description": "JPEG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true }, { "ext": "jpeg", "description": "JPEG 图片(单页)", "max_size_mb": 200, "max_pages": 1, "requires_ocr": true }, { "ext": "html", "description": "HTML 文件(需指定 model_version=MinerU-HTML)", "max_size_mb": 200, "max_pages": 600, "requires_ocr": false } ], "ocr_languages": [ { "code": "ch", "name": "中文(默认)" }, { "code": "en", "name": "英文" }, { "code": "japan", "name": "日文" }, { "code": "korean", "name": "韩文" }, { "code": "french", "name": "法文" }, { "code": "german", "name": "德文" } ], "notes": [ "language 参数默认值为 'ch'(非 'zh'),遵循 PaddleOCR v3 语言代码规范", "上传时不需要携带 Content-Type: application/pdf 等,服务端自动识别", "PNG/JPG/JPEG 单次最多处理 1 页(图片文件视为单页文档)" ] } } ``` --- ### F4. Demo 数据(快速预览) ``` GET /api/v1/system/demo ``` **说明:** 返回现有 `output/kg_nodes.json` + `output/kg_edges.json` 数据,无需上传 PDF 即可预览 KG 可视化效果。与旧版 `GET /api/demo`(Flask web_server.py)兼容。 **Response 200:** ```json { "code": 0, "msg": "success", "request_id": "...", "data": { "nodes": [ ...KGNode[] ], "edges": [ ...KGEdge[] ], "stats": { "nodes": 40, "edges": 780, "type_counts": { "TECHNOLOGY": 4, "CONCEPT": 36 }, "density": 1.0000 } } } ``` **错误:** `3002` (demo 数据文件不存在,需先运行 bridge.py 生成) --- ## 十、文件格式支持矩阵 | 格式 | 扩展名 | 最大体积 | 最大页数 | OCR | MinerU model_version | 说明 | |------|--------|---------|---------|-----|----------------------|------| | PDF | `.pdf` | 200MB | 600 页 | 可选 | `pipeline`(默认) | 核心能力,文本型/扫描型/混合型均支持 | | Word(新) | `.docx` | 200MB | 600 页 | 可选 | `pipeline` | | | Word(旧) | `.doc` | 200MB | 600 页 | 可选 | `pipeline` | | | PPT(新) | `.pptx` | 200MB | 600 页 | 可选 | `pipeline` | | | PPT(旧) | `.ppt` | 200MB | 600 页 | 可选 | `pipeline` | | | PNG 图片 | `.png` | 200MB | 1 页 | 必须 | `pipeline` | EXIF 方向自动校正 | | JPEG 图片 | `.jpg` | 200MB | 1 页 | 必须 | `pipeline` | EXIF 方向自动校正 | | JPEG 图片 | `.jpeg` | 200MB | 1 页 | 必须 | `pipeline` | 同 `.jpg` | | HTML | `.html` | 200MB | 600 页 | 否 | `MinerU-HTML` | 必须指定特定 model_version | **MinerU 云端 API 限制(来自 mineru_specification-v1.0.md):** | 约束项 | 限制值 | |--------|--------| | 单文件最大体积 | 200 MB | | 单文件最大页数 | 600 页 | | 批量请求最大文件数 | 200 个 | | 预签名上传 URL 有效期 | 24 小时 | | 云端 API 每日最高优先级额度 | 2,000 页(超出降低优先级) | **服务端验证代码(FastAPI + Pydantic):** ```python ALLOWED_EXTENSIONS = {"pdf", "docx", "doc", "pptx", "ppt", "png", "jpg", "jpeg", "html"} MAX_FILE_SIZE_MB = 200 async def upload_document(file: UploadFile = File(...), ...): ext = Path(file.filename).suffix.lower().lstrip(".") if ext not in ALLOWED_EXTENSIONS: raise HTTPException(400, detail=f"Unsupported format: .{ext}") content = await file.read() size_mb = len(content) / (1024 * 1024) if size_mb > MAX_FILE_SIZE_MB: raise HTTPException(400, detail=f"File size {size_mb:.1f}MB exceeds 200MB limit") ``` --- ## 十一、依赖与运行 ### 安装依赖 ```bash # FastAPI + uvicorn + multipart 文件上传 uv pip install fastapi uvicorn[standard] python-multipart \ --python F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe # 已有依赖(无需重复安装) # langextract[all]、langchain、langchain-openai、networkx、python-dotenv、flask、requests ``` ### 启动服务 ```bash # 开发模式(--reload 热重载) F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe -m uvicorn \ graphrag_pipeline.api_server:app \ --host 0.0.0.0 --port 8000 --reload # 或直接运行主入口 F:/GraphRAGAgent/langextract_src/.venv/Scripts/python.exe \ F:/GraphRAGAgent/graphrag_pipeline/api_server.py ``` ### API 文档访问 FastAPI 自动生成 OpenAPI 文档,启动后可访问: | 地址 | 说明 | |------|------| | `http://localhost:8000/api/v1/health` | 健康检查(验证服务启动) | | `http://localhost:8000/docs` | Swagger UI(交互式 API 文档) | | `http://localhost:8000/redoc` | ReDoc(只读 API 文档) | | `http://localhost:8000/openapi.json` | OpenAPI JSON Schema | ### 端口说明 | 服务 | 端口 | 说明 | |------|------|------| | **FastAPI(新)** | `8000` | 本规范描述的生产级 API | | Flask web_server.py(旧) | `5000` | 原型,保留用于对比 |