feat: 新增 mineru_mvp 文档解析组件并适配 Linux 路径
This commit is contained in:
64
mineru_mvp/CLAUDE.md
Normal file
64
mineru_mvp/CLAUDE.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# MinerU MVP — 文档解析组件
|
||||
|
||||
## 路径
|
||||
|
||||
```
|
||||
GraphRAGAgent/mineru_mvp/
|
||||
```
|
||||
|
||||
## 功能
|
||||
|
||||
通过 MinerU Cloud API 将 PDF/DOCX 等文档解析为结构化 JSON(`content_list.json`),供后端索引流水线消费。
|
||||
|
||||
## 安装
|
||||
|
||||
```bash
|
||||
cd mineru_mvp
|
||||
uv venv --python 3.12
|
||||
source .venv/bin/activate # Linux / macOS
|
||||
# .venv\Scripts\activate # Windows
|
||||
uv pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## 配置
|
||||
|
||||
复制 `.env.example` 为 `.env`,填入 MinerU API Token:
|
||||
|
||||
```env
|
||||
MINERU_API_TOKEN=your_token_here
|
||||
```
|
||||
|
||||
Token 获取地址:https://mineru.net/apiManage/token
|
||||
|
||||
## 使用
|
||||
|
||||
```bash
|
||||
# 激活 venv 后(或直接指定解释器路径):
|
||||
python pipeline.py /path/to/document.pdf
|
||||
|
||||
# 或由 backend 通过 subprocess 调用:
|
||||
/path/to/mineru_mvp/.venv/bin/python /path/to/mineru_mvp/pipeline.py /path/to/document.pdf
|
||||
```
|
||||
|
||||
## 输出
|
||||
|
||||
解析结果输出到 `output/{文件名}/` 目录:
|
||||
|
||||
```
|
||||
output/
|
||||
└── {pdf_stem}/
|
||||
├── {uuid}_content_list.json ← 核心产物,供 backend 读取
|
||||
├── full.md
|
||||
├── {uuid}_origin.pdf
|
||||
├── layout.json
|
||||
└── images/
|
||||
└── {hash}.jpg
|
||||
```
|
||||
|
||||
## 流水线步骤
|
||||
|
||||
1. POST `/file-urls/batch` — 获取预签名上传 URL
|
||||
2. PUT 文件到预签名 URL(不带 Content-Type)
|
||||
3. 轮询 GET `/extract-results/batch/{batch_id}`
|
||||
4. 下载 ZIP → 解压到 `output/`
|
||||
5. 打印摘要到 stdout
|
||||
Reference in New Issue
Block a user