65 lines
1.4 KiB
Markdown
65 lines
1.4 KiB
Markdown
# MinerU MVP — 文档解析组件
|
||
|
||
## 路径
|
||
|
||
```
|
||
GraphRAGAgent/mineru_mvp/
|
||
```
|
||
|
||
## 功能
|
||
|
||
通过 MinerU Cloud API 将 PDF/DOCX 等文档解析为结构化 JSON(`content_list.json`),供后端索引流水线消费。
|
||
|
||
## 安装
|
||
|
||
```bash
|
||
cd mineru_mvp
|
||
uv venv --python 3.12
|
||
source .venv/bin/activate # Linux / macOS
|
||
# .venv\Scripts\activate # Windows
|
||
uv pip install -r requirements.txt
|
||
```
|
||
|
||
## 配置
|
||
|
||
复制 `.env.example` 为 `.env`,填入 MinerU API Token:
|
||
|
||
```env
|
||
MINERU_API_TOKEN=your_token_here
|
||
```
|
||
|
||
Token 获取地址:https://mineru.net/apiManage/token
|
||
|
||
## 使用
|
||
|
||
```bash
|
||
# 激活 venv 后(或直接指定解释器路径):
|
||
python pipeline.py /path/to/document.pdf
|
||
|
||
# 或由 backend 通过 subprocess 调用:
|
||
/path/to/mineru_mvp/.venv/bin/python /path/to/mineru_mvp/pipeline.py /path/to/document.pdf
|
||
```
|
||
|
||
## 输出
|
||
|
||
解析结果输出到 `output/{文件名}/` 目录:
|
||
|
||
```
|
||
output/
|
||
└── {pdf_stem}/
|
||
├── {uuid}_content_list.json ← 核心产物,供 backend 读取
|
||
├── full.md
|
||
├── {uuid}_origin.pdf
|
||
├── layout.json
|
||
└── images/
|
||
└── {hash}.jpg
|
||
```
|
||
|
||
## 流水线步骤
|
||
|
||
1. POST `/file-urls/batch` — 获取预签名上传 URL
|
||
2. PUT 文件到预签名 URL(不带 Content-Type)
|
||
3. 轮询 GET `/extract-results/batch/{batch_id}`
|
||
4. 下载 ZIP → 解压到 `output/`
|
||
5. 打印摘要到 stdout
|