Files
GraphRAGAgent/mineru_mvp/CLAUDE.md

65 lines
1.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# MinerU MVP — 文档解析组件
## 路径
```
GraphRAGAgent/mineru_mvp/
```
## 功能
通过 MinerU Cloud API 将 PDF/DOCX 等文档解析为结构化 JSON`content_list.json`),供后端索引流水线消费。
## 安装
```bash
cd mineru_mvp
uv venv --python 3.12
source .venv/bin/activate # Linux / macOS
# .venv\Scripts\activate # Windows
uv pip install -r requirements.txt
```
## 配置
复制 `.env.example``.env`,填入 MinerU API Token
```env
MINERU_API_TOKEN=your_token_here
```
Token 获取地址https://mineru.net/apiManage/token
## 使用
```bash
# 激活 venv 后(或直接指定解释器路径):
python pipeline.py /path/to/document.pdf
# 或由 backend 通过 subprocess 调用:
/path/to/mineru_mvp/.venv/bin/python /path/to/mineru_mvp/pipeline.py /path/to/document.pdf
```
## 输出
解析结果输出到 `output/{文件名}/` 目录:
```
output/
└── {pdf_stem}/
├── {uuid}_content_list.json ← 核心产物,供 backend 读取
├── full.md
├── {uuid}_origin.pdf
├── layout.json
└── images/
└── {hash}.jpg
```
## 流水线步骤
1. POST `/file-urls/batch` — 获取预签名上传 URL
2. PUT 文件到预签名 URL不带 Content-Type
3. 轮询 GET `/extract-results/batch/{batch_id}`
4. 下载 ZIP → 解压到 `output/`
5. 打印摘要到 stdout