GraphRAG Studio — initial commit: multimodal RAG system with KG visualization

Full-stack application for document-to-knowledge-graph pipeline: - Backend: FastAPI + LangGraph ReAct agent + DeepSeek + MinerU parsing - Frontend: React 19 + Vite + D3.js + shadcn/ui - Pipeline: MinerU parsing → LangExtract entity extraction → KG building Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 17:30:04 +08:00
commit b02d3378fc
127 changed files with 37218 additions and 0 deletions
--- a/backend/output/8456b615_sample_graphrag_overview/99c9be1f-bba4-4a58-824b-7331d50db9bb_content_list.json
+++ b/backend/output/8456b615_sample_graphrag_overview/99c9be1f-bba4-4a58-824b-7331d50db9bb_content_list.json
@@ -0,0 +1,367 @@
+[
+    {
+        "type": "text",
+        "text": "GraphRAG System ",
+        "text_level": 1,
+        "bbox": [
+            344,
+            175,
+            655,
+            204
+        ],
+        "page_idx": 0
+    },
+    {
+        "type": "text",
+        "text": "Technical Architecture Overview ",
+        "bbox": [
+            289,
+            234,
+            710,
+            254
+        ],
+        "page_idx": 0
+    },
+    {
+        "type": "text",
+        "text": "Version 1.0 | March 2026 ",
+        "bbox": [
+            364,
+            272,
+            633,
+            290
+        ],
+        "page_idx": 0
+    },
+    {
+        "type": "text",
+        "text": "1. Abstract ",
+        "text_level": 1,
+        "bbox": [
+            52,
+            42,
+            200,
+            61
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "This document presents the technical architecture of a Multimodal GraphRAG System designed for intelligent document parsing and knowledge graph construction. The system integrates MinerU for document parsing, LangExtract for structured entity extraction, and a graph database for knowledge storage and retrieval. ",
+        "bbox": [
+            48,
+            83,
+            951,
+            171
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "The pipeline supports multiple document formats including PDF, DOCX, PPTX, and image files. Extracted entities and relations are stored as graph nodes and edges, enabling semantic search and question answering over large document collections. ",
+        "bbox": [
+            48,
+            200,
+            949,
+            265
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "2. System Components ",
+        "text_level": 1,
+        "bbox": [
+            50,
+            299,
+            321,
+            318
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "2.1 Document Parsing Module ",
+        "text_level": 1,
+        "bbox": [
+            50,
+            343,
+            349,
+            361
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "MinerU Cloud API (v4) serves as the document parsing backend. It accepts PDF, DOCX, PPTX, PNG, JPG, and HTML files. Output includes Markdown text, structured content_list.json, and extracted images. ",
+        "bbox": [
+            48,
+            373,
+            951,
+            436
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "2.2 Entity Extraction Module ",
+        "text_level": 1,
+        "bbox": [
+            50,
+            461,
+            357,
+            479
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "LangExtract (v1.1.1) performs structured information extraction from plain text using few-shot prompting with LLM backends (Gemini, OpenAI, or local Ollama). Each extraction includes character-level position anchoring. ",
+        "bbox": [
+            48,
+            492,
+            949,
+            555
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "2.3 Knowledge Graph Module ",
+        "text_level": 1,
+        "bbox": [
+            50,
+            580,
+            337,
+            596
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "Extracted entities and relationships are stored in a graph database. Node types include: Person, Organization, Location, Event, Concept. Edge types include: RELATED_TO, BELONGS_TO, CAUSED_BY, LOCATED_IN. ",
+        "bbox": [
+            48,
+            608,
+            949,
+            674
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "2.4 Retrieval Module ",
+        "text_level": 1,
+        "bbox": [
+            50,
+            697,
+            272,
+            715
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "The retrieval layer supports hybrid search combining vector similarity and graph traversal.   \nQuery results are ranked by relevance score and returned with source document references. ",
+        "bbox": [
+            48,
+            727,
+            944,
+            766
+        ],
+        "page_idx": 1
+    },
+    {
+        "type": "text",
+        "text": "3. Data Pipeline ",
+        "text_level": 1,
+        "bbox": [
+            50,
+            42,
+            268,
+            61
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "The end-to-end data pipeline consists of the following stages: ",
+        "bbox": [
+            50,
+            83,
+            623,
+            99
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "Stage 1: Document Ingestion ",
+        "bbox": [
+            68,
+            130,
+            322,
+            146
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "- Accept raw documents (PDF, DOCX, images, HTML) - Submit to MinerU API for parsing - Poll task status until state $\\underline { { \\underline { { \\mathbf { \\delta \\pi } } } } }$ done ",
+        "bbox": [
+            85,
+            153,
+            531,
+            217
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "Stage 2: Content Extraction ",
+        "bbox": [
+            68,
+            249,
+            322,
+            263
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "- Download and decompress full_zip_url - Parse content_list.json into Document objects - Separate text blocks, tables, images, equations ",
+        "bbox": [
+            85,
+            272,
+            542,
+            335
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "Stage 3: Entity & Relation Extraction ",
+        "bbox": [
+            67,
+            367,
+            415,
+            381
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "- Feed text blocks to LangExtract - Extract entities with char_interval positions - Extract relationships between entities ",
+        "bbox": [
+            85,
+            390,
+            526,
+            454
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "Stage 4: Graph Construction ",
+        "bbox": [
+            68,
+            485,
+            322,
+            500
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "- Map extractions to graph nodes and edges - Store with source provenance (page_idx, bbox) - Build vector embeddings for semantic search ",
+        "bbox": [
+            85,
+            508,
+            522,
+            571
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "4. Supported File Formats ",
+        "text_level": 1,
+        "bbox": [
+            50,
+            604,
+            326,
+            620
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "table",
+        "img_path": "images/1ed7aacecd20fecef8dc27ee2fe76dc1ae7fa93c44f7d10878d17a41f21a6bef.jpg",
+        "table_caption": [],
+        "table_footnote": [],
+        "table_body": "<table><tr><td rowspan=1 colspan=1>Format</td><td rowspan=1 colspan=1>Extension</td><td rowspan=1 colspan=1>OCR Required</td><td rowspan=1 colspan=1>ModeI</td></tr><tr><td rowspan=1 colspan=1>PDF (text)</td><td rowspan=1 colspan=1>. pdf</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>pipeline / vlm</td></tr><tr><td rowspan=1 colspan=1>PDF (scan)</td><td rowspan=1 colspan=1>. pdf</td><td rowspan=1 colspan=1>Yes</td><td rowspan=1 colspan=1>vIlm</td></tr><tr><td rowspan=1 colspan=1>Word</td><td rowspan=1 colspan=1>. docx</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>pipeline</td></tr><tr><td rowspan=1 colspan=1>PowerPoint</td><td rowspan=1 colspan=1>.pptx</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>pipeline</td></tr><tr><td rowspan=1 colspan=1>Image</td><td rowspan=1 colspan=1>.png / .jpg</td><td rowspan=1 colspan=1>Auto</td><td rowspan=1 colspan=1>vIlm</td></tr><tr><td rowspan=1 colspan=1>HTML</td><td rowspan=1 colspan=1>.html</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>MinerU-HTML</td></tr></table>",
+        "bbox": [
+            45,
+            634,
+            882,
+            806
+        ],
+        "page_idx": 2
+    },
+    {
+        "type": "text",
+        "text": "5. API Configuration Reference ",
+        "text_level": 1,
+        "bbox": [
+            48,
+            42,
+            457,
+            63
+        ],
+        "page_idx": 3
+    },
+    {
+        "type": "text",
+        "text": "The following environment variables must be configured before running the MinerU parsing service: ",
+        "bbox": [
+            48,
+            83,
+            952,
+            123
+        ],
+        "page_idx": 3
+    },
+    {
+        "type": "text",
+        "text": "MINERU_API_TOKEN : Bearer token for API authentication   \nMINERU_USER_UID : User UUID for quota management   \nMINERU_BASE_URL : https://mineru.net/api/v4   \nMINERU_MODEL_VERSION : pipeline (default) | vlm | MinerU-HTML   \nMINERU_LANGUAGE : ch (Chinese) | en (English)   \nMINERU_IS_OCR : false (text PDF) | true (scanned PDF)   \nMINERU_ENABLE_FORMULA: true | false   \nMINERU_ENABLE_TABLE : true | false ",
+        "bbox": [
+            65,
+            152,
+            636,
+            337
+        ],
+        "page_idx": 3
+    },
+    {
+        "type": "text",
+        "text": "Rate Limits: ",
+        "bbox": [
+            48,
+            367,
+            161,
+            381
+        ],
+        "page_idx": 3
+    },
+    {
+        "type": "text",
+        "text": "- Max file size : 200 MB per file - Max pages : 600 pages per file - Daily quota : 2000 pages (high priority) - Batch limit : 200 files per request ",
+        "bbox": [
+            65,
+            388,
+            504,
+            478
+        ],
+        "page_idx": 3
+    }
+]
--- a/backend/output/8456b615_sample_graphrag_overview/99c9be1f-bba4-4a58-824b-7331d50db9bb_origin.pdf
+++ b/backend/output/8456b615_sample_graphrag_overview/99c9be1f-bba4-4a58-824b-7331d50db9bb_origin.pdf
--- a/backend/output/8456b615_sample_graphrag_overview/full.md
+++ b/backend/output/8456b615_sample_graphrag_overview/full.md
@@ -0,0 +1,71 @@
+# GraphRAG System
+
+Technical Architecture Overview
+
+Version 1.0 | March 2026
+
+# 1. Abstract
+
+This document presents the technical architecture of a Multimodal GraphRAG System designed for intelligent document parsing and knowledge graph construction. The system integrates MinerU for document parsing, LangExtract for structured entity extraction, and a graph database for knowledge storage and retrieval.
+
+The pipeline supports multiple document formats including PDF, DOCX, PPTX, and image files. Extracted entities and relations are stored as graph nodes and edges, enabling semantic search and question answering over large document collections.
+
+# 2. System Components
+
+# 2.1 Document Parsing Module
+
+MinerU Cloud API (v4) serves as the document parsing backend. It accepts PDF, DOCX, PPTX, PNG, JPG, and HTML files. Output includes Markdown text, structured content_list.json, and extracted images.
+
+# 2.2 Entity Extraction Module
+
+LangExtract (v1.1.1) performs structured information extraction from plain text using few-shot prompting with LLM backends (Gemini, OpenAI, or local Ollama). Each extraction includes character-level position anchoring.
+
+# 2.3 Knowledge Graph Module
+
+Extracted entities and relationships are stored in a graph database. Node types include: Person, Organization, Location, Event, Concept. Edge types include: RELATED_TO, BELONGS_TO, CAUSED_BY, LOCATED_IN.
+
+# 2.4 Retrieval Module
+
+The retrieval layer supports hybrid search combining vector similarity and graph traversal.   
+Query results are ranked by relevance score and returned with source document references.
+
+# 3. Data Pipeline
+
+The end-to-end data pipeline consists of the following stages:
+
+Stage 1: Document Ingestion
+
+- Accept raw documents (PDF, DOCX, images, HTML) - Submit to MinerU API for parsing - Poll task status until state $\underline { { \underline { { \mathbf { \delta \pi } } } } }$ done
+
+Stage 2: Content Extraction
+
+- Download and decompress full_zip_url - Parse content_list.json into Document objects - Separate text blocks, tables, images, equations
+
+Stage 3: Entity & Relation Extraction
+
+- Feed text blocks to LangExtract - Extract entities with char_interval positions - Extract relationships between entities
+
+Stage 4: Graph Construction
+
+- Map extractions to graph nodes and edges - Store with source provenance (page_idx, bbox) - Build vector embeddings for semantic search
+
+# 4. Supported File Formats
+
+<table><tr><td rowspan=1 colspan=1>Format</td><td rowspan=1 colspan=1>Extension</td><td rowspan=1 colspan=1>OCR Required</td><td rowspan=1 colspan=1>ModeI</td></tr><tr><td rowspan=1 colspan=1>PDF (text)</td><td rowspan=1 colspan=1>. pdf</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>pipeline / vlm</td></tr><tr><td rowspan=1 colspan=1>PDF (scan)</td><td rowspan=1 colspan=1>. pdf</td><td rowspan=1 colspan=1>Yes</td><td rowspan=1 colspan=1>vIlm</td></tr><tr><td rowspan=1 colspan=1>Word</td><td rowspan=1 colspan=1>. docx</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>pipeline</td></tr><tr><td rowspan=1 colspan=1>PowerPoint</td><td rowspan=1 colspan=1>.pptx</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>pipeline</td></tr><tr><td rowspan=1 colspan=1>Image</td><td rowspan=1 colspan=1>.png / .jpg</td><td rowspan=1 colspan=1>Auto</td><td rowspan=1 colspan=1>vIlm</td></tr><tr><td rowspan=1 colspan=1>HTML</td><td rowspan=1 colspan=1>.html</td><td rowspan=1 colspan=1>No</td><td rowspan=1 colspan=1>MinerU-HTML</td></tr></table>
+
+# 5. API Configuration Reference
+
+The following environment variables must be configured before running the MinerU parsing service:
+
+MINERU_API_TOKEN : Bearer token for API authentication   
+MINERU_USER_UID : User UUID for quota management   
+MINERU_BASE_URL : https://mineru.net/api/v4   
+MINERU_MODEL_VERSION : pipeline (default) | vlm | MinerU-HTML   
+MINERU_LANGUAGE : ch (Chinese) | en (English)   
+MINERU_IS_OCR : false (text PDF) | true (scanned PDF)   
+MINERU_ENABLE_FORMULA: true | false   
+MINERU_ENABLE_TABLE : true | false
+
+Rate Limits:
+
+- Max file size : 200 MB per file - Max pages : 600 pages per file - Daily quota : 2000 pages (high priority) - Batch limit : 200 files per request
--- a/backend/output/8456b615_sample_graphrag_overview/images/1ed7aacecd20fecef8dc27ee2fe76dc1ae7fa93c44f7d10878d17a41f21a6bef.jpg
+++ b/backend/output/8456b615_sample_graphrag_overview/images/1ed7aacecd20fecef8dc27ee2fe76dc1ae7fa93c44f7d10878d17a41f21a6bef.jpg
--- a/backend/output/8456b615_sample_graphrag_overview/layout.json
+++ b/backend/output/8456b615_sample_graphrag_overview/layout.json
--- a/backend/output/8456b615_sample_graphrag_overview/parse_summary.json
+++ b/backend/output/8456b615_sample_graphrag_overview/parse_summary.json
@@ -0,0 +1,10 @@
+{
+  "total_blocks": 32,
+  "type_distribution": {
+    "text": 31,
+    "table": 1
+  },
+  "total_pages": 4,
+  "text_block_count": 31,
+  "table_block_count": 1
+}