|
1 | | -# ParseFlow |
2 | | - |
3 | | -<div align="center"> |
4 | | - |
5 | | -**🚀 High-Performance PDF Parsing MCP Server** |
6 | | - |
7 | | -[](https://github.com/Libres-coder/ParseFlow/actions/workflows/ci.yml) |
8 | | -[](LICENSE) |
9 | | -[](https://nodejs.org) |
10 | | -[](https://modelcontextprotocol.io) |
11 | | -[](CHANGELOG.md) |
12 | | - |
13 | | -[中文](README.md) | **English** |
14 | | - |
15 | | -</div> |
16 | | - |
17 | | ---- |
18 | | - |
19 | | -## ⚡ Quick Overview |
20 | | - |
21 | | -> **3 Key Features** |
22 | | -
|
23 | | -✅ **Automatic Recognition** - No manual tool selection, AI automatically calls PDF parsing functions |
24 | | -✅ **Dynamic Path Passing** - No hardcoding, specify different PDF files each time |
25 | | -✅ **Local Deployment** - Deploy locally via configuration files, full data control |
26 | | - |
27 | | -**Usage Example**: |
28 | | - |
29 | | -``` |
30 | | -In Windsurf, simply say: |
31 | | -"Analyze D:\report.pdf" |
32 | | -"How many pages does this PDF have?" |
33 | | -"Search for 'breach of contract' in the contract" |
34 | | -``` |
35 | | - |
36 | | ---- |
37 | | - |
38 | | -## 📖 Overview |
39 | | - |
40 | | -ParseFlow is a high-performance **MCP (Model Context Protocol) Server** for PDF document parsing and analysis, designed specifically for AI assistants in **Windsurf** and **Cursor** IDEs. |
41 | | - |
42 | | -### Core Features |
43 | | - |
44 | | -- 📄 **Text Extraction**: Extract PDF text content with pagination and range support ✅ |
45 | | -- 📊 **Metadata Reading**: Get title, author, page count, creation date, etc. ✅ |
46 | | -- 🔍 **Keyword Search**: Search for specific content in PDFs ✅ |
47 | | -- 🖼️ **Image Extraction**: Export images from PDFs (requires poppler-utils) ✅ |
48 | | -- 📑 **Table of Contents**: Extract PDF bookmarks and TOC structure (requires pdftk/pdfinfo) ✅ |
49 | | - |
50 | | -### Technical Features |
51 | | - |
52 | | -- ✅ **MCP Protocol Support**: Standard MCP Tools implementation |
53 | | -- ✅ **TypeScript Development**: Type-safe, maintainable |
54 | | -- ✅ **Monorepo Architecture**: Separate core library and server |
55 | | -- ✅ **Local Deployment**: Data stays local, secure and controllable |
56 | | - |
57 | | ---- |
58 | | - |
59 | | -## 🏗️ Architecture |
60 | | - |
61 | | -``` |
62 | | -┌─────────────────────────────────────┐ |
63 | | -│ Windsurf IDE │ |
64 | | -│ (MCP Client / Cascade) │ |
65 | | -└──────────────┬──────────────────────┘ |
66 | | - │ MCP Protocol (stdio) |
67 | | -┌──────────────▼──────────────────────┐ |
68 | | -│ ParseFlow MCP Server │ |
69 | | -│ ┌─────────────────────────────┐ │ |
70 | | -│ │ MCP Tools │ │ |
71 | | -│ │ • extract_text ✅ │ │ |
72 | | -│ │ • search_pdf ✅ │ │ |
73 | | -│ │ • get_metadata ✅ │ │ |
74 | | -│ │ • extract_images ✅ │ │ |
75 | | -│ │ • get_toc ✅ │ │ |
76 | | -│ └─────────────────────────────┘ │ |
77 | | -└──────────────┬──────────────────────┘ |
78 | | - │ |
79 | | -┌──────────────▼──────────────────────┐ |
80 | | -│ PDF Parser Core Library │ |
81 | | -│ • pdf-parse (text/metadata) │ |
82 | | -│ • pdf-lib (PDF operations) │ |
83 | | -│ • Keyword search engine │ |
84 | | -│ • External tools (optional) │ |
85 | | -│ - pdfimages (image extraction) │ |
86 | | -│ - pdftk/pdfinfo (TOC extraction) │ |
87 | | -└─────────────────────────────────────┘ |
88 | | -``` |
89 | | - |
90 | | ---- |
91 | | - |
92 | | -## 🚀 Quick Start |
93 | | - |
94 | | -### Prerequisites |
95 | | - |
96 | | -- Node.js >= 18.0.0 |
97 | | -- pnpm >= 8.0.0 |
98 | | -- Windsurf or Cursor IDE |
99 | | - |
100 | | -### Optional Tools (for Image and TOC Extraction) |
101 | | - |
102 | | -If you need image extraction and table of contents extraction features, please install: |
103 | | - |
104 | | -**Windows**: |
105 | | -- [Poppler](https://github.com/oschwartz10612/poppler-windows/releases) - For image and TOC extraction |
106 | | -- Download and add to system PATH (e.g., `D:\poppler\Library\bin`) |
107 | | - |
108 | | -**Ubuntu/Debian**: |
109 | | -```bash |
110 | | -sudo apt-get install poppler-utils pdftk |
111 | | -``` |
112 | | - |
113 | | -**macOS**: |
114 | | -```bash |
115 | | -brew install poppler pdftk-java |
116 | | -``` |
117 | | - |
118 | | -> 💡 You can still use text extraction, metadata, and search features without installing external tools. See [External Tools Guide](docs/guides/external-tools.md) |
119 | | -
|
120 | | -### Installation |
121 | | - |
122 | | -```bash |
123 | | -# 1. Clone repository |
124 | | -git clone https://github.com/Libres-coder/ParseFlow.git |
125 | | -cd ParseFlow |
126 | | - |
127 | | -# 2. Install dependencies |
128 | | -pnpm install |
129 | | - |
130 | | -# 3. Build project |
131 | | -pnpm build |
132 | | -``` |
133 | | - |
134 | | -### Configuration |
135 | | - |
136 | | -Choose your IDE: |
137 | | - |
138 | | -#### Option 1: Windsurf (Recommended) |
139 | | - |
140 | | -```bash |
141 | | -# Run auto-configuration script |
142 | | -.\scripts\setup-windsurf.ps1 |
143 | | -``` |
144 | | - |
145 | | -**Or configure manually**: |
146 | | - |
147 | | -1. Open Windsurf settings |
148 | | -2. Find MCP Server configuration |
149 | | -3. Add ParseFlow configuration (see [Windsurf Setup Guide](docs/en/setup/windsurf.md)) |
150 | | - |
151 | | -#### Option 2: Cursor |
152 | | - |
153 | | -```bash |
154 | | -# Run auto-configuration script |
155 | | -.\scripts\setup-cursor.ps1 |
156 | | -``` |
157 | | - |
158 | | -**Or configure manually**: |
159 | | - |
160 | | -1. Edit `C:\Users\<username>\.cursor\mcp.json` |
161 | | -2. Add ParseFlow configuration (see [Cursor Setup Guide](docs/en/setup/cursor.md)) |
162 | | - |
163 | | -### Testing |
164 | | - |
165 | | -Restart your IDE and try in the chat: |
166 | | - |
167 | | -**In Windsurf**: |
168 | | - |
169 | | -``` |
170 | | -Extract text from D:\document.pdf |
171 | | -``` |
172 | | - |
173 | | -**In Cursor** (Agent mode): |
174 | | - |
175 | | -``` |
176 | | -Use parseflow tool to extract text from D:\document.pdf |
177 | | -``` |
178 | | - |
179 | | ---- |
180 | | - |
181 | | -## 📚 Documentation |
182 | | - |
183 | | -### 📖 User Guides |
184 | | - |
185 | | -- [Quick Start](docs/en/guides/quick-start.md) - Get started in 5 minutes |
186 | | -- [FAQ](docs/en/guides/faq.md) - Frequently asked questions |
187 | | -- [Examples](docs/en/guides/examples.md) - Code examples and best practices |
188 | | - |
189 | | -### ⚙️ Setup Guides |
190 | | - |
191 | | -- [Windsurf Setup](docs/en/setup/windsurf.md) - Windsurf IDE configuration (Recommended) |
192 | | -- [Cursor Setup](docs/en/setup/cursor.md) - Cursor IDE configuration |
193 | | - |
194 | | -### 🛠️ Development Documentation |
195 | | - |
196 | | -- [API Reference](docs/en/development/api.md) - Complete API documentation |
197 | | -- [Architecture](docs/en/development/architecture.md) - System architecture |
198 | | -- [Development Guide](docs/en/development/development.md) - How to contribute |
199 | | -- [Naming Conventions](docs/en/development/naming-conventions.md) - Code standards |
200 | | - |
201 | | -### 📋 Project Planning |
202 | | - |
203 | | -- [TODO](docs/en/planning/todo.md) - Feature roadmap |
204 | | -- [Distribution Analysis](docs/en/planning/distribution-analysis.md) - Release plans |
205 | | - |
206 | | -### 📂 Documentation Index |
207 | | - |
208 | | -- [Complete Documentation](docs/en/README.md) - Full documentation index |
209 | | - |
210 | | ---- |
211 | | - |
212 | | -## 🎯 Usage Examples |
213 | | - |
214 | | -### Text Extraction |
215 | | - |
216 | | -``` |
217 | | -Q: Extract text from D:\report.pdf |
218 | | -A: [Parsed text content...] |
219 | | -``` |
220 | | - |
221 | | -### Keyword Search |
222 | | - |
223 | | -``` |
224 | | -Q: Search for "contract" in D:\document.pdf |
225 | | -A: Found 3 results: |
226 | | - Page 1: ...contract terms... |
227 | | - Page 3: ...contract signed... |
228 | | - Page 5: ...contract expires... |
229 | | -``` |
230 | | - |
231 | | -### Metadata Retrieval |
232 | | - |
233 | | -``` |
234 | | -Q: What's the author of D:\document.pdf? |
235 | | -A: Author: Unknown, Created: 2025-01-15 |
236 | | -``` |
237 | | - |
238 | | ---- |
239 | | - |
240 | | -## 🛠️ Project Structure |
241 | | - |
242 | | -``` |
243 | | -ParseFlow/ |
244 | | -├── packages/ |
245 | | -│ ├── mcp-server/ # MCP Server |
246 | | -│ │ ├── src/ |
247 | | -│ │ │ ├── index.ts # Entry point |
248 | | -│ │ │ ├── server.ts # MCP Server core |
249 | | -│ │ │ ├── tools/ # MCP tools |
250 | | -│ │ │ ├── resources/ # MCP resources |
251 | | -│ │ │ └── utils/ # Utilities |
252 | | -│ │ └── dist/ # Build output |
253 | | -│ └── pdf-parser-core/ # PDF parsing core |
254 | | -│ ├── src/ |
255 | | -│ │ ├── parser.ts # Main parser |
256 | | -│ │ ├── extractors/ # Text extractors |
257 | | -│ │ ├── search/ # Search functionality |
258 | | -│ │ └── types/ # Type definitions |
259 | | -│ └── dist/ # Build output |
260 | | -├── docs/ # Documentation |
261 | | -│ ├── zh/ # Chinese docs |
262 | | -│ └── en/ # English docs |
263 | | -├── examples/ # Usage examples |
264 | | -├── tests/ # Test files |
265 | | -└── scripts/ # Utility scripts |
266 | | -``` |
267 | | - |
268 | | ---- |
269 | | - |
270 | | -## 🔧 MCP Tools |
271 | | - |
272 | | -ParseFlow provides the following MCP tools: |
273 | | - |
274 | | -| Tool | Description | Parameters | Status | |
275 | | -| ----------------- | ------------------------- | ------------------------------------------------ | ------ | |
276 | | -| `extract_text` | Extract text from PDF | `path`, `page?`, `range?`, `strategy?` | ✅ | |
277 | | -| `get_metadata` | Get PDF metadata | `path` | ✅ | |
278 | | -| `search_pdf` | Search keywords in PDF | `path`, `query`, `caseSensitive?`, `maxResults?` | ✅ | |
279 | | -| `extract_images` | Extract images from PDF | `path`, `outputDir`, `format?` | ✅ | |
280 | | -| `get_toc` | Get table of contents | `path` | ✅ | |
281 | | - |
282 | | -For detailed API documentation, see [API Reference](docs/en/development/api.md) |
283 | | - |
284 | | ---- |
285 | | - |
286 | | -## 🚀 Future Plans |
287 | | - |
288 | | -### ✅ Completed Features (v1.0.0) |
289 | | - |
290 | | -- ✅ Text extraction |
291 | | -- ✅ Metadata extraction |
292 | | -- ✅ Keyword search |
293 | | -- ✅ Image extraction (external tool integration) |
294 | | -- ✅ Table of contents extraction (external tool integration) |
295 | | - |
296 | | -### High Priority |
297 | | - |
298 | | -#### ⭐ npm Package Release |
299 | | - |
300 | | -Simplify installation and usage! |
301 | | - |
302 | | -**Plans**: |
303 | | - |
304 | | -- ✅ Core functionality complete |
305 | | -- ✅ Documentation complete |
306 | | -- ✅ Testing complete |
307 | | -- 📦 Ready to publish to npm |
308 | | -- 🎯 Submit to official MCP Registry |
309 | | - |
310 | | -**Priority**: ⭐⭐⭐⭐⭐ |
311 | | - |
312 | | -#### ⭐ GitHub Release |
313 | | - |
314 | | -Complete project release |
315 | | - |
316 | | -**Plans**: |
317 | | - |
318 | | -- 📋 Create release notes |
319 | | -- 📦 Package distribution |
320 | | -- 🎉 v1.0.0 release |
321 | | - |
322 | | -**Priority**: ⭐⭐⭐⭐⭐ |
323 | | - |
324 | | -### Medium Priority |
325 | | - |
326 | | -- 🔄 Performance optimization (large file handling) |
327 | | -- 📊 Advanced search features (fuzzy search, regex) |
328 | | -- 🎨 Better error messages and user feedback |
329 | | - |
330 | | -### Future Considerations |
331 | | - |
332 | | -- 📸 OCR support (for scanned documents) |
333 | | -- 🤖 AI-powered document analysis |
334 | | -- 🔄 PDF merge/split functionality |
335 | | -- 🔐 PDF encryption/decryption |
336 | | -- 🌐 More IDE integrations |
337 | | - |
338 | | -**Detailed Roadmap**: [docs/en/planning/todo.md](docs/en/planning/todo.md) |
339 | | - |
340 | | ---- |
341 | | - |
342 | | -## 🤝 Contributing |
343 | | - |
344 | | -Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) |
345 | | - |
346 | | -### Contribution Process |
347 | | - |
348 | | -1. Fork the repository |
349 | | -2. Create your feature branch (`git checkout -b feature/AmazingFeature`) |
350 | | -3. Commit your changes (`git commit -m 'Add AmazingFeature'`) |
351 | | -4. Push to the branch (`git push origin feature/AmazingFeature`) |
352 | | -5. Open a Pull Request |
353 | | - |
354 | | ---- |
355 | | - |
356 | | -## 🐛 Issue Reporting |
357 | | - |
358 | | -If you encounter problems: |
359 | | - |
360 | | -1. Check [docs/en/guides/faq.md](docs/en/guides/faq.md) for common issues |
361 | | -2. Check [logs/parseflow.log](logs/) log file |
362 | | -3. Submit an [Issue](https://github.com/Libres-coder/ParseFlow/issues) |
363 | | - |
364 | | ---- |
365 | | - |
366 | | -## 📄 License |
367 | | - |
368 | | -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details |
369 | | - |
370 | | ---- |
371 | | - |
372 | | -## 🙏 Acknowledgments |
373 | | - |
374 | | -- [Model Context Protocol](https://modelcontextprotocol.io) - MCP protocol standard |
375 | | -- [pdf-parse](https://www.npmjs.com/package/pdf-parse) - PDF text extraction library |
376 | | -- [pdf-lib](https://www.npmjs.com/package/pdf-lib) - PDF manipulation library |
377 | | -- [Poppler](https://poppler.freedesktop.org/) - PDF rendering library |
378 | | -- Windsurf Community - Testing and feedback |
379 | | - |
380 | | ---- |
381 | | - |
382 | | -## 📮 Resources |
383 | | - |
384 | | -- [MCP Protocol Documentation](https://modelcontextprotocol.io) |
385 | | -- [Windsurf IDE](https://codeium.com/windsurf) |
386 | | -- [Project Documentation](docs/en/) |
387 | | - |
388 | | ---- |
389 | | - |
390 | | -<div align="center"> |
391 | | - |
392 | | -**Made with ❤️ by ParseFlow Team** |
393 | | - |
394 | | -</div> |
0 commit comments