Skip to content

Commit 012e0af

Browse files
committed
feat: add PowerPoint (pptx) support with extract and search tools
- Add PowerPointParser class using node-pptx-parser library - Add extract_powerpoint and search_powerpoint MCP tools (11 tools total) - Update README_EN.md to v1.1.0 - Export PowerPointParser types from parseflow-core
1 parent ff51dcc commit 012e0af

File tree

7 files changed

+453
-431
lines changed

7 files changed

+453
-431
lines changed

README_EN.md

Lines changed: 0 additions & 394 deletions
Original file line numberDiff line numberDiff line change
@@ -1,394 +0,0 @@
1-
# ParseFlow
2-
3-
<div align="center">
4-
5-
**🚀 High-Performance PDF Parsing MCP Server**
6-
7-
[![CI](https://github.com/Libres-coder/ParseFlow/actions/workflows/ci.yml/badge.svg)](https://github.com/Libres-coder/ParseFlow/actions/workflows/ci.yml)
8-
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
9-
[![Node](https://img.shields.io/badge/node-%3E%3D18.0.0-brightgreen.svg)](https://nodejs.org)
10-
[![MCP](https://img.shields.io/badge/MCP-1.0-purple.svg)](https://modelcontextprotocol.io)
11-
[![Version](https://img.shields.io/badge/version-1.0.0-orange.svg)](CHANGELOG.md)
12-
13-
[中文](README.md) | **English**
14-
15-
</div>
16-
17-
---
18-
19-
## ⚡ Quick Overview
20-
21-
> **3 Key Features**
22-
23-
**Automatic Recognition** - No manual tool selection, AI automatically calls PDF parsing functions
24-
**Dynamic Path Passing** - No hardcoding, specify different PDF files each time
25-
**Local Deployment** - Deploy locally via configuration files, full data control
26-
27-
**Usage Example**:
28-
29-
```
30-
In Windsurf, simply say:
31-
"Analyze D:\report.pdf"
32-
"How many pages does this PDF have?"
33-
"Search for 'breach of contract' in the contract"
34-
```
35-
36-
---
37-
38-
## 📖 Overview
39-
40-
ParseFlow is a high-performance **MCP (Model Context Protocol) Server** for PDF document parsing and analysis, designed specifically for AI assistants in **Windsurf** and **Cursor** IDEs.
41-
42-
### Core Features
43-
44-
- 📄 **Text Extraction**: Extract PDF text content with pagination and range support ✅
45-
- 📊 **Metadata Reading**: Get title, author, page count, creation date, etc. ✅
46-
- 🔍 **Keyword Search**: Search for specific content in PDFs ✅
47-
- 🖼️ **Image Extraction**: Export images from PDFs (requires poppler-utils) ✅
48-
- 📑 **Table of Contents**: Extract PDF bookmarks and TOC structure (requires pdftk/pdfinfo) ✅
49-
50-
### Technical Features
51-
52-
-**MCP Protocol Support**: Standard MCP Tools implementation
53-
-**TypeScript Development**: Type-safe, maintainable
54-
-**Monorepo Architecture**: Separate core library and server
55-
-**Local Deployment**: Data stays local, secure and controllable
56-
57-
---
58-
59-
## 🏗️ Architecture
60-
61-
```
62-
┌─────────────────────────────────────┐
63-
│ Windsurf IDE │
64-
│ (MCP Client / Cascade) │
65-
└──────────────┬──────────────────────┘
66-
│ MCP Protocol (stdio)
67-
┌──────────────▼──────────────────────┐
68-
│ ParseFlow MCP Server │
69-
│ ┌─────────────────────────────┐ │
70-
│ │ MCP Tools │ │
71-
│ │ • extract_text ✅ │ │
72-
│ │ • search_pdf ✅ │ │
73-
│ │ • get_metadata ✅ │ │
74-
│ │ • extract_images ✅ │ │
75-
│ │ • get_toc ✅ │ │
76-
│ └─────────────────────────────┘ │
77-
└──────────────┬──────────────────────┘
78-
79-
┌──────────────▼──────────────────────┐
80-
│ PDF Parser Core Library │
81-
│ • pdf-parse (text/metadata) │
82-
│ • pdf-lib (PDF operations) │
83-
│ • Keyword search engine │
84-
│ • External tools (optional) │
85-
│ - pdfimages (image extraction) │
86-
│ - pdftk/pdfinfo (TOC extraction) │
87-
└─────────────────────────────────────┘
88-
```
89-
90-
---
91-
92-
## 🚀 Quick Start
93-
94-
### Prerequisites
95-
96-
- Node.js >= 18.0.0
97-
- pnpm >= 8.0.0
98-
- Windsurf or Cursor IDE
99-
100-
### Optional Tools (for Image and TOC Extraction)
101-
102-
If you need image extraction and table of contents extraction features, please install:
103-
104-
**Windows**:
105-
- [Poppler](https://github.com/oschwartz10612/poppler-windows/releases) - For image and TOC extraction
106-
- Download and add to system PATH (e.g., `D:\poppler\Library\bin`)
107-
108-
**Ubuntu/Debian**:
109-
```bash
110-
sudo apt-get install poppler-utils pdftk
111-
```
112-
113-
**macOS**:
114-
```bash
115-
brew install poppler pdftk-java
116-
```
117-
118-
> 💡 You can still use text extraction, metadata, and search features without installing external tools. See [External Tools Guide](docs/guides/external-tools.md)
119-
120-
### Installation
121-
122-
```bash
123-
# 1. Clone repository
124-
git clone https://github.com/Libres-coder/ParseFlow.git
125-
cd ParseFlow
126-
127-
# 2. Install dependencies
128-
pnpm install
129-
130-
# 3. Build project
131-
pnpm build
132-
```
133-
134-
### Configuration
135-
136-
Choose your IDE:
137-
138-
#### Option 1: Windsurf (Recommended)
139-
140-
```bash
141-
# Run auto-configuration script
142-
.\scripts\setup-windsurf.ps1
143-
```
144-
145-
**Or configure manually**:
146-
147-
1. Open Windsurf settings
148-
2. Find MCP Server configuration
149-
3. Add ParseFlow configuration (see [Windsurf Setup Guide](docs/en/setup/windsurf.md))
150-
151-
#### Option 2: Cursor
152-
153-
```bash
154-
# Run auto-configuration script
155-
.\scripts\setup-cursor.ps1
156-
```
157-
158-
**Or configure manually**:
159-
160-
1. Edit `C:\Users\<username>\.cursor\mcp.json`
161-
2. Add ParseFlow configuration (see [Cursor Setup Guide](docs/en/setup/cursor.md))
162-
163-
### Testing
164-
165-
Restart your IDE and try in the chat:
166-
167-
**In Windsurf**:
168-
169-
```
170-
Extract text from D:\document.pdf
171-
```
172-
173-
**In Cursor** (Agent mode):
174-
175-
```
176-
Use parseflow tool to extract text from D:\document.pdf
177-
```
178-
179-
---
180-
181-
## 📚 Documentation
182-
183-
### 📖 User Guides
184-
185-
- [Quick Start](docs/en/guides/quick-start.md) - Get started in 5 minutes
186-
- [FAQ](docs/en/guides/faq.md) - Frequently asked questions
187-
- [Examples](docs/en/guides/examples.md) - Code examples and best practices
188-
189-
### ⚙️ Setup Guides
190-
191-
- [Windsurf Setup](docs/en/setup/windsurf.md) - Windsurf IDE configuration (Recommended)
192-
- [Cursor Setup](docs/en/setup/cursor.md) - Cursor IDE configuration
193-
194-
### 🛠️ Development Documentation
195-
196-
- [API Reference](docs/en/development/api.md) - Complete API documentation
197-
- [Architecture](docs/en/development/architecture.md) - System architecture
198-
- [Development Guide](docs/en/development/development.md) - How to contribute
199-
- [Naming Conventions](docs/en/development/naming-conventions.md) - Code standards
200-
201-
### 📋 Project Planning
202-
203-
- [TODO](docs/en/planning/todo.md) - Feature roadmap
204-
- [Distribution Analysis](docs/en/planning/distribution-analysis.md) - Release plans
205-
206-
### 📂 Documentation Index
207-
208-
- [Complete Documentation](docs/en/README.md) - Full documentation index
209-
210-
---
211-
212-
## 🎯 Usage Examples
213-
214-
### Text Extraction
215-
216-
```
217-
Q: Extract text from D:\report.pdf
218-
A: [Parsed text content...]
219-
```
220-
221-
### Keyword Search
222-
223-
```
224-
Q: Search for "contract" in D:\document.pdf
225-
A: Found 3 results:
226-
Page 1: ...contract terms...
227-
Page 3: ...contract signed...
228-
Page 5: ...contract expires...
229-
```
230-
231-
### Metadata Retrieval
232-
233-
```
234-
Q: What's the author of D:\document.pdf?
235-
A: Author: Unknown, Created: 2025-01-15
236-
```
237-
238-
---
239-
240-
## 🛠️ Project Structure
241-
242-
```
243-
ParseFlow/
244-
├── packages/
245-
│ ├── mcp-server/ # MCP Server
246-
│ │ ├── src/
247-
│ │ │ ├── index.ts # Entry point
248-
│ │ │ ├── server.ts # MCP Server core
249-
│ │ │ ├── tools/ # MCP tools
250-
│ │ │ ├── resources/ # MCP resources
251-
│ │ │ └── utils/ # Utilities
252-
│ │ └── dist/ # Build output
253-
│ └── pdf-parser-core/ # PDF parsing core
254-
│ ├── src/
255-
│ │ ├── parser.ts # Main parser
256-
│ │ ├── extractors/ # Text extractors
257-
│ │ ├── search/ # Search functionality
258-
│ │ └── types/ # Type definitions
259-
│ └── dist/ # Build output
260-
├── docs/ # Documentation
261-
│ ├── zh/ # Chinese docs
262-
│ └── en/ # English docs
263-
├── examples/ # Usage examples
264-
├── tests/ # Test files
265-
└── scripts/ # Utility scripts
266-
```
267-
268-
---
269-
270-
## 🔧 MCP Tools
271-
272-
ParseFlow provides the following MCP tools:
273-
274-
| Tool | Description | Parameters | Status |
275-
| ----------------- | ------------------------- | ------------------------------------------------ | ------ |
276-
| `extract_text` | Extract text from PDF | `path`, `page?`, `range?`, `strategy?` ||
277-
| `get_metadata` | Get PDF metadata | `path` ||
278-
| `search_pdf` | Search keywords in PDF | `path`, `query`, `caseSensitive?`, `maxResults?` ||
279-
| `extract_images` | Extract images from PDF | `path`, `outputDir`, `format?` ||
280-
| `get_toc` | Get table of contents | `path` ||
281-
282-
For detailed API documentation, see [API Reference](docs/en/development/api.md)
283-
284-
---
285-
286-
## 🚀 Future Plans
287-
288-
### ✅ Completed Features (v1.0.0)
289-
290-
- ✅ Text extraction
291-
- ✅ Metadata extraction
292-
- ✅ Keyword search
293-
- ✅ Image extraction (external tool integration)
294-
- ✅ Table of contents extraction (external tool integration)
295-
296-
### High Priority
297-
298-
#### ⭐ npm Package Release
299-
300-
Simplify installation and usage!
301-
302-
**Plans**:
303-
304-
- ✅ Core functionality complete
305-
- ✅ Documentation complete
306-
- ✅ Testing complete
307-
- 📦 Ready to publish to npm
308-
- 🎯 Submit to official MCP Registry
309-
310-
**Priority**: ⭐⭐⭐⭐⭐
311-
312-
#### ⭐ GitHub Release
313-
314-
Complete project release
315-
316-
**Plans**:
317-
318-
- 📋 Create release notes
319-
- 📦 Package distribution
320-
- 🎉 v1.0.0 release
321-
322-
**Priority**: ⭐⭐⭐⭐⭐
323-
324-
### Medium Priority
325-
326-
- 🔄 Performance optimization (large file handling)
327-
- 📊 Advanced search features (fuzzy search, regex)
328-
- 🎨 Better error messages and user feedback
329-
330-
### Future Considerations
331-
332-
- 📸 OCR support (for scanned documents)
333-
- 🤖 AI-powered document analysis
334-
- 🔄 PDF merge/split functionality
335-
- 🔐 PDF encryption/decryption
336-
- 🌐 More IDE integrations
337-
338-
**Detailed Roadmap**: [docs/en/planning/todo.md](docs/en/planning/todo.md)
339-
340-
---
341-
342-
## 🤝 Contributing
343-
344-
Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md)
345-
346-
### Contribution Process
347-
348-
1. Fork the repository
349-
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
350-
3. Commit your changes (`git commit -m 'Add AmazingFeature'`)
351-
4. Push to the branch (`git push origin feature/AmazingFeature`)
352-
5. Open a Pull Request
353-
354-
---
355-
356-
## 🐛 Issue Reporting
357-
358-
If you encounter problems:
359-
360-
1. Check [docs/en/guides/faq.md](docs/en/guides/faq.md) for common issues
361-
2. Check [logs/parseflow.log](logs/) log file
362-
3. Submit an [Issue](https://github.com/Libres-coder/ParseFlow/issues)
363-
364-
---
365-
366-
## 📄 License
367-
368-
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details
369-
370-
---
371-
372-
## 🙏 Acknowledgments
373-
374-
- [Model Context Protocol](https://modelcontextprotocol.io) - MCP protocol standard
375-
- [pdf-parse](https://www.npmjs.com/package/pdf-parse) - PDF text extraction library
376-
- [pdf-lib](https://www.npmjs.com/package/pdf-lib) - PDF manipulation library
377-
- [Poppler](https://poppler.freedesktop.org/) - PDF rendering library
378-
- Windsurf Community - Testing and feedback
379-
380-
---
381-
382-
## 📮 Resources
383-
384-
- [MCP Protocol Documentation](https://modelcontextprotocol.io)
385-
- [Windsurf IDE](https://codeium.com/windsurf)
386-
- [Project Documentation](docs/en/)
387-
388-
---
389-
390-
<div align="center">
391-
392-
**Made with ❤️ by ParseFlow Team**
393-
394-
</div>

0 commit comments

Comments
 (0)