|
1 | | -# 📄 ParseFlow |
2 | | - |
3 | | -**Universal document parsing library for PDF, Word, and Excel files** |
4 | | - |
5 | | -[](https://www.npmjs.com/package/parseflow-core) |
6 | | -[](https://www.npmjs.com/package/parseflow-mcp-server) |
7 | | -[](https://opensource.org/licenses/MIT) |
8 | | - |
9 | | -ParseFlow is a comprehensive document parsing solution that supports **PDF**, **Word (docx)**, and **Excel (xlsx/xls)** files. It provides both a standalone library and an MCP (Model Context Protocol) server for AI assistants. |
10 | | - |
11 | | -[English](./README_EN.md) | [Examples](./OFFICE_EXAMPLES.md) | [GitHub](https://github.com/Libres-coder/ParseFlow) |
12 | | - |
13 | | ---- |
14 | | - |
15 | | -## ✨ Features |
16 | | - |
17 | | -### 📄 PDF Support |
18 | | -- ✅ Text extraction with multiple strategies (raw, formatted, clean) |
19 | | -- ✅ Page-specific and range-based extraction |
20 | | -- ✅ Metadata retrieval (title, author, dates, page count) |
21 | | -- ✅ Full-text search with context |
22 | | -- ✅ Image extraction (placeholder) |
23 | | -- ✅ Table of contents (TOC) extraction (placeholder) |
24 | | - |
25 | | -### 📝 Word (docx) Support |
26 | | -- ✅ Text extraction |
27 | | -- ✅ HTML conversion |
28 | | -- ✅ Metadata retrieval |
29 | | -- ✅ Text search with context |
30 | | - |
31 | | -### 📊 Excel (xlsx/xls) Support |
32 | | -- ✅ Multi-sheet data extraction |
33 | | -- ✅ Multiple output formats (JSON, CSV, Text) |
34 | | -- ✅ Sheet-specific extraction |
35 | | -- ✅ Cell-based search |
36 | | -- ✅ Range extraction |
37 | | -- ✅ Workbook metadata |
38 | | - |
39 | | -### 🤖 MCP Server |
40 | | -- ✅ 9 tools for AI assistants (5 PDF + 2 Word + 2 Excel) |
41 | | -- ✅ Works with Claude Desktop and other MCP clients |
42 | | -- ✅ Path security with allowlist support |
43 | | - |
44 | | ---- |
45 | | - |
46 | | -## 📦 Installation |
47 | | - |
48 | | -### Core Library |
49 | | - |
50 | | -```bash |
51 | | -npm install parseflow-core |
52 | | -``` |
53 | | - |
54 | | -### MCP Server (Global) |
55 | | - |
56 | | -```bash |
57 | | -npm install -g parseflow-mcp-server |
58 | | -``` |
59 | | - |
60 | | -Or use with npx: |
61 | | - |
62 | | -```bash |
63 | | -npx parseflow-mcp-server |
64 | | -``` |
65 | | - |
66 | | ---- |
67 | | - |
68 | | -## 🚀 Quick Start |
69 | | - |
70 | | -### PDF Parsing |
71 | | - |
72 | | -```typescript |
73 | | -import { PDFParser } from 'parseflow-core'; |
74 | | - |
75 | | -const parser = new PDFParser(); |
76 | | - |
77 | | -// Extract all text |
78 | | -const text = await parser.extractText('document.pdf'); |
79 | | - |
80 | | -// Extract specific page |
81 | | -const page5 = await parser.extractPage('document.pdf', 5); |
82 | | - |
83 | | -// Search |
84 | | -const results = await parser.search('document.pdf', 'keyword'); |
85 | | - |
86 | | -// Get metadata |
87 | | -const metadata = await parser.getMetadata('document.pdf'); |
88 | | -``` |
89 | | - |
90 | | -### Word Parsing |
91 | | - |
92 | | -```typescript |
93 | | -import { WordParser } from 'parseflow-core'; |
94 | | - |
95 | | -const parser = new WordParser(); |
96 | | - |
97 | | -// Extract text |
98 | | -const result = await parser.extractText('report.docx'); |
99 | | -console.log(result.text); |
100 | | - |
101 | | -// Convert to HTML |
102 | | -const html = await parser.extractHTML('report.docx'); |
103 | | - |
104 | | -// Search |
105 | | -const matches = await parser.searchText('report.docx', 'budget'); |
106 | | -``` |
107 | | - |
108 | | -### Excel Parsing |
109 | | - |
110 | | -```typescript |
111 | | -import { ExcelParser } from 'parseflow-core'; |
112 | | - |
113 | | -const parser = new ExcelParser(); |
114 | | - |
115 | | -// Extract all sheets (JSON format) |
116 | | -const data = await parser.extractData('spreadsheet.xlsx'); |
117 | | - |
118 | | -// Extract specific sheet |
119 | | -const sales = await parser.extractData('data.xlsx', { |
120 | | - sheetName: 'Q4 Sales', |
121 | | - format: 'json' |
122 | | -}); |
123 | | - |
124 | | -// Search in cells |
125 | | -const results = await parser.searchText('data.xlsx', 'revenue'); |
126 | | -``` |
127 | | - |
128 | | ---- |
129 | | - |
130 | | -## 🛠️ MCP Server Usage |
131 | | - |
132 | | -### Configuration for Claude Desktop |
133 | | - |
134 | | -Add to `claude_desktop_config.json`: |
135 | | - |
136 | | -```json |
137 | | -{ |
138 | | - "mcpServers": { |
139 | | - "parseflow": { |
140 | | - "command": "npx", |
141 | | - "args": ["-y", "parseflow-mcp-server"], |
142 | | - "env": { |
143 | | - "PARSEFLOW_ALLOWED_PATHS": "C:\\Documents;D:\\Projects" |
144 | | - } |
145 | | - } |
146 | | - } |
147 | | -} |
148 | | -``` |
149 | | - |
150 | | -### Available Tools |
151 | | - |
152 | | -#### PDF Tools |
153 | | -- `extract_text` - Extract text from PDF files |
154 | | -- `search_pdf` - Search for keywords in PDF |
155 | | -- `get_metadata` - Get PDF metadata |
156 | | -- `extract_images` - Extract images from PDF |
157 | | -- `get_toc` - Get table of contents |
158 | | - |
159 | | -#### Word Tools |
160 | | -- `extract_word` - Extract text/HTML from Word documents |
161 | | -- `search_word` - Search in Word documents |
162 | | - |
163 | | -#### Excel Tools |
164 | | -- `extract_excel` - Extract data from Excel spreadsheets |
165 | | -- `search_excel` - Search in Excel cells |
166 | | - |
167 | | -### Example Usage in Claude |
168 | | - |
169 | | -``` |
170 | | -"请读取 report.docx 文件的内容" |
171 | | -→ Uses extract_word tool |
172 | | -
|
173 | | -"在 sales.xlsx 中查找 '产品A'" |
174 | | -→ Uses search_excel tool |
175 | | -
|
176 | | -"提取 document.pdf 的元数据" |
177 | | -→ Uses get_metadata tool |
178 | | -``` |
179 | | - |
180 | | ---- |
181 | | - |
182 | | -## 📚 Documentation |
183 | | - |
184 | | -- **[Office Examples](./OFFICE_EXAMPLES.md)** - Word and Excel usage examples |
185 | | -- **[Release Guide](./RELEASE_GUIDE.md)** - How to publish new versions |
186 | | -- **[Contributing](./CONTRIBUTING.md)** - Contribution guidelines |
187 | | -- **[Security Policy](./SECURITY.md)** - Security vulnerability reporting |
188 | | -- **[Code of Conduct](./CODE_OF_CONDUCT.md)** - Community guidelines |
189 | | - |
190 | | ---- |
191 | | - |
192 | | -## 🏗️ Project Structure |
193 | | - |
194 | | -``` |
195 | | -ParseFlow/ |
196 | | -├── packages/ |
197 | | -│ ├── pdf-parser-core/ # Core library (parseflow-core) |
198 | | -│ │ ├── src/ |
199 | | -│ │ │ ├── parser.ts # PDF parser |
200 | | -│ │ │ ├── WordParser.ts # Word parser |
201 | | -│ │ │ └── ExcelParser.ts # Excel parser |
202 | | -│ │ └── package.json |
203 | | -│ └── mcp-server/ # MCP server (parseflow-mcp-server) |
204 | | -│ ├── src/ |
205 | | -│ │ ├── index.ts # Server entry |
206 | | -│ │ └── tools/ # MCP tools |
207 | | -│ └── package.json |
208 | | -├── docs/ # Documentation |
209 | | -├── examples/ # Usage examples |
210 | | -├── tests/ # Test files |
211 | | -└── scripts/ # Build scripts |
212 | | -``` |
213 | | - |
214 | | ---- |
215 | | - |
216 | | -## 🧪 Testing |
217 | | - |
218 | | -```bash |
219 | | -# Run all tests |
220 | | -pnpm test |
221 | | - |
222 | | -# Test coverage |
223 | | -pnpm test:coverage |
224 | | - |
225 | | -# Run specific test |
226 | | -pnpm test parser.test.ts |
227 | | -``` |
228 | | - |
229 | | -### Test Files |
230 | | -- **Word测试文件.docx** - Word test document |
231 | | -- **Excel测试文件.xlsx** - Excel test workbook (3 sheets) |
232 | | -- **PDF测试文档.pdf** - PDF test document |
233 | | - |
234 | | ---- |
235 | | - |
236 | | -## 🔧 Development |
237 | | - |
238 | | -```bash |
239 | | -# Install dependencies |
240 | | -pnpm install |
241 | | - |
242 | | -# Build all packages |
243 | | -pnpm build |
244 | | - |
245 | | -# Watch mode |
246 | | -pnpm dev |
247 | | - |
248 | | -# Lint |
249 | | -pnpm lint |
250 | | - |
251 | | -# Type check |
252 | | -pnpm type-check |
253 | | -``` |
254 | | - |
255 | | ---- |
256 | | - |
257 | | -## 📈 Roadmap |
258 | | - |
259 | | -### v1.1.0 (Current) |
260 | | -- ✅ Word (docx) support |
261 | | -- ✅ Excel (xlsx/xls) support |
262 | | -- ✅ 9 MCP tools |
263 | | - |
264 | | -### v1.2.0 (Planned) |
265 | | -- [ ] Encrypted PDF support |
266 | | -- [ ] OCR text recognition |
267 | | -- [ ] PowerPoint (pptx) support |
268 | | -- [ ] Batch processing optimization |
269 | | - |
270 | | -### v2.0.0 (Future) |
271 | | -- [ ] Plugin system |
272 | | -- [ ] More document formats (CSV, TXT, RTF) |
273 | | -- [ ] Advanced table extraction |
274 | | -- [ ] Document conversion |
275 | | - |
276 | | ---- |
277 | | - |
278 | | -## 🤝 Contributing |
279 | | - |
280 | | -We welcome contributions! Please see [CONTRIBUTING.md](./CONTRIBUTING.md) for details. |
281 | | - |
282 | | -### Ways to Contribute |
283 | | -- 🐛 Report bugs |
284 | | -- 💡 Suggest features |
285 | | -- 📝 Improve documentation |
286 | | -- 🔧 Submit pull requests |
287 | | - |
288 | | ---- |
289 | | - |
290 | | -## 📦 Packages |
291 | | - |
292 | | -| Package | Version | Description | |
293 | | -|---------|---------|-------------| |
294 | | -| [parseflow-core](https://www.npmjs.com/package/parseflow-core) | 1.0.1 | Core parsing library | |
295 | | -| [parseflow-mcp-server](https://www.npmjs.com/package/parseflow-mcp-server) | 1.0.2 | MCP server for AI | |
296 | | - |
297 | | ---- |
298 | | - |
299 | | -## 🔗 Links |
300 | | - |
301 | | -- **npm Core**: https://www.npmjs.com/package/parseflow-core |
302 | | -- **npm MCP**: https://www.npmjs.com/package/parseflow-mcp-server |
303 | | -- **GitHub**: https://github.com/Libres-coder/ParseFlow |
304 | | -- **Issues**: https://github.com/Libres-coder/ParseFlow/issues |
305 | | -- **MCP Registry**: https://registry.modelcontextprotocol.io/ |
306 | | - |
307 | | ---- |
308 | | - |
309 | | -## 📄 License |
310 | | - |
311 | | -MIT License - see [LICENSE](./LICENSE) file for details. |
312 | | - |
313 | | ---- |
314 | | - |
315 | | -## 🙏 Acknowledgments |
316 | | - |
317 | | -- **pdf-parse** - PDF parsing |
318 | | -- **pdf-lib** - PDF manipulation |
319 | | -- **mammoth** - Word document parsing |
320 | | -- **xlsx** - Excel spreadsheet parsing |
321 | | -- **MCP SDK** - Model Context Protocol |
322 | | - |
323 | | ---- |
324 | | - |
325 | | -## 📊 Stats |
326 | | - |
327 | | -- **Test Coverage**: 83%+ |
328 | | -- **Supported Formats**: 3 (PDF, Word, Excel) |
329 | | -- **MCP Tools**: 9 |
330 | | -- **Dependencies**: Minimal and well-maintained |
331 | | - |
332 | | ---- |
333 | | - |
334 | | -## 💬 Community |
335 | | - |
336 | | -- **Issues**: [GitHub Issues](https://github.com/Libres-coder/ParseFlow/issues) |
337 | | -- **Discussions**: [GitHub Discussions](https://github.com/Libres-coder/ParseFlow/discussions) |
338 | | - |
339 | | ---- |
340 | | - |
341 | | -**Made with ❤️ by Libres-coder** |
342 | | - |
343 | | -**Status**: 🎉 Production Ready (v1.1.0) |
0 commit comments