Skip to content

Latest commit

 

History

History
48 lines (33 loc) · 1.97 KB

README.md

File metadata and controls

48 lines (33 loc) · 1.97 KB

🦜 ParseDF

License Build Status Issues

✨ PDF labeling tool to enhance the performance of RAG.

📋 Table of Contents

📖 Project Overview

This project is a PDF labeling tool designed to enhance RAG performance. The tool allows users to draw bounding boxes to label elements such as tables and images. It also supports labeling all pages for items located in fixed positions and provides options for auto-labeling using APIs or custom-trained models. The labeled data is then used to generate Markdown (MD) files optimized for RAG workflows. The main features include precise manual labeling, automated labeling capabilities, and seamless integration into RAG pipelines, making it a versatile solution for various use cases.

✨ Features

  • 1️⃣ bbox labeling: Easily define label types and draw precise bounding boxes for tables, images, and other elements in PDFs.

  • 2️⃣ Apply Labels Across All Pages: Apply Labels Across All Pages

  • 3️⃣ Auto Labeling with API or Custom Models: Leverage APIs or your trained models to perform automated and efficient labeling.

🛠️ Installation

  1. Clone the repository:
    git clone -b dev https://github.com/Tagite/ParseDF.git
  2. Install UV(Python package and project manager)
    # On macOS and Linux.
    curl -LsSf https://astral.sh/uv/install.sh | sh 
    
    # On Windows.
    powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
    
    # (Cross platform)With pip.
    pip install uv
    

🚀 Usage

  1. Run with pdf_path:
    uv run main.py {PDF_PATH}