pdf to image

HomunMage · Nov 16, 2024 · d2caaf6 · d2caaf6
1 parent c69caeb
commit d2caaf6
Show file tree

Hide file tree

Showing 9 changed files with 511 additions and 2 deletions.
diff --git a/Productivity/Converter/PDF2Text/OCR/README.md b/Productivity/Converter/PDF2Text/OCR/README.md
@@ -0,0 +1,120 @@
+# OCR
+
+
+## easy ocr
+
+check lang codes at https://www.jaided.ai/easyocr/
+
+<div class="load_as_code_session" data-url="easy_ocr.py">
+  Loading content...
+</div>
+
+
+
+## other OCR with GPU
+
+If you're looking for open-source AI-based OCR solutions that can leverage your NVIDIA GPU and process Traditional Chinese (zh-TW), here are some excellent options:
+
+---
+
+### 1. **Tesseract OCR with GPU Support**
+   - **Description**: Tesseract is a well-established open-source OCR engine that supports multiple languages, including Traditional Chinese. However, it doesn’t natively support GPU acceleration, but you can pair it with pre-processing tools like OpenCV or other AI models to boost performance.
+   - **Key Features**:
+     - High customization and language support (including Traditional Chinese).
+     - Works well for clean, printed text.
+   - **Limitations**:
+     - Relatively slow compared to modern AI-based OCR solutions.
+   - **Setup**:
+     - Install `tesseract-ocr` and the Traditional Chinese language data package (`chi_tra`).
+     - Can be used with Python via the `pytesseract` library.
+   - **GPU Option**:
+     - Pre-process images using GPU-accelerated libraries like OpenCV with CUDA.
+
+---
+
+### 2. **EasyOCR**
+   - **Description**: EasyOCR is a modern, AI-powered OCR library written in PyTorch. It supports GPU acceleration out of the box and handles Traditional Chinese well.
+   - **Key Features**:
+     - Multilingual support, including zh-TW.
+     - Lightweight and easy to set up.
+     - Can leverage NVIDIA GPUs for faster processing.
+   - **Setup**:
+     1. Install via pip: `pip install easyocr`.
+     2. Run the code:
+        ```python
+        import easyocr
+        reader = easyocr.Reader(['zh-tw'], gpu=True)
+        result = reader.readtext('path_to_image')
+        ```
+   - **Limitations**:
+     - Struggles with very complex or heavily distorted handwriting.
+
+---
+
+### 3. **PaddleOCR**
+   - **Description**: PaddleOCR is a powerful OCR tool developed by Baidu. It supports GPU acceleration using NVIDIA GPUs and provides excellent accuracy, especially for Chinese text.
+   - **Key Features**:
+     - Optimized for Chinese languages.
+     - High accuracy for both printed and handwritten text.
+     - Built-in tools for image pre-processing and text detection.
+   - **Setup**:
+     1. Install the PaddleOCR package:
+        ```bash
+        pip install paddleocr
+        pip install paddlepaddle-gpu  # Ensure GPU support
+        ```
+     2. Use the library:
+        ```python
+        from paddleocr import PaddleOCR
+        ocr = PaddleOCR(use_gpu=True, lang='ch')
+        result = ocr.ocr('path_to_image', cls=True)
+        ```
+   - **Limitations**:
+     - Requires installing PaddlePaddle, which can have specific system requirements.
+
+---
+
+### 4. **OCR with OpenCV and Deep Learning Models**
+   - **Description**: OpenCV allows integration with custom deep learning OCR models like CRNN (Convolutional Recurrent Neural Network) or SAR (Sequence-to-Sequence Attention-based OCR). These models can be trained or fine-tuned on Traditional Chinese datasets.
+   - **Key Features**:
+     - Customizable for your specific needs.
+     - Full GPU acceleration using NVIDIA CUDA.
+   - **Setup**:
+     - Use OpenCV with CUDA for pre-processing (e.g., noise removal, binarization).
+     - Combine with a deep learning framework (e.g., PyTorch or TensorFlow) for OCR.
+
+---
+
+### 5. **TrOCR by Microsoft**
+   - **Description**: TrOCR is a transformer-based OCR model provided by Microsoft. It supports multilingual text recognition, including Chinese, and works efficiently with GPU acceleration.
+   - **Key Features**:
+     - State-of-the-art accuracy.
+     - Uses transformers for improved contextual understanding.
+   - **Setup**:
+     1. Install the `transformers` library:
+        ```bash
+        pip install transformers
+        ```
+     2. Use the model:
+        ```python
+        from transformers import TrOCRProcessor, VisionEncoderDecoderModel
+        from PIL import Image
+        import torch
+
+        processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
+        model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten").cuda()
+
+        image = Image.open('path_to_image').convert("RGB")
+        pixel_values = processor(images=image, return_tensors="pt").pixel_values.cuda()
+        generated_ids = model.generate(pixel_values)
+        text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+        print(text)
+        ```
+   - **Limitations**:
+     - Requires fine-tuning for best performance on Traditional Chinese.
+
+---
+
+
+
+<script src="https://posetmage.com/assets/js/LoadAsCodeSession.js"></script>
diff --git a/Productivity/Converter/PDF2Text/OCR/easy_ocr.py b/Productivity/Converter/PDF2Text/OCR/easy_ocr.py
@@ -0,0 +1,38 @@
+import easyocr
+import torch
+
+# Function to check if GPU is available
+def check_gpu():
+    if torch.cuda.is_available():
+        print("GPU is available and will be used.")
+    else:
+        print("GPU is not available. Using CPU.")
+
+# Check GPU availability
+check_gpu()
+
+# Initialize EasyOCR reader for Traditional Chinese (zh-tw)
+reader = easyocr.Reader(['ch_tra', 'en'], gpu=True)  # Set gpu=True to ensure it uses GPU
+
+# Loop through image files from 001 to 274
+for i in range(1, 275):  # Loop from 1 to 274
+    # Format the image file name
+    image_file = f'output-{i:03}.png'  # This formats numbers with leading zeros (e.g., 001, 002, ..., 274)
+
+    try:
+        # Perform OCR on the image
+        result = reader.readtext(image_file)
+
+        # Create corresponding .txt file name
+        output_file = image_file.replace('.png', '.txt')  # Replace .png with .txt
+
+        # Save the recognized text to a .txt file
+        with open(output_file, 'w', encoding='utf-8') as f:
+            for detection in result:
+                text = detection[1]  # The recognized text
+                f.write(text + '\n')  # Write text to file, each on a new line
+
+        print(f'Text from {image_file} saved to {output_file}')
+
+    except Exception as e:
+        print(f"Error processing {image_file}: {e}")
diff --git a/Productivity/Converter/PDF2Text/PDF2Image/README.md b/Productivity/Converter/PDF2Text/PDF2Image/README.md
@@ -0,0 +1,77 @@
+# PDF2Image 
+
+## PDF2Image by python
+
+<div class="load_as_code_session" data-url="pdf2img.py">
+  Loading content...
+</div>
+
+
+## PDF2Image by CLI
+
+To convert each page of a PDF into separate image files using a CLI (Command Line Interface) tool, you can use **`pdftoppm`**, part of the `poppler-utils` package, or **`ImageMagick`**. Here are solutions using both:
+
+---
+
+### **Option 1: Using `pdftoppm`**
+1. **Install `poppler-utils`** (if not installed):
+   - On Debian/Ubuntu:  
+     ```bash
+     sudo apt update
+     sudo apt install poppler-utils
+     ```
+   - On macOS (via Homebrew):  
+     ```bash
+     brew install poppler
+     ```
+
+2. **Convert PDF to Images**:
+   ```bash
+   pdftoppm -png input.pdf output
+   ```
+   - `-png`: Sets the output format to PNG (use `-jpeg` for JPEG).
+   - `input.pdf`: The input PDF file.
+   - `output`: The prefix for output image files (e.g., `output-1.png`, `output-2.png`).
+
+---
+
+### **Option 2: Using ImageMagick**
+1. **Install ImageMagick**:
+   - On Debian/Ubuntu:  
+     ```bash
+     sudo apt update
+     sudo apt install imagemagick
+     ```
+   - On macOS (via Homebrew):  
+     ```bash
+     brew install imagemagick
+     ```
+
+2. **Convert PDF to Images**:
+   ```bash
+   convert -density 300 input.pdf page-%03d.png
+   ```
+   - `-density 300`: Sets resolution to 300 DPI (higher values produce better quality images).
+   - `input.pdf`: The input PDF file.
+   - `page-%03d.png`: Output filenames with a three-digit page number (e.g., `page-001.png`, `page-002.png`).
+
+---
+
+### **Advanced Options**
+- To extract specific pages with `pdftoppm`, use the `-f` (from) and `-l` (last) flags:
+  ```bash
+  pdftoppm -png -f 2 -l 5 input.pdf output
+  ```
+  This converts pages 2 to 5 only.
+
+- To customize image size or quality in `ImageMagick`:
+  ```bash
+  convert -density 300 -quality 90 input.pdf page-%03d.png
+  ```
+  - `-quality 90`: Sets the compression quality for JPEG/PNG output.
+
+Both tools are efficient and widely available on Linux, macOS, and Windows (via WSL or binaries). Let me know if you need further assistance!
+
+
+
+<script src="https://posetmage.com/assets/js/LoadAsCodeSession.js"></script>
diff --git a/Productivity/Converter/pdf2img.py → ...y/Converter/PDF2Text/PDF2Image/pdf2img.py b/Productivity/Converter/pdf2img.py → ...y/Converter/PDF2Text/PDF2Image/pdf2img.py
diff --git a/Productivity/Converter/PDF2Text/README.md b/Productivity/Converter/PDF2Text/README.md
@@ -0,0 +1,25 @@
+# PDF2Text 
+
+
+## PDF to Images
+
+see [PDF2Image](./PDF2Image/)
+
+
+## Images to Text
+
+see [OCR](./OCR/)
+
+
+## View in browser
+
+After convert images to text, you can use this file to see left side is image and right side is text
+
+
+<div class="load_as_code_session" data-url="browse.html">
+  Loading content...
+</div>
+
+
+
+<script src="https://posetmage.com/assets/js/LoadAsCodeSession.js"></script>