Skip to content

Commit

Permalink
pdf to image
Browse files Browse the repository at this point in the history
  • Loading branch information
HomunMage committed Nov 16, 2024
1 parent c69caeb commit d2caaf6
Show file tree
Hide file tree
Showing 9 changed files with 511 additions and 2 deletions.
120 changes: 120 additions & 0 deletions Productivity/Converter/PDF2Text/OCR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# OCR


## easy ocr

check lang codes at https://www.jaided.ai/easyocr/

<div class="load_as_code_session" data-url="easy_ocr.py">
Loading content...
</div>



## other OCR with GPU

If you're looking for open-source AI-based OCR solutions that can leverage your NVIDIA GPU and process Traditional Chinese (zh-TW), here are some excellent options:

---

### 1. **Tesseract OCR with GPU Support**
- **Description**: Tesseract is a well-established open-source OCR engine that supports multiple languages, including Traditional Chinese. However, it doesn’t natively support GPU acceleration, but you can pair it with pre-processing tools like OpenCV or other AI models to boost performance.
- **Key Features**:
- High customization and language support (including Traditional Chinese).
- Works well for clean, printed text.
- **Limitations**:
- Relatively slow compared to modern AI-based OCR solutions.
- **Setup**:
- Install `tesseract-ocr` and the Traditional Chinese language data package (`chi_tra`).
- Can be used with Python via the `pytesseract` library.
- **GPU Option**:
- Pre-process images using GPU-accelerated libraries like OpenCV with CUDA.

---

### 2. **EasyOCR**
- **Description**: EasyOCR is a modern, AI-powered OCR library written in PyTorch. It supports GPU acceleration out of the box and handles Traditional Chinese well.
- **Key Features**:
- Multilingual support, including zh-TW.
- Lightweight and easy to set up.
- Can leverage NVIDIA GPUs for faster processing.
- **Setup**:
1. Install via pip: `pip install easyocr`.
2. Run the code:
```python
import easyocr
reader = easyocr.Reader(['zh-tw'], gpu=True)
result = reader.readtext('path_to_image')
```
- **Limitations**:
- Struggles with very complex or heavily distorted handwriting.

---

### 3. **PaddleOCR**
- **Description**: PaddleOCR is a powerful OCR tool developed by Baidu. It supports GPU acceleration using NVIDIA GPUs and provides excellent accuracy, especially for Chinese text.
- **Key Features**:
- Optimized for Chinese languages.
- High accuracy for both printed and handwritten text.
- Built-in tools for image pre-processing and text detection.
- **Setup**:
1. Install the PaddleOCR package:
```bash
pip install paddleocr
pip install paddlepaddle-gpu # Ensure GPU support
```
2. Use the library:
```python
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_gpu=True, lang='ch')
result = ocr.ocr('path_to_image', cls=True)
```
- **Limitations**:
- Requires installing PaddlePaddle, which can have specific system requirements.

---

### 4. **OCR with OpenCV and Deep Learning Models**
- **Description**: OpenCV allows integration with custom deep learning OCR models like CRNN (Convolutional Recurrent Neural Network) or SAR (Sequence-to-Sequence Attention-based OCR). These models can be trained or fine-tuned on Traditional Chinese datasets.
- **Key Features**:
- Customizable for your specific needs.
- Full GPU acceleration using NVIDIA CUDA.
- **Setup**:
- Use OpenCV with CUDA for pre-processing (e.g., noise removal, binarization).
- Combine with a deep learning framework (e.g., PyTorch or TensorFlow) for OCR.

---

### 5. **TrOCR by Microsoft**
- **Description**: TrOCR is a transformer-based OCR model provided by Microsoft. It supports multilingual text recognition, including Chinese, and works efficiently with GPU acceleration.
- **Key Features**:
- State-of-the-art accuracy.
- Uses transformers for improved contextual understanding.
- **Setup**:
1. Install the `transformers` library:
```bash
pip install transformers
```
2. Use the model:
```python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten").cuda()

image = Image.open('path_to_image').convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values.cuda()
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)
```
- **Limitations**:
- Requires fine-tuning for best performance on Traditional Chinese.

---



<script src="https://posetmage.com/assets/js/LoadAsCodeSession.js"></script>
38 changes: 38 additions & 0 deletions Productivity/Converter/PDF2Text/OCR/easy_ocr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import easyocr
import torch

# Function to check if GPU is available
def check_gpu():
if torch.cuda.is_available():
print("GPU is available and will be used.")
else:
print("GPU is not available. Using CPU.")

# Check GPU availability
check_gpu()

# Initialize EasyOCR reader for Traditional Chinese (zh-tw)
reader = easyocr.Reader(['ch_tra', 'en'], gpu=True) # Set gpu=True to ensure it uses GPU

# Loop through image files from 001 to 274
for i in range(1, 275): # Loop from 1 to 274
# Format the image file name
image_file = f'output-{i:03}.png' # This formats numbers with leading zeros (e.g., 001, 002, ..., 274)

try:
# Perform OCR on the image
result = reader.readtext(image_file)

# Create corresponding .txt file name
output_file = image_file.replace('.png', '.txt') # Replace .png with .txt

# Save the recognized text to a .txt file
with open(output_file, 'w', encoding='utf-8') as f:
for detection in result:
text = detection[1] # The recognized text
f.write(text + '\n') # Write text to file, each on a new line

print(f'Text from {image_file} saved to {output_file}')

except Exception as e:
print(f"Error processing {image_file}: {e}")
77 changes: 77 additions & 0 deletions Productivity/Converter/PDF2Text/PDF2Image/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# PDF2Image

## PDF2Image by python

<div class="load_as_code_session" data-url="pdf2img.py">
Loading content...
</div>


## PDF2Image by CLI

To convert each page of a PDF into separate image files using a CLI (Command Line Interface) tool, you can use **`pdftoppm`**, part of the `poppler-utils` package, or **`ImageMagick`**. Here are solutions using both:

---

### **Option 1: Using `pdftoppm`**
1. **Install `poppler-utils`** (if not installed):
- On Debian/Ubuntu:
```bash
sudo apt update
sudo apt install poppler-utils
```
- On macOS (via Homebrew):
```bash
brew install poppler
```

2. **Convert PDF to Images**:
```bash
pdftoppm -png input.pdf output
```
- `-png`: Sets the output format to PNG (use `-jpeg` for JPEG).
- `input.pdf`: The input PDF file.
- `output`: The prefix for output image files (e.g., `output-1.png`, `output-2.png`).

---

### **Option 2: Using ImageMagick**
1. **Install ImageMagick**:
- On Debian/Ubuntu:
```bash
sudo apt update
sudo apt install imagemagick
```
- On macOS (via Homebrew):
```bash
brew install imagemagick
```

2. **Convert PDF to Images**:
```bash
convert -density 300 input.pdf page-%03d.png
```
- `-density 300`: Sets resolution to 300 DPI (higher values produce better quality images).
- `input.pdf`: The input PDF file.
- `page-%03d.png`: Output filenames with a three-digit page number (e.g., `page-001.png`, `page-002.png`).

---

### **Advanced Options**
- To extract specific pages with `pdftoppm`, use the `-f` (from) and `-l` (last) flags:
```bash
pdftoppm -png -f 2 -l 5 input.pdf output
```
This converts pages 2 to 5 only.

- To customize image size or quality in `ImageMagick`:
```bash
convert -density 300 -quality 90 input.pdf page-%03d.png
```
- `-quality 90`: Sets the compression quality for JPEG/PNG output.

Both tools are efficient and widely available on Linux, macOS, and Windows (via WSL or binaries). Let me know if you need further assistance!



<script src="https://posetmage.com/assets/js/LoadAsCodeSession.js"></script>
File renamed without changes.
25 changes: 25 additions & 0 deletions Productivity/Converter/PDF2Text/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# PDF2Text


## PDF to Images

see [PDF2Image](./PDF2Image/)


## Images to Text

see [OCR](./OCR/)


## View in browser

After convert images to text, you can use this file to see left side is image and right side is text


<div class="load_as_code_session" data-url="browse.html">
Loading content...
</div>



<script src="https://posetmage.com/assets/js/LoadAsCodeSession.js"></script>
Loading

0 comments on commit d2caaf6

Please sign in to comment.