Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
RenAIssance_Transformer_OCR_Utsav_Rai/weights
RenAIssance_Transformer_OCR_Utsav_Rai/models
RenAIssance_Transformer_OCR_Utsav_Rai/quantized_model
RenAIssance_Transformer_OCR_Utsav_Rai/quantized_model
RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/models/*.pt
RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/models/*.pth
RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/data/ssl/word_images/*
!RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/data/ssl/word_images/.gitkeep
RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/data/finetuning/*/word_images/*
!RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/data/finetuning/perfecto/word_images/.gitkeep
!RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/data/finetuning/ezcaray/word_images/.gitkeep
!RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/data/finetuning/virtuosa/word_images/.gitkeep
76 changes: 61 additions & 15 deletions RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,70 @@
# Spanish Historical OCR using Self-Supervised Learning

## Overview
This repository implements a word-level OCR model for Renaissance Spanish documents using Self-Supervised Learning. The model was developed with reference to SeqCLR ([Aberdam A., et al., 2021](https://arxiv.org/abs/2012.10873)). According to the paper, SeqCLR employs a Contrastive Learning method, wherein its encoder learns to become robust against certain image transformations. The architecture includes a combination of ResNet50(or ViT tiny) and a 2-layer BiLSTM as the Encoder, and an Attention LSTM Decoder. At this point, the model achieves approximately 4% CER. This model can be tested in `test_model.ipynb`. For further information, please refer to my [blog](https://medium.com/@yamanko1234/historical-ocr-with-self-supervised-learning-c4f00da6637f).
This repository implements a word-level OCR model for Renaissance Spanish documents using self-supervised learning. The model was developed with reference to SeqCLR ([Aberdam A., et al., 2021](https://arxiv.org/abs/2012.10873)). According to the paper, SeqCLR uses contrastive learning so its encoder becomes robust to image transformations. The architecture combines a ResNet50 (or ViT tiny) and a 2-layer BiLSTM encoder with an attention LSTM decoder.

At this point, the model achieves approximately 4% CER. This model can be tested in `test_model.ipynb`. For more background, see the [project blog post](https://medium.com/@yamanko1234/historical-ocr-with-self-supervised-learning-c4f00da6637f).

## Portable Configuration
The default `config.json` now uses paths relative to this folder instead of machine-specific absolute paths. That makes the project easier to clone and configure on another machine.

Populate the directories below with your local datasets and checkpoints, or update `config.json` to match your own layout:

```text
RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/
├── config.json
├── data/
│ ├── ssl/
│ │ └── word_images/
│ └── finetuning/
│ ├── perfecto/
│ │ ├── word_images/
│ │ └── word_images.csv
│ ├── ezcaray/
│ │ ├── word_images/
│ │ └── word_images.csv
│ └── virtuosa/
│ ├── word_images/
│ └── word_images.csv
├── models/
└── test_images/
```

The bundled `test_images/` folder is used as the default `test dataset` path so contributors can validate notebook setup without first changing that entry.

Before running the notebooks, you can verify the configured paths:

```bash
python check_config_paths.py
```

## File/Folder Descriptions
- **Tokenizer**: A folder containing Tokenizer pickle files for the Decoder training.
- **test_image**: A folder containing images used for testing.
- **Decoder.py**: Implementation of the SeqCLR’s Decoder.
- **ResNet.py**: Implementation of ResNet, a component of the Encoder.
- **config.json**: A JSON file that sets the configuration for training.
- **custom_dataset.py**: Implementation of a custom dataset used in training.
- **decoder_training.ipynb**: A notebook to train the Decoder.
- **encoder.py**: Implementation of the SeqCLR’s Encoder.
- **ViT_encoder.py** Implementation of ViT version Encoder.
- **encoder_training.ipynb**: A notebook to train the Encoder.
- **test_model.ipynb**: A notebook to test a saved model.
- **Tokenizer**: Pickle files used for decoder training and decoding.
- **data**: Local SSL and fine-tuning datasets referenced by `config.json`.
- **models**: Saved encoder and decoder checkpoints.
- **test_images**: Sample images used for testing.
- **Decoder.py**: SeqCLR decoder implementation.
- **ResNet.py**: ResNet implementation used by the encoder.
- **config.json**: Training and inference configuration.
- **check_config_paths.py**: Helper script that verifies configured dataset and model paths exist.
- **custom_dataset.py**: Custom dataset implementations used in training.
- **decoder_training.ipynb**: Notebook for decoder training and evaluation.
- **encoder.py**: SeqCLR encoder implementation.
- **ViT encoder support**: The notebooks include an optional ViT encoder path controlled by `config.json`.
- **encoder_training.ipynb**: Notebook for encoder training.
- **test_model.ipynb**: Notebook for testing a saved model.

## Testing the Model
First, you need to install the dependencies:
```
Install the dependencies:

```bash
pip install -r requirements.txt
```
Then, you can test the saved model by executing the cells in `test_model.ipynb` one by one.

Confirm `config.json` points to valid paths for your environment:

```bash
python check_config_paths.py
```

Then run the cells in `test_model.ipynb`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
from __future__ import annotations

import json
from pathlib import Path


PROJECT_ROOT = Path(__file__).resolve().parent
CONFIG_PATH = PROJECT_ROOT / "config.json"


def resolve_path(raw_path: str | None) -> str:
if raw_path is None:
return "<not set>"
return str((PROJECT_ROOT / raw_path).resolve())


def path_exists(raw_path: str | None) -> bool | None:
if raw_path is None:
return None
return (PROJECT_ROOT / raw_path).exists()


def iter_config_paths(config: dict) -> list[tuple[str, str | None, bool]]:
return [
("SSL.dataset 1", config["SSL"].get("dataset 1"), True),
("SSL.dataset 2", config["SSL"].get("dataset 2"), False),
("SSL.dataset 3", config["SSL"].get("dataset 3"), False),
("SSL.saved Encoder path", config["SSL"].get("saved Encoder path"), False),
("fine-tuning.dataset 1", config["fine-tuning"].get("dataset 1"), True),
("fine-tuning.dataset 1 csv", config["fine-tuning"].get("dataset 1 csv"), True),
("fine-tuning.dataset 2", config["fine-tuning"].get("dataset 2"), False),
("fine-tuning.dataset 2 csv", config["fine-tuning"].get("dataset 2 csv"), False),
("fine-tuning.dataset 3", config["fine-tuning"].get("dataset 3"), False),
("fine-tuning.dataset 3 csv", config["fine-tuning"].get("dataset 3 csv"), False),
("fine-tuning.test dataset", config["fine-tuning"].get("test dataset"), True),
(
"fine-tuning.Encoder path for fine-tuning",
config["fine-tuning"].get("Encoder path for fine-tuning"),
False,
),
(
"fine-tuning.Decoder path for fine-tuning",
config["fine-tuning"].get("Decoder path for fine-tuning"),
False,
),
("fine-tuning.char to token", config["fine-tuning"].get("char to token"), True),
("fine-tuning.token to char", config["fine-tuning"].get("token to char"), True),
("fine-tuning.saved Encoder path", config["fine-tuning"].get("saved Encoder path"), False),
("fine-tuning.saved Decoder path", config["fine-tuning"].get("saved Decoder path"), False),
]


def main() -> int:
with CONFIG_PATH.open("r", encoding="utf-8") as config_file:
config = json.load(config_file)

print(f"Checking paths in {CONFIG_PATH}")
print()

missing_required = False
for label, raw_path, must_exist in iter_config_paths(config):
exists = path_exists(raw_path)
absolute_path = resolve_path(raw_path)
if exists is None:
status = "OPTIONAL"
elif exists:
status = "OK"
elif not must_exist:
status = "OPTIONAL"
else:
status = "MISSING"
missing_required = True
print(f"[{status:<8}] {label}: {absolute_path}")

print()
if missing_required:
print("Some configured paths are missing. Update config.json or place your data/models in the expected folders.")
return 1

print("All configured paths exist.")
return 0


if __name__ == "__main__":
raise SystemExit(main())
20 changes: 10 additions & 10 deletions RenAIssance_SelfSupervisedLearning_OCR_YukinoriYamamoto/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,23 @@
"ViT": false
},
"SSL": {
"dataset 1": "/home/yukinori/Desktop/CRAFT-pytorch/self_supervised_data/word_images",
"dataset 1": "data/ssl/word_images",
"dataset 2": null,
"dataset 3": null,
"epoch size": 1,
"Batch size": 32,
"start lr": 0.001,
"lr scheduler step size": 2,
"saved Encoder path": "ViT_encoder.pth"
"saved Encoder path": "models/ViT_encoder.pth"
},
"fine-tuning": {
"dataset 1": "/home/yukinori/Desktop/CRAFT-pytorch/Perfecto/Perfecto/word_images",
"dataset 1 csv": "/home/yukinori/Desktop/CRAFT-pytorch/Perfecto/Perfecto/word_images.csv",
"dataset 2": "/home/yukinori/Desktop/CRAFT-pytorch/Ezcaray/word_images",
"dataset 2 csv": "/home/yukinori/Desktop/CRAFT-pytorch/Ezcaray/word_images.csv",
"dataset 3": "/home/yukinori/Desktop/CRAFT-pytorch/Virtuosa/word_images",
"dataset 3 csv": "/home/yukinori/Desktop/CRAFT-pytorch/Virtuosa/word_images.csv",
"test dataset": "/home/yukinori/Desktop/CRAFT-pytorch/self_supervised_data/word_images",
"dataset 1": "data/finetuning/perfecto/word_images",
"dataset 1 csv": "data/finetuning/perfecto/word_images.csv",
"dataset 2": "data/finetuning/ezcaray/word_images",
"dataset 2 csv": "data/finetuning/ezcaray/word_images.csv",
"dataset 3": "data/finetuning/virtuosa/word_images",
"dataset 3 csv": "data/finetuning/virtuosa/word_images.csv",
"test dataset": "test_images",
"fine-tune on other dataset": true,
"Encoder path for fine-tuning": "models/trdg_Encoder_9_13.pt",
"Decoder path for fine-tuning": "models/trdg_Decoder_9_13.pt",
Expand All @@ -33,4 +33,4 @@
"saved Encoder path": "models/trdg_fine_tuned_Encoder_withoutSSL_9_13.pt",
"saved Decoder path": "models/trdg_fine_tuned_Decoder_withoutSSL_9_13.pt"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Place local training data under this directory.

Expected layout:

- `data/ssl/word_images/`
- `data/finetuning/perfecto/word_images/`
- `data/finetuning/perfecto/word_images.csv`
- `data/finetuning/ezcaray/word_images/`
- `data/finetuning/ezcaray/word_images.csv`
- `data/finetuning/virtuosa/word_images/`
- `data/finetuning/virtuosa/word_images.csv`

These paths match the defaults in `config.json`.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
label,image
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
label,image
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
label,image
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Binary file not shown.
54 changes: 43 additions & 11 deletions RenAIssance_Transformer_OCR_Utsav_Rai/code/app/app_streamlit.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import sys
import os
# Add CRAFT directory to sys.path for craft imports
CRAFT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'CRAFT'))
if CRAFT_DIR not in sys.path:
sys.path.insert(0, CRAFT_DIR)

APP_DIR = os.path.dirname(os.path.abspath(__file__))
CRAFT_DIR = os.path.abspath(os.path.join(APP_DIR, "..", "CRAFT"))
for path in (APP_DIR, CRAFT_DIR):
if os.path.isdir(path) and path not in sys.path:
sys.path.insert(0, path)
import torch
import torch.backends.cudnn as cudnn
from collections import OrderedDict
Expand All @@ -17,14 +19,28 @@
from PIL import Image, ImageEnhance
import cv2
import numpy as np
import os
import math
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
import streamlit as st
from deskew import determine_skew

st.set_page_config(layout="wide")


def resolve_existing_path(env_var, *candidates):
override = os.getenv(env_var)
if override:
return override

for candidate in candidates:
if os.path.exists(candidate):
return candidate

raise FileNotFoundError(
f"Could not resolve a path for {env_var or 'required asset'}. "
f"Tried: {', '.join(candidates)}"
)

def copyStateDict(state_dict):
if list(state_dict.keys())[0].startswith("module"):
start_idx = 1
Expand All @@ -39,7 +55,11 @@ def copyStateDict(state_dict):
@st.cache_resource
def load_craft_model():
# Define the path to the pre-trained CRAFT model weights
trained_model_path = '../../weights/craft_mlt_25k.pth'
trained_model_path = resolve_existing_path(
"RENAISSANCE_CRAFT_MODEL_PATH",
os.path.join(APP_DIR, "weights", "craft_mlt_25k.pth"),
os.path.abspath(os.path.join(APP_DIR, "..", "..", "weights", "craft_mlt_25k.pth")),
)

# Initialize the CRAFT model
net = CRAFT() # initialize
Expand All @@ -57,7 +77,11 @@ def load_craft_model():
refine = True # Set to True if using refine_net
if refine:
from refinenet import RefineNet
refiner_model_path = '../../weights/craft_refiner_CTW1500.pth' # Update the path
refiner_model_path = resolve_existing_path(
"RENAISSANCE_CRAFT_REFINER_PATH",
os.path.join(APP_DIR, "weights", "craft_refiner_CTW1500.pth"),
os.path.abspath(os.path.join(APP_DIR, "..", "..", "weights", "craft_refiner_CTW1500.pth")),
)
refine_net = RefineNet()
refine_net.load_state_dict(copyStateDict(torch.load(refiner_model_path, map_location=device)))
refine_net.to(device)
Expand Down Expand Up @@ -109,9 +133,17 @@ def test_net(net, image, text_threshold, link_threshold, low_text, *, cuda, poly
@st.cache_resource
def load_ocr_model():
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Update path to point to the correct location of the OCR weights
model_path = "../../models"
processor_path = "../../models"
model_path = resolve_existing_path(
"RENAISSANCE_OCR_MODEL_DIR",
os.path.join(APP_DIR, "models"),
os.path.abspath(os.path.join(APP_DIR, "..", "..", "models")),
)
processor_path = resolve_existing_path(
"RENAISSANCE_OCR_PROCESSOR_DIR",
model_path,
os.path.join(APP_DIR, "models"),
os.path.abspath(os.path.join(APP_DIR, "..", "..", "models")),
)
processor = TrOCRProcessor.from_pretrained(processor_path)
model = VisionEncoderDecoderModel.from_pretrained(model_path).to(device)
return processor, model, device
Expand Down Expand Up @@ -771,4 +803,4 @@ def get_virtual_page(pdf_document, virtual_index, dpi, **kwargs):
st.write("No image to display.")

else:
st.info("Please upload a PDF file from the left panel.")
st.info("Please upload a PDF file from the left panel.")