Questions and minor concerns about `utils.py` design choices

This issue is about `github.com/humanai-foundation/RenAIssance/blob/main/RenAIssance_CRNN_OCR_Shashank_Shekhar_Singh/utility/utils.py`.

#### 1. Suggestions / potential issues

- Add error checking for `cv2.imread(...)` and `pdf_to_images(pdf_path, output_folder)` to handle missing/invalid files.  
- `remove_punctuation()` (using `string.punctuation` and `str.translate`) seems unused/commented out; should it be removed or wired into the pipeline?  
- In `create_csv_from_folder`, the check `if file_name.lower() == ".png": continue` looks unnecessary / possibly incorrect, since it checks whether .png exists, not .contains().  
- Several magic numbers (e.g. `last_textfile_number = 7`, `vertical_distance_between_lines = 10`, `count_semicolon - 2`) aren’t documented; are these dataset-specific or heuristics that should be configurable?

#### 2. Clarification questions

- In `split_and_save_image`, there’s no handling for the “in‑between” case `350 <= width <= 450`; are such images guaranteed not to exist, or should they be treated as single/dual pages?  
- The bounding box coordinate unpacking repeats variables (`x_min, y_min, x_max, y_min, x_max, y_max, x_min, y_max`); is the input format really repeated points, or is this an oversight?  
- The augmentation helpers (`rotation_aug`, `gaussian_noise_aug`) write outputs into the same folder without checks; is the expected workflow to run them only once on a clean folder, or would it make sense to separate input/output directories?

I’m happy to open follow-up PRs to adjust these if you confirm which parts are intentional vs accidental.

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions and minor concerns about `utils.py` design choices #70

1. Suggestions / potential issues

2. Clarification questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Questions and minor concerns about utils.py design choices #70

Description

1. Suggestions / potential issues

2. Clarification questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Questions and minor concerns about `utils.py` design choices #70