This issue is about github.com/humanai-foundation/RenAIssance/blob/main/RenAIssance_CRNN_OCR_Shashank_Shekhar_Singh/utility/utils.py.
1. Suggestions / potential issues
- Add error checking for
cv2.imread(...) and pdf_to_images(pdf_path, output_folder) to handle missing/invalid files.
remove_punctuation() (using string.punctuation and str.translate) seems unused/commented out; should it be removed or wired into the pipeline?
- In
create_csv_from_folder, the check if file_name.lower() == ".png": continue looks unnecessary / possibly incorrect, since it checks whether .png exists, not .contains().
- Several magic numbers (e.g.
last_textfile_number = 7, vertical_distance_between_lines = 10, count_semicolon - 2) aren’t documented; are these dataset-specific or heuristics that should be configurable?
2. Clarification questions
- In
split_and_save_image, there’s no handling for the “in‑between” case 350 <= width <= 450; are such images guaranteed not to exist, or should they be treated as single/dual pages?
- The bounding box coordinate unpacking repeats variables (
x_min, y_min, x_max, y_min, x_max, y_max, x_min, y_max); is the input format really repeated points, or is this an oversight?
- The augmentation helpers (
rotation_aug, gaussian_noise_aug) write outputs into the same folder without checks; is the expected workflow to run them only once on a clean folder, or would it make sense to separate input/output directories?
I’m happy to open follow-up PRs to adjust these if you confirm which parts are intentional vs accidental.
Thank you
This issue is about
github.com/humanai-foundation/RenAIssance/blob/main/RenAIssance_CRNN_OCR_Shashank_Shekhar_Singh/utility/utils.py.1. Suggestions / potential issues
cv2.imread(...)andpdf_to_images(pdf_path, output_folder)to handle missing/invalid files.remove_punctuation()(usingstring.punctuationandstr.translate) seems unused/commented out; should it be removed or wired into the pipeline?create_csv_from_folder, the checkif file_name.lower() == ".png": continuelooks unnecessary / possibly incorrect, since it checks whether .png exists, not .contains().last_textfile_number = 7,vertical_distance_between_lines = 10,count_semicolon - 2) aren’t documented; are these dataset-specific or heuristics that should be configurable?2. Clarification questions
split_and_save_image, there’s no handling for the “in‑between” case350 <= width <= 450; are such images guaranteed not to exist, or should they be treated as single/dual pages?x_min, y_min, x_max, y_min, x_max, y_max, x_min, y_max); is the input format really repeated points, or is this an oversight?rotation_aug,gaussian_noise_aug) write outputs into the same folder without checks; is the expected workflow to run them only once on a clean folder, or would it make sense to separate input/output directories?I’m happy to open follow-up PRs to adjust these if you confirm which parts are intentional vs accidental.
Thank you