Skip to content

Questions and minor concerns about utils.py design choices #70

@VibezCoder

Description

@VibezCoder

This issue is about github.com/humanai-foundation/RenAIssance/blob/main/RenAIssance_CRNN_OCR_Shashank_Shekhar_Singh/utility/utils.py.

1. Suggestions / potential issues

  • Add error checking for cv2.imread(...) and pdf_to_images(pdf_path, output_folder) to handle missing/invalid files.
  • remove_punctuation() (using string.punctuation and str.translate) seems unused/commented out; should it be removed or wired into the pipeline?
  • In create_csv_from_folder, the check if file_name.lower() == ".png": continue looks unnecessary / possibly incorrect, since it checks whether .png exists, not .contains().
  • Several magic numbers (e.g. last_textfile_number = 7, vertical_distance_between_lines = 10, count_semicolon - 2) aren’t documented; are these dataset-specific or heuristics that should be configurable?

2. Clarification questions

  • In split_and_save_image, there’s no handling for the “in‑between” case 350 <= width <= 450; are such images guaranteed not to exist, or should they be treated as single/dual pages?
  • The bounding box coordinate unpacking repeats variables (x_min, y_min, x_max, y_min, x_max, y_max, x_min, y_max); is the input format really repeated points, or is this an oversight?
  • The augmentation helpers (rotation_aug, gaussian_noise_aug) write outputs into the same folder without checks; is the expected workflow to run them only once on a clean folder, or would it make sense to separate input/output directories?

I’m happy to open follow-up PRs to adjust these if you confirm which parts are intentional vs accidental.

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions