Skip to content

[Quesiton] Imbalanced data for classifiers in classification tasks #79

@omihub777

Description

@omihub777

Thank you for all your hard work.

I've noticed that in your implementation of classifiers in ClassificationEvaluator, it seems that classifiers like LogisticRegression and kNN are trained on the entire training datasets even for extremely imbalanced data such as amazon_counterfactual dataset, where 90% of the labels are 0 (stats-ja). In the original MTEB, this issue is addressed by undersampling the training dataset to achieve a balanced distribution before fitting LogisticRegression.

Could you elaborate on your design choice for training on the entire dataset? Are there specific reasons for this approach? If I am missing something, feel free to correct me. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions