Refactoring training for readability #296

levoz92 · 2024-11-24T04:01:59Z

Removed redundant imports
Added optimization functions for data handling, tokenization, batch allocation and distribution of training.

…dling, tokenization, batch allocation and training distribution

…side the train() function

musab-mk

Could we remove the old code, as keeping them as comments is making it look not readable as much

musab-mk · 2024-12-05T23:58:48Z

Thank you for the PR btw

levoz92 · 2024-12-06T00:10:49Z

@musab-mk Np. Check my new commit. I removed commented out code.

musab-mk · 2024-12-06T00:27:25Z

functionary/train/train.py

@@ -248,6 +265,55 @@ def train():
        print_rank0("***** HERE ARE SOME EXAMPLES FROM EVALUATION ***")
        training_utils.print_some_examples(eval_dataset, tokenizer)

+    # Dynamic batch size based on max tokens per batch


While dynamic batch size might seem a good idea for dealing with memory issues, it would cause instabilities in training. Due to the difference in gradient updates per conversation. I would rather prefer stability over memory efficiency. Stable updates with higher cost of GPUs should be preferred over a cheaper&faster training

musab-mk · 2024-12-06T00:29:12Z

functionary/train/train.py

@@ -139,6 +134,23 @@ def trainer_save_model_safe(trainer: transformers.Trainer):
    trainer.save_model()


+"""
+Below is the updated train() function from LEVENT OZBEK.


Authorship tracking is a responsibility of git. We should remove all authorship info from the code.

musab-mk · 2024-12-06T00:30:17Z

functionary/train/train.py

+Most of the changes are identical to those in train_lora.py. I simply applied the changes to the utility code in training_utils.py
+I commented out the original train() function
+
+- training_utils.tokenize_and_cache() is used for both training and evaluation datasets to avoid repetition.
+- dynamic_batch_size() function auto adjusts batch sizes based on token counts. I did not implement this in train_lora.py since loras are trained on a smaller data so I felt that it wasn't too necessary there.
+- DataLoaders are constructed using BatchSampler to dynamically adjust the batch size per epoch.
+- distributed DataLoader is used if local_rank != -1.
+- updated to use the optimized preprocess_logits_for_metrics dynamically compute_metrics from training_utils.py.
+
+Advantages of These Changes:
+- handles datasets with varying sequence lengths dynamically
+- supports both single-GPU and distributed setups.
+"""
+
+


Updates would rather preferred to go into PR description, not the code

musab-mk · 2024-12-06T00:32:46Z

functionary/train/training_utils.py

-    """Preprocesses the logits during evaluation by computing the greedy token predictions for
-    accuracy calculation and loss values for perplexity calculation. Both pred_ids and loss are
-    of shape (batch_size x seq_len)"""
-


We should keep docstring . If you think its obsolete in terms of information, better to update it, rather than removing completely.

musab-mk · 2024-12-06T00:33:08Z

functionary/train/training_utils.py

 def compute_metrics(eval_preds, id2token, tokenizer):
-    """Computes next-token accuracy and perplexity metrics for evaluation"""


Lets have a docstring here too

musab-mk · 2024-12-06T00:35:01Z

functionary/train/training_utils.py


    metrics = {
        "accuracy": acc_count / total_num,
        "perplexity": perplexity,
-        "accuracy_first_token": first_token_correct_count / first_token_total_count,
-        "total_number_first_token": first_token_total_count,
-        "first_token_param_values": first_token_param_value_acc
-        / first_token_param_value_total,
-        "first_token_param_values_total": first_token_param_value_total,
+        "accuracy_first_token": first_token_correct / max(first_token_total, 1),
    }

-    for token_id, stat in sorted(
-        first_token_label_dic.items(), key=lambda x: -x[1]["total"]
-    )[:5]:
-        token = tokenizer.decode([token_id])
-        metrics[f"accuracy_first_token_{token}"] = stat["correct"] / stat["total"]
-        metrics[f"accuracy_first_token_{token}_total"] = stat["total"]
-
-    for token_id in dic:
+    # Token-specific accuracies
+    for token_id, stat in token_stats.items():
        token = id2token[token_id]
-        total_num = dic[token_id]["total"]
-        acc = -1
-        if total_num > 0:
-            acc = dic[token_id]["acc"] / total_num
-        metrics[f"accuracy_{token}"] = acc
-        metrics[f"accuracy_total_num_{token}"] = total_num


I see some metrics has been removed. Could you kindly explain why these changes has been made?

levoz92 added 3 commits November 23, 2024 22:42

removed unused imports. Added optimized functions for better data han…

6681f0e

…dling, tokenization, batch allocation and training distribution

removed unused imports. Integrated the optimized utility functions in…

cdf6a06

…side the train() function

removed unused imports. Integrated the optimized utility functions in…

443c232

…side the train() function

musab-mk requested changes Dec 5, 2024

View reviewed changes

musab-mk changed the title ~~Levent update~~ Refactoring for readability Dec 5, 2024

musab-mk changed the title ~~Refactoring for readability~~ Refactoring training for readability Dec 5, 2024

removed commented out code for readability

25cfec3

refactored the code to solve the unimported dataloader error

16626bf

musab-mk requested changes Dec 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring training for readability #296

Refactoring training for readability #296

levoz92 commented Nov 24, 2024

musab-mk left a comment

musab-mk commented Dec 5, 2024

levoz92 commented Dec 6, 2024

musab-mk Dec 6, 2024

musab-mk Dec 6, 2024

musab-mk Dec 6, 2024

musab-mk Dec 6, 2024

musab-mk Dec 6, 2024

musab-mk Dec 6, 2024

		def compute_metrics(eval_preds, id2token, tokenizer):
		"""Computes next-token accuracy and perplexity metrics for evaluation"""

Refactoring training for readability #296

Are you sure you want to change the base?

Refactoring training for readability #296

Conversation

levoz92 commented Nov 24, 2024

musab-mk left a comment

Choose a reason for hiding this comment

musab-mk commented Dec 5, 2024

levoz92 commented Dec 6, 2024

musab-mk Dec 6, 2024

Choose a reason for hiding this comment

musab-mk Dec 6, 2024

Choose a reason for hiding this comment

musab-mk Dec 6, 2024

Choose a reason for hiding this comment

musab-mk Dec 6, 2024

Choose a reason for hiding this comment

musab-mk Dec 6, 2024

Choose a reason for hiding this comment

musab-mk Dec 6, 2024

Choose a reason for hiding this comment