The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' 

You're welcome! I'm glad to assist you with this question！
1. The error message is as follows：
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{'loss': 0.0185, 'learning_rate': 7.973102785782901e-06, 'epoch': 5.04}
{'loss': 0.0185, 'learning_rate': 6.9724623759205894e-06, 'epoch': 5.16}
{'loss': 0.0188, 'learning_rate': 5.971821966058277e-06, 'epoch': 5.28}
{'loss': 0.0178, 'learning_rate': 4.971181556195966e-06, 'epoch': 5.4}
wandb: Network error (ReadTimeout), entering retry loop.
{'loss': 0.0183, 'learning_rate': 3.970541146333654e-06, 'epoch': 5.52}
{'loss': 0.018, 'learning_rate': 2.9699007364713415e-06, 'epoch': 5.64}
{'loss': 0.0179, 'learning_rate': 1.96926032660903e-06, 'epoch': 5.76}
{'loss': 0.0174, 'learning_rate': 9.68619916746718e-07, 'epoch': 5.88}
[INFO|trainer.py:1901] 2023-04-19 17:43:28,689 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 1749.401, 'train_samples_per_second': 142.815, 'train_steps_per_second': 14.281, 'train_loss': 0.023130248267032992, 'epoch': 6.0}
[INFO|trainer.py:2709] 2023-04-19 17:43:28,693 >> Saving model checkpoint to classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None
[INFO|configuration_utils.py:453] 2023-04-19 17:43:28,694 >> Configuration saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/config.json
[INFO|modeling_utils.py:1704] 2023-04-19 17:43:29,841 >> Model weights saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/pytorch_model.bin
***** train metrics *****
  epoch                    =        6.0
  train_loss               =     0.0231
  train_runtime            = 0:29:09.40
  train_samples            =      41640
  train_samples_per_second =    142.815
  train_steps_per_second   =     14.281
04/19/2023 17:43:29 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:710] 2023-04-19 17:43:29,848 >> The following columns in the evaluation set don't have a corresponding argument in `Classifier_Tree.forward` and have been ignored: chart_lst. If chart_lst are not expected by `Classifier_Tree.forward`,  you can safely ignore this message.
[INFO|trainer.py:2964] 2023-04-19 17:43:29,850 >> ***** Running Evaluation *****
[INFO|trainer.py:2966] 2023-04-19 17:43:29,850 >>   Num examples = 421
[INFO|trainer.py:2969] 2023-04-19 17:43:29,851 >>   Batch size = 10
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{'eval_runtime': 1.4868, 'eval_samples_per_second': 283.16, 'eval_steps_per_second': 28.921, 'epoch': 6.0}
Traceback (most recent call last):
  File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1704, in <module>
    main()
  File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1675, in main
    perplexity = math.exp(metrics["eval_loss"])
KeyError: 'eval_loss'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
Exception ignored in atexit callback: <function _Manager._atexit_setup.<locals>.<lambda> at 0x7f2f280f1fc0>
Traceback (most recent call last):
  File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 166, in <lambda>
    self._atexit_lambda = lambda: self._atexit_teardown()
  File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 175, in _atexit_teardown
    self._teardown(exit_code)
  File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 186, in _teardown
    result = self._service.join()
  File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 216, in join
    ret = self._internal_proc.wait()
  File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1204, in wait
    return self._wait(timeout=timeout)
  File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1938, in _wait
    (pid, sts) = self._try_wait(0)
  File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1896, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt: 
(diffusion-LM) name@taizun-SYS-4029GP-TRT:~/diffusion-LM$ wandb: - 0.010 MB of 0.010 MB uploaded (0.(diffusion-LM) name@taizun-SYS-4029GP-TRT:~/diffusion-LM$ wandb: / 0.010 MB of 0.010 MB uploaded (0.wandb: \ 0.010 MB of 0.010 MB uploaded (0.000 MB deduped)

2. The relevant code that caused the error is as follows：
    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        metrics = train_result.metrics


        max_train_samples = (
            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        )
        metrics["train_samples"] = min(max_train_samples, len(train_dataset))

        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()

    # Evaluation
    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        metrics = trainer.evaluate()

        max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
        metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
        try:
            perplexity = math.exp(metrics["eval_loss"])
        except OverflowError:
            perplexity = float("inf")
        metrics["perplexity"] = perplexity

        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

    kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"}
    if data_args.dataset_name is not None:
        kwargs["dataset_tags"] = data_args.dataset_name
        if data_args.dataset_config_name is not None:
            kwargs["dataset_args"] = data_args.dataset_config_name
            kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}"
        else:
            kwargs["dataset"] = data_args.dataset_name

    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)
    else:
        trainer.create_model_card(**kwargs)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Training

Evaluation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Description

Training

Evaluation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions