Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colab runtime crashes while loading wmt14 dataset #10967

Open
bharanikommoju opened this issue Jan 1, 2025 · 1 comment
Open

Colab runtime crashes while loading wmt14 dataset #10967

bharanikommoju opened this issue Jan 1, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@bharanikommoju
Copy link

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description
Description of the bug.
Docker colab runtime crashes on running the below python code.

Environment information

  • Operating System: Windows 11 Home Single Language Build: 26100.2605

  • Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0]

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 4.9.6

  • tensorflow/tf-nightly version: tensorflow 2.15.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

Reproduction instructions
Recommend using a RAM constrained machine.

Connect to a local Google colab runtime and install tensorflow_datasets

  1. local Google colab runtime: docker run --gpus=all -p 127.0.0.1:9000:8080 us-docker.pkg.dev/colab-images/public/runtime
  2. !pip install tensorflow_datasets
  3. Connect to local runtime from colab
  4. Run the below code in colab
import tensorflow_datasets as tfds

class Wmt14TranslateFrEn(tfds.translate.wmt14.Wmt14Translate):
  BUILDER_CONFIGS = [      
      tfds.translate.wmt.WmtConfig(  # pylint:disable=g-complex-comprehension
          description="WMT 2014 %s-%s translation task dataset." % ("fr", "en"),
          url=tfds.translate.wmt14._URL,
          citation=tfds.translate.wmt14._CITATION,
          language_pair=("fr", "en"),
          version=tfds.core.Version("1.0.0"),
      )
  ]

  @property
  def _subsets(self):
    return {
        tfds.Split.TRAIN: [
            "gigafren",
        ]
    }

wmt14_fr_en_translate = Wmt14TranslateFrEn()
wmt14_fr_en_translate.download_and_prepare()```

If you share a colab, make sure to update the permissions to share it.

**Link to logs**
If applicable, Nothing useful from logs

**Expected behavior**
successful execution of the code.

**Additional context**
I did an investigation. I believe https://github.com/tensorflow/datasets/blob/1a8fed713ed3a58bd459e1a8cccd31eb641d9b58/tensorflow_datasets/translate/wmt.py#L966 tries to load the entire gzip uncompressed file into memory which causes OOM on my machine.
@bharanikommoju bharanikommoju added the bug Something isn't working label Jan 1, 2025
@fineguy fineguy self-assigned this Jan 21, 2025
@fineguy
Copy link
Collaborator

fineguy commented Jan 21, 2025

@bharanikommoju thanks for bringing this up!

I think the files are reasonably sized, so it's fine loading them into memory. How much RAM does it consume on your machine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants