From 3ed916f3b7b75844ef201b7e95c173c14b7b012a Mon Sep 17 00:00:00 2001 From: Tanisha Samant Date: Wed, 1 Oct 2025 00:02:23 +0530 Subject: [PATCH] Docs: fix typo, improve readability, add code comments --- README.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index d4162b9e761..07365de179d 100644 --- a/README.md +++ b/README.md @@ -20,8 +20,12 @@ 🤗 Datasets is a lightweight library providing **two** main features: -- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), -- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training. +- **one-line dataloaders for many public datasets**: Provides simple one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets. This includes image datasets, audio datasets, text datasets in 467 languages and dialects, and more, all available on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). For example, you can run: +`squad_dataset = load_dataset("rajpurkar/squad")` +to get a dataset ready for training or evaluating an ML model in Numpy, Pandas, PyTorch, TensorFlow, or JAX. +- **efficient data pre-processing**: Provides simple, fast, and reproducible data pre-processing for public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. For example, you can run: +`processed_dataset = dataset.map(process_example)` +to efficiently prepare the dataset for inspection and ML model evaluation or training. [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share) @@ -64,7 +68,7 @@ Follow the installation pages of TensorFlow and PyTorch to see how to install th For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation -## Installation to use with Machine Learning & Data frameworks frameworks +## Installation to use with Machine Learning & Data frameworks If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (3.14+) you should also install PyTorch, TensorFlow or JAX. 🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately. @@ -80,20 +84,21 @@ This library can be used for text/image/audio/etc. datasets. Here is an example Here is a quick example: ```python +# Example: Load datasets and print first example from datasets import load_dataset -# Print all the available datasets +# Example: List all available datasets from huggingface_hub import list_datasets print([dataset.id for dataset in list_datasets()]) -# Load a dataset and print the first example in the training set +# Example: Load a dataset and print the first example in the training set squad_dataset = load_dataset('rajpurkar/squad') print(squad_dataset['train'][0]) -# Process the dataset - add a column with the length of the context texts +# Example: Process the dataset - add a column with the length of the context texts dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])}) -# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library) +# Example: Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')