Skip to content

Latest commit

 

History

History
239 lines (199 loc) · 13.4 KB

advanced-data-preprocessing.md

File metadata and controls

239 lines (199 loc) · 13.4 KB

Advanced Data Processing

Our library also supports a powerful data processing backend which can be used by the users to perform custom data preprocessing including

  1. Support for multiple datasets
  2. Creating custom data processing pipeline for the datasets.
  3. Combining multiple datasets into one, even if they have different formats.
  4. Mixing datasets as required and sampling each dataset with different weights.

These things are supported via what we call a data_config which can be passed as an argument to sft trainer.

Data Config

Data config is a configuration file which sft_trainer.py supports as an argument via --data_config flag. In this configuration users can describe multiple datasets, configurations on how to load the datasets and configuration on how to process the datasets. Users can currently pass both YAML or JSON based configuration files as data_configs.

What is data config schema

The data config schema is designed to define datasets and their processing strategies in a structured way.

It consists of the following top-level keys:

  • datapreprocessor: Defines global data processing parameters, such as the type (default), sampling stopping strategy (all_exhausted or first_exhausted), and sampling seed for reproducibility.
  • datasets: A list of dataset configurations, each describing the dataset name, paths, optional builders, sampling ratios, and data handlers.

At the top level, the data config schema looks like this:

definitions:
  data_config:
    type: object
    additionalProperties: false
    properties:
      dataprocessor:
        $ref: '#/definitions/Dataprocessor'
      datasets:
        type: array
        items:
          $ref: '#/definitions/Dataset'
    required:
      - dataprocessor
      - datasets
    title: data_config
  Dataprocessor:
    type: object
    additionalProperties: false
    properties:
      type:
        type: string
      sampling_stopping_strategy:
        type: string
      seed:
        type: integer
    required:
      - type
    title: Dataprocessor
  Dataset:
    type: object
    additionalProperties: false
    properties:
      name:
        type: string
      sampling:
        type: float
      builder:
        type: string
      data_paths:
        type: array
        items:
          type: string
      data_handlers:
        type: array
        items:
          $ref: '#/definitions/DataHandler'
    required:
      - data_paths
      - name
    title: Dataset
  DataHandler:
    type: object
    additionalProperties: false
    properties:
      name:
        type: string
      arguments:
        $ref: '#/definitions/DataHandlerArguments'
    required:
      - arguments
      - name
    title: DataHandler
  DataHandlerArguments:
    type: object
    additionalProperties: false
    properties:
      remove_columns:
        type: string
      batched:
        type: boolean
      fn_kwargs:
        $ref: '#/definitions/DataHandlerFnKwargs'
    required:
      - fn_kwargs
      - remove_columns
    title: DataHandlerArguments
  DataHandlerFnKwargs:
    type: object
    properties:
      str:
        type: str
    title: DataHandlerFnKwargs

How the user can write data configs

Users can create a data config file in any of YAML or JSON format they choose (we provide examples of YAML for ease of use). The file should follow the schema outlined above with the following parameters:

datapreprocessor:

  • type (optional, str): Type of data preprocessor, default is currently the only supported type.
  • sampling_stopping_strategy (optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values are all_exhausted or first_exhausted, defaults to all_exhausted.
  • sampling_seed (optional, int): Sampling seed to use for interleaving datasets, for reproducibility choose same value, defaults to 42.

datasets (list):

  • name (optional, str): A unique identifier for the dataset.
    • data_paths (optional, list): A list of file paths or directories containing the dataset.
    • builder (optional, str): Specifies a Hugging Face dataset builder, if applicable.
    • sampling (optional, float): The sampling ratio (0.0 to 1.0) with which to sample a dataset in case of interleaving.
    • data_handlers (optional, list): A list of data handler configurations which preprocess the dataset.

Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use Hugging Face Map API to apply these routines. These functions can process the dataset in any way users require and the list of data handlers specified for each dataset are applied in order. Each data handler has:

  • name: The handler's unique identifier.
  • arguments: A dictionary of parameters specific to the handler.

We do provide some sample data_configs here, predefined_data_configs.

How users can pass the datasets

Users can provide single or multiple file paths, folder paths, or Hugging Face dataset IDs through the data_paths argument. These datasets can be in various supported formats such as JSON, JSONL, Parquet, and Arrow. For a more up-to-date supported format list see README.md. Additionally, users can pass globbing patterns to specify files or folder paths matching specific regex patterns.

When passing multiple datasets with differing column structures, users should ensure appropriate handlers are specified to process the datasets correctly.

The builder argument can also be optionally included to provide additional information for dataset loading, this argument is directly passed to HF load_dataset API as the first argument.

User can pass builder in DataSetConfig to mention the specific loader for the passed file/folder/pattern. We support the following,

  • Passing file paths that include a file extension in filename, or specifying a builder if file extension is not provided in filename.
  • Passing of folder paths, with or without builder.

Not Supported:

  • Passing file paths that do not include a file extension in filename and do not specify a builder.
  • Passing a folder as a wildcard globbing pattern.

Currently there's no support for sampling under multiple data paths which are defined inside a dataset definition. All dataset paths that will be specified inside one dataset will be concatenated after loading them, while across datasets users can specify mixing via sampling datasets

How can users specify data handlers.

Data handlers, as explained above, are routines which process the dataset using HF map framework. All data handler routines are registered with our data preprocessor as a k:func object where k is the name (str) of the data handler and func (callable) is the function which is called.

In the data config, users can request which data handler to apply by requesting the corresponding name with which the data handler was registered and specifying the appropriate arguments. Each data handler accepts two types of arguments via DataHandlerArguments (as defined in the above schema), as shown below.

  DataHandler:
    type: object
    additionalProperties: false
    properties:
      name:
        type: string
      arguments:
        $ref: '#/definitions/DataHandlerArguments'
    required:
      - arguments
      - name
    title: DataHandler
  DataHandlerArguments:
    type: object
    additionalProperties: false
    properties:
      remove_columns:
        type: string
      batched:
        type: boolean
      fn_kwargs:
        $ref: '#/definitions/DataHandlerFnKwargs'
    required:
      - fn_kwargs
      - remove_columns
    title: DataHandlerArguments
  DataHandlerFnKwargs:
    type: object
    properties:
      str:
        type: str
    title: DataHandlerFnKwargs

Arguments to the data handlers are of two types,

Each data handler is a routine passed to the underlying HF Map API so the kwargs supported by the underlying API can be passed via the arguments section of the data handler config.

For example, users can pass remove_columns to remove any columns from the dataset when executing the particular handler or they can use batched to ensure batched processing of the data handler.

Users can also pass any number of kwargs arguments required for each data handling routine function as fn_kwargs inside the arguments.

Preexisting data handlers

This library currently supports the following preexisting data handlers:

  • tokenize_and_apply_input_masking: Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
  • apply_dataset_formatting: Formats a dataset by appending an EOS token to a specified field.
  • apply_custom_data_formatting_template: Applies a custom template (e.g., Alpaca style) to format dataset elements.
  • apply_tokenizer_chat_template: Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.

These handlers could be requested by their same name and users can lookup the function args from here

Extra data handlers

Users are also allowed to pass custom data handlers using sft_trainer.py::train() API call via the additional_data_handlers argument.

The argument expects users to pass a map similar to the existing data handlers k(str):func(callable) which will be registered with the data preprocessor via its register_data_handlers api

Data Mixing

Dataset mixing allows users to mix multiple datasets often with different sampling ratios to ensure the model is trained on a mix of some datasets in specific proportion.

If users want to train a model on just a straight forward concatenation of the datasets then they need not enable data mixing. Users can specify just different datasets via data_paths as shown above and all the datasets will be concatenated via concatenate_datasets.

If users want to enable data mixing they need to enable sampling of the datasets by specifying sampling ratio for each dataset as described above. The library will then collect sampling ratios from each dataset definition in the data_config and create a new interleaved dataset which is a combination of all the sampled datasets via interleave_datasets() api.

Needless to say the sampling ratio of a datasets is a float and all the sampling ratios must sum to 1.

We also allow users to pass a seed to randomize the interleaving of datasets and a stopping_strategy to describe when to stop sampling. Both values should remain the same for experiment reproducibility. Both these values are common for all datasets and should be supplied at top level in the datapreprocessor as shown above. For a list of the supported values of these arguments see the corresponding HF API.

Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset

Example data configs.

We provide some example data configs here