Our library also supports a powerful data processing backend which can be used by the users to perform custom data preprocessing including
- Support for multiple datasets
- Creating custom data processing pipeline for the datasets.
- Combining multiple datasets into one, even if they have different formats.
- Mixing datasets as required and sampling each dataset with different weights.
These things are supported via what we call a data_config
which can be passed as an argument to sft trainer.
Data config is a configuration file which sft_trainer.py
supports as an argument via --data_config
flag. In this
configuration users can describe multiple datasets, configurations on how to load the datasets and configuration on how to
process the datasets. Users can currently pass both YAML or JSON based configuration files as data_configs.
The data config schema is designed to define datasets and their processing strategies in a structured way.
It consists of the following top-level keys:
datapreprocessor
: Defines global data processing parameters, such as the type (default
), sampling stopping strategy (all_exhausted
orfirst_exhausted
), and sampling seed for reproducibility.datasets
: A list of dataset configurations, each describing the dataset name, paths, optional builders, sampling ratios, and data handlers.
At the top level, the data config schema looks like this:
definitions:
data_config:
type: object
additionalProperties: false
properties:
dataprocessor:
$ref: '#/definitions/Dataprocessor'
datasets:
type: array
items:
$ref: '#/definitions/Dataset'
required:
- dataprocessor
- datasets
title: data_config
Dataprocessor:
type: object
additionalProperties: false
properties:
type:
type: string
sampling_stopping_strategy:
type: string
seed:
type: integer
required:
- type
title: Dataprocessor
Dataset:
type: object
additionalProperties: false
properties:
name:
type: string
sampling:
type: float
builder:
type: string
data_paths:
type: array
items:
type: string
data_handlers:
type: array
items:
$ref: '#/definitions/DataHandler'
required:
- data_paths
- name
title: Dataset
DataHandler:
type: object
additionalProperties: false
properties:
name:
type: string
arguments:
$ref: '#/definitions/DataHandlerArguments'
required:
- arguments
- name
title: DataHandler
DataHandlerArguments:
type: object
additionalProperties: false
properties:
remove_columns:
type: string
batched:
type: boolean
fn_kwargs:
$ref: '#/definitions/DataHandlerFnKwargs'
required:
- fn_kwargs
- remove_columns
title: DataHandlerArguments
DataHandlerFnKwargs:
type: object
properties:
str:
type: str
title: DataHandlerFnKwargs
Users can create a data config file in any of YAML or JSON format they choose (we provide examples of YAML for ease of use). The file should follow the schema outlined above with the following parameters:
datapreprocessor
:
type
(optional, str): Type of data preprocessor,default
is currently the only supported type.sampling_stopping_strategy
(optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values areall_exhausted
orfirst_exhausted
, defaults toall_exhausted
.sampling_seed
(optional, int): Sampling seed to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
datasets
(list):
name
(optional, str): A unique identifier for the dataset.data_paths
(optional, list): Alist
of file paths or directories containing the dataset.builder
(optional, str): Specifies a Hugging Face dataset builder, if applicable.sampling
(optional, float): The sampling ratio (0.0 to 1.0) with which to sample a dataset in case of interleaving.data_handlers
(optional, list): A list of data handler configurations which preprocess the dataset.
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use Hugging Face Map API to apply these routines.
These functions can process the dataset in any way users require and the list
of data handlers specified for each dataset are applied in order.
Each data handler has:
name
: The handler's unique identifier.arguments
: A dictionary of parameters specific to the handler.
We do provide some sample data_configs
here, predefined_data_configs.
Users can provide single or multiple file paths, folder paths, or Hugging Face dataset IDs through the data_paths
argument. These datasets can be in various supported formats such as JSON, JSONL, Parquet, and Arrow. For a more up-to-date supported format list see README.md. Additionally, users can pass globbing patterns to specify files or folder paths matching specific regex patterns.
When passing multiple datasets with differing column structures, users should ensure appropriate handlers are specified to process the datasets correctly.
The builder
argument can also be optionally included to provide additional information for dataset loading, this argument is directly passed to HF load_dataset
API as the first argument.
User can pass builder
in DataSetConfig
to mention the specific loader for the passed file/folder/pattern
.
We support the following,
- Passing file paths that include a file extension in filename, or specifying a
builder
if file extension is not provided in filename. - Passing of folder paths, with or without
builder
.
Not Supported:
- Passing file paths that do not include a file extension in filename and do not specify a
builder
. - Passing a folder as a wildcard globbing pattern.
Currently there's no support for sampling under multiple data paths which are defined inside a dataset definition. All dataset paths that will be specified inside one dataset will be concatenated after loading them, while across datasets users can specify mixing via sampling datasets
Data handlers, as explained above, are routines which process the dataset using HF map framework.
All data handler routines are registered with our data preprocessor as a k:func
object where
k
is the name (str
) of the data handler and func
(callable
) is the function which is called.
In the data config, users can request which data handler to apply by requesting the corresponding name
with which the data handler was registered and specifying the appropriate arguments
. Each data handler accepts two types of arguments via DataHandlerArguments
(as defined in the above schema), as shown below.
DataHandler:
type: object
additionalProperties: false
properties:
name:
type: string
arguments:
$ref: '#/definitions/DataHandlerArguments'
required:
- arguments
- name
title: DataHandler
DataHandlerArguments:
type: object
additionalProperties: false
properties:
remove_columns:
type: string
batched:
type: boolean
fn_kwargs:
$ref: '#/definitions/DataHandlerFnKwargs'
required:
- fn_kwargs
- remove_columns
title: DataHandlerArguments
DataHandlerFnKwargs:
type: object
properties:
str:
type: str
title: DataHandlerFnKwargs
Arguments to the data handlers are of two types,
Each data handler is a routine passed to the underlying HF Map API so the kwargs
supported by the underlying API can be passed via the arguments
section of the data handler config.
For example, users can pass remove_columns
to remove any columns from the dataset when executing the particular handler or they can use batched
to ensure batched processing of the data handler.
Users can also pass any number of kwargs
arguments required for each data handling routine
function as fn_kwargs
inside the arguments.
This library currently supports the following preexisting data handlers:
tokenize_and_apply_input_masking
: Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.apply_dataset_formatting
: Formats a dataset by appending an EOS token to a specified field.apply_custom_data_formatting_template
: Applies a custom template (e.g., Alpaca style) to format dataset elements.apply_tokenizer_chat_template
: Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
These handlers could be requested by their same name and users can lookup the function args from here
Users are also allowed to pass custom data handlers using sft_trainer.py::train()
API call via the additional_data_handlers
argument.
The argument expects users to pass a map similar to the existing data handlers k(str):func(callable)
which will be registered with the data preprocessor via its register_data_handlers
api
Dataset mixing allows users to mix multiple datasets often with different sampling ratios
to ensure the model is trained on a mix of some datasets in specific proportion.
If users want to train a model on just a straight forward concatenation
of the datasets then they need not enable data mixing. Users can specify just different datasets via data_paths
as shown above and all the datasets will be concatenated via concatenate_datasets
.
If users want to enable data mixing they need to enable sampling
of the datasets by specifying sampling
ratio for each dataset as described above. The library will then collect sampling ratios from each dataset definition in the data_config
and create a new interleaved dataset which is a combination of all the sampled datasets via interleave_datasets()
api.
Needless to say the sampling ratio of a datasets is a float and all the sampling ratios must sum to 1.
We also allow users to pass a seed
to randomize the interleaving of datasets and a stopping_strategy
to describe when to stop sampling. Both values should remain the same for experiment reproducibility. Both these values are common for all datasets and should be supplied at top level in the datapreprocessor
as shown above. For a list of the supported values of these arguments see the corresponding HF API.
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
We provide some example data configs here