This method provides train/test indices according to multiple target columns. It is suitable for applications involving multiclass-multioutput classification
[1] in which one has access to multiple targets and each target has the cardinality strictly greater than 2.
The implementation uses a genetic algorithm and it is inspired by the EvoSplit algorithm
[2]. Furthermore, I have used various useful resources[3, 4, 5, 6] for the Cython implementation
of the underlying genetic algorithm, and the codes are heavily inspired by the scikit-learn
API design[7].
It is worth emphasizing that the original EvoSplit algorithm was devised for multilabel classification tasks, where there are multiple target columns but each target admits only binary values (0 or 1). Thus, our algorithm is an adaptation of EvoSplit to multiclass-multioutput classification tasks, where each target can have multiple categorical values. In order to generalize the aforementioned method, we have adapted the fitness function in order to accomodate the multiple cardinality of each target.
Let multioutput-multiclass dataset
is defined as
For a multioutput target sample
Therefore, the multioutput target matrix
can be written as
Our aim is to split Label Distribution measure
[2] adapted to the multiclass-multioutput setting:
where we have used the following notations:
-
$\Lambda^{[k]}$ is the vector of the counts corresponding to the unique values of the$k^{th}$ target output associated to the samples belonging to the original dataset$\mathcal{D}$ -
$\Lambda_{subset}^{[k]}$ is the vector of the counts corresponding to the unique values of the$k^{th}$ target output associated to the samples belonging to$\mathcal{D}_{s}$ , where$subset \in \lbrace \text{train, test} \rbrace$
- n_iterations [int]: number of learning stages of the genetic algorithm
- test_size [float]: the proportion of the dataset representing the test subset
- population_size [int]: number of genetic algorithm's individuals
- mutation_rate [float | int]: the percentage/number of the indices which need to be swapped between the train & test partitions of each individual.
- crossover_rate [float | int]: the percentage/number of indices which need to be swapped between two individuals. As stated in the original EvoSplit algorithm, in order to preserve the train/test split percentages, we use a correction process by randomly reassigning extra train indices to the test subset and vice versa.
- n_individuals_by_mutation [int]: number of new individuals generated at each iteration by mutation
- n_individuals_by_crossover [int]: number of new individuals generated at each iteration by crossover
- sample_with_replacement [bool]: specifies if the parents considered in the mutation and crossover processes are drawn with replacement or not
- verbose [bool]: enables the printing of the fitness values at each iteration
- random_state [None | int | np.random.RandomState]: parameter which can lead to reproducible experiments
The first thing to do is to install the Python packages mentioned in requirements.txt
and then compile the Cython code for the genetic algorithm using
python setup.py build_ext --inplace
inside the src
folder.
Similar to scikit-learn, after defining the MultiOutputStratifiedSplitter
(defined in src/splitter.py)
stratified_splitter = \
MultiOutputStratifiedSplitter(n_iterations = n_iterations,
test_size = test_size,
population_size = population_size,
mutation_rate = mutation_rate,
crossover_rate = crossover_rate,
n_individuals_by_mutation = n_individuals_by_mutation,
n_individuals_by_crossover = n_individuals_by_crossover,
sample_with_replacement = sample_with_replacement,
verbose = verbose,
random_state = random_state
)
one can train it by a simple fit call:
X_train, X_test, y_train, y_test = stratified_splitter.fit(X, y)
If one just wants to try the stratifier, we made a synthetic dataset generator where each target output
data = CustomDataset(n_samples = n_samples,
n_outputs = n_outputs,
random_state = random_state)
X, y = data.generate_data()
Here, the CustomDataset
is a particular case of the Dataset
class from src/dataset.py
, which can generate the multioutput target
data = Dataset(n_samples = n_samples,
n_features = n_features,
n_classes = n_classes,
random_state = random_state)
X, y = data.generate_data(p_list)
More precisely, the n_classes parameter is a list containing the number of unique values for each target column
After training the MultiOutputStratifiedSplitter
, one can plot the fitness values at each iteration and the standard deviation of the fitnesses from each iteration, respectively:
plot_losses(losses = stratified_splitter.losses,
stds = stratified_splitter.stds,
save_path = save_path
)
If one wants to plot the heatmap containing the percentages of the counts for each unique value with respect to the train/test subsets, one can call
plot_counts(y = y,
data_indices = [stratified_splitter.train_idx,
stratified_splitter.test_idx],
random_state = stratified_splitter.random_state,
title = 'MultiOutputStratifiedSplitter',
save_path = '../results/'
)
Alternatively, for plotting the distribution of the counts corresponding to the target outputs, one can use
plot_distributions(y = y,
data_indices = [stratified_splitter.train_idx,
stratified_splitter.test_idx],
losses = stratified_splitter.losses,
random_state = stratified_splitter.random_state,
title = 'MultiOutputStratifiedSplitter',
save_path = '../results/'
)
For making a gif animation concerning the distribution of the counts, we have considered a separate function that does this:
make_gif_distributions(y = y,
data_indices = [stratified_splitter.train_indices,
stratified_splitter.test_indices],
losses = stratified_splitter.losses,
random_state = stratified_splitter.random_state,
title = method_name,
gif_fps = n_iterations // 20,
animation_interval = 200,
save_path = save_path
)
Finally, if one wants to infer the best parameters for the stratified splitting, the one can use the grid_search_param
function defined in src/splitter.py
, i.e.
(stratified_splitter,
best_params) = grid_search_param(X = X,
y = y,
random_state = random_state,
test_size = test_size,
n_iterations = n_iterations,
population_size = population_size,
mutation_rate = mutation_rate,
crossover_rate = crossover_rate,
n_individuals_by_mutation = n_individuals_by_mutation,
n_individuals_by_crossover = n_individuals_by_crossover,
sample_with_replacement = sample_with_replacement
)
It is worth mentioning the following:
- In order to simplify the visualization of the plots involving the counts and their percentages, if the dimensionality of the multioutput target, i.e.
$k$ (defined as n_outputs) exceeds$4$ then maximum$6$ subplots are generated, involving randomized pairs of target columns. - The percentages depicted in the counts heatmap are computed using the distributions of the percentages of each unique values corresponding to different pairs of target columns. Furthermore, the heatmap's title represents the medians and the standard deviations of the percentages corresponding to each pair consisting of target columns.
- The best parameters for the grid_search_param are found by choosing the parameters for which the mean of the sum of the medians and the standard deviation (mentioned previously) has the lowest value.
For other technical details regarding the usage of the MultiOutputStratifiedSplitter
, check the examples from src/main.py
. Furthermore, check the folder results
for the results that we have obtained for some experiments involving different combinations of data parameters & genetic algorithm hyper-parameters.
- MultiOutput-MultiClass classification in scikit-learn
- Florez-Revuelta, Francisco. "Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets." Applied Sciences 11.6 (2021): 2823.
- Cython implementation of DecisionTreeClassifier in scikit-learn
- Cython implementation of HistGradientBoostingClassifier in scikit-learn
- Smith, Kurt W. "Cython: A Guide for Python Programmers". O'Reilly Media, Inc., 2015.
- Cython's online documentation
- Buitinck, Lars, et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013).
-
Genetic algorithms
: -
C (C++) \& Cython
:- Sorting:
- Numpy C functions:
- Conversions:
- Numpy types:
- C containers:
- Shape of Cython memoryviews:
- Indexing:
- Cython exceptions:
- Printing at the C level:
- Fused types & Numpy arrays:
- Creating Numpy arrays from existing data at the C-level
-
scikit-learn
:- Shuffling method - sklearn.utils
- Resampling method - sklearn.utils
- Random state checking - sklearn.validation
- Validation of parameters - sklearn._param_validation
- Interval class - sklearn._param_validation
- Representing real numbers that are not instances of int - sklearn._param_validation
- Computing train & test size - sklearn.model_selection.split
-
misc
: