Skip to content

This method provides train/test indices according to multiple target columns, and it is suitable for applications involving multiclass-multioutput classification tasks.

License

Notifications You must be signed in to change notification settings

CDAlecsa/MultiOutput-Stratified-Splitting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiOutputStratified Splitting

This method provides train/test indices according to multiple target columns. It is suitable for applications involving multiclass-multioutput classification[1] in which one has access to multiple targets and each target has the cardinality strictly greater than 2.
The implementation uses a genetic algorithm and it is inspired by the EvoSplit algorithm[2]. Furthermore, I have used various useful resources[3, 4, 5, 6] for the Cython implementation of the underlying genetic algorithm, and the codes are heavily inspired by the scikit-learn API design[7].
It is worth emphasizing that the original EvoSplit algorithm was devised for multilabel classification tasks, where there are multiple target columns but each target admits only binary values (0 or 1). Thus, our algorithm is an adaptation of EvoSplit to multiclass-multioutput classification tasks, where each target can have multiple categorical values. In order to generalize the aforementioned method, we have adapted the fitness function in order to accomodate the multiple cardinality of each target.

Theoretical framework

Let $d$ to be the number of input features, $p$ to be the number of targets, and $N$ to denote the number of samples. Then, a multioutput-multiclass dataset is defined as $\mathcal{D} = \lbrace (x_i, y_i) \rbrace_{i = \overline{1, \ldots, N}}$, where the sample associated to the $i^{th}$ observation consists of $x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}^p$. Using these notations, the input matrix is defined as

$$X = \begin{pmatrix} x_1^T \\\ \ldots \\\ x_N^T \end{pmatrix} \in \mathbb{R}^{N \times d}$$

For a multioutput target sample $y_i$, we define the one-dimensional target components as $y_i^{[k]} \in \mathbb{R}$ for $k \in \lbrace 1, \ldots, p \rbrace$. For every target column index $k \in \lbrace 1, \ldots, p \rbrace$, we denote

$$\textbf{y}^{[k]} = \begin{pmatrix} y_1^{[k]} \\\ \ldots \\\ y_N^{[k]} \end{pmatrix} \in \mathbb{R}^N$$

Therefore, the multioutput target matrix can be written as

$$Y = \begin{pmatrix} \textbf{y}^{[1]}, \ldots, \textbf{y}^{[p]} \end{pmatrix} \in \mathbb{R}^{N \times p}$$

Our aim is to split $\mathcal{D}$ into 2 disjoint subsets $\mathcal{D}_{train}$ and $\mathcal{D}_{test}$, of desired lengths $N_{train}$ and $N_{test}$, respectively. As a loss function, we consider a modified version of the Label Distribution measure[2] adapted to the multiclass-multioutput setting:

$$\mathcal{L} = \dfrac{1}{2p} \sum\limits_{k = 1}^{p} \left( \left\| \dfrac{\Lambda_{train}^{[k]}}{N_{train} - \Lambda_{train}^{[k]}} - \dfrac{\Lambda^{[k]}}{N - \Lambda^{[k]}} \right\|_{\mathcal{L}_1} + \left\| \dfrac{\Lambda_{test}^{[k]}}{N_{test} - \Lambda_{test}^{[k]}} - \dfrac{\Lambda^{[k]}}{N - \Lambda^{[k]}} \right\|_{\mathcal{L}_1} \right)$$

where we have used the following notations:

  • $\Lambda^{[k]}$ is the vector of the counts corresponding to the unique values of the $k^{th}$ target output associated to the samples belonging to the original dataset $\mathcal{D}$
  • $\Lambda_{subset}^{[k]}$ is the vector of the counts corresponding to the unique values of the $k^{th}$ target output associated to the samples belonging to $\mathcal{D}_{s}$, where $subset \in \lbrace \text{train, test} \rbrace$

Parameters of the MultiOutputStratifiedSplitter

  • n_iterations [int]: number of learning stages of the genetic algorithm
  • test_size [float]: the proportion of the dataset representing the test subset
  • population_size [int]: number of genetic algorithm's individuals
  • mutation_rate [float | int]: the percentage/number of the indices which need to be swapped between the train & test partitions of each individual.
  • crossover_rate [float | int]: the percentage/number of indices which need to be swapped between two individuals. As stated in the original EvoSplit algorithm, in order to preserve the train/test split percentages, we use a correction process by randomly reassigning extra train indices to the test subset and vice versa.
  • n_individuals_by_mutation [int]: number of new individuals generated at each iteration by mutation
  • n_individuals_by_crossover [int]: number of new individuals generated at each iteration by crossover
  • sample_with_replacement [bool]: specifies if the parents considered in the mutation and crossover processes are drawn with replacement or not
  • verbose [bool]: enables the printing of the fitness values at each iteration
  • random_state [None | int | np.random.RandomState]: parameter which can lead to reproducible experiments

Installation

The first thing to do is to install the Python packages mentioned in requirements.txt and then compile the Cython code for the genetic algorithm using

python setup.py build_ext --inplace

inside the src folder.

Running the code

Similar to scikit-learn, after defining the MultiOutputStratifiedSplitter (defined in src/splitter.py)

stratified_splitter = \
MultiOutputStratifiedSplitter(n_iterations = n_iterations,
                              test_size = test_size,
                              population_size = population_size,
                              mutation_rate = mutation_rate, 
                              crossover_rate = crossover_rate,
                              n_individuals_by_mutation = n_individuals_by_mutation,
                              n_individuals_by_crossover = n_individuals_by_crossover,
                              sample_with_replacement = sample_with_replacement,
                              verbose = verbose,
                              random_state = random_state
                            )

one can train it by a simple fit call:

X_train, X_test, y_train, y_test = stratified_splitter.fit(X, y)

If one just wants to try the stratifier, we made a synthetic dataset generator where each target output $\textbf{y}^{[k]}$ is generated according to chosen probabilities. For this one can use

data = CustomDataset(n_samples = n_samples, 
                     n_outputs = n_outputs,
                     random_state = random_state)
X, y = data.generate_data()

Here, the CustomDataset is a particular case of the Dataset class from src/dataset.py, which can generate the multioutput target $Y$ as follows:

data = Dataset(n_samples = n_samples, 
               n_features = n_features,
               n_classes = n_classes,
               random_state = random_state)
X, y = data.generate_data(p_list)

More precisely, the n_classes parameter is a list containing the number of unique values for each target column $\textbf{y}^{[k]}$. At the same time, p_list represents a list of dictionaries, where each dictionary corresponds to a target column $\textbf{y}^{[k]}$. For an index $k$, the keys of the $k^{th}$ dictionary represent the unique values of the target columns, and the dictionary values are the probabilities corresponding corresponding to those keys. The method of generating the synthetic dataset consists in making the product of the probabilities of the target columns for each sample, and then generating the $k$ outputs corresponding to each sample with the corresponding joint probability.
After training the MultiOutputStratifiedSplitter, one can plot the fitness values at each iteration and the standard deviation of the fitnesses from each iteration, respectively:

plot_losses(losses = stratified_splitter.losses, 
            stds = stratified_splitter.stds,
            save_path = save_path
            )

Fitness function
If one wants to plot the heatmap containing the percentages of the counts for each unique value with respect to the train/test subsets, one can call

plot_counts(y = y, 
            data_indices = [stratified_splitter.train_idx,      
                            stratified_splitter.test_idx], 
            random_state = stratified_splitter.random_state,
            title = 'MultiOutputStratifiedSplitter',
            save_path = '../results/'
            )

Counts heatmap
Alternatively, for plotting the distribution of the counts corresponding to the target outputs, one can use

plot_distributions(y = y, 
                   data_indices = [stratified_splitter.train_idx,
                                   stratified_splitter.test_idx], 
                  losses = stratified_splitter.losses,
                  random_state = stratified_splitter.random_state,
                  title = 'MultiOutputStratifiedSplitter',
                  save_path = '../results/'
                )

Counts distribution
For making a gif animation concerning the distribution of the counts, we have considered a separate function that does this:

make_gif_distributions(y = y, 
                      data_indices = [stratified_splitter.train_indices,
                                      stratified_splitter.test_indices], 
                      losses = stratified_splitter.losses,
                      random_state = stratified_splitter.random_state,
                      title = method_name,
                      gif_fps = n_iterations // 20,
                      animation_interval = 200,
                      save_path = save_path
                    )

Counts distribution animation
Finally, if one wants to infer the best parameters for the stratified splitting, the one can use the grid_search_param function defined in src/splitter.py, i.e.

(stratified_splitter, 
best_params) = grid_search_param(X = X,
                                 y = y,
                                 random_state = random_state,
                                 test_size = test_size,
                                 n_iterations = n_iterations,
                                 population_size = population_size,
                                 mutation_rate = mutation_rate,
                                 crossover_rate = crossover_rate,
                                 n_individuals_by_mutation =  n_individuals_by_mutation,
                                 n_individuals_by_crossover = n_individuals_by_crossover,
                                 sample_with_replacement = sample_with_replacement
                                )

It is worth mentioning the following:

  • In order to simplify the visualization of the plots involving the counts and their percentages, if the dimensionality of the multioutput target, i.e. $k$ (defined as n_outputs) exceeds $4$ then maximum $6$ subplots are generated, involving randomized pairs of target columns.
  • The percentages depicted in the counts heatmap are computed using the distributions of the percentages of each unique values corresponding to different pairs of target columns. Furthermore, the heatmap's title represents the medians and the standard deviations of the percentages corresponding to each pair consisting of target columns.
  • The best parameters for the grid_search_param are found by choosing the parameters for which the mean of the sum of the medians and the standard deviation (mentioned previously) has the lowest value.

For other technical details regarding the usage of the MultiOutputStratifiedSplitter, check the examples from src/main.py. Furthermore, check the folder results for the results that we have obtained for some experiments involving different combinations of data parameters & genetic algorithm hyper-parameters.

References

  1. MultiOutput-MultiClass classification in scikit-learn
  2. Florez-Revuelta, Francisco. "Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets." Applied Sciences 11.6 (2021): 2823.
  3. Cython implementation of DecisionTreeClassifier in scikit-learn
  4. Cython implementation of HistGradientBoostingClassifier in scikit-learn
  5. Smith, Kurt W. "Cython: A Guide for Python Programmers". O'Reilly Media, Inc., 2015.
  6. Cython's online documentation
  7. Buitinck, Lars, et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013).

Additional resources

  1. Genetic algorithms:

  2. C (C++) \& Cython:

  3. scikit-learn:

  4. misc:

About

This method provides train/test indices according to multiple target columns, and it is suitable for applications involving multiclass-multioutput classification tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published