MultiOutputStratified Splitting

This method provides train/test indices according to multiple target columns. It is suitable for applications involving multiclass-multioutput classification^[1] in which one has access to multiple targets and each target has the cardinality strictly greater than 2.
The implementation uses a genetic algorithm and it is inspired by the EvoSplit algorithm^[2]. Furthermore, I have used various useful resources^{[3, 4, 5, 6]} for the Cython implementation of the underlying genetic algorithm, and the codes are heavily inspired by the scikit-learn API design^[7].
It is worth emphasizing that the original EvoSplit algorithm was devised for multilabel classification tasks, where there are multiple target columns but each target admits only binary values (0 or 1). Thus, our algorithm is an adaptation of EvoSplit to multiclass-multioutput classification tasks, where each target can have multiple categorical values. In order to generalize the aforementioned method, we have adapted the fitness function in order to accomodate the multiple cardinality of each target.

Theoretical framework

Let $d$ to be the number of input features, $p$ to be the number of targets, and $N$ to denote the number of samples. Then, a multioutput-multiclass dataset is defined as $\mathcal{D} = \lbrace (x_i, y_i) \rbrace_{i = \overline{1, \ldots, N}}$, where the sample associated to the $i^{th}$ observation consists of $x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}^p$. Using these notations, the input matrix is defined as

$$X = \begin{pmatrix} x_1^T \\\ \ldots \\\ x_N^T \end{pmatrix} \in \mathbb{R}^{N \times d}$$

For a multioutput target sample $y_i$, we define the one-dimensional target components as $y_i^{[k]} \in \mathbb{R}$ for $k \in \lbrace 1, \ldots, p \rbrace$. For every target column index $k \in \lbrace 1, \ldots, p \rbrace$, we denote

$$\textbf{y}^{[k]} = \begin{pmatrix} y_1^{[k]} \\\ \ldots \\\ y_N^{[k]} \end{pmatrix} \in \mathbb{R}^N$$

Therefore, the multioutput target matrix can be written as

$$Y = \begin{pmatrix} \textbf{y}^{[1]}, \ldots, \textbf{y}^{[p]} \end{pmatrix} \in \mathbb{R}^{N \times p}$$

Our aim is to split $\mathcal{D}$ into 2 disjoint subsets $\mathcal{D}_{train}$ and $\mathcal{D}_{test}$, of desired lengths $N_{train}$ and $N_{test}$, respectively. As a loss function, we consider a modified version of the Label Distribution measure^[2] adapted to the multiclass-multioutput setting:

$$\mathcal{L} = \dfrac{1}{2p} \sum\limits_{k = 1}^{p} \left( \left\| \dfrac{\Lambda_{train}^{[k]}}{N_{train} - \Lambda_{train}^{[k]}} - \dfrac{\Lambda^{[k]}}{N - \Lambda^{[k]}} \right\|_{\mathcal{L}_1} + \left\| \dfrac{\Lambda_{test}^{[k]}}{N_{test} - \Lambda_{test}^{[k]}} - \dfrac{\Lambda^{[k]}}{N - \Lambda^{[k]}} \right\|_{\mathcal{L}_1} \right)$$

where we have used the following notations:

$\Lambda^{[k]}$ is the vector of the counts corresponding to the unique values of the $k^{th}$ target output associated to the samples belonging to the original dataset $\mathcal{D}$
$\Lambda_{subset}^{[k]}$ is the vector of the counts corresponding to the unique values of the $k^{th}$ target output associated to the samples belonging to $\mathcal{D}_{s}$, where $subset \in \lbrace \text{train, test} \rbrace$

Parameters of the MultiOutputStratifiedSplitter

n_iterations [int]: number of learning stages of the genetic algorithm
test_size [float]: the proportion of the dataset representing the test subset
population_size [int]: number of genetic algorithm's individuals
mutation_rate [float | int]: the percentage/number of the indices which need to be swapped between the train & test partitions of each individual.
crossover_rate [float | int]: the percentage/number of indices which need to be swapped between two individuals. As stated in the original EvoSplit algorithm, in order to preserve the train/test split percentages, we use a correction process by randomly reassigning extra train indices to the test subset and vice versa.
n_individuals_by_mutation [int]: number of new individuals generated at each iteration by mutation
n_individuals_by_crossover [int]: number of new individuals generated at each iteration by crossover
sample_with_replacement [bool]: specifies if the parents considered in the mutation and crossover processes are drawn with replacement or not
verbose [bool]: enables the printing of the fitness values at each iteration
random_state [None | int | np.random.RandomState]: parameter which can lead to reproducible experiments

Installation

The first thing to do is to install the Python packages mentioned in requirements.txt and then compile the Cython code for the genetic algorithm using

python setup.py build_ext --inplace

inside the src folder.

Running the code

Similar to scikit-learn, after defining the MultiOutputStratifiedSplitter (defined in src/splitter.py)

stratified_splitter = \
MultiOutputStratifiedSplitter(n_iterations = n_iterations,
                              test_size = test_size,
                              population_size = population_size,
                              mutation_rate = mutation_rate, 
                              crossover_rate = crossover_rate,
                              n_individuals_by_mutation = n_individuals_by_mutation,
                              n_individuals_by_crossover = n_individuals_by_crossover,
                              sample_with_replacement = sample_with_replacement,
                              verbose = verbose,
                              random_state = random_state
                            )

one can train it by a simple fit call:

X_train, X_test, y_train, y_test = stratified_splitter.fit(X, y)

If one just wants to try the stratifier, we made a synthetic dataset generator where each target output $\textbf{y}^{[k]}$ is generated according to chosen probabilities. For this one can use

data = CustomDataset(n_samples = n_samples, 
                     n_outputs = n_outputs,
                     random_state = random_state)
X, y = data.generate_data()

Here, the CustomDataset is a particular case of the Dataset class from src/dataset.py, which can generate the multioutput target $Y$ as follows:

data = Dataset(n_samples = n_samples, 
               n_features = n_features,
               n_classes = n_classes,
               random_state = random_state)
X, y = data.generate_data(p_list)

More precisely, the n_classes parameter is a list containing the number of unique values for each target column $\textbf{y}^{[k]}$. At the same time, p_list represents a list of dictionaries, where each dictionary corresponds to a target column $\textbf{y}^{[k]}$. For an index $k$, the keys of the $k^{th}$ dictionary represent the unique values of the target columns, and the dictionary values are the probabilities corresponding corresponding to those keys. The method of generating the synthetic dataset consists in making the product of the probabilities of the target columns for each sample, and then generating the $k$ outputs corresponding to each sample with the corresponding joint probability.
After training the MultiOutputStratifiedSplitter, one can plot the fitness values at each iteration and the standard deviation of the fitnesses from each iteration, respectively:

plot_losses(losses = stratified_splitter.losses, 
            stds = stratified_splitter.stds,
            save_path = save_path
            )

If one wants to plot the heatmap containing the percentages of the counts for each unique value with respect to the train/test subsets, one can call

plot_counts(y = y, 
            data_indices = [stratified_splitter.train_idx,      
                            stratified_splitter.test_idx], 
            random_state = stratified_splitter.random_state,
            title = 'MultiOutputStratifiedSplitter',
            save_path = '../results/'
            )

Alternatively, for plotting the distribution of the counts corresponding to the target outputs, one can use

plot_distributions(y = y, 
                   data_indices = [stratified_splitter.train_idx,
                                   stratified_splitter.test_idx], 
                  losses = stratified_splitter.losses,
                  random_state = stratified_splitter.random_state,
                  title = 'MultiOutputStratifiedSplitter',
                  save_path = '../results/'
                )

For making a gif animation concerning the distribution of the counts, we have considered a separate function that does this:

make_gif_distributions(y = y, 
                      data_indices = [stratified_splitter.train_indices,
                                      stratified_splitter.test_indices], 
                      losses = stratified_splitter.losses,
                      random_state = stratified_splitter.random_state,
                      title = method_name,
                      gif_fps = n_iterations // 20,
                      animation_interval = 200,
                      save_path = save_path
                    )

Finally, if one wants to infer the best parameters for the stratified splitting, the one can use the grid_search_param function defined in src/splitter.py, i.e.

(stratified_splitter, 
best_params) = grid_search_param(X = X,
                                 y = y,
                                 random_state = random_state,
                                 test_size = test_size,
                                 n_iterations = n_iterations,
                                 population_size = population_size,
                                 mutation_rate = mutation_rate,
                                 crossover_rate = crossover_rate,
                                 n_individuals_by_mutation =  n_individuals_by_mutation,
                                 n_individuals_by_crossover = n_individuals_by_crossover,
                                 sample_with_replacement = sample_with_replacement
                                )

It is worth mentioning the following:

In order to simplify the visualization of the plots involving the counts and their percentages, if the dimensionality of the multioutput target, i.e. $k$ (defined as n_outputs) exceeds $4$ then maximum $6$ subplots are generated, involving randomized pairs of target columns.
The percentages depicted in the counts heatmap are computed using the distributions of the percentages of each unique values corresponding to different pairs of target columns. Furthermore, the heatmap's title represents the medians and the standard deviations of the percentages corresponding to each pair consisting of target columns.
The best parameters for the grid_search_param are found by choosing the parameters for which the mean of the sum of the medians and the standard deviation (mentioned previously) has the lowest value.

For other technical details regarding the usage of the MultiOutputStratifiedSplitter, check the examples from src/main.py. Furthermore, check the folder results for the results that we have obtained for some experiments involving different combinations of data parameters & genetic algorithm hyper-parameters.

References

MultiOutput-MultiClass classification in scikit-learn
Florez-Revuelta, Francisco. "Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets." Applied Sciences 11.6 (2021): 2823.
Cython implementation of DecisionTreeClassifier in scikit-learn
Cython implementation of HistGradientBoostingClassifier in scikit-learn
Smith, Kurt W. "Cython: A Guide for Python Programmers". O'Reilly Media, Inc., 2015.
Cython's online documentation
Buitinck, Lars, et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013).

Additional resources

Genetic algorithms:
C (C++) \& Cython:
- Sorting:
- Numpy C functions:
- Conversions:
  - Utility function for the conversion of np.ndarray to C++ std::valarray
- Numpy types:
- C containers:
- Shape of Cython memoryviews:
  - Cython returns an 8D list instead of a 2D tuple
- Indexing:
  - Indexing memoryviews with multiple indices
- Cython exceptions:
  - Exception handling when returning a memoryview
- Printing at the C level:
  - Print size_t variables portably in C
  - Print ssize_t variables in C
- Fused types & Numpy arrays:
  - Mapping Numpy types to C-level types using a given fused type
  - Numpy types & C types
- Creating Numpy arrays from existing data at the C-level
  - Exposing C-computed arrays in Python without data copies
scikit-learn:
misc:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
results		results
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiOutputStratified Splitting

Theoretical framework

Parameters of the MultiOutputStratifiedSplitter

Installation

Running the code

References

Additional resources

About

Releases

Packages

Languages

License

CDAlecsa/MultiOutput-Stratified-Splitting

Folders and files

Latest commit

History

Repository files navigation

MultiOutputStratified Splitting

Theoretical framework

Parameters of the MultiOutputStratifiedSplitter

Installation

Running the code

References

Additional resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages