Command

Command Line Tools

SOL provides a set of useful command line tools related to data processing and model training/testing.

Data Formats and Preprocessing Tools

Data formats supported by this software package are "svmlight" (commonly used in LIBSVM and LIBLINEAR), "csv", and a binary format defined by ourselves. Labels and features in data files should all be numeric.

The Binary Format

The binary format is for fast loading and processing. It is used to cache datasets, like in cross-validation procedures. Each sample in binary format is comprised of the following items in sequence:

- **label**: *sizeof(label_t)*, *label_t* is *int32_t* by default;
- **feature number**: *sizeof(size_t)*;
- **length of compressed index**: *sizeoof(size_t)*;
- **compressed index**: *sizeof(char)* * **length of compressed index**;
- **features**: *sizeof(float)* * **feature number**;

Data Preprocessing Tools

The library provides some tools to help users pre-process datasets, like analyzing, splitting, shuffling , etc. Note that the tools support the data formats as mentioned above.

- **analyze**: analyze the data number, feature number, feature dimension, nonzero feature number, class number, feature sparsity of datasets.
- **concat**: concatenate several data files into one.
- **converter**: convert data from one format to another.
- **shuffle**: shuffle the order of data samples.
- **split**: split one data file into several parts.

The detailed input parameters for the tools can be obtained by running the tool without options or with "-h" or "--help" option.

Training Tool

The training tool is sol_train. Running sol_train without any arguments or with "--help/-h" will produce a message which briefly explains each argument.

The command to call sol_train is:

$ sol_train [options] ... train_file [model_file]

Options include "General Options", "IO Options" and "Model Options".

General Options

-h  arg :    show all help information;
-s  arg :    show list of information specified by arg.

The "-s" option is to help users know what the library can do (what kind of algorithms are implemented, what kind of data formats are supported, what kind of loss functions are implemented, etc.) without checking the code. The available arguments include "reader", "writer", "model", and "loss". For example, running with "model" will show users the available algorithms as follows:

$ sol_train -s model
[output is:]
ada-fobos-l1:   Adaptive Subgradient FOBOS with l1 regularization
ada-fobos:      Adaptive Subgradient FOBOS
ada-rda-l1:     Adaptive Subgradient RDA with l1 regularization
ada-rda:        Adaptive Subgradient RDA
alma2:  Approximate Large Margin Algorithm with norm 2
arow:   Adaptive Regularization of Weight Vectors
cw:     confidence weighted online learning
eccw:   exact convex confidence weighted online learning
erda-l1:        mixed l1-l2^2 enhanced regularized dual averaging
fobos-l1:       Forward Backward Splitting l1 regularization
ogd:    Online Gradient Descent
pa1:    Online Passive Aggressive-1
pa2:    Online Passive Aggressive-2
pa:     Online Passive Aggressive
perceptron:     perceptron algorithm
rda-l1: mixed l1-l2^2 regularized dual averaging
rda:    l2^2 regularized dual averaging
sop:    second order perceptron
stg:    Sparse Online Learning via Truncated Gradient

And Running with "reader" will show users the data readers for all support data formats:

$ sol_train -s reader
bin:    binary format data reader
csv:    csv format data reader
svm:    libsvm format data reader

IO Options

-f  arg :    dataset format ('svm'[default], 'bin', or 'csv')
-c  arg :    nubmer of classes (default=2)
-p  arg :    number of passes to go through the data (default=1).
-d  arg :    dimension of the data.

Note that the IO options are not required. For the "-d" option, if not specified, the tool will learn the dimension by itself, with a little more memory cost.

Model Options

-a  arg		:    learning algorithm to use (see "General Options")
-m  arg		:    path to pre-trained model for finetuning
--params arg:    parameters for algorithms in the format "param=val;param2=val2;..."

Some usefule parameters include:

loss=[string]

bool          :   1 if wrong predict, 0 if correct
hinge         :   hinge loss
maxscore-bool :   multi-class max-score bool loss
maxscore-hinge:   multi-class max-score hinge loss
uniform-bool  :   multi-class uniform bool loss
uniform-hinge :   multi-class uniform hinge loss

lambda=[float]

Regularization parameter for sparse online learning algorithms
norm=[string]

Normalize the data samples, the supported normalization method include:
```
l1 :  divide each feature by L1 norm
l2 :  divide each feature by L2 norm
```

eta=[float]

Learning rate for online algorithms.

For the OGD algorthm,

eta     arg :  initial learning rate
power_t arg :  power t of decaying learning rate
t       arg :  initial iteration number

So the options can be:

$ sol_train -a ogd --params "eta=0.1;power_t=1;t=100" data_path

The following table shows the algorithms and their correspondent parameters.

Algorithm	Parameters	Meaning
Ada-FOBOS	eta	learning rate
	delta	parameter to ensure positive-definite property of the adaptive weighting matrix
Ada-RDA	eta	learning rate
	delta	parameter to ensure positive-definite property of the adaptive weighting matrix
ALMA	alpha	final margin parameter (1-alpha) * gamma
	C	typically set to sqrt(2)
AROW	r	parameter of passive-aggressive update trade-off
CW	a	initial confidence
	phi	threshold of inverse normal distribution of the threshold-probability
ECCW	a	initial confidence
	phi	threshold of inverse normal distribution of the threshold-probability
OGD	eta	learning rate
	power_t	power to of decaying learning rate
PA1	C	passive-aggressive trade-off parameter
PA2	C	passive-aggressive trade-off parameter
RDA	sigma	coefficient of the proximal function
ERDA	sigma	coefficient of the proximal function
	rou	parameter for l1 penality in proximal function
SOP	a	parameter for positive definite normalization matrix

We provide an example to show how to use sol_train and explain the details of how sol_train works. The dataset we use will be "a1a" provided in the "data" folder.

The command for training wit default algorithm is as the following shows.

$ sol_train data/a1a

Output of the above command will be:

--------------------------------------------------
Model Information:
{
"clf_num" : 1,
"cls_num" : 2,
"loss" : "hinge",
"model" : "ogd",
"norm" : 0,
"online" : {
"bias_eta" : 0,
"dim" : 1,
"eta" : 1,
"lazy_update" : "false",
"power_t" : 0.5,
"t" : 0
}
}

Training Process....
Iterate No.             Error Rate              Update No.
2                       0.500000                1
4                       0.250000                1
8                       0.125000                1
16                      0.187500                3
32                      0.125000                8
64                      0.218750                19
128                     0.179688                41
256                     0.218750                95
512                     0.210938                179
1024                    0.205078                381
1605                    0.187539                567
training accuracy: 0.8125
training time: 0.031 seconds
model sparsity: 15.1260%

Illustrations:

Model Information: Class number, classifier number, including specified algorithm, and detailed parameters as specified by "--params" option above.
Trainin Process: The iteration information: the first column is number of processed data samples; the second column is the training error rate; the third column is number of updated times of the classifiers.
Summary: The final training accuracy, time cost, and model sparsity.

By default, SOL use the "OGD" algorithm to learn a model. If users want to try another algorithm ("AROW" for example) and save the learnt model to a file ("arow.model"):

$ sol_train -a arow data/a1a arow.model

Each algorithm may have its own parameters as illustrated in "Model Options". The following command changes the default value of parameter "r" to "2.0":

$ sol_train -a arow --params r=2.0 data/a1a arow.model

In some cases we want to finetune from a pre-trained model, the command is:

$ sol_train -m arow.model data/a1a arow2.model

Test Tool

The test tool is sol_test. The test command is:

$ sol_test model_file data_file [predict_file]

model_file : model trained by sol_train
data_file : path to test data
predict_file : path to save the prediction results(optional)

For exmaple, We can test with the learned model:

$ sol_test arow.model data/a1a.t predict.txt
test accuracy: 0.8437
test time: 0.016 seconds

Python Wrapper

The library provides python wrappers for users. The command line tools are the same("sol_train" and "sol_test"). The usage are almost the same, except that "sol_train.py" provides the cross validation function. For example, if users want to do a 5-fold GridSearch Cross Validation in the range [2^-5,2^-4,...,2^4, 2^5] for parameter "r" of "AROW", the command and output will be:

$ sol_train -a arow --cv r=0.03125:2:32 -f 5 \ data/a1a arow.model
cross validation parameters: [('r', 2.0)]

For advanced users, they can from pysol import SOL to their own python scripts. The SOL class provides similar interfaces as scikit-learn classifiers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Command

Command Line Tools

Data Formats and Preprocessing Tools

Training Tool

Test Tool

Python Wrapper

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally