-
Notifications
You must be signed in to change notification settings - Fork 39
Command
SOL provides a set of useful command line tools related to data processing and model training/testing.
Data formats supported by this software package are "svmlight" (commonly used in LIBSVM and LIBLINEAR), "csv", and a binary format defined by ourselves. Labels and features in data files should all be numeric.
- The Binary Format
The binary format is for fast loading and processing. It is used to cache datasets, like in cross-validation procedures. Each sample in binary format is comprised of the following items in sequence:
- **label**: *sizeof(label_t)*, *label_t* is *int32_t* by default;
- **feature number**: *sizeof(size_t)*;
- **length of compressed index**: *sizeoof(size_t)*;
- **compressed index**: *sizeof(char)* * **length of compressed index**;
- **features**: *sizeof(float)* * **feature number**;
- Data Preprocessing Tools
The library provides some tools to help users pre-process datasets, like analyzing, splitting, shuffling , etc. Note that the tools support the data formats as mentioned above.
- **analyze**: analyze the data number, feature number, feature dimension, nonzero feature number, class number, feature sparsity of datasets.
- **concat**: concatenate several data files into one.
- **converter**: convert data from one format to another.
- **shuffle**: shuffle the order of data samples.
- **split**: split one data file into several parts.
The detailed input parameters for the tools can be obtained by running the tool without options or with "-h" or "--help" option.
The training tool is sol_train. Running sol_train without any arguments or with "--help/-h" will produce a message which briefly explains each argument.
The command to call sol_train is:
$ sol_train [options] ... train_file [model_file]
Options include "General Options", "IO Options" and "Model Options".
- General Options
-h arg : show all help information;
-s arg : show list of information specified by arg.
The "-s" option is to help users know what the library can do (what kind of algorithms are implemented, what kind of data formats are supported, what kind of loss functions are implemented, etc.) without checking the code. The available arguments include "reader", "writer", "model", and "loss". For example, running with "model" will show users the available algorithms as follows:
$ sol_train -s model
[output is:]
ada-fobos-l1: Adaptive Subgradient FOBOS with l1 regularization
ada-fobos: Adaptive Subgradient FOBOS
ada-rda-l1: Adaptive Subgradient RDA with l1 regularization
ada-rda: Adaptive Subgradient RDA
alma2: Approximate Large Margin Algorithm with norm 2
arow: Adaptive Regularization of Weight Vectors
cw: confidence weighted online learning
eccw: exact convex confidence weighted online learning
erda-l1: mixed l1-l2^2 enhanced regularized dual averaging
fobos-l1: Forward Backward Splitting l1 regularization
ogd: Online Gradient Descent
pa1: Online Passive Aggressive-1
pa2: Online Passive Aggressive-2
pa: Online Passive Aggressive
perceptron: perceptron algorithm
rda-l1: mixed l1-l2^2 regularized dual averaging
rda: l2^2 regularized dual averaging
sop: second order perceptron
stg: Sparse Online Learning via Truncated Gradient
And Running with "reader" will show users the data readers for all support data formats:
$ sol_train -s reader
bin: binary format data reader
csv: csv format data reader
svm: libsvm format data reader
- IO Options
-f arg : dataset format ('svm'[default], 'bin', or 'csv')
-c arg : nubmer of classes (default=2)
-p arg : number of passes to go through the data (default=1).
-d arg : dimension of the data.
Note that the IO options are not required. For the "-d" option, if not specified, the tool will learn the dimension by itself, with a little more memory cost.
- Model Options
-a arg : learning algorithm to use (see "General Options")
-m arg : path to pre-trained model for finetuning
--params arg: parameters for algorithms in the format "param=val;param2=val2;..."
Some usefule parameters include:
-
loss=[string]
bool : 1 if wrong predict, 0 if correct hinge : hinge loss maxscore-bool : multi-class max-score bool loss maxscore-hinge: multi-class max-score hinge loss uniform-bool : multi-class uniform bool loss uniform-hinge : multi-class uniform hinge loss
-
lambda=[float]
Regularization parameter for sparse online learning algorithms
-
norm=[string]
Normalize the data samples, the supported normalization method include:
l1 : divide each feature by L1 norm l2 : divide each feature by L2 norm
-
eta=[float]
Learning rate for online algorithms.
For the OGD algorthm,
eta arg : initial learning rate power_t arg : power t of decaying learning rate t arg : initial iteration number
So the options can be:
$ sol_train -a ogd --params "eta=0.1;power_t=1;t=100" data_path
The following table shows the algorithms and their correspondent parameters.
Algorithm | Parameters | Meaning |
---|---|---|
Ada-FOBOS | eta | learning rate |
delta | parameter to ensure positive-definite property of the adaptive weighting matrix | |
Ada-RDA | eta | learning rate |
delta | parameter to ensure positive-definite property of the adaptive weighting matrix | |
ALMA | alpha | final margin parameter (1-alpha) * gamma |
C | typically set to sqrt(2) | |
AROW | r | parameter of passive-aggressive update trade-off |
CW | a | initial confidence |
phi | threshold of inverse normal distribution of the threshold-probability | |
ECCW | a | initial confidence |
phi | threshold of inverse normal distribution of the threshold-probability | |
OGD | eta | learning rate |
power_t | power to of decaying learning rate | |
PA1 | C | passive-aggressive trade-off parameter |
PA2 | C | passive-aggressive trade-off parameter |
RDA | sigma | coefficient of the proximal function |
ERDA | sigma | coefficient of the proximal function |
rou | parameter for l1 penality in proximal function | |
SOP | a | parameter for positive definite normalization matrix |
We provide an example to show how to use sol_train and explain the details of how sol_train works. The dataset we use will be "a1a" provided in the "data" folder.
The command for training wit default algorithm is as the following shows.
$ sol_train data/a1a
Output of the above command will be:
--------------------------------------------------
Model Information:
{
"clf_num" : 1,
"cls_num" : 2,
"loss" : "hinge",
"model" : "ogd",
"norm" : 0,
"online" : {
"bias_eta" : 0,
"dim" : 1,
"eta" : 1,
"lazy_update" : "false",
"power_t" : 0.5,
"t" : 0
}
}
Training Process....
Iterate No. Error Rate Update No.
2 0.500000 1
4 0.250000 1
8 0.125000 1
16 0.187500 3
32 0.125000 8
64 0.218750 19
128 0.179688 41
256 0.218750 95
512 0.210938 179
1024 0.205078 381
1605 0.187539 567
training accuracy: 0.8125
training time: 0.031 seconds
model sparsity: 15.1260%
Illustrations:
-
Model Information: Class number, classifier number, including specified algorithm, and detailed parameters as specified by "--params" option above.
-
Trainin Process: The iteration information: the first column is number of processed data samples; the second column is the training error rate; the third column is number of updated times of the classifiers.
-
Summary: The final training accuracy, time cost, and model sparsity.
By default, SOL use the "OGD" algorithm to learn a model. If users want to try another algorithm ("AROW" for example) and save the learnt model to a file ("arow.model"):
$ sol_train -a arow data/a1a arow.model
Each algorithm may have its own parameters as illustrated in "Model Options". The following command changes the default value of parameter "r" to "2.0":
$ sol_train -a arow --params r=2.0 data/a1a arow.model
In some cases we want to finetune from a pre-trained model, the command is:
$ sol_train -m arow.model data/a1a arow2.model
The test tool is sol_test. The test command is:
$ sol_test model_file data_file [predict_file]
-
model_file : model trained by sol_train
-
data_file : path to test data
-
predict_file : path to save the prediction results(optional)
For exmaple, We can test with the learned model:
$ sol_test arow.model data/a1a.t predict.txt
test accuracy: 0.8437
test time: 0.016 seconds
The library provides python wrappers for users. The command line tools are the same("sol_train" and "sol_test"). The usage are almost the same, except that "sol_train.py" provides the cross validation function. For example, if users want to do a 5-fold GridSearch Cross Validation in the range [2^-5,2^-4,...,2^4, 2^5] for parameter "r" of "AROW", the command and output will be:
$ sol_train -a arow --cv r=0.03125:2:32 -f 5 \ data/a1a arow.model
cross validation parameters: [('r', 2.0)]
For advanced users, they can from pysol import SOL to their own python scripts. The SOL class provides similar interfaces as scikit-learn classifiers.