This repository accompanies the paper "Stabilizing Linear Passive‑Aggressive Online Learning with Weighted Reservoir Sampling" accepted to NeurIPS 2024.
A note on data꞉ in the data
folder, we include the features (stored as sparse arrays) and labels (stored as standard NumPy
arrays) for the Newsgroups (Binary, CS) and SST‑2 datasets that we preprocessed specifically for this project. The other 14 datasets, presented in Table 1 of our paper, are easily downloadable via the provided citations and links. To run results on all 16 datasets, please convert all datasets' features into sparse arrays and labels into standard NumPy
arrays following the examples we provide with Newsgroups (Binary, CS) and SST‑2. For the EMBER dataset, please see the ember repository for additional details.
A note on dependencies꞉ the only packages that must be installed to run all code in this repository are the following, with the versions we used in parentheses꞉ numpy=1.25.0
, numba=0.60.0
, matplotlib=3.7.1
, scipy=1.10.1
, and tqdm=4.65.0
.
A note on compute꞉ all experiments for this paper were run on a Linux cluster with 32 nodes, with a SLURM scheduler. However, no multiprocessing is necessary, and jobs can be easily run on single nodes. Bigger datasets like Criteo might require 32 GB of memory to safely operate. We include the (...)_runscript_driver.sh
files we used to manage jobs with the SLURM system out of transparency, but please alter these files accordingly to your own systems and clusters.
Helper files꞉ the files metric_extractors.py
and utils_v2.py
contain helper functions for loading loading datasets, computing various metrics, and working with weighted reservoir sampling weights.
The hparam_tuning
directory contains our code for finding the best values of our hyperparameters
folders.py
is for creating the relevant folders for our experimental results to prevent race conditions. Can call with justpython3 folders.py
.FSOL_main.py
is the main function for running one hyperparameter variant of FSOL on a given dataset for one pass through the training data. All FSOL variants can be run by callingFSOL_main_runscript_driver.sh
.PAC_main.py
andPAC_main_runscript_driver.sh
are the associated files for running hyperparameter variants of PAC.analyzer.ipynb
generates metrics for all PAC and FSOL variants, and picks the ones that we will use for further experiments down the line.- For transparency, the logs folder contains the log files generated by
analyzer.ipynb
.
The WRS
directory contains all code for running PAC‑WRS and FSOL‑WRS experiments꞉
- The base_variants folder contains the selected hyperparameter values for PAC and FSOL found above.
folders.py
is for creating the relevant folders for our experimental results to prevent race conditions. Can call with justpython3 folders.py
.FSOL_WRS_main.py
runs one variant of FSOL with a particular set of WRS‑specific settings on one dataset for one pass through the training data. All FSOL‑WRS variants can be run by callingFSOL_WRS_main_runscript_driver.sh
.PAC_WRS_main.py
andPAC_WRS_main_runscript_driver.sh
are the associated files for running hyperparameter variants of PAC‑WRS, with the same structure asFSOL_WRS_main.py
andFSOL_WRS_main_runscript_driver.sh
.analyzer+visualizer.ipynb
generates summaries of model performances and produces figures of test accuracy and sparsity over time, for each dataset.error_bars.ipynb
generates figures with error‑bars of ROP, final test accuracy, and final sparsity for our PAC‑WRS and FSOL‑WRS variants.hypothesis_testing.ipynb
performs Wilcoxon Signed‑Rank Tests for statistical significance of our ROP and final test accuracy metrics.- For transparency, we also include the
logs
folder containing various log‑files generated by our analysis notebooks.
The baselines
directory contains all code for running top‑K experiments on PAC and FSOL꞉
- The
base_variants
folder contains the selected hyperparameter values for PAC and FSOL found above. folders.py
is for creating the relevant folders for our experimental results to prevent race conditions. Can call with justpython3 folders.py
.FSOL_TOPK_main.py
runs one variant of top‑K + FSOL with a particular set of top-K‑specific settings on one dataset for one pass through the training data. All FSOL + top‑K variants can be run by callingFSOL_TOPK_main_runscript_driver.sh
.PAC_TOPK_main.py
runs one variant of top‑K + PAC with a particular set of top-K‑specific settings on one dataset for one pass through the training data. All PAC + top‑K variants can be run by callingPAC_TOPK_main_runscript_driver.sh
.analyzer.ipynb
contains our analysis scripts for creating summary metrics of all top‑K variants, computing the number of locations where each model outperformed the base methods, and performing Wilcoxon Signed‑Rank Tests on ROP and final test accuracy.- For transparency, we also include the logs folder containing the
topk_master.csv
file generated by our analysis notebook.
The moving_average
directory contains all code for running moving-average experiments on PAC and FSOL:
- The
base_variants
folder contains the selected hyperparameter values for PAC and FSOL as discussed earlier. folders.py
is for creating the relevant folders for our experimental results to prevent race conditions. Can call with justpython3 folders.py
.FSOL_MA_main.py
runs one variant of moving average (K=64) on one dataset for one pass through the training data. All FSOL + moving-average variants can be run by callingFSOL_MA_main_runscript_driver.sh
.PAC_MA_main.py
runs one variant of moving average (K=64) on one dataset for one pass through the training data. All PAC + moving-average variants can be run by callingPAC_MA_main_runscript_driver.sh
.
The exponential_average
directory contains all code for running exponential-average experiments on PAC and FSOL:
- The
base_variants
folder contains the selected hyperparameter values for PAC and FSOL as discussed earlier. -
folders.py
is for creating the relevant folders for our experimental results to prevent race conditions. Can call with justpython3 folders.py
. -
FSOL_EA_main.py
runs one variant of exponential average ($\gamma=0.9$ ) on one dataset for one pass through the training data. All FSOL + exponential-average variants can be run by callingFSOL_EA_main_runscript_driver.sh
. -
PAC_EA_main.py
runs one variant of exponential average ($\gamma=0.9$ ) on one dataset for one pass through the training data. All PAC + exponential-average variants can be run by callingPAC_EA_main_runscript_driver.sh
.
The sgd_variants
directory contains all code for running our modified WRS-Augmented Training procedure on the non-passive-aggressive online algorithms ADAGRAD, SGD+Momentum, and TGD:
folders.py
is for creating the relevant folders for our experimental results to prevent race conditions. Can call with justpython3 folders.py
.ADAGRAD_WRS_main.py
runs one variant of ADAGRAD + modified WRS-Augmented Training on one dataset for one pass through the training data. All ADAGRAD + modified WRS-Augmented Training variants can be run by callingADAGRAD_WRS_main_runscript_driver.sh
.SGDM_WRS_main.py
runs one variant of SGD+Momentum + modified WRS-Augmented Training on one dataset for one pass through the training data. All SGD+Momentum + modified WRS-Augmented Training variants can be run by callingSGDM_WRS_main_runscript_driver.sh
.TGD_WRS_main.py
runs one variant of TGD + modified WRS-Augmented Training on one dataset for one pass through the training data. All TGD + modified WRS-Augmented Training variants can be run by callingTGD_WRS_main_runscript_driver.sh
.
Finally, the visualizing_avging+non-passive-agg.ipynb
notebook generates figures showing test accuracy over time of Moving Average, Exponential Average, and WRS-Augmented Training, as well as calculates each methods' average compute time per iteration per dataset. This notebook also generates figures showing test accuracy over time of SGD+Momentum, ADAGRAD, and TGD with/without modified WRS-Augmented Training.
License꞉ all assets in our paper and accompanying repository are under CC BY‑NC 4.0.