Skip to content

FutureComputing4AI/Weighted-Reservoir-Sampling-Augmented-Training

Repository files navigation

Stabilizing Linear Passive‑Aggressive Online Learning with Weighted Reservoir Sampling

This repository accompanies the paper "Stabilizing Linear Passive‑Aggressive Online Learning with Weighted Reservoir Sampling" accepted to NeurIPS 2024.

A note on data꞉ in the data folder, we include the features (stored as sparse arrays) and labels (stored as standard NumPy arrays) for the Newsgroups (Binary, CS) and SST‑2 datasets that we preprocessed specifically for this project. The other 14 datasets, presented in Table 1 of our paper, are easily downloadable via the provided citations and links. To run results on all 16 datasets, please convert all datasets' features into sparse arrays and labels into standard NumPy arrays following the examples we provide with Newsgroups (Binary, CS) and SST‑2. For the EMBER dataset, please see the ember repository for additional details.

A note on dependencies꞉ the only packages that must be installed to run all code in this repository are the following, with the versions we used in parentheses꞉ numpy=1.25.0, numba=0.60.0, matplotlib=3.7.1, scipy=1.10.1, and tqdm=4.65.0.

A note on compute꞉ all experiments for this paper were run on a Linux cluster with 32 nodes, with a SLURM scheduler. However, no multiprocessing is necessary, and jobs can be easily run on single nodes. Bigger datasets like Criteo might require 32 GB of memory to safely operate. We include the (...)_runscript_driver.sh files we used to manage jobs with the SLURM system out of transparency, but please alter these files accordingly to your own systems and clusters.

Helper files꞉ the files metric_extractors.py and utils_v2.py contain helper functions for loading loading datasets, computing various metrics, and working with weighted reservoir sampling weights.

The hparam_tuning directory contains our code for finding the best values of our hyperparameters $\log_{10}C_{err}$ for PAC, and $\eta$ and $\lambda$ for FSOL꞉

  • folders.py is for creating the relevant folders for our experimental results to prevent race conditions. Can call with just python3 folders.py.
  • FSOL_main.py is the main function for running one hyperparameter variant of FSOL on a given dataset for one pass through the training data. All FSOL variants can be run by calling FSOL_main_runscript_driver.sh.
  • PAC_main.py and PAC_main_runscript_driver.sh are the associated files for running hyperparameter variants of PAC.
  • analyzer.ipynb generates metrics for all PAC and FSOL variants, and picks the ones that we will use for further experiments down the line.
  • For transparency, the logs folder contains the log files generated by analyzer.ipynb.

The WRS directory contains all code for running PAC‑WRS and FSOL‑WRS experiments꞉

  • The base_variants folder contains the selected hyperparameter values for PAC and FSOL found above.
  • folders.py is for creating the relevant folders for our experimental results to prevent race conditions. Can call with just python3 folders.py.
  • FSOL_WRS_main.py runs one variant of FSOL with a particular set of WRS‑specific settings on one dataset for one pass through the training data. All FSOL‑WRS variants can be run by calling FSOL_WRS_main_runscript_driver.sh.
  • PAC_WRS_main.py and PAC_WRS_main_runscript_driver.sh are the associated files for running hyperparameter variants of PAC‑WRS, with the same structure as FSOL_WRS_main.py and FSOL_WRS_main_runscript_driver.sh.
  • analyzer+visualizer.ipynb generates summaries of model performances and produces figures of test accuracy and sparsity over time, for each dataset.
  • error_bars.ipynb generates figures with error‑bars of ROP, final test accuracy, and final sparsity for our PAC‑WRS and FSOL‑WRS variants.
  • hypothesis_testing.ipynb performs Wilcoxon Signed‑Rank Tests for statistical significance of our ROP and final test accuracy metrics.
  • For transparency, we also include the logs folder containing various log‑files generated by our analysis notebooks.

The baselines directory contains all code for running top‑K experiments on PAC and FSOL꞉

  • The base_variants folder contains the selected hyperparameter values for PAC and FSOL found above.
  • folders.py is for creating the relevant folders for our experimental results to prevent race conditions. Can call with just python3 folders.py.
  • FSOL_TOPK_main.py runs one variant of top‑K + FSOL with a particular set of top-K‑specific settings on one dataset for one pass through the training data. All FSOL + top‑K variants can be run by calling FSOL_TOPK_main_runscript_driver.sh.
  • PAC_TOPK_main.py runs one variant of top‑K + PAC with a particular set of top-K‑specific settings on one dataset for one pass through the training data. All PAC + top‑K variants can be run by calling PAC_TOPK_main_runscript_driver.sh.
  • analyzer.ipynb contains our analysis scripts for creating summary metrics of all top‑K variants, computing the number of locations where each model outperformed the base methods, and performing Wilcoxon Signed‑Rank Tests on ROP and final test accuracy.
  • For transparency, we also include the logs folder containing the topk_master.csv file generated by our analysis notebook.

The moving_average directory contains all code for running moving-average experiments on PAC and FSOL:

  • The base_variants folder contains the selected hyperparameter values for PAC and FSOL as discussed earlier.
  • folders.py is for creating the relevant folders for our experimental results to prevent race conditions. Can call with just python3 folders.py.
  • FSOL_MA_main.py runs one variant of moving average (K=64) on one dataset for one pass through the training data. All FSOL + moving-average variants can be run by calling FSOL_MA_main_runscript_driver.sh.
  • PAC_MA_main.py runs one variant of moving average (K=64) on one dataset for one pass through the training data. All PAC + moving-average variants can be run by calling PAC_MA_main_runscript_driver.sh.

The exponential_average directory contains all code for running exponential-average experiments on PAC and FSOL:

  • The base_variants folder contains the selected hyperparameter values for PAC and FSOL as discussed earlier.
  • folders.py is for creating the relevant folders for our experimental results to prevent race conditions. Can call with just python3 folders.py.
  • FSOL_EA_main.py runs one variant of exponential average ($\gamma=0.9$) on one dataset for one pass through the training data. All FSOL + exponential-average variants can be run by calling FSOL_EA_main_runscript_driver.sh.
  • PAC_EA_main.py runs one variant of exponential average ($\gamma=0.9$) on one dataset for one pass through the training data. All PAC + exponential-average variants can be run by calling PAC_EA_main_runscript_driver.sh.

The sgd_variants directory contains all code for running our modified WRS-Augmented Training procedure on the non-passive-aggressive online algorithms ADAGRAD, SGD+Momentum, and TGD:

  • folders.py is for creating the relevant folders for our experimental results to prevent race conditions. Can call with just python3 folders.py.
  • ADAGRAD_WRS_main.py runs one variant of ADAGRAD + modified WRS-Augmented Training on one dataset for one pass through the training data. All ADAGRAD + modified WRS-Augmented Training variants can be run by calling ADAGRAD_WRS_main_runscript_driver.sh.
  • SGDM_WRS_main.py runs one variant of SGD+Momentum + modified WRS-Augmented Training on one dataset for one pass through the training data. All SGD+Momentum + modified WRS-Augmented Training variants can be run by calling SGDM_WRS_main_runscript_driver.sh.
  • TGD_WRS_main.py runs one variant of TGD + modified WRS-Augmented Training on one dataset for one pass through the training data. All TGD + modified WRS-Augmented Training variants can be run by calling TGD_WRS_main_runscript_driver.sh.

Finally, the visualizing_avging+non-passive-agg.ipynb notebook generates figures showing test accuracy over time of Moving Average, Exponential Average, and WRS-Augmented Training, as well as calculates each methods' average compute time per iteration per dataset. This notebook also generates figures showing test accuracy over time of SGD+Momentum, ADAGRAD, and TGD with/without modified WRS-Augmented Training.

License꞉ all assets in our paper and accompanying repository are under CC BY‑NC 4.0.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published