Skip to content

Bayesian and spatial bandit algorithms for adaptive model selection under non-stationary reward dynamics, featuring latent-state inference and Gaussian Process exploration.

Notifications You must be signed in to change notification settings

marvinernst1020/bandits-for-algorithm-selection

Repository files navigation

Bandits for Algorithm Selection

If you’re interested in the full write-up, see Master Thesis.pdf.

Project Overview

  • Title: Bayesian Bandits for Algorithm Selection: Latent-State Modeling and Spatial Reward Structures
  • Supervisors: Prof. Christian Brownlees, Prof. David Rossell
  • Institution: Universitat Pompeu Fabra & Barcelona School of Economics
  • Objective: Develop and evaluate bandit algorithms that dynamically select forecasting models under non-stationary and spatially structured rewards.
  • Team Members: Marvin Ernst, Oriol Gelabert, Melisa Vadenja

Motivation

Real-world decision environments are frequently non-stationary and spatially correlated. We design bandit methods that adapt to hidden latent-state dynamics and spatial smoothness among arms.


Problem Formulation

  • Setting: At each time $t$, pick one arm (algorithm) out of $K$; observe reward.
  • Goal: Minimize cumulative regret $R_T$ relative to the best dynamic/spatial policy.
  • Extensions:
    • Dynamic Bandits (hidden regimes / HMMs)
    • Spatial Bandits (arms in a continuous space with smooth reward structure)

Contributions

  1. Dynamic Bandits

    • Two Bayesian latent-state models:
      • M1: Arm-specific HMMs
      • M2: Globally shared latent HMM
    • Baselines: AR bandits, classical UCB/TS
    • Result: M1-TS adapts best to regime switches and dominates across settings.
  2. Spatial Bandits

    • Benchmarked GP-UCB / GP-TS vs. Zoom-In (tree-based region refinement for Lipschitz bandits) and classical UCB.
    • Show kernel/length-scale sensitivity for GPs; quantify robustness of Zoom-In.
  3. Hybrid Strategies (Future)

    • Combine GP exploration with Zoom-In style refinement to balance accuracy and scalability.

Evaluation Metrics

  • Cumulative Regret
  • Instantaneous Regret
  • Euclidean Distance to Best Arm (spatial)

Key Algorithms Implemented

  • Bayesian inference (JAGS) for HMM-style models (M1/M2)
  • Autoregressive bandits
  • GP-UCB, GP-TS
  • Zoom-In (UCB0/UCB1 and TS scoring)
  • Classical UCB and Thompson Sampling

Figures from the Presentation

All figures below are pulled directly from the imgs/ folder used in the presentation.

Plate Diagrams (Dynamic Models)

Arm-dependent latent-state plate diagram    Global latent-state plate diagram


Dynamic Bandits - Results (Global vs. Local HMMs)

Global latent-state (M2)

TS variants - global latent-state cumulative regret

UCB variants - global latent-state cumulative regret

Arm-specific latent-state (M1)

TS variants - local latent-state cumulative regret

UCB variants - local latent-state cumulative regret


Baseline (Static) Comparison

Baseline static setting: cumulative reward over time


Spatial Bandits - Lipschitz Setting

Arm space and setup

Lipschitz bandits - arm space

Zoom-In vs. Standard (UCB / TS)

Lipschitz - Zoom-In vs standard cumulative regret

Zoom-In vs. GP methods (K = 1000)

Lipschitz - Zoom-In vs GP cumulative regret

Lipschitz - Zoom-In vs GP instantaneous regret


Spatial Bandits - Gaussian Process Setting

Arm space (ℓ = 0.05)

GP bandits - arm space l=0.05

Zoom-In vs. GP-UCB (ℓ = 0.05)

GP bandits - cumulative regret

GP bandits - instantaneous regret

GP bandits - distance to best arm

Varying length scales (ℓ = 1.0 vs. 0.05)

GP B/1 arm space    GP B/005 arm space

GP B/1 cumulative regret    GP B/005 cumulative regret

GP B/1 distance    GP B/005 distance


GP Bandits - Misspecification Studies

GP misspecification - cumulative regret

GP misspecification - instantaneous regret

GP misspecification - distance to best arm


GP-TS vs. GP-UCB+

GP-TS vs GP-UCB+ cumulative regret

GP-TS vs GP-UCB+ instantaneous regret

GP-TS vs GP-UCB+ distance to optimum


Literature

  • Desautels et al. (2014) - Gaussian Process Bandits with Exploration-Exploitation
  • Chowdhury & Gopalan (2017) - Bandit Optimization with Gaussian Processes
  • Salgia et al. (2021) - Lipschitz Bandits without the Lipschitz Constant
  • Kandasamy et al. (2018) - Parallelised Bayesian Optimisation via TS
  • Kleinberg et al. (2008) - Regret Bounds for Restless/Lipschitz Bandits
  • Lattimore & Szepesvári (2020) - Bandit Algorithms
  • Hamilton (1994) - Time Series Analysis

Final Remarks

We provide empirical evidence for structured bandits in non-stationary and spatial settings, highlighting:

  • Bayesian models: high fidelity, high compute
  • GP methods: strong but scale poorly with $K,T$
  • Zoom-In: competitive and robust in Lipschitz settings

See the full thesis for details: Master Thesis.pdf

About

Bayesian and spatial bandit algorithms for adaptive model selection under non-stationary reward dynamics, featuring latent-state inference and Gaussian Process exploration.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •