Workshop paper

Contextual Bandits for Large-Scale Structured Discrete Constrained Optimization Problems

Abstract

We study contextual bandits in high-dimensional combinatorial action spaces arising in structured constrained optimization problems, such as IT resource allocation and retail assortment pricing. Key quantities in the objective or constraints must be estimated from data during the trial sequence of actions. In [ 12], we propose a novel, practical, and transparent approach based on general-purpose regression oracles with Inverse Gap Weighting (IGW) for seamless integration within an optimization framework. IGW sampling is efficiently managed by: (a) a column-generation reformulation of the underlying Mixed Integer Programming (MIP) model which allows for flexible lower-level predictors, causal coherence, and efficient representation of large action spaces; (b) a diverse solution pool generation to balance the exploration-exploitation trade-off in large-action spaces. To address non-smooth rewards induced by constraints, we introduce a risk-averse phased learning strategy. Experiments on an IT auto-scaling task demonstrate substantial reductions in cumulative regret, with added gains from risk-averse methods that effectively manage constraint violations. This submission summarizes [ 12] and sketches our extensions underway as we seek a full theoretical regret analysis.