Saurabh Jha, Rohan Arora, et al.
ICML 2025
We study contextual bandits in high-dimensional combinatorial action spaces arising in structured constrained optimization problems, such as IT resource allocation and retail assortment pricing. Key quantities in the objective or constraints must be estimated from data during the trial sequence of actions. In [ 12], we propose a novel, practical, and transparent approach based on general-purpose regression oracles with Inverse Gap Weighting (IGW) for seamless integration within an optimization framework. IGW sampling is efficiently managed by: (a) a column-generation reformulation of the underlying Mixed Integer Programming (MIP) model which allows for flexible lower-level predictors, causal coherence, and efficient representation of large action spaces; (b) a diverse solution pool generation to balance the exploration-exploitation trade-off in large-action spaces. To address non-smooth rewards induced by constraints, we introduce a risk-averse phased learning strategy. Experiments on an IT auto-scaling task demonstrate substantial reductions in cumulative regret, with added gains from risk-averse methods that effectively manage constraint violations. This submission summarizes [ 12] and sketches our extensions underway as we seek a full theoretical regret analysis.
Saurabh Jha, Rohan Arora, et al.
ICML 2025
Sarath Swaminathan, Nathaniel Park, et al.
NeurIPS 2025
Giovanni De Felice, Arianna Casanova Flores, et al.
NeurIPS 2025
Brianna Richardson, Kush R. Varshney
INFORMS 2021