ConstraintBench: A Benchmark and RL Environment for Constrained Optimization from Formally Verified Rewards

By The Haladir Team | February 23rd, 2026 | 10 min read

Introduction

Every business runs on constrained optimization. A logistics company deciding which warehouses to open is solving a facility location problem. A hospital scheduling nurses across shifts is solving a set cover problem with fairness constraints. A portfolio manager allocating capital across assets is solving a quadratic program under risk limits. These are not abstract mathematical exercises, but simply the daily operational decisions that determine whether organizations are profitable, efficient, and compliant.

We introduce ConstraintBench, a dual-purpose system: (1) an LLM benchmark for constrained optimization spanning 10 Operations Research domains with all ground-truth solutions verified by the Gurobi Optimizer, and (2) an RL training environment capable of generating tens of thousands of unique problem instances across the same domains, with deterministic, solver-verified reward signals requiring no human annotation. We benchmark six frontier models and find that no model surpasses 31% on tasks requiring both feasibility and optimality.

Background: Constrained Optimization, Solvers, and the Formulation-vs-Solution Distinction

Operations Research (OR) is the discipline of applying mathematical methods to decision-making problems. OR problems typically share a common structure: given a set of decision variables, an objective function, and a set of constraints, find the assignment of variables that optimizes the objective while satisfying every constraint.

Formulation vs. Direct Solution

A growing body of work has begun to benchmark LLMs on optimization tasks, but the focus has been almost exclusively on formulation. ConstraintBench targets a complementary question: can a model directly produce correct decisions for a fully specified optimization problem, without access to a solver at inference time?

ConstraintBench

Task Structure

Each ConstraintBench problem is presented as a natural-language prompt describing a scenario with entities, constraints, and an optimization objective. The model must return a structured solution conforming to a domain-specific schema.

Domain Coverage

ConstraintBench spans 10 OR domains: Order Fulfillment, Production Mix, Shift Scheduling, Crew Assignment, Job-Shop Scheduling, Project Planning, Vehicle Routing, Bin Packing, Portfolio Optimization, and Facility Location.

Task Generation

Tasks are generated from a combinatorial product of industry contexts, company scales, urgency levels, geographic regions, and domain specializations, yielding approximately 28,000 possible unique seeds per domain.

Benchmark Results

The results demonstrate that feasibility, not optimality, is the primary bottleneck. The best-performing model (gpt-5.2-pro) achieves only 65.3% feasibility, meaning over a third of its solutions violate at least one constraint. Yet among feasible solutions, optimality rates are consistently high (95.2% for gpt-5.2-pro), suggesting that when models do find the feasible region, they tend to find good solutions within it.

ConstraintBench as an RL Training Environment

Beyond benchmarking, the same infrastructure that generates and verifies ConstraintBench instances can serve as a scalable RL training environment. The key property is that Gurobi provides deterministic, formally grounded reward signals without human annotation.

Next Steps

We are extending ConstraintBench from 10 to 100 OR domains. We are also developing a perturbation engine that leverages Gurobi's sensitivity analysis, satisfiability evaluation, sensitivity-analysis-based reward shaping, scenario framing bias analysis, self-report discrepancy analysis, and multi-turn agentic evaluation.