RLFR: Reinforcement Learning from Formally-Defined Rewards for Code Generation

By The Haladir Team | January 6th, 2026 | 8 min read

Introduction

Over the past few years, Reinforcement Learning from Verifiable Rewards (RLVR), also known as "execution-based RL," has emerged as a promising learning mechanism for general code generation, using unit test execution as the primary reward signal in both single-turn and multi-turn RL. Despite the method being largely successful in defining and rewarding generally "correct" code, RLVR's use of unit test-driven rewards sample only finite points from infinite input spaces, enabling specification-gaming (also known as reward-hacking), where models produce degenerate implementations that satisfy tests without exhibiting necessarily "correct" behavior. We introduce Reinforcement Learning from Formally-Defined Rewards for Code Generation (RLFR), a reinforcement learning paradigm for code generation which replaces unit-test execution with provably correct contract/invariant adherence and formal verification as the primary reward signal in single-turn RL.

RLFR leverages specification languages and code/specification pairs, dynamically revealing or hiding contract satisfaction requirements to the model during post-training. We demonstrate RLFR across the specification-language Dafny rather than other specification languages such as ACSL or the Viper family of languages to prove independently that RLFR may generalize past the formal verification domain.

Formal Specification Languages - Dafny

Dafny is a verification-aware programming language developed by Microsoft Research that embeds first-order logic specifications directly into its syntax, where the Dafny verifier translates annotated source code into verification conditions (VCs) discharged by the Boogie intermediate verification language and Z3 SMT solver.

Methodology

We introduce hidden contract training (HCT), where models generate implementations without seeing specifications, followed by contract re-injection for verification. We also introduce visible contract training (VCT), where models generate implementations given the specifications as context for each sample. We fine-tune over Qwen2.5-Coder-7B-Instruct and on Qwen3-8B using GRPO.

We derive our training data from the ReForm-Python2Dafny dataset, modifying the dataset to a 17,521 verified Dafny program corpus containing 56,051 methods, 7,960 functions, 8,256 predicates, and 20,496 classes with 287,048 contract clauses total.

Evaluation

Results

Both HCT/VCT models fine-tuned over Qwen2.5-Coder-7B-Instruct perform measurably better than the base model on LiveCodeBench, with the HCT model outperforming the base model by +3.7 percentage points. The HCT model also performs better on HumanEval, HumanEval+, MBPP, and MBPP+. Notably, the HCT model performs better on all benchmarks compared to DeepSeek-Coder-33B-Instruct, a significantly larger model.

These results are surprising, considering the entire post-training process was conducted purely on Dafny programs and all benchmarks listed above are in Python, suggesting generalization past the formal verification domain.

Next Steps

The true potential of RLFR lies in multi-turn RL environments that leverage rich SMT counterexample feedback for iterative refinement, scaling through synthetic formalization pipelines, and extending to annotation-based verification languages for real-world codebases.