Back to blog
ICML 2026 Spotlight Autoregressive Boltzmann Generators
Danyal Rehman1,2,3,4, Charlie B. Tan1,5, Yoshua Bengio1,4,6, Joey Bose1,7, Alexander Tong3
1Mila – Québec AI Institute  ·  2Broad Institute of MIT & Harvard  ·  3Aithyra  ·  4Université de Montréal  ·  5University of Oxford  ·  6CIFAR Senior Fellow  ·  7Imperial College London
June 25, 2026  ·  ~8 min read

Generate proteins like LLMs generate text — autoregressively, atom by atom.

Video 1. A coordinate-by-coordinate roll-out of an octapeptide (IFGWVYTG) using Robin, our 132 million parameter transferable Boltzmann Generator.
Training loss vs. compute
Robin accuracy vs. energy evaluations
Figure 1. Left: training loss vs. compute (FLOPs) for ArBG models from 8M to 126M parameters. Right: Robin’s resampled Wasserstein metrics on unseen 8-residue peptides. MD needs 2–3 orders of magnitude more energy evaluations than Robin to match on energy-W₂, nearly 4 on torus-W₂, and does not reach Robin’s TICA-W₂ within the same compute budget. Hover for values; toggle the metric; tap CSV to download the underlying data.

Primer

Boltzmann Generators (BGs) combine an exact-likelihood generative model $p_\theta(x)$ with an energy function $\mathcal{E}(x)$ to align observables with the target Boltzmann distribution:

$$\mu_{\text{target}}(x) \;\propto\; e^{-\mathcal{E}(x)},$$

where $x \in \mathbb{R}^{n \times 3}$ is a conformation of $n$ atoms and $\mathcal{E}$ is the (dimensionless) potential energy that maps $\mathbb{R}^{n \times 3} \to \mathbb{R}$. The correction step — self-normalized importance sampling (SNIS) — assigns each sample $x^i \sim p_\theta$ a weight $w(x^i) = e^{-\mathcal{E}(x^i)}/p_\theta(x^i)$, which serves to de-bias samples in accordance with the Boltzmann density. For this to be consistent, the model must provide an exact likelihood $p_\theta(x)$ and a proposal that lies within the support of the target — two requirements that drive architecture choice.

Proposal $p(x)$ Target $\pi(x) \propto e^{-\mathcal{E}(x)}$ Reweighted density ESS:
Figure 2. Importance sampling, visualized. 2,500 samples are drawn from the blue proposal $p(x) = \mathcal{N}(\mu_p, \sigma_p^2)$ and re-weighted by $w_i = \pi(x_i)/p(x_i)$ to approximate the purple bimodal target $\pi(x) \propto e^{-\mathcal{E}(x)}$. The thin purple bars are a weighted histogram of those samples — the empirical distribution after re-weighting. Slide the proposal away from a target mode or shrink its width to watch the effective sample size (ESS) collapse and the reweighted histogram fail to recover both modes — a finite-sample preview of why proposal quality determines whether SNIS-corrected estimates are usable in practice.

Normalizing flows

Discrete-time (NFs)

Discrete-time NFs compose a stack of analytically invertible layers $f_1, \ldots, f_K$ that transport a prior sample $z_0 \sim p_0$ to $x = f_K \circ \cdots \circ f_1(z_0)$. The resulting transport map is $\mathbb{R}^d \to \mathbb{R}^d$, and conserves probability mass through the network via a discrete change of variables. Consequently, the likelihood follows directly from the change-of-variables formula:

$$\log p_\theta(x) \;=\; \log p_0(z_0) \;-\; \sum_{k=1}^{K} \log\!\left|\det \frac{\partial f_k}{\partial z_{k-1}}(z_{k-1})\right|.$$

Using this formulation, likelihoods are efficient, requiring one forward pass; however, every layer must admit a tractable Jacobian determinant, which sharply restricts the family of admissible architectures. A deeper, more structural problem is topological: a flow with a Gaussian prior is a homeomorphism, while the Boltzmann distribution of a peptide is multi-modal with approximately disconnected support. Mapping a single connected prior onto several well-separated basins forces the model to deform space across thin bridges, which can produce poorly-conditioned Jacobians, unstable training, and a practical ceiling on expressivity.

Continuous-time (CNFs)

CNFs replace the layer-by-layer composition with a learned vector field $v_\theta(x_t, t)$ that transports a prior sample at $t = 0$ to a model sample at $t = 1$. The likelihood is given by the instantaneous change-of-variables — an integral of the trace of the Jacobian:

$$\log p_\theta(x_1) \;=\; \log p_0(x_0) \;-\; \int_0^1 \mathrm{tr}\!\left( \frac{\partial v_\theta}{\partial x_t}(x_t, t) \right) dt.$$

This relaxes the architectural constraints, but the trace term is what makes likelihood evaluation expensive: computing it exactly costs $O(d)$ network evaluations per integration step. Stochastic estimators (e.g. Hutchinson’s trace estimator) bring the cost down but inject variance into the SNIS weights. Either way, the per-sample likelihood becomes the inference bottleneck — and since SNIS variance shrinks with the number of samples, this directly caps how well the corrected estimator can perform.

Autoregressive Boltzmann Generators

Both problems disappear if we drop the flow and factor the joint density over coordinates instead. Flattening $x \in \mathbb{R}^{n \times 3}$ to a $d$-dimensional vector $(x_1, \ldots, x_d)$ and applying the chain rule yields:

$$\log p_\theta(x) \;=\; \log \prod_{j=1}^{d} p_\theta\bigl(x_j \,\vert\, x_{\lt j}\bigr) \;=\; \sum_{j=1}^{d} \log p_\theta\bigl(x_j \,\vert\, x_{\lt j}\bigr).$$

The joint log-likelihood is a sum of $d$ scalar log-probabilities computed in a single forward pass — no Jacobian determinant, no ODE integration, and no architectural constraint on invertibility.

There is, however, an immediate mismatch to resolve. Atomic coordinates live in continuous space ($x_j \in \mathbb{R}$), while autoregressive models — transformers in particular — are built to emit categorical distributions over a finite vocabulary of tokens. How do you parameterize a continuous density with an intrinsically discrete model?

Bin parameterization

Each conditional $p_\theta(x_j \vert x_{\lt j})$ is parameterized as a categorical distribution over a uniform partition of the coordinate’s support $[C_{\min}, C_{\max}]$ into $L$ bins. At sampling time, a continuous coordinate is recovered by drawing uniformly within the selected bin. Training uses maximum likelihood estimation (MLE) with teacher forcing: at every position $j$ the model sees the ground-truth prefix $x_{\lt j}$ and minimizes the cross-entropy of its predicted bin distribution against the bin index of the ground-truth $x_j$. Because each conditional is a categorical, the resulting density on $x$ is piecewise-constant; with sufficiently many bins it converges to the underlying continuous distribution and the SNIS weights remain exact in the continuous limit.

Autoregressive coordinate-by-coordinate generation: a causal transformer predicts a categorical over bins for each next atom and samples one at a time Autoregressive  ·  coordinate by coordinate  ·  atom by atom causal-attention transformer p(xj | x<j) categorical over bins sample x1y1z1x2y2z2 Exact likelihood  ·  one forward pass
Figure 3. ArBG factors a peptide’s joint density coordinate by coordinate: a causal-attention transformer predicts a categorical over bins for each next atom and samples one at a time. The distribution snaps to a new shape at every step — one forward pass, exact likelihood.

Architecture

We use a decoder-only transformer with causal self-attention — the same family of architectures used to power modern large language models. Per-coordinate embeddings carry atom type, residue type, and positional information. For Robin, our 132 million parameter transferable Boltzmann Generator, the conditioning is augmented with residue position and overall sequence length, which turns the model into a sequence-to-structure generator: given a peptide sequence, it samples an equilibrium ensemble.

Conditioning A R P L Non-causal Global Self-attention K, V from conditioning Inputs RMSNorm Cross Attention + Transformer Block (×N) RMSNorm Causal Self-Attention + RMSNorm Feed-forward Network SwiGLU + Bins Outputs
Figure 4. ArBG architecture. Continuous coordinates are cross-attended against a global conditioning encoder (atom, residue, position, sequence length), passed through $N$ pre-norm causal-attention blocks, and projected to a categorical over $K$ value bins — sampled to produce the next coordinate. Hover any block for details.

Inference-time techniques

Because the model is autoregressive, decoding tricks developed for large language models apply almost without modification. Below, $\tau$ denotes the sampling temperature (distinct from the physical temperature, which we have already absorbed into $\mathcal{E}$):

Robin

Single-system results

We start by training a separate model per system on five peptides, ranging from the 2-residue alanine dipeptide to the 10-residue Chignolin. Following prior work, the ground-truth reference is taken from long unbiased molecular dynamics (MD) trajectories. The first thing to check is whether the proposal $p_\theta$ overlaps the target well enough that the SNIS-corrected distribution recovers the Boltzmann statistics. Figure 5 overlays the proposal, the SNIS-resampled distribution, and the MD reference for each system.

The quantitative picture across four systems is summarized in Table 1; the energy and structural distributions behind these numbers follow.

Algorithm Tri-alanine Tetrapeptide Hexa-alanine Chignolin
$\mathcal{E}\text{-}\mathcal{W}_2$ $\mathbb{T}\text{-}\mathcal{W}_2$ $\mathcal{E}\text{-}\mathcal{W}_2$ $\mathbb{T}\text{-}\mathcal{W}_2$ $\mathcal{E}\text{-}\mathcal{W}_2$ $\mathbb{T}\text{-}\mathcal{W}_2$ $\mathcal{E}\text{-}\mathcal{W}_2$ $\mathbb{T}\text{-}\mathcal{W}_2$
ECNF++ 2.206 ±0.813 0.962 ±0.253 5.638 ±0.483 1.002 ±0.061 10.668 ±0.285 1.902 ±0.055
RegFlow 0.853 ±0.105 1.577 ±0.140 3.277 ±0.546 2.342 ±0.102
SBG 0.598 ±0.084 0.503 ±0.029 1.007 ±0.382 1.039 ±0.069 1.189 ±0.357 1.444 ±0.140 10.819 ±7.206 3.778 ±0.440
FALCON-A 1.385 ±0.182 0.343 ±0.004 2.929 ±0.068 1.094 ±0.034 1.211 ±0.105 1.163 ±0.112
FALCON 0.544 ±0.013 0.452 ±0.011 0.686 ±0.047 0.858 ±0.077 0.892 ±0.311 1.256 ±0.132
GIVT 1.354 ±0.058 0.343 ±0.008 1.033 ±0.449 1.113 ±0.100 1.206 ±0.056 1.527 ±0.048 45.646 ±20.989 3.031 ±0.098
MoL-PixelCNN++ 0.506 ±0.082 1.024 ±0.686 1.643 ±0.504 1.415 ±0.110 1.429 ±0.186 1.264 ±0.205 140.717 ±49.113 3.391 ±0.093
GMM-PixelCNN++ 0.249 ±0.025 0.364 ±0.016 1.434 ±0.783 0.806 ±0.056 1.164 ±0.037 1.285 ±0.058 23.339 ±6.485 3.007 ±0.086
ArBG (ours) 0.202 ±0.010 0.312 ±0.003 0.449 ±0.030 0.592 ±0.010 0.328 ±0.122 1.094 ±0.052 1.723 ±0.075 2.632 ±0.044
Table 1. Single-system results at 2×105 energy evaluations (lower is better). Best in bold, second-best underlined; all methods use SNIS except SBG (SMC). Shaded rows (GIVT, MoL-PixelCNN++, GMM-PixelCNN++) are autoregressive baselines we train with the same ArBG recipe — identical de-quantization and transformer blocks — isolating the effect of the proposal parameterization. Click any column header to sort.
Figure 5. Energy histograms across five peptide systems: the autoregressive proposal, the SNIS-resampled distribution, and the molecular dynamics reference. The resampled distributions track the reference closely across all systems considered, including Chignolin.

Agreement on the energy is a local consistency check; a more demanding question is whether the sampler actually covers the relevant regions of conformational space. Chignolin is the standard benchmark for this: its $\beta$-hairpin secondary structure produces several well-separated torsional modes, and prior flow-based methods tend to miss some of them. Figure 6 places the per-residue Ramachandran plots from the MD reference and from our model side by side.

Figure 6. Per-residue Ramachandran plots on Chignolin (top: molecular dynamics reference; bottom: Robin). The model recovers the populated modes in the test set, including the $\beta$-hairpin region more accurately than state-of-the-art generative approaches.

Transferable generation

The next question is whether a single network can learn to sample many peptides at once. We train Robin, a 132 million parameter conditional causal transformer, on ManyPeptidesMD — a corpus of biased MD trajectories spanning many distinct peptide sequences — and evaluate it on held-out unbiased 8-residue peptides that never appear during training. As an aggregate metric we report the squared 2-Wasserstein distance between predicted and reference energy distributions, denoted $\mathcal{E}\text{-}\mathcal{W}_2$. Averaged over 30 unseen octapeptides, Robin reduces $\mathcal{E}\text{-}\mathcal{W}_2$ by roughly 60% relative to Prose, the prior flow-based state-of-the-art. Table 2 reports the aggregate zero-shot numbers; Figure 7 then zooms in on which conformational modes the model is actually capturing.

Model 4-residue (30 systems) 8-residue (30 systems)
$\mathcal{E}\text{-}\mathcal{W}_2$ $\mathbb{T}\text{-}\mathcal{W}_2$ $\mathrm{TICA}\text{-}\mathcal{W}_2$ $\mathcal{E}\text{-}\mathcal{W}_2$ $\mathbb{T}\text{-}\mathcal{W}_2$ $\mathrm{TICA}\text{-}\mathcal{W}_2$
TimeWarp 7.2372.2040.993
BioEmu 90.0792.0371.479 193.8734.6381.601
UniSim >10⁴2.7661.733 >10³6.1561.495
ECNF++ 10.0321.1210.572
TarFlow 1.2600.9240.492 11.2982.7331.087
Prose 0.9320.7520.367 10.0382.4560.988
Robin (ours) 1.1680.8860.471 4.2512.3250.943
Robin (ours) + SMC 1.0790.8740.463 4.2632.3150.977
Table 2. Transferable (zero-shot) results, averaged over 30 unseen 4- and 8-residue peptides (lower is better). Toggle the energy-evaluation budget; click any column header to sort. Best in bold, second-best underlined.
Figure 7. Time-lagged independent component analysis (TICA) projections for the molecular dynamics reference (top) and Robin (bottom) on a representative selection of unseen octapeptides. The model recovers the slow dynamical modes — the metastable basins relevant to long-time behaviour — on peptides it has not encountered during training.

References

  1. Frank Noé, Simon Olsson, Jonas Köhler, and Hao Wu (2019). Boltzmann Generators: Sampling equilibrium states of many-body systems with deep learning. Science.
  2. Danilo Jimenez Rezende and Shakir Mohamed (2015). Variational Inference with Normalizing Flows. ICML.
  3. Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud (2018). Neural Ordinary Differential Equations. NeurIPS.
  4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin (2017). Attention Is All You Need. NeurIPS.
  5. Danyal Rehman, Charlie B. Tan, Yoshua Bengio, Joey Bose, and Alexander Tong (2026). Autoregressive Boltzmann Generators. ICML.

Cite

If you found this useful, the BibTeX is below:

BibTeX
@inproceedings{rehman2026arbg,
  title        = {{A}utoregressive {B}oltzmann {G}enerators},
  author       = {Rehman, Danyal and Tan, Charlie B and Bengio, Yoshua and Bose, Joey and Tong, Alexander},
  booktitle    = {International Conference on Machine Learning (ICML)},
  year         = {2026},
  url          = {https://arxiv.org/abs/2606.27361},
  doi          = {10.48550/arXiv.2606.27361},
  eprint       = {2606.27361},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  note         = {Spotlight at ICML 2026}
}