Generate proteins like LLMs generate text — autoregressively, atom by atom.
Primer
Boltzmann Generators (BGs) combine an exact-likelihood generative model $p_\theta(x)$ with an energy function $\mathcal{E}(x)$ to align observables with the target Boltzmann distribution:
$$\mu_{\text{target}}(x) \;\propto\; e^{-\mathcal{E}(x)},$$where $x \in \mathbb{R}^{n \times 3}$ is a conformation of $n$ atoms and $\mathcal{E}$ is the (dimensionless) potential energy that maps $\mathbb{R}^{n \times 3} \to \mathbb{R}$. The correction step — self-normalized importance sampling (SNIS) — assigns each sample $x^i \sim p_\theta$ a weight $w(x^i) = e^{-\mathcal{E}(x^i)}/p_\theta(x^i)$, which serves to de-bias samples in accordance with the Boltzmann density. For this to be consistent, the model must provide an exact likelihood $p_\theta(x)$ and a proposal that lies within the support of the target — two requirements that drive architecture choice.
Normalizing flows
Discrete-time (NFs)
Discrete-time NFs compose a stack of analytically invertible layers $f_1, \ldots, f_K$ that transport a prior sample $z_0 \sim p_0$ to $x = f_K \circ \cdots \circ f_1(z_0)$. The resulting transport map is $\mathbb{R}^d \to \mathbb{R}^d$, and conserves probability mass through the network via a discrete change of variables. Consequently, the likelihood follows directly from the change-of-variables formula:
$$\log p_\theta(x) \;=\; \log p_0(z_0) \;-\; \sum_{k=1}^{K} \log\!\left|\det \frac{\partial f_k}{\partial z_{k-1}}(z_{k-1})\right|.$$Using this formulation, likelihoods are efficient, requiring one forward pass; however, every layer must admit a tractable Jacobian determinant, which sharply restricts the family of admissible architectures. A deeper, more structural problem is topological: a flow with a Gaussian prior is a homeomorphism, while the Boltzmann distribution of a peptide is multi-modal with approximately disconnected support. Mapping a single connected prior onto several well-separated basins forces the model to deform space across thin bridges, which can produce poorly-conditioned Jacobians, unstable training, and a practical ceiling on expressivity.
Continuous-time (CNFs)
CNFs replace the layer-by-layer composition with a learned vector field $v_\theta(x_t, t)$ that transports a prior sample at $t = 0$ to a model sample at $t = 1$. The likelihood is given by the instantaneous change-of-variables — an integral of the trace of the Jacobian:
$$\log p_\theta(x_1) \;=\; \log p_0(x_0) \;-\; \int_0^1 \mathrm{tr}\!\left( \frac{\partial v_\theta}{\partial x_t}(x_t, t) \right) dt.$$This relaxes the architectural constraints, but the trace term is what makes likelihood evaluation expensive: computing it exactly costs $O(d)$ network evaluations per integration step. Stochastic estimators (e.g. Hutchinson’s trace estimator) bring the cost down but inject variance into the SNIS weights. Either way, the per-sample likelihood becomes the inference bottleneck — and since SNIS variance shrinks with the number of samples, this directly caps how well the corrected estimator can perform.
Autoregressive Boltzmann Generators
Both problems disappear if we drop the flow and factor the joint density over coordinates instead. Flattening $x \in \mathbb{R}^{n \times 3}$ to a $d$-dimensional vector $(x_1, \ldots, x_d)$ and applying the chain rule yields:
$$\log p_\theta(x) \;=\; \log \prod_{j=1}^{d} p_\theta\bigl(x_j \,\vert\, x_{\lt j}\bigr) \;=\; \sum_{j=1}^{d} \log p_\theta\bigl(x_j \,\vert\, x_{\lt j}\bigr).$$The joint log-likelihood is a sum of $d$ scalar log-probabilities computed in a single forward pass — no Jacobian determinant, no ODE integration, and no architectural constraint on invertibility.
There is, however, an immediate mismatch to resolve. Atomic coordinates live in continuous space ($x_j \in \mathbb{R}$), while autoregressive models — transformers in particular — are built to emit categorical distributions over a finite vocabulary of tokens. How do you parameterize a continuous density with an intrinsically discrete model?
Bin parameterization
Each conditional $p_\theta(x_j \vert x_{\lt j})$ is parameterized as a categorical distribution over a uniform partition of the coordinate’s support $[C_{\min}, C_{\max}]$ into $L$ bins. At sampling time, a continuous coordinate is recovered by drawing uniformly within the selected bin. Training uses maximum likelihood estimation (MLE) with teacher forcing: at every position $j$ the model sees the ground-truth prefix $x_{\lt j}$ and minimizes the cross-entropy of its predicted bin distribution against the bin index of the ground-truth $x_j$. Because each conditional is a categorical, the resulting density on $x$ is piecewise-constant; with sufficiently many bins it converges to the underlying continuous distribution and the SNIS weights remain exact in the continuous limit.
Architecture
We use a decoder-only transformer with causal self-attention — the same family of architectures used to power modern large language models. Per-coordinate embeddings carry atom type, residue type, and positional information. For Robin, our 132 million parameter transferable Boltzmann Generator, the conditioning is augmented with residue position and overall sequence length, which turns the model into a sequence-to-structure generator: given a peptide sequence, it samples an equilibrium ensemble.
Inference-time techniques
Because the model is autoregressive, decoding tricks developed for large language models apply almost without modification. Below, $\tau$ denotes the sampling temperature (distinct from the physical temperature, which we have already absorbed into $\mathcal{E}$):
- Temperature scaling. Logits are divided by $\tau > 0$ before the softmax, which controls sample diversity. For transferable generation, we tune $\tau$ to optimize performance, and use $\tau = 0.95$ for all studies conducted.
- KV-caching. The attention keys and values computed for the prefix are stored and reused, so generating each new coordinate is a single-token append rather than a full re-encode.
- Twisted SMC. SMC maintains a population of partial generations and resamples at residue boundaries using twist functions: scalar approximations of the future likelihood under the target, computed from the partial energy of the residues generated so far. Implausible substructures (e.g., steric clashes) are filtered out before they propagate. Autoregressive sampling is well-suited to this, since intermediate prefixes can be scored before the rest of the conformation is generated.
Robin
Single-system results
We start by training a separate model per system on five peptides, ranging from the 2-residue alanine dipeptide to the 10-residue Chignolin. Following prior work, the ground-truth reference is taken from long unbiased molecular dynamics (MD) trajectories. The first thing to check is whether the proposal $p_\theta$ overlaps the target well enough that the SNIS-corrected distribution recovers the Boltzmann statistics. Figure 5 overlays the proposal, the SNIS-resampled distribution, and the MD reference for each system.
The quantitative picture across four systems is summarized in Table 1; the energy and structural distributions behind these numbers follow.
| Algorithm ↕ | Tri-alanine | Tetrapeptide | Hexa-alanine | Chignolin | ||||
|---|---|---|---|---|---|---|---|---|
| $\mathcal{E}\text{-}\mathcal{W}_2$ ↕ | $\mathbb{T}\text{-}\mathcal{W}_2$ ↕ | $\mathcal{E}\text{-}\mathcal{W}_2$ ↕ | $\mathbb{T}\text{-}\mathcal{W}_2$ ↕ | $\mathcal{E}\text{-}\mathcal{W}_2$ ↕ | $\mathbb{T}\text{-}\mathcal{W}_2$ ↕ | $\mathcal{E}\text{-}\mathcal{W}_2$ ↕ | $\mathbb{T}\text{-}\mathcal{W}_2$ ↕ | |
| ECNF++ | 2.206 ±0.813 | 0.962 ±0.253 | 5.638 ±0.483 | 1.002 ±0.061 | 10.668 ±0.285 | 1.902 ±0.055 | — | — |
| RegFlow | 0.853 ±0.105 | 1.577 ±0.140 | 3.277 ±0.546 | 2.342 ±0.102 | — | — | — | — |
| SBG | 0.598 ±0.084 | 0.503 ±0.029 | 1.007 ±0.382 | 1.039 ±0.069 | 1.189 ±0.357 | 1.444 ±0.140 | 10.819 ±7.206 | 3.778 ±0.440 |
| FALCON-A | 1.385 ±0.182 | 0.343 ±0.004 | 2.929 ±0.068 | 1.094 ±0.034 | 1.211 ±0.105 | 1.163 ±0.112 | — | — |
| FALCON | 0.544 ±0.013 | 0.452 ±0.011 | 0.686 ±0.047 | 0.858 ±0.077 | 0.892 ±0.311 | 1.256 ±0.132 | — | — |
| GIVT | 1.354 ±0.058 | 0.343 ±0.008 | 1.033 ±0.449 | 1.113 ±0.100 | 1.206 ±0.056 | 1.527 ±0.048 | 45.646 ±20.989 | 3.031 ±0.098 |
| MoL-PixelCNN++ | 0.506 ±0.082 | 1.024 ±0.686 | 1.643 ±0.504 | 1.415 ±0.110 | 1.429 ±0.186 | 1.264 ±0.205 | 140.717 ±49.113 | 3.391 ±0.093 |
| GMM-PixelCNN++ | 0.249 ±0.025 | 0.364 ±0.016 | 1.434 ±0.783 | 0.806 ±0.056 | 1.164 ±0.037 | 1.285 ±0.058 | 23.339 ±6.485 | 3.007 ±0.086 |
| ArBG (ours) | 0.202 ±0.010 | 0.312 ±0.003 | 0.449 ±0.030 | 0.592 ±0.010 | 0.328 ±0.122 | 1.094 ±0.052 | 1.723 ±0.075 | 2.632 ±0.044 |
Agreement on the energy is a local consistency check; a more demanding question is whether the sampler actually covers the relevant regions of conformational space. Chignolin is the standard benchmark for this: its $\beta$-hairpin secondary structure produces several well-separated torsional modes, and prior flow-based methods tend to miss some of them. Figure 6 places the per-residue Ramachandran plots from the MD reference and from our model side by side.
Transferable generation
The next question is whether a single network can learn to sample many peptides at once. We train Robin, a 132 million parameter conditional causal transformer, on ManyPeptidesMD — a corpus of biased MD trajectories spanning many distinct peptide sequences — and evaluate it on held-out unbiased 8-residue peptides that never appear during training. As an aggregate metric we report the squared 2-Wasserstein distance between predicted and reference energy distributions, denoted $\mathcal{E}\text{-}\mathcal{W}_2$. Averaged over 30 unseen octapeptides, Robin reduces $\mathcal{E}\text{-}\mathcal{W}_2$ by roughly 60% relative to Prose, the prior flow-based state-of-the-art. Table 2 reports the aggregate zero-shot numbers; Figure 7 then zooms in on which conformational modes the model is actually capturing.
| Model ↕ | 4-residue (30 systems) | 8-residue (30 systems) | ||||
|---|---|---|---|---|---|---|
| $\mathcal{E}\text{-}\mathcal{W}_2$ ↕ | $\mathbb{T}\text{-}\mathcal{W}_2$ ↕ | $\mathrm{TICA}\text{-}\mathcal{W}_2$ ↕ | $\mathcal{E}\text{-}\mathcal{W}_2$ ↕ | $\mathbb{T}\text{-}\mathcal{W}_2$ ↕ | $\mathrm{TICA}\text{-}\mathcal{W}_2$ ↕ | |
| TimeWarp | 7.237 | 2.204 | 0.993 | — | — | — |
| BioEmu | 90.079 | 2.037 | 1.479 | 193.873 | 4.638 | 1.601 |
| UniSim | >10⁴ | 2.766 | 1.733 | >10³ | 6.156 | 1.495 |
| ECNF++ | 10.032 | 1.121 | 0.572 | — | — | — |
| TarFlow | 1.260 | 0.924 | 0.492 | 11.298 | 2.733 | 1.087 |
| Prose | 0.932 | 0.752 | 0.367 | 10.038 | 2.456 | 0.988 |
| Robin (ours) | 1.168 | 0.886 | 0.471 | 4.251 | 2.325 | 0.943 |
| Robin (ours) + SMC | 1.079 | 0.874 | 0.463 | 4.263 | 2.315 | 0.977 |
| TarFlow | 0.929 | 0.776 | 0.498 | 10.826 | 2.320 | 1.057 |
| Prose | 0.646 | 0.607 | 0.349 | 9.360 | 2.019 | 0.960 |
| Robin (ours) | 0.531 | 0.649 | 0.379 | 3.615 | 1.902 | 0.882 |
References
- Frank Noé, Simon Olsson, Jonas Köhler, and Hao Wu (2019). Boltzmann Generators: Sampling equilibrium states of many-body systems with deep learning. Science.
- Danilo Jimenez Rezende and Shakir Mohamed (2015). Variational Inference with Normalizing Flows. ICML.
- Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud (2018). Neural Ordinary Differential Equations. NeurIPS.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin (2017). Attention Is All You Need. NeurIPS.
- Danyal Rehman, Charlie B. Tan, Yoshua Bengio, Joey Bose, and Alexander Tong (2026). Autoregressive Boltzmann Generators. ICML.
Cite
If you found this useful, the BibTeX is below:
@inproceedings{rehman2026arbg, title = {{A}utoregressive {B}oltzmann {G}enerators}, author = {Rehman, Danyal and Tan, Charlie B and Bengio, Yoshua and Bose, Joey and Tong, Alexander}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2026}, url = {https://arxiv.org/abs/2606.27361}, doi = {10.48550/arXiv.2606.27361}, eprint = {2606.27361}, archivePrefix= {arXiv}, primaryClass = {cs.LG}, note = {Spotlight at ICML 2026} }
@inproceedings{rehman2026arbg,
title = {{A}utoregressive {B}oltzmann {G}enerators},
author = {Rehman, Danyal and Tan, Charlie B and Bengio, Yoshua and Bose, Joey and Tong, Alexander},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2026},
url = {https://arxiv.org/abs/2606.27361},
doi = {10.48550/arXiv.2606.27361},
eprint = {2606.27361},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
note = {Spotlight at ICML 2026}
}