# Etalumis: Scientific simulators as probabilistic programs

## Team

• Atılım Güneş Baydin (University of Oxford)
• Lukas Heinrich (CERN)
• Wahid Bhimji (Lawrence Berkeley National Lab)
• Lei Shao (Intel Corp)
• Saeid Naderiparizi (University of British Columbia)
• Andreas Munk (University of British Columbia)
• Jialin Liu (Lawrence Berkeley National Lab)
• Bradley Gram-Hansen (University of Oxford)
• Gilles Louppe (University of Liege)
• Philip Torr (University of Oxford)
• Victor Lee (Intel Corp)
• Prabhat (Lawrence Berkeley National Lab)
• Kyle Cranmer (New York University)
• Frank Wood (University of British Columbia)

## Abstract

We present a novel probabilistic programming framework that couples directly to existing large-scale simulators through a cross-platform probabilistic execution protocol, which allows general-purpose inference engines to record and control random number draws within simulators in a language-agnostic way. The execution of existing simulators as probabilistic programs enables highly interpretable posterior inference in the structured model defined by the simulator code base. We demonstrate the technique in particle physics, on a scientifically accurate simulation of the tau lepton decay, which is a key ingredient in establishing the properties of the Higgs boson. Inference efficiency is achieved via inference compilation where a deep recurrent neural network is trained to parameterize proposal distributions and control the stochastic simulator in a sequential importance sampling scheme, at a fraction of the computational cost of a Markov chain Monte Carlo baseline.

## Publications

1. Naderiparizi, Saeid, Adam Ścibior, Andreas Munk, Mehrdad Ghadiri, Atılım Güneş Baydin, Bradley Gram-Hansen, Christian Schroeder de Witt, Robert Zinkov, Philip H.S. Torr, Tom Rainforth, Yee Whye Teh, and Frank Wood. 2022. “Amortized Rejection Sampling in Universal Probabilistic Programming.” In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS).

Naive approaches to amortized inference in probabilistic programs with unbounded loops can produce estimators with infinite variance. This is particularly true of importance sampling inference in programs that explicitly include rejection sampling as part of the user-programmed generative procedure. In this paper we develop a new and efficient amortized importance sampling estimator. We prove finite variance of our estimator and empirically demonstrate our method’s correctness and efficiency compared to existing alternatives on generative programs containing rejection sampling loops and discuss how to implement our method in a generic probabilistic programming framework.

@inproceedings{naderiparizi-2022-amortized,
title = {Amortized Rejection Sampling in Universal Probabilistic Programming},
author = {Naderiparizi, Saeid and {\'S}cibior, Adam and Munk, Andreas and Ghadiri, Mehrdad and Baydin, At{\i}l{\i}m G{\"u}ne{\c{s}} and Gram-Hansen, Bradley and de Witt, Christian Schroeder and Zinkov, Robert and Torr, Philip H.S. and Rainforth, Tom and Teh, Yee Whye and Wood, Frank},
booktitle = {Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS)},
year = {2022}
}

2. Baydin, Atılım Güneş, Lukas Heinrich, Wahid Bhimji, Lei Shao, Saeid Naderiparizi, Andreas Munk, Jialin Liu, Bradley Gram-Hansen, Gilles Louppe, Lawrence Meadows, Philip Torr, Victor Lee, Prabhat, Kyle Cranmer, and Frank Wood. 2019. “Efficient Probabilistic Inference in the Quest for Physics Beyond the Standard Model.” In Advances in Neural Information Processing Systems 33 (NeurIPS).

We present a novel probabilistic programming framework that couples directly to existing large-scale simulators through a cross-platform probabilistic execution protocol, which allows general-purpose inference engines to record and control random number draws within simulators in a language-agnostic way. The execution of existing simulators as probabilistic programs enables highly interpretable posterior inference in the structured model defined by the simulator code base. We demonstrate the technique in particle physics, on a scientifically accurate simulation of the tau lepton decay, which is a key ingredient in establishing the properties of the Higgs boson. Inference efficiency is achieved via inference compilation where a deep recurrent neural network is trained to parameterize proposal distributions and control the stochastic simulator in a sequential importance sampling scheme, at a fraction of the computational cost of a Markov chain Monte Carlo baseline.

@inproceedings{baydin-2019-quest-for-physics,
title = {Efficient Probabilistic Inference in the Quest for Physics Beyond the Standard Model},
author = {Baydin, Atılım Güneş and Heinrich, Lukas and Bhimji, Wahid and Shao, Lei and Naderiparizi, Saeid and Munk, Andreas and Liu, Jialin and Gram-Hansen, Bradley and Louppe, Gilles and Meadows, Lawrence and Torr, Philip and Lee, Victor and Prabhat and Cranmer, Kyle and Wood, Frank},
booktitle = {Advances in Neural Information Processing Systems 33 (NeurIPS)},
year = {2019}
}

3. Baydin, Atılım Güneş, Lei Shao, Wahid Bhimji, Lukas Heinrich, Lawrence F. Meadows, Jialin Liu, Andreas Munk, Saeid Naderiparizi, Bradley Gram-Hansen, Gilles Louppe, Mingfei Ma, Xiaohui Zhao, Philip Torr, Victor Lee, Kyle Cranmer, Prabhat, and Frank Wood. 2019. “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’19. New York, NY, USA: Association for Computing Machinery. doi:10.1145/3295500.3356180.

Probabilistic programming languages (PPLs) are receiving widespread attention for performing Bayesian inference in complex generative models. However, applications to science remain limited because of the impracticability of rewriting complex scientific simulators in a PPL, the computational cost of inference, and the lack of scalable implementations. To address these, we present a novel PPL framework that couples directly to existing scientific simulators through a cross-platform probabilistic execution protocol and provides Markov chain Monte Carlo (MCMC) and deep-learning-based inference compilation (IC) engines for tractable inference. To guide IC inference, we perform distributed training of a dynamic 3DCNN–LSTM architecture with a PyTorch-MPI-based framework on 1,024 32-core CPU nodes of the Cori supercomputer with a global minibatch size of 128k: achieving a performance of 450 Tflop/s through enhancements to PyTorch. We demonstrate a Large Hadron Collider (LHC) use-case with the C++ Sherpa simulator and achieve the largest-scale posterior inference in a Turing-complete PPL.

@inproceedings{baydin-2019-etalumis,
author = {Baydin, Atılım Güneş and Shao, Lei and Bhimji, Wahid and Heinrich, Lukas and Meadows, Lawrence F. and Liu, Jialin and Munk, Andreas and Naderiparizi, Saeid and Gram-Hansen, Bradley and Louppe, Gilles and Ma, Mingfei and Zhao, Xiaohui and Torr, Philip and Lee, Victor and Cranmer, Kyle and Prabhat and Wood, Frank},
title = {Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale},
year = {2019},
isbn = {9781450362290},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3295500.3356180},
doi = {10.1145/3295500.3356180},
booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and   Analysis},
articleno = {Article 29},
numpages = {24},
keywords = {inference, probabilistic programming, deep learning, simulation},
series = {SC ’19}
}

1. Schroeder de Witt, Christian, Bradley Gram-Hansen, Nantas Nardelli, Andrew Gambardella, Rob Zinkov, Puneet Dokania, N. Siddharth, Ana Belen Espinosa-Gonzalez, Ara Darzi, Philip Torr, and Atılım Güneş Baydin. 2020. “Simulation-Based Inference for Global Health Decisions.” In ICML Workshop on Machine Learning for Global Health, Thirty-Seventh International Conference on Machine Learning (ICML 2020).

The COVID-19 pandemic has highlighted the importance of in-silico epidemiological modelling in predicting the dynamics of infectious diseases to inform health policy and decision makers about suitable prevention and containment strategies. Work in this setting involves solving challenging inference and control problems in individual-based models of ever increasing complexity. Here we discuss recent breakthroughs in machine learning, specifically in simulation-based inference, and explore its potential as a novel venue for model calibration to support the design and evaluation of public health interventions. To further stimulate research, we are developing software interfaces that turn two cornerstone COVID-19 and malaria epidemiology models (CovidSim and OpenMalaria) into probabilistic programs, enabling efficient interpretable Bayesian inference within those simulators.

@inproceedings{schroederdewitt-2020-simulation,
title = {Simulation-Based Inference for Global Health Decisions},
author = {{Schroeder de Witt}, Christian and Gram-Hansen, Bradley and Nardelli, Nantas and Gambardella, Andrew and Zinkov, Rob and Dokania, Puneet and Siddharth, N. and Espinosa-Gonzalez, Ana Belen and Darzi, Ara and Torr, Philip and Baydin, Atılım Güneş},
booktitle = {ICML Workshop on Machine Learning for Global Health, Thirty-seventh International Conference on Machine Learning (ICML 2020)},
year = {2020}
}

2. Naderiparizi, Saeid, Adam Ścibior, Andreas Munk, Mehrdad Ghadiri, Atılım Güneş Baydin, Bradley Gram-Hansen, Christian Schroeder de Witt, Robert Zinkov, Philip H.S. Torr, Tom Rainforth, Yee Whye Teh, and Frank Wood. 2020. “Amortized Rejection Sampling in Universal Probabilistic Programming.” In International Conference on Probabilistic Programming (PROBPROG 2020), Cambridge, MA, United States. https://probprog.cc/.
@inproceedings{naderiparizi-2020-amortized,
title = {Amortized Rejection Sampling in Universal Probabilistic Programming},
author = {Naderiparizi, Saeid and {\'S}cibior, Adam and Munk, Andreas and Ghadiri, Mehrdad and Baydin, At{\i}l{\i}m G{\"u}ne{\c{s}} and Gram-Hansen, Bradley and de Witt, Christian Schroeder and Zinkov, Robert and Torr, Philip H.S. and Rainforth, Tom and Teh, Yee Whye and Wood, Frank},
booktitle = {International Conference on Probabilistic Programming (PROBPROG 2020), Cambridge, MA, United States},
year = {2020},
url = {https://probprog.cc/}
}

3. Gram-Hansen, Bradley, Christian Schroeder de Witt, Robert Zinkov, Saeid Naderiparizi, Adam Scibior, Andreas Munk, Frank Wood, Mehrdad Ghadiri, Philip Torr, Yee Whye Teh, Atılım Güneş Baydin, and Tom Rainforth. 2019. “Efficient Bayesian Inference for Nested Simulators.” In Second Symposium on Advances in Approximate Bayesian Inference (AABI), Vancouver, Canada, 8 December 2019.

We introduce two approaches for conducting efficient Bayesian inference in stochastic simulators containing nested stochastic sub-procedures, i.e., internal procedures such as rejection sampling loops for which the density cannot calculated directly. Such simulators are standard through the sciences and can be interpreted as probabilistic generative models. However, drawing inferences from them poses a substantial challenge due to the inability to evaluate even their unnormalised density. To address this, we introduce inference algorithms based on a two-step procedure where one first tackle the sub-procedures as amortised inference problems then uses the learned artefacts to construct an approximation of the original unnormalised density that can be used as a target for Markov chain Monte Carlo methods. Because the sub-procedures can be dealt with separately and are lower-dimensional than that of the overall problem, this two-step process allows them to be isolated and thus be tractably dealt with, without placing restrictions on the overall dimensionality of the problem. We demonstrate the utility of our methods on a simple, artificially constructed simulator.

@inproceedings{gramhansen-2019-efficient,
author = {Gram-Hansen, Bradley and de Witt, Christian Schroeder and Zinkov, Robert and Naderiparizi, Saeid and Scibior, Adam and Munk, Andreas and Wood, Frank and Ghadiri, Mehrdad and Torr, Philip and Teh, Yee Whye and Baydin, Atılım Güneş and Rainforth, Tom},
booktitle = {Second Symposium on Advances in Approximate Bayesian Inference (AABI), Vancouver, Canada, 8 December 2019},
title = {Efficient Bayesian Inference for Nested Simulators},
year = {2019}
}


## Acknowledgments

We thank the anonymous reviewers for their constructive comments that helped us improve this paper significantly. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231. This work was partially supported by the NERSC Big Data Center; we acknowledge Intel for their funding support. KC, LH, and GL were supported by the National Science Foundation under the awards ACI-1450310. Additionally, KC was supported by the National Science Foundation award OAC-1836650. BGH is supported by the EPRSC Autonomous Intelligent Machines and Systems grant. AGB and PT are supported by EPSRC/MURI grant EP/N019474/1 and AGB is also supported by Lawrence Berkeley National Lab. FW is supported by DARPA D3M, under Cooperative Agreement FA8750-17-2-0093, Intel under its LBNL NERSC Big Data Center, and an NSERC Discovery grant.