Inferring molecular complexity and looking for life using machine learning

Team

  • Timothy D. Gebhard (ETH Zurich)
  • Aaron C. Bell (Insight Edge)
  • Jian Gong (MIT)
  • Jaden J. A. Hastings (XO.LABS)
  • Matthew Fricke (University of New Mexico)
  • Nathalie Cabrol (SETI Institute)
  • Scott Sandford (NASA Ames Research Center)
  • Michael Phillips (Johns Hopkins University Applied Physics Laboratory)
  • Kimberley Warren-Rhodes (SETI Institute)
  • Atılım Güneş Baydin (University of Oxford)

Abstract

Molecular complexity has been proposed as a potential agnostic biosignature — in other words: a way to search for signs of life beyond Earth without relying on “life as we know it.” More than one way to compute molecular complexity has been proposed, so comparing their performance in evaluating experimental data collected in situ, such as on board a probe or rover exploring another planet, is imperative. Here, we report the results of an attempt to deploy multiple machine learning (ML) techniques to predict molecular complexity scores directly from mass spectrometry data. Our initial results are encouraging and may provide fruitful guidance toward determining which complexity measures are best suited for use with experimental data. Beyond the search for signs of life, this approach is likewise valuable for studying the chemical composition of samples to assist decisions made by the rover or probe, and may thus contribute toward supporting the need for greater autonomy.

Publications

  1. Gebhard, Timothy D., Aaron Bell, Jian Gong, Jaden J.A. Hastings, George M. Fricke, Nathalie Cabrol, Scott Sandford, Michael Phillips, Kimberley Warren-Rhodes, and Atılım Güneş Baydin. 2022. “Inferring Molecular Complexity from Mass Spectrometry Data Using Machine Learning.” In Machine Learning and the Physical Sciences Workshop, NeurIPS 2022.

    Molecular complexity has been proposed as a potential agnostic biosignature – in other words: a way to search for signs of life beyond Earth without relying on “life as we know it.” More than one way to compute molecular complexity has been proposed, so comparing their performance in evaluating experimental data collected in situ, such as on board a probe or rover exploring another planet, is imperative. Here, we report the results of an attempt to deploy multiple machine learning (ML) techniques to predict molecular complexity scores directly from mass spectrometry data. Our initial results are encouraging and may provide fruitful guidance toward determining which complexity measures are best suited for use with experimental data. Beyond the search for signs of life, this approach is likewise valuable for studying the chemical composition of samples to assist decisions made by the rover or probe, and may thus contribute toward supporting the need for greater autonomy

    @inproceedings{gebhard-2022-molecular,
      title = {Inferring molecular complexity from mass spectrometry data using machine learning},
      author = {Gebhard, Timothy D. and Bell, Aaron and Gong, Jian and Hastings, Jaden J.A. and Fricke, George M. and Cabrol, Nathalie and Sandford, Scott and Phillips, Michael and Warren-Rhodes, Kimberley and Baydin, {Atılım Güneş}},
      booktitle = {Machine Learning and the Physical Sciences workshop, NeurIPS 2022},
      year = {2022}
    }
    
  1. Gong, Jian, Aaron C. Bell, Timothy Gebhard, Jaden J.A. Hastings, Atılım Güneş Baydin, Kimberly Warren-Rhodes, Michael Phillips, Matthew Fricke, Nathalie A. Cabrol, Scott A. Sandford, and Massimo Mascaro. 2022. “Molecular Complexity to Biosignatures: A Machine Learning Pipeline That Connects Mass Spectrometry to Molecular Synthesis and Reaction Networks.” In American Geophysical Union (AGU) Fall Meeting, December 12–16, 2022. https://agu.confex.com/agu/fm22/meetingapp.cgi/Paper/1186669.

    The search for life beyond Earth is complicated by the lack of a consensus on what life is – especially when considering potential forms of life not resembling anything known on Earth. Agnostic means of assessing samples for evidence of life are needed to address this challenge. Information encoded within the atoms and bonds of a molecule can be used to generate agnostic metrics of complexity. The distributions of complexity metrics for chemical mixtures involving biological processes have been hypothesized to be different from those produced by abiotic or prebiotic chemical reactions (Marshall et al. 2021). Complexity metrics, rooted in Shannon Entropy (Bertz 1981; Böttcher 2016) and Assembly Theory (Marshall et al., 2017, 2021), rely on knowledge of the precise structures of molecules and time-consuming human-expert-based analysis decoupled from real-time instrumental measurements onboard robotic missions. In addition, leveraging these metrics requires intensive – often intractable – computations that are infeasible for real-time, on-probe investigations. We propose light-weight, flexible neural network models, trainable from publicly available datasets that can be employed to predict molecular structures and their complexity metrics from mass spectra. We show that with careful selection of datasets, the ML-based approach can learn characteristics of experimental data and digital representation of molecules. This enables rapid, accurate prediction of molecular complexity from mass spectra. Such data pipelines may open new doors for critical robotic missions where autonomous decision-making is required, empowering rapid biosignature screening tasks and in situ fingerprinting of prebiotic molecular reaction networks.

    @inproceedings{gong-2022-molecular,
      title = {Molecular Complexity to Biosignatures: A Machine Learning Pipeline that Connects Mass Spectrometry to Molecular Synthesis and Reaction Networks},
      author = {Gong, Jian and Bell, Aaron C. and Gebhard, Timothy and Hastings, Jaden J.A. and Baydin, Atılım Güneş and Warren-Rhodes, Kimberly and Phillips, Michael and Fricke, Matthew and Cabrol, Nathalie A. and Sandford, Scott A. and Mascaro, Massimo},
      booktitle = {American Geophysical Union (AGU) Fall Meeting, December 12--16, 2022},
      year = {2022},
      url = {https://agu.confex.com/agu/fm22/meetingapp.cgi/Paper/1186669}
    }
    
  1. Hastings, Jaden J.A., Aaron C. Bell, Timothy Gebhard, Jian Gong, Atılım Güneş Baydin, Matthew Fricke, Massimo Mascaro, Michael Phillips, Kimberly Warren-Rhodes, and Nathalie A. Cabrol. 2022. “Modeling Molecular Complexity: Building a Novel Multidisciplinary Machine Learning Framework to Understand Molecular Synthesis and Signatures.” In American Geophysical Union (AGU) Fall Meeting, December 12–16, 2022. https://agu.confex.com/agu/fm22/meetingapp.cgi/Paper/1200601.

    The ability to analyze and compare the structure of every known molecule, let alone molecules not yet encountered, and be able to predict all the possible synthesis pathways to be able to build ever more complex molecules at the atomic scale is a bottleneck spanning multiple disciplines. These span the fundamental and applied sciences – from organic synthesis of novel pharmaceuticals to detecting biosignatures on distant planets. Fundamental to this effort is the identification and standardization of key features of complexity and generating datasets optimized for machine learning methods. Connecting molecules and their complexity measures within vast chemical synthesis and reaction networks is similarly promising.The Molecular Complexity Consortium (MCC) – a working group of subject matter experts across academic, government, and commercial sectors – advances both applied and theoretical research in molecular complexity. We argue key shared objectives for unlocking the vast potential of ML-driven modeling of molecular complexity: the requisite standardization of features, generation of well-curated training datasets, and optimization of computation by ML method selection. Here we offer an overview of the field of molecular complexity, from methods of mathematical modeling to forming a notion of molecular signatures, and pose a call to action as we seek out new avenues for collaboration in this exciting emergent field.

    @inproceedings{hastings-2022-molecular,
      title = {Modeling Molecular Complexity: Building a Novel Multidisciplinary Machine Learning Framework to Understand Molecular Synthesis and Signatures},
      author = {Hastings, Jaden J.A. and Bell, Aaron C. and Gebhard, Timothy and Gong, Jian and Baydin, Atılım Güneş and Fricke, Matthew and Mascaro, Massimo and Phillips, Michael and Warren-Rhodes, Kimberly and Cabrol, Nathalie A.},
      booktitle = {American Geophysical Union (AGU) Fall Meeting, December 12--16, 2022},
      year = {2022},
      url = {https://agu.confex.com/agu/fm22/meetingapp.cgi/Paper/1200601}
    }
    

Acknowledgments

This work was enabled by and carried out during an eight-week research sprint as part of the Frontier Development Lab (FDL), a public-private partnership between NASA, the U.S. Department of Energy, the SETI Institute, Trillium Technologies, and leaders in commercial AI, space exploration, and Earth sciences, formed with the purpose of advancing the application of machine learning, data science, and high performance computing to problems of material concern to humankind. We thank Google Cloud and the University of New Mexico Center for Advanced Research Computing for providing the compute resources critical to completing this work. GMF was funded by NASA Astrobiology NfoLD grant #80NSSC18K1140. We also thank the Cronin Group at the University of Glasgow for their collaboration, and for providing us with the code for computing MA values.