preprintbioRxiv (Cold Spring Harbor Laboratory)Jul 2, 2024GREEN OA

Simulating 500 million years of evolution with a language model

Institut de Biologia Evolutiva · Southern California Institute of Architecture · +2 more institutions

Indexed incrossref

Abstract

Abstract More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to biological alignment. We have prompted ESM3 to generate fluorescent proteins with a chain of thought. Among the generations that we synthesized, we found a bright fluorescent protein at far…

Citation impact

194
total citations
FWCI
Percentile
References
86
Citations per year

Authors

25

Topics & keywords

Keywords
  • Generative model
  • Generative grammar
  • Evolutionary biology
  • Modalities
  • Frontier
  • Function (biology)
  • Fluorescence
  • Computer science
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.