This repository contains a PyTorch implementation of and pre-trained weights for the transformer protein language models in "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019) from Facebook AI Research:
@article{rives2019biological,
author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
year={2019},
doi={10.1101/622803},
url={https://www.biorxiv.org/content/10.1101/622803v3},
journal={bioRxiv}
}
As a prerequisite, you must have PyTorch 1.5 or later installed to use this repository. A cuda device is optional and will be auto-detected.
You can either work in the root of this repository, or use this one-liner for installation:
$ pip install git+https://github.com/facebookresearch/esm.git
Then, you can load and use a pretrained model as follows:
import torch
import esm
# Load 34 layer model
model, alphabet = esm.pretrained.esm1_t34_670M_UR50S()
batch_converter = alphabet.get_batch_converter()
# Prepare data (two protein sequences)
data = [("protein1", "MYLYQKIKN"), ("protein2", "MNAKYD")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# Extract per-residue embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[34])
token_embeddings = results["representations"][34]
# Generate per-sequence embeddings via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_embeddings = []
for i, (_, seq) in enumerate(data):
sequence_embeddings.append(token_embeddings[i, 1:len(seq) + 1].mean(0))
We also support PyTorch Hub, which removes the need to clone and/or install this repository yourself:
import torch
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1_t34_670M_UR50S")
For your convenience, we have provided a script that efficiently extracts embeddings in bulk from a FASTA file:
# Extract final-layer embedding for a FASTA file from a 34-layer model
$ python extract.py esm1_t34_670M_UR50S examples/some_proteins.fasta my_reprs/ \
--repr_layers 0 32 34 --include mean per_tok
# my_reprs/ now contains one ".pt" file per FASTA sequence; use torch.load() to load them
# extract.py has flags that determine what's included in the ".pt" file:
# --repr-layers (default: final only) selects which layers to include embeddings from.
# --include specifies what embeddings to save. You can use the following:
# * per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
# * mean includes the embeddings averaged over the full sequence, per layer.
# * bos includes the embeddings from the beginning-of-sequence token.
# (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)
To help you get started, we provide a jupyter notebook tutorial demonstrating how to train a variant predictor using embeddings from ESM. You can adopt a similar protocol to train a model for any downstream task, even with limited data.
First you can obtain the embeddings for examples/P62593.fasta
either by downloading the precomputed embeddings
as instructed in the notebook or by running the following:
# Obtain the embeddings
$ python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ \
--repr_layers 34 --include mean
Then, follow the remaining instructions in the tutorial. You can also run the tutorial in a colab notebook.
The following table lists the pretrained models available for use. See also Table 1 in the paper.
Shorthand | Full Name | #layers | #params | Dataset | Embedding Dim | Perplexity/ECE | Model URL |
ESM1-main | esm1_t34_670M_UR50S | 34 | 670M | UR50/S | 1280 | 8.54 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt |
esm1_t34_670M_UR50D | 34 | 670M | UR50/D | 1280 | 8.46 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt | |
esm1_t34_670M_UR100 | 34 | 670M | UR100 | 1280 | 10.32 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt | |
esm1_t12_85M_UR50S | 12 | 85M | UR50/S | 768 | 10.45 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt | |
esm1_t6_43M_UR50S | 6 | 43M | UR50/S | 768 | 11.79 | https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt |
This table compares to related pre-training methods, and corresponds to Table 8 in the paper. The last 3 columns are the major benchmark results:
- RH: Remote Homology at the fold level, using Hit-10 metric on SCOP.
- SSP: Secondary structure Q8 accuracy on CB513.
- Contact: Top-L long range contact precision on RaptorX test set from Wang et al. (2017).
Model | Pre-training | Params | RH | SSP | Contact |
UniRep | 18M | .527 | 58.4 | 21.9 | |
SeqVec | 93M | .545 | 62.1 | 29.0 | |
TAPE | 38M | .581 | 58.0 | 23.2 | |
LSTM biLM (S) | UR50/S | 28M | .558 | 60.4 | 24.1 |
LSTM biLM (L) | UR50/S | 113M | .574 | 62.4 | 27.8 |
Transformer-6 | UR50/S | 43M | .653 | 62.0 | 30.2 |
Transformer-12 | UR50/S | 85M | .639 | 65.4 | 37.7 |
Transformer-34 | UR100 | 670M | .599 | 64.3 | 32.7 |
Transformer-34 | UR50/S | 670M | .639 | 69.2 | 50.2 |
We evaluated our best performing model on the TAPE benchmark (Rao, et al. 2019), finding that our neural embeddings perform similarly to or better than alignment-based methods.
Model | SS3 | SS8 | Remote homology | Fluorescence | Stability | Contact |
ESM (best neural) | 0.82 | 0.67 | 0.33 | 0.68 | 0.71 | (0.61)* |
TAPE (best neural) | 0.75 | 0.59 | 0.26 | 0.68 | 0.73 | 0.4 |
TAPE (alignment) | 0.8 | 0.63 | 0.09 | N/A | N/A | 0.64 |
* Not comparable: ESM (bests neural) uses a linear projection on the features (the contact head available in the PyTorch version of TAPE), but the results from the TAPE paper use a ResNet head. See the previous table for a rigorous comparison of ESM and TAPE in a fair benchmarking setup.
If you find the model useful in your research, we ask that you cite the following paper:
@article{rives2019biological,
author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
year={2019},
doi={10.1101/622803},
url={https://www.biorxiv.org/content/10.1101/622803v3},
journal={bioRxiv}
}
Additionally, much of this code hails from the excellent fairseq sequence modeling framework; we have released this standalone model to facilitate more lightweight and flexible usage. We encourage those who wish to pretrain protein language models from scratch to use fairseq.
This source code is licensed under the MIT license found in the LICENSE
file
in the root directory of this source tree.