You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am a bioinformatician and I am slowly getting in the field of LLMs. I am using them to embed amino acid sequences to cluster those with UMAP or similar. In my examples one token is always one amino acid. Now, I was wondering if I could analyse amino acid to amino acid relationships using the token_classification. Attached is an example from a paper: https://www.sciencedirect.com/science/article/pii/S2666389922001052#da0010 which analysed the attention for one head in one layer. However, I was thinking that it would be more explainable to use the attribution between two amino acids (two tokens) across all layers to explain their related importance.
The benefit of this would be that one could for example relate the binding affinity (how one sequence or protein binds to another molecule, for instance a toxin) which could be very important for pharmaceutical research.
I have tried to start with some code which you can see attached. However, I am not so sure about the feasibility of this or if I misunderstood something. Furthermore, I am struggling to use the word_attributions to make the heatmap of sequence vs sequence since the dictionary contains unique values and my tokens repeat themselves.
I wanted to ask for help here to have a better understanding of the TokenClassificationExplainer given my issue.
importnumpyasnpimportpandasaspdfromtransformersimportAutoModelForTokenClassification, AutoTokenizerimporttorchimportmatplotlib.pyplotaspltimportseabornassnsimportre# Load the model and tokenizermodel_name="Rostlab/prot_bert"tokenizer=AutoTokenizer.from_pretrained(model_name)
model=AutoModelForTokenClassification.from_pretrained(model_name)
# Tokenize and encode your sequencepeptide_sequence="CAKGGTRYYYYGMDVW"seq= [peptide_sequence]
asserttype(seq) ==list, "Input must be a list of strings"sequences= [" ".join(list(re.sub(r"[UZOB*_]", "X", sequence))) forsequenceinseq]
#inputs = tokenizer(sequences, return_tensors="pt", add_special_tokens=True)#input_ids = inputs.input_ids # numerical ids from tensor for sequence#assert input_ids.shape[1] == 21inputs=tokenizer(sequences, return_tensors="pt")
output=model(**inputs)
fromtransformers_interpretimportSequenceClassificationExplainerfromtransformers_interpretimportTokenClassificationExplainerner_explainer=TokenClassificationExplainer(
model,
tokenizer,
)
word_attributions=ner_explainer(sequences[0])
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I am a bioinformatician and I am slowly getting in the field of LLMs. I am using them to embed amino acid sequences to cluster those with UMAP or similar. In my examples one token is always one amino acid. Now, I was wondering if I could analyse amino acid to amino acid relationships using the token_classification. Attached is an example from a paper: https://www.sciencedirect.com/science/article/pii/S2666389922001052#da0010 which analysed the attention for one head in one layer. However, I was thinking that it would be more explainable to use the attribution between two amino acids (two tokens) across all layers to explain their related importance.
The benefit of this would be that one could for example relate the binding affinity (how one sequence or protein binds to another molecule, for instance a toxin) which could be very important for pharmaceutical research.
I have tried to start with some code which you can see attached. However, I am not so sure about the feasibility of this or if I misunderstood something. Furthermore, I am struggling to use the word_attributions to make the heatmap of sequence vs sequence since the dictionary contains unique values and my tokens repeat themselves.
I wanted to ask for help here to have a better understanding of the TokenClassificationExplainer given my issue.
Beta Was this translation helpful? Give feedback.
All reactions