token to token attribution heatmap #143

nilshof01 · 2024-03-26T19:16:05Z

nilshof01
Mar 26, 2024

Hi,
I am a bioinformatician and I am slowly getting in the field of LLMs. I am using them to embed amino acid sequences to cluster those with UMAP or similar. In my examples one token is always one amino acid. Now, I was wondering if I could analyse amino acid to amino acid relationships using the token_classification. Attached is an example from a paper: https://www.sciencedirect.com/science/article/pii/S2666389922001052#da0010 which analysed the attention for one head in one layer. However, I was thinking that it would be more explainable to use the attribution between two amino acids (two tokens) across all layers to explain their related importance.
The benefit of this would be that one could for example relate the binding affinity (how one sequence or protein binds to another molecule, for instance a toxin) which could be very important for pharmaceutical research.
I have tried to start with some code which you can see attached. However, I am not so sure about the feasibility of this or if I misunderstood something. Furthermore, I am struggling to use the word_attributions to make the heatmap of sequence vs sequence since the dictionary contains unique values and my tokens repeat themselves.
I wanted to ask for help here to have a better understanding of the TokenClassificationExplainer given my issue.

import numpy as np
import pandas as pd
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import re
# Load the model and tokenizer
model_name = "Rostlab/prot_bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Tokenize and encode your sequence
peptide_sequence = "CAKGGTRYYYYGMDVW"
seq = [peptide_sequence]
assert type(seq) == list, "Input must be a list of strings"
sequences = [" ".join(list(re.sub(r"[UZOB*_]", "X", sequence))) for sequence in seq]
#inputs = tokenizer(sequences, return_tensors="pt", add_special_tokens=True)
#input_ids = inputs.input_ids # numerical ids from tensor for sequence
#assert input_ids.shape[1] == 21
inputs = tokenizer(sequences, return_tensors="pt")
output = model(**inputs)
from transformers_interpret import SequenceClassificationExplainer

from transformers_interpret import TokenClassificationExplainer
ner_explainer = TokenClassificationExplainer(
    model,
    tokenizer,
)
word_attributions = ner_explainer(sequences[0])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token to token attribution heatmap #143

{{title}}

Replies: 0 comments

Select a reply

token to token attribution heatmap #143

nilshof01 Mar 26, 2024

Replies: 0 comments

nilshof01
Mar 26, 2024