Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Issue with RNA Chains in Paired MSA Construction #238

Closed
wtni-gidle opened this issue Dec 28, 2024 · 7 comments
Closed

Possible Issue with RNA Chains in Paired MSA Construction #238

wtni-gidle opened this issue Dec 28, 2024 · 7 comments
Labels
bug Something isn't working question Further information is requested

Comments

@wtni-gidle
Copy link

wtni-gidle commented Dec 28, 2024

While investigating how AlphaFold3 processes MSA pairings, I encountered a problem related to the handling of RNA chains. I used a complex consisting of one RNA chain, two different protein chains, and a magnesium ion (MG) as an example.

In the following code snippet, AF3 assigns the "protein" molecule type to all chain types, including RNA, when constructing the paired_msa. This leads to RNA paired MSAs (which should only contain the query sequence) being converted into amino acid IDs.

paired_msa = msa_module.Msa.from_a3m(
query_sequence=sequence,
chain_poly_type=mmcif_names.PROTEIN_CHAIN,
a3m=paired_a3m,
deduplicate=False,
)

To confirm this behavior, I used print to inspect the paired MSA for the RNA chain. The output demonstrated that the RNA sequence had been incorrectly converted into amino acid IDs (by _PROTEIN_TO_ID).

[ 7  7  4  7  0  4  0  4  4  4  7  4  0  0  4  4  4  4  4  7  7  0  4  4
   7  0  4  0  4  4  4  4  4  7  4  4  0  7  7  0  4  0  7  0  7  7  4  4
   7  4  4]

And unpaired_msa is constructed correctly.

[23 23 24 23 22 24 22 25 25 25 23 25 22 22 25 25 24 24 25 23 23 22 24 24
  23 22 25 22 24 25 25 24 24 23 25 24 22 23 23 22 24 22 23 22 23 23 25 25
  23 24 24]

Moreover, since the paired MSA is always placed above the unpaired MSA, this error propagates and remains in the final MSA. I used print to inspect the first row of the final msa:

    # Count MSA size before padding.
    num_alignments = np_example['msa'].shape[0]
    print(np_example["msa"][0])

The result is:

[ 7  7  4  7  0  4  0  4  4  4  7  4  0  0  4  4  4  4  4  7  7  0  4  4
  7  0  4  0  4  4  4  4  4  7  4  4  0  7  7  0  4  0  7  0  7  7  4  4
  7  4  4 12 15  1  9  9 16  0 14  8  9  7  9  6 11 10 15  0  9 15 10  6
  6 10 15  4  7 10 14  3  1 18  0 10 14 14  3  7  8 14 19  6 14  8 10  6
  1 10 18 14 16  0  5 15 11  1 15 10 17  3 13  0 15 14  7 18 16 13  8  7
 10  8  1  0  5  3 18  1  1  6 10  3 16 10  5 15 10 10 16 16 15  5 15 15
  6 10  5  0  0  0  0 10 10 11  4  5  5  3  3  3  1 10 10  5  9  9 10  2
 10 10  8 11 19 12  2  9 16 10 16 11  1  5  5  6 13 10 10 10  2  7 17 10
  5 10  5  4  7  8  0  6  1  0  4  9 10 10  3  0 10 10 16 10  2 14  6  8
 10  0  7  1  1  4  1 10 19  0 10 10  2  2  2  5  7  6  1  0  6 11  6  0
  5 17 10  9 15  8  3 14 10  5  0  7  2 17 10  4 10 15  1  0  5  5 10  2
  7  3 10  3 11  0  1  8  0 18  5  8 18 10  6 10 11  3  8  2  6 15 14 21]

However, this error does not seem to significantly affect the prediction results. Could this be because AF3 no longer emphasizes the first row in MSA (making it order-independent) and relies less on the MSA?

Please let me know if I’m wrong. Thanks!

@wtni-gidle
Copy link
Author

I also printed a brief summary of these chains

Chain A:
  Sequence: GGCGACAUUUGUAAUUCCUGGACCGAUACUUCCGUCAGGACAGAGGUUGCC
  Unpaired MSA: ['GGCGACAUUUGUAAUUCCUGGACCGAUACUUCCGUCAGGACAGAGGUUGCC']
  Paired MSA: ['GGCGACAUUUGUAAUUCCUGGACCGAUACUUCCGUCAGGACAGAGGUUGCC']
Chain B:
  Sequence: MSRIITAPHIGIEKLSAISLEELSCGLPDRYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV
  Unpaired MSA: ['MSRIITAPHIGIEKLSAISLEELSCGLPDRYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV', '-------------------------------ALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV']
  Paired MSA: ['MSRIITAPHIGIEKLSAISLEELSCGLPDRYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV', 'MSRIITAPHIGIEKLSAISLEELSCGLPDRYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV']
Chain C:
  Sequence: MNITLTKRQQEFLLLNGWLQLQCGHAERACILLDALLTLNPEHLAGRRCRLVALLNNNQGERAEKEAQWLISHDPLQAGNWLCLSRAQQLNGDLDKARHAYQHYLELKDHNESP
  Unpaired MSA: ['MNITLTKRQQEFLLLNGWLQLQCGHAERACILLDALLTLNPEHLAGRRCRLVALLNNNQGERAEKEAQWLISHDPLQAGNWLCLSRAQQLNGDLDKARHAYQHYLELKDHNESP', '--MTLTERQQAFLLLNGWLQLQYGQAERACILLDALLHLSPDHLAARRCRLVALLKSGQGVRAQQEATWLVLNDDPQPGSWLCLSRAHQLSGELELARHAYQRYLELEEQYES-']
  Paired MSA: ['MNITLTKRQQEFLLLNGWLQLQCGHAERACILLDALLTLNPEHLAGRRCRLVALLNNNQGERAEKEAQWLISHDPLQAGNWLCLSRAQQLNGDLDKARHAYQHYLELKDHNESP', 'MNITLTKRQQEFLLLNGWLQLQCGHAERACILLDALLTLNPEHLAGRRCRLVALLNNNQGERAEKEAQWLISHDPLQAGNWLCLSRAQQLNGDLDKARHAYQHYLELKDHNESP']
Chain D:
  Sequence: X
  Unpaired MSA: ['-']
  Paired MSA: ['-']

@wtni-gidle
Copy link
Author

Another related issue: If the prediction is for a monomeric protein, the MSA features will contain the same query sequence in the first two rows. This happens because the unpaired MSA is deduplicated only when need_msa_pairing is True.

if need_msa_pairing:
np_chains_list = list(map(dict, np_chains_list))
np_chains_list = msa_pairing.create_paired_features(
np_chains_list,
max_paired_sequences=max_paired_sequences,
nonempty_chain_ids=nonempty_chain_ids,
max_hits_per_species=max_paired_sequence_per_species,
)
np_chains_list = msa_pairing.deduplicate_unpaired_sequences(
np_chains_list
)

@Augustin-Zidek Augustin-Zidek added the question Further information is requested label Jan 7, 2025
@Augustin-Zidek
Copy link
Collaborator

Thank you for the detailed report! We are investigating on our side now and will report back once we know more.

@joshabramson
Copy link
Collaborator

Hi Wantao - just want to check something before we go too far - are you running with custom MSA? Can you share your full input json?

@joshabramson
Copy link
Collaborator

I used print to inspect the paired MSA for the RNA chain.. And unpaired_msa is constructed correctly.

can you share the full Msa object, rather than just parts of it?

@joshabramson
Copy link
Collaborator

Sorry, you can ignore those two requests - I think I agree this is a bug. We will confirm and make a fix.

The reason why this has not had much effect is that the RNA model sensitivity to MSA inputs is low, and note that MSA rows are shuffled so the first row has no particular importance in the input.

@joshabramson joshabramson added the bug Something isn't working label Jan 8, 2025
@Augustin-Zidek
Copy link
Collaborator

Thanks again for reporting! This has been fixed in ea04034.

Another related issue: If the prediction is for a monomeric protein, the MSA features will contain the same query sequence in the first two rows. This happens because the unpaired MSA is deduplicated only when need_msa_pairing is True.

Yes, this is because the sequence is included once for unpaired, once for paired MSA. This is not an issue, the model will deal with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants