Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink isotype inference #116

Open
willdumm opened this issue May 30, 2023 · 0 comments
Open

Rethink isotype inference #116

willdumm opened this issue May 30, 2023 · 0 comments

Comments

@willdumm
Copy link
Contributor

willdumm commented May 30, 2023

Currently, gctree optionally uses 'isotype parsimony', which is the number of isotype switching events required along a tree, given observed leaf isotypes, to help rank MP trees (alongside likelihood and mutability parsimony).

Isotypes are added to trees in the DAG by assigning all internal (unobserved) nodes the earliest isotype observed on any of the leaves below them. This is guaranteed to yield a labeling that minimizes the number of isotype switching events on the tree, and with only allowed isotype transitions. However, since isotype doesn't influence the tree topology, this method sometimes results in lots of isotype switching which could be avoided by adding a few extra nodes. A scenario where this occurs is below. Although this behavior could be okay for ranking trees according to isotype parsimony, it's not great for understanding where isotype switching happens in the tree, because it seems likely that the real tree looks a bit different:

image

In order to understand where class switching really happens in the tree, we propose doing something slightly different, where when provided with isotype data, we:

  • pre-process MP trees from dnapars by splitting pendant branches whose leaf nodes represent multiple observed isotypes
  • add inferred ancestral isotypes to internal nodes
  • partially resolve multifurcations when at least one child of the multifurcating node has the same sequence as its parent but different isotype (this will be done independently on different subtrees using hDAG infrastructure). Also keep the original multifurcating structure as an alternative.
  • collapse with respect to both isotype and sequence, and fit branching process parameters on this notion of collapsed tree (so the trees we get when doing inference with isotype data are fundamentally different structures than those we get from gctree when not providing isotype data)
  • rank trees w.r.t. likelihood (and possibly mutability parsimony), but no longer use isotype parsimony in ranking.

From the example above, we would be considering both the tree on the right, and the following one (which benefits from a partial resolution of the multifurcation, using isotypes (but if isotypes weren't provided, we would only consider for ranking the tree on the left in the above image):

image

This tree seems more plausible, because it places the high-abundance node above more children in the [isotype, sequence]-collapsed tree, and because it has only one isotype switching event.

Here's a more detailed example (with a different starting MP tree) showing the steps in the list above. Here edges without mutations are marked with a slash, and inferred ancestral isotypes are surrounded in parentheses.

image

Implementation Details:

At some point (#91), I spent awhile making gctree work with ambiguous input sequences. This was quite difficult, because different MP trees have different resolved tip sequences, resulting in different collapsed abundances between trees. To make it possible, there's a somewhat complicated scheme involving abundances as part of hDAG node labels, and involved pre-processing of MP trees from dnapars.

To make these proposed isotyping changes easier, I'd like to revert those changes, making gctree only work with fully-resolved sequences again. One could always use one of the versions that does support ambiguous sequences if they need that feature, but it doesn't seem to be a feature which core gctree users need, and not having it would make the code much cleaner and easier to modify in the future, including for the proposed changes in this issue.

Whether inference is provided with isotype data will affect at least the following:

  • How MP trees are pre-processed
  • hDAG node label data
  • the extraction of CollapsedTrees from the hDAG CollapsedForest object
  • rendering of CollapsedTrees (renderings should indicate isotype of nodes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant