-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allpairs output not congruent with sphere clustering output? #13
Comments
Hi Claudius, well, this is not necessarily true. Spheres clustering works as follows:
So now imagine that we have the following network after all pairs search:
i.e. The point that you raised is interesting because
Precomputing this is not straightforward given the current implementation of Note, however, that even if the centroid choice is optimal the Thank you for contributing! Eduard |
Hi Eduard, thank you for the clarification! I know this is not the intended use case for cheers claudius |
Yes, I can make a raw version that prints all the edges found for each node. In the A,B,C example it would be something like: Eduard |
Hi Eduard, that's not quite what I mean. I am thinking of a "clustering" output with a canonical sequence, cluster size and list of child nodes (and grandchild nodes and great grandchild nodes ...), but from the raw clusters of the all-pairs search. For instance, if there was a path from every input sequence to any other input sequence, that would output just one big cluster. So, in principle, the cluster radius could be much bigger than tau if there were intermediate nodes between the canonical sequence and a sequence that is more than tau edits apart. claudius |
I see, this feature should be easy to add. I am busy analyzing experimental data right now, but I will get back to this asap. Eduard |
…s clustering centroid selection (not yet optimal). Changed Eduard's e-mail address in src files. Added article citation to README files.
Please pull the update in master branch. Eduard |
Hi Eduard, thank you for adding the I am just wondering how the canonical sequence is selected for this output, i. e. random or sequence with highest coverage or even some kind of majority consensus? claudius |
Sorry Claudius, I missed this one. Yes, the canonical sequence with Eduard |
Thank you Eduard! |
Hi Eduard,
I have run the latest
starcode
version from theallpairs
branch on an input file of distinct sequences (same as referenced in "cluster size" bug report) with the command line:I count 96,071 distinct sequences (nodes) in the output file. There are 134,492 distinct sequences in the input file. So I infer from that 38,421 (134492 - 96,071) sequences are singletons, i. e. have no tau-match with any other sequence in the input.
To check that, I looked into the output from a run with the newest (i. e. cluster size corrected) version of
starcode
from themaster
branch created with the following command:I assume that with sphere clustering, any sequence that does not have a tau-match will be put in a cluster on its own. Counting the number of "clusters" with only one sequence I get 41,908. Shouldn't this rather be 38,421?
many thanks for your help,
claudius
The text was updated successfully, but these errors were encountered: