-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Count File Dataset -- Memory Issues #44
Comments
Thanks for reporting the issue. Starcode should terminate gracefully when it runs out of memory, but maybe we made a mistake somewhere. Are you running Starcode in a Docker container or in a cluster? We sometimes observed strange behavior in these contexts. I would be interested in trying to run your sample on our machine to understand better which statement fails, if you are interested. As for the required memory, it depends on the parameters you use for the run. Allowing more errors means keeping more branches of the prefix tree in memory, so setting Also, if the sequences consist of a constant region and a barcode, you could try to isolate the barcodes and cluster those only. You can sometimes observe spectacular improvement of performance by reducing the sequence size because it can significantly prune the prefix tree. Let me know if any of this helps. |
Hello,
Thank you for your reply. I am not running this in a Docker container or in
a cluster, we are doing it all in wsl2.
I would be interested to see if running these samples on your machine
results in a successful run.
Unfortunately, the sequences are already trimmed down as much as they can
and just consist of the variable 60bp region. For my application, I also
only need --distance of 1, so I don't think we can cut down any memory use
with that parameter.
Let me know if there are any other suggestions!
Thank you very much,
Andrew
…On Fri, Jan 13, 2023 at 5:18 PM Guillaume Filion ***@***.***> wrote:
Thanks for reporting the issue. Starcode should terminate gracefully when
it runs out of memory, but maybe we forgot something. Are you running
Starcode in a Docker container or in a cluster? We sometimes observed
strange behavior in these contexts. I would be interested in trying to run
your sample on our machine to understand better which statement fails, if
you are interested.
As for the required memory, it depends on the parameters the you use for
the run. Allowing more errors means keeping more branches of the prefix
tree in memory, so setting tau to a small value is not only faster, it
also requires less memory. Depending on your goals, you could try lowering
tau and see whether that works for you.
Also, if the sequences consist of a constant region and a barcode, you
could try to isolate the barcodes and cluster those only. You can sometimes
observe spectacular improvement of performance by reducing the sequence
size because it can significantly prune the prefix tree.
Let me know if any of this helps.
—
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXNDVNQ3X3VVZ3X2RRX4TMTWSH5FLANCNFSM6AAAAAATZQZWIE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks for clarifying. I'd be happy to look at the data on my machine. Can you contact me by email so that we can set up a way to transfer the data. My address is easy to find on the Internet (like here for instance). |
I am attempting to process a relatively large count file. It is around 10GB and contains ~170 million unique sequences which are 60bp in length.
The computer I am using has 64GB of RAM and a Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz Processor. When I try to run starcode, the process is Killed without an error message being outputted. I am guessing that the program might exceed the memory allowance because when I subset my count file to contain 25million sequences, starcode is able to process it, but utilizes almost 100% of the computer's memory.
I am wondering if there is any solution to this, or if there is a RAM size that you all believe could process this dataset size?
Thank you!
The text was updated successfully, but these errors were encountered: