-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
has human check #3284
has human check #3284
Conversation
Leaving some notes here of the evolution of has_human:
In [XX]: %timeit hh = [1 for s, v in st.get_category('env_package').items() if s in asamples and v.startswith('human-')]
2.77 s ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [XX]: %timeit hh = [1 for s in asamples if st.get(s).get('env_package').startswith('human-')]
1min 19s ± 319 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [XX]: %timeit artifact.has_human
2.81 s ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [XX]: %timeit Artifact(43738).has_human
26.7 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
@@ -145,7 +145,7 @@ def create(cls, owner, name, description, from_default=False, | |||
If the duplicated sample ids in the selected studies should be | |||
merged or prepended with the artifact ids. False (default) prepends | |||
the artifact id | |||
categories : set of str, optional | |||
categories : list of str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on something else, I realized that this documentation was wrong.
Finally, these is the new behavior:
$ wget -O qiita_43738.zip "https://qiita-rc.ucsd.edu/public_artifact_download/?artifact_id=43738"
--2023-05-11 07:10:42-- https://qiita-rc.ucsd.edu/public_artifact_download/?artifact_id=43738
Resolving qiita-rc.ucsd.edu (qiita-rc.ucsd.edu)... 132.249.211.60
Connecting to qiita-rc.ucsd.edu (qiita-rc.ucsd.edu)|132.249.211.60|:443... connected.
HTTP request sent, awaiting response... 404 Artifact has possible human sequences. If this is a mistake contact: [email protected]
2023-05-11 07:10:42 ERROR 404: Artifact has possible human sequences. If this is a mistake contact: [email protected]. |
has_human = False | ||
if self.artifact_type == 'per_sample_FASTQ': | ||
st = self.study.sample_template | ||
if 'env_package' in st.categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about whether this is stringent enough. Users who upload data, or data we import from EBI, is not assured to have the env_package
variable. If the study lacks this variable then we may accidentally be exposing human associated data. Are there any human associated shotgun samples in Qiita right now that this doesn't catch?
One thing we could do is test multiple variables, such as also including host_taxid
, host_scientific_name
, and host_common_name
. However, I think it would be helpful to take a data driven perspective here and make sure what we choose is able to capture all human shotgun data we currently index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! As background, around 4 years ago we introduced the idea of restrictions to change artifacts from sandbox to private/public so it means that all studies that have changed level since that date should have env_package. Now that doesn't mean that this is the case for older studies; however, all Metagenomic data has been added in more recent years.
Anyway, to give do this more data oriented, we currently have: Counter({'public': 695, 'private': 153, 'sandbox': 3254}) and checking Study status and the existence of env_package & host_taxid, we have:
Counter({'public host_taxid': 38,
'public env_package': 638,
'private host_taxid': 3,
'sandbox host_taxid': 150,
'sandbox env_package': 889,
'private env_package': 130})
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of the public studies which lack env_package ones we should be concerned about? Are there other variables those studies have which could be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMOO yes but I think we should add that column to the studies vs. trying to use another column. I'll send an email to the qiita.admin's with the studies.
@@ -102,7 +102,7 @@ | |||
scripts=glob('scripts/*'), | |||
# making sure that numpy is installed before biom | |||
setup_requires=['numpy', 'cython'], | |||
install_requires=['psycopg2', 'click', 'bcrypt', 'pandas', | |||
install_requires=['psycopg2', 'click', 'bcrypt', 'pandas<2.0', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the pin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because pandas>2.0 brakes the tests. I saw that this was failing: https://github.com/qiita-spots/qiita/actions/runs/4940536571/jobs/8832562245 so installed the latests pandas version in my local computer and got a lot of errors; then installed 'pandas<2.0' and everything worked fine again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, makes sense to punt. Some of the types of changes needed are outlined here scikit-bio/scikit-bio#1851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #3289 so we don't forget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wasade, thank you for the review!
Co-authored-by: Daniel McDonald <[email protected]>
No description provided.