Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

has human check #3284

Merged
merged 7 commits into from
May 13, 2023
Merged

Conversation

antgonza
Copy link
Member

No description provided.

@antgonza antgonza changed the title has human check [WIP] has human check May 10, 2023
@antgonza
Copy link
Member Author

Leaving some notes here of the evolution of has_human:

  • The original implementation simply looped over the samples/values but it felt a little slow for the AGP
In [XX]: %timeit hh = [1 for s, v in st.get_category('env_package').items() if s in asamples and v.startswith('human-')]
2.77 s ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Then decided to try something else simple, loop over only the samples that belonged to that artifact, which made things slower:
In [XX]: %timeit hh = [1 for s in asamples if st.get(s).get('env_package').startswith('human-')]
1min 19s ± 319 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Then decided to create it's own method (so we can test) using the fast version of the 2 above
In [XX]: %timeit artifact.has_human
2.81 s ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Finally decided to write something "smart" combining python/SQL (current version)
In [XX]: %timeit Artifact(43738).has_human
26.7 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@antgonza antgonza changed the title [WIP] has human check has human check May 11, 2023
@@ -145,7 +145,7 @@ def create(cls, owner, name, description, from_default=False,
If the duplicated sample ids in the selected studies should be
merged or prepended with the artifact ids. False (default) prepends
the artifact id
categories : set of str, optional
categories : list of str, optional
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on something else, I realized that this documentation was wrong.

@antgonza
Copy link
Member Author

Finally, these is the new behavior:

  1. If a user tried to access a public artifact that has possible human reads
before-giving-permissions
  1. The same user after they have been invited to the study
after-giving-permissions
  1. If any user tries to download the artifact via CLI
$ wget -O qiita_43738.zip "https://qiita-rc.ucsd.edu/public_artifact_download/?artifact_id=43738"
--2023-05-11 07:10:42--  https://qiita-rc.ucsd.edu/public_artifact_download/?artifact_id=43738
Resolving qiita-rc.ucsd.edu (qiita-rc.ucsd.edu)... 132.249.211.60
Connecting to qiita-rc.ucsd.edu (qiita-rc.ucsd.edu)|132.249.211.60|:443... connected.
HTTP request sent, awaiting response... 404 Artifact has possible human sequences. If this is a mistake contact: [email protected]
2023-05-11 07:10:42 ERROR 404: Artifact has possible human sequences. If this is a mistake contact: [email protected].

@antgonza antgonza mentioned this pull request May 11, 2023
has_human = False
if self.artifact_type == 'per_sample_FASTQ':
st = self.study.sample_template
if 'env_package' in st.categories:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about whether this is stringent enough. Users who upload data, or data we import from EBI, is not assured to have the env_package variable. If the study lacks this variable then we may accidentally be exposing human associated data. Are there any human associated shotgun samples in Qiita right now that this doesn't catch?

One thing we could do is test multiple variables, such as also including host_taxid, host_scientific_name, and host_common_name. However, I think it would be helpful to take a data driven perspective here and make sure what we choose is able to capture all human shotgun data we currently index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! As background, around 4 years ago we introduced the idea of restrictions to change artifacts from sandbox to private/public so it means that all studies that have changed level since that date should have env_package. Now that doesn't mean that this is the case for older studies; however, all Metagenomic data has been added in more recent years.

Anyway, to give do this more data oriented, we currently have: Counter({'public': 695, 'private': 153, 'sandbox': 3254}) and checking Study status and the existence of env_package & host_taxid, we have:
Counter({'public host_taxid': 38,
'public env_package': 638,
'private host_taxid': 3,
'sandbox host_taxid': 150,
'sandbox env_package': 889,
'private env_package': 130})

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the public studies which lack env_package ones we should be concerned about? Are there other variables those studies have which could be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMOO yes but I think we should add that column to the studies vs. trying to use another column. I'll send an email to the qiita.admin's with the studies.

qiita_pet/handlers/download.py Outdated Show resolved Hide resolved
@@ -102,7 +102,7 @@
scripts=glob('scripts/*'),
# making sure that numpy is installed before biom
setup_requires=['numpy', 'cython'],
install_requires=['psycopg2', 'click', 'bcrypt', 'pandas',
install_requires=['psycopg2', 'click', 'bcrypt', 'pandas<2.0',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the pin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pandas>2.0 brakes the tests. I saw that this was failing: https://github.com/qiita-spots/qiita/actions/runs/4940536571/jobs/8832562245 so installed the latests pandas version in my local computer and got a lot of errors; then installed 'pandas<2.0' and everything worked fine again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense to punt. Some of the types of changes needed are outlined here scikit-bio/scikit-bio#1851

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #3289 so we don't forget.

Copy link
Member Author

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade, thank you for the review!

@antgonza antgonza merged commit 9ec2ce5 into qiita-spots:master May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants