has human check #3284

antgonza · 2023-05-10T19:15:53Z

No description provided.

antgonza · 2023-05-11T12:24:11Z

Leaving some notes here of the evolution of has_human:

The original implementation simply looped over the samples/values but it felt a little slow for the AGP

In [XX]: %timeit hh = [1 for s, v in st.get_category('env_package').items() if s in asamples and v.startswith('human-')]
2.77 s ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then decided to try something else simple, loop over only the samples that belonged to that artifact, which made things slower:

In [XX]: %timeit hh = [1 for s in asamples if st.get(s).get('env_package').startswith('human-')]
1min 19s ± 319 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then decided to create it's own method (so we can test) using the fast version of the 2 above

In [XX]: %timeit artifact.has_human
2.81 s ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Finally decided to write something "smart" combining python/SQL (current version)

In [XX]: %timeit Artifact(43738).has_human
26.7 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

antgonza · 2023-05-11T13:08:08Z

qiita_db/analysis.py

@@ -145,7 +145,7 @@ def create(cls, owner, name, description, from_default=False,
            If the duplicated sample ids in the selected studies should be
            merged or prepended with the artifact ids. False (default) prepends
            the artifact id
-        categories : set of str, optional
+        categories : list of str, optional


While working on something else, I realized that this documentation was wrong.

antgonza · 2023-05-11T13:13:33Z

Finally, these is the new behavior:

If a user tried to access a public artifact that has possible human reads

The same user after they have been invited to the study

If any user tries to download the artifact via CLI

$ wget -O qiita_43738.zip "https://qiita-rc.ucsd.edu/public_artifact_download/?artifact_id=43738"
--2023-05-11 07:10:42--  https://qiita-rc.ucsd.edu/public_artifact_download/?artifact_id=43738
Resolving qiita-rc.ucsd.edu (qiita-rc.ucsd.edu)... 132.249.211.60
Connecting to qiita-rc.ucsd.edu (qiita-rc.ucsd.edu)|132.249.211.60|:443... connected.
HTTP request sent, awaiting response... 404 Artifact has possible human sequences. If this is a mistake contact: [email protected]
2023-05-11 07:10:42 ERROR 404: Artifact has possible human sequences. If this is a mistake contact: [email protected].

wasade · 2023-05-11T16:21:05Z

qiita_db/artifact.py

+        has_human = False
+        if self.artifact_type == 'per_sample_FASTQ':
+            st = self.study.sample_template
+            if 'env_package' in st.categories:


I worry about whether this is stringent enough. Users who upload data, or data we import from EBI, is not assured to have the env_package variable. If the study lacks this variable then we may accidentally be exposing human associated data. Are there any human associated shotgun samples in Qiita right now that this doesn't catch?

One thing we could do is test multiple variables, such as also including host_taxid, host_scientific_name, and host_common_name. However, I think it would be helpful to take a data driven perspective here and make sure what we choose is able to capture all human shotgun data we currently index

Good question! As background, around 4 years ago we introduced the idea of restrictions to change artifacts from sandbox to private/public so it means that all studies that have changed level since that date should have env_package. Now that doesn't mean that this is the case for older studies; however, all Metagenomic data has been added in more recent years.

Anyway, to give do this more data oriented, we currently have: Counter({'public': 695, 'private': 153, 'sandbox': 3254}) and checking Study status and the existence of env_package & host_taxid, we have:
Counter({'public host_taxid': 38,
'public env_package': 638,
'private host_taxid': 3,
'sandbox host_taxid': 150,
'sandbox env_package': 889,
'private env_package': 130})

What do you think?

Are any of the public studies which lack env_package ones we should be concerned about? Are there other variables those studies have which could be used?

IMOO yes but I think we should add that column to the studies vs. trying to use another column. I'll send an email to the qiita.admin's with the studies.

qiita_pet/handlers/download.py

wasade · 2023-05-11T16:27:42Z

setup.py

@@ -102,7 +102,7 @@
      scripts=glob('scripts/*'),
      # making sure that numpy is installed before biom
      setup_requires=['numpy', 'cython'],
-      install_requires=['psycopg2', 'click', 'bcrypt', 'pandas',
+      install_requires=['psycopg2', 'click', 'bcrypt', 'pandas<2.0',


Why the pin?

Because pandas>2.0 brakes the tests. I saw that this was failing: https://github.com/qiita-spots/qiita/actions/runs/4940536571/jobs/8832562245 so installed the latests pandas version in my local computer and got a lot of errors; then installed 'pandas<2.0' and everything worked fine again.

Okay, makes sense to punt. Some of the types of changes needed are outlined here scikit-bio/scikit-bio#1851

Added #3289 so we don't forget.

antgonza

@wasade, thank you for the review!

Co-authored-by: Daniel McDonald <[email protected]>

has human check

d5f846c

antgonza changed the title ~~has human check~~ [WIP] has human check May 10, 2023

antgonza added 4 commits May 10, 2023 13:47

pandas<2.0

8d2cf5e

Artifact.has_human

920d171

adding test and using SQL

03429ca

bla -> oral

a8b92bd

decorate download.py

fefb57b

antgonza changed the title ~~[WIP] has human check~~ has human check May 11, 2023

antgonza commented May 11, 2023

View reviewed changes

antgonza mentioned this pull request May 11, 2023

2023.05 #3285

Merged

wasade reviewed May 11, 2023

View reviewed changes

antgonza commented May 11, 2023

View reviewed changes

Apply suggestions from code review

ad459a3

Co-authored-by: Daniel McDonald <[email protected]>

antgonza mentioned this pull request May 13, 2023

Use to newest pandas. #3289

Open

antgonza merged commit 9ec2ce5 into qiita-spots:master May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

has human check #3284

has human check #3284

antgonza commented May 10, 2023

antgonza commented May 11, 2023

antgonza May 11, 2023

antgonza commented May 11, 2023

wasade May 11, 2023

antgonza May 11, 2023

wasade May 11, 2023

antgonza May 11, 2023

wasade May 11, 2023

antgonza May 11, 2023

wasade May 11, 2023

antgonza May 13, 2023

antgonza left a comment

has human check #3284

has human check #3284

Conversation

antgonza commented May 10, 2023

antgonza commented May 11, 2023

Choose a reason for hiding this comment

antgonza commented May 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antgonza left a comment

Choose a reason for hiding this comment