Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

has human check #3284

Merged
merged 7 commits into from
May 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion qiita_db/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ def create(cls, owner, name, description, from_default=False,
If the duplicated sample ids in the selected studies should be
merged or prepended with the artifact ids. False (default) prepends
the artifact id
categories : set of str, optional
categories : list of str, optional
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on something else, I realized that this documentation was wrong.

If not None, use _only_ these categories for the metaanalysis

Returns
Expand Down
22 changes: 22 additions & 0 deletions qiita_db/artifact.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ class Artifact(qdb.base.QiitaObject):
prep_template
ebi_run_accession
study
has_human

Methods
-------
Expand Down Expand Up @@ -1550,6 +1551,27 @@ def being_deleted_by(self):
res = qdb.sql_connection.TRN.execute_fetchindex()
return qdb.processing_job.ProcessingJob(res[0][0]) if res else None

@property
def has_human(self):
has_human = False
if self.artifact_type == 'per_sample_FASTQ':
st = self.study.sample_template
if 'env_package' in st.categories:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about whether this is stringent enough. Users who upload data, or data we import from EBI, is not assured to have the env_package variable. If the study lacks this variable then we may accidentally be exposing human associated data. Are there any human associated shotgun samples in Qiita right now that this doesn't catch?

One thing we could do is test multiple variables, such as also including host_taxid, host_scientific_name, and host_common_name. However, I think it would be helpful to take a data driven perspective here and make sure what we choose is able to capture all human shotgun data we currently index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! As background, around 4 years ago we introduced the idea of restrictions to change artifacts from sandbox to private/public so it means that all studies that have changed level since that date should have env_package. Now that doesn't mean that this is the case for older studies; however, all Metagenomic data has been added in more recent years.

Anyway, to give do this more data oriented, we currently have: Counter({'public': 695, 'private': 153, 'sandbox': 3254}) and checking Study status and the existence of env_package & host_taxid, we have:
Counter({'public host_taxid': 38,
'public env_package': 638,
'private host_taxid': 3,
'sandbox host_taxid': 150,
'sandbox env_package': 889,
'private env_package': 130})

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the public studies which lack env_package ones we should be concerned about? Are there other variables those studies have which could be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMOO yes but I think we should add that column to the studies vs. trying to use another column. I'll send an email to the qiita.admin's with the studies.

sql = f"""SELECT DISTINCT sample_values->>'env_package'
FROM qiita.sample_{st.id} WHERE sample_id in (
SELECT sample_id from qiita.preparation_artifact
LEFT JOIN qiita.prep_template_sample USING (
prep_template_id)
WHERE artifact_id = {self.id})"""
with qdb.sql_connection.TRN:
qdb.sql_connection.TRN.add(sql)
for v in qdb.sql_connection.TRN.execute_fetchflatten():
if v.startswith('human-'):
has_human = True
break

return has_human

def jobs(self, cmd=None, status=None, show_hidden=False):
"""Jobs that used this artifact as input

Expand Down
42 changes: 42 additions & 0 deletions qiita_db/test/test_artifact.py
Original file line number Diff line number Diff line change
Expand Up @@ -677,6 +677,28 @@ def setUp(self):

self._clean_up_files = [self.fp1, self.fp2, self.fp3, self.fp4]

# per_sample_FASTQ Metagenomic example

self.prep_template_per_sample_fastq = \
qdb.metadata_template.prep_template.PrepTemplate.create(
metadata, qdb.study.Study(1), "Metagenomic")
fd, self.fwd = mkstemp(prefix='SKB8.640193', suffix='_R1.fastq')
close(fd)
with open(self.fwd, 'w') as f:
f.write("@HWI-ST753:189:D1385ACXX:1:1101:1214:1906 1:N:0:\n"
"NACGTAGGGTGCAAGCGTTGTCCGGAATNA\n"
"+\n"
"#1=DDFFFHHHHHJJJJJJJJJJJJGII#0\n")
fd, self.rev = mkstemp(prefix='SKB8.640193', suffix='_R2.fastq')
close(fd)
with open(self.rev, 'w') as f:
f.write("@HWI-ST753:189:D1385ACXX:1:1101:1214:1906 1:N:0:\n"
"NACGTAGGGTGCAAGCGTTGTCCGGAATNA\n"
"+\n"
"#1=DDFFFHHHHHJJJJJJJJJJJJGII#0\n")

self._clean_up_files.extend([self.fwd, self.rev])

def tearDown(self):
for f in self._clean_up_files:
if exists(f):
Expand Down Expand Up @@ -1364,6 +1386,26 @@ def test_descendants_with_jobs_one_element(self):
exp = [('artifact', artifact)]
self.assertCountEqual(obs, exp)

def test_has_human(self):
# testing a FASTQ artifact (1), should be False
self.assertFalse(qdb.artifact.Artifact(1).has_human)

# create a per_sample_FASTQ
artifact = qdb.artifact.Artifact.create(
[(self.fwd, 1), (self.rev, 2)], "per_sample_FASTQ",
prep_template=self.prep_template_per_sample_fastq)

# this should be False as there are no human samples
self.assertFalse(artifact.has_human)

# let's make it True by making the samle human-*
df = pd.DataFrame.from_dict(
{'1.SKB8.640193': {'env_package': 'human-oral'}},
orient='index', dtype=str)
artifact.study.sample_template.update(df)

self.assertTrue(artifact.has_human)


@qiita_test_checker()
class ArtifactArchiveTests(TestCase):
Expand Down
11 changes: 9 additions & 2 deletions qiita_pet/handlers/artifact_handlers/base_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,15 @@ def check_artifact_access(user, artifact):
"""
if user.level in ('admin', 'wet-lab admin'):
return
if artifact.visibility != 'public':
study = artifact.study

study = artifact.study
if artifact.visibility == 'public':
# if it's public we need to confirm that this artifact has no possible
# human sequences
if artifact.has_human and not study.has_access(user, True):
raise QiitaHTTPError(403, "Access denied to artifact %s"
% artifact.id)
else:
analysis = artifact.analysis
if study:
if not study.has_access(user):
Expand Down
10 changes: 7 additions & 3 deletions qiita_pet/handlers/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ def get(self, study_id):
(self.current_user in study.shared_with)))

for a in study.artifacts(artifact_type='BIOM'):
if full_access or a.visibility == 'public':
if full_access or (a.visibility == 'public' and not a.has_human):
to_download.extend(self._list_artifact_files_nginx(a))

self._write_nginx_file_list(to_download)
Expand Down Expand Up @@ -289,7 +289,7 @@ def get(self, study_id):
to_download = []
for a in study.artifacts():
if not a.parents:
if not is_owner and a.visibility != 'public':
if not is_owner and (a.visibility != 'public' or a.has_human):
continue
to_download.extend(self._list_artifact_files_nginx(a))

Expand Down Expand Up @@ -460,7 +460,7 @@ def get(self):
artifacts = study.artifacts(
dtype=data_type, artifact_type='BIOM')
for a in artifacts:
if a.visibility != 'public':
if a.visibility != 'public' or a.has_human:
continue
to_download.extend(self._list_artifact_files_nginx(a))

Expand Down Expand Up @@ -498,6 +498,10 @@ def get(self):
raise HTTPError(404, reason='Artifact is not public. If '
'this is a mistake contact: '
'[email protected]')
elif artifact.has_human:
raise HTTPError(404, reason='Artifact has possible human '
'sequences. If this is a mistake contact: '
'[email protected]')
else:
to_download = self._list_artifact_files_nginx(artifact)
if not to_download:
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@
scripts=glob('scripts/*'),
# making sure that numpy is installed before biom
setup_requires=['numpy', 'cython'],
install_requires=['psycopg2', 'click', 'bcrypt', 'pandas',
install_requires=['psycopg2', 'click', 'bcrypt', 'pandas<2.0',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the pin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pandas>2.0 brakes the tests. I saw that this was failing: https://github.com/qiita-spots/qiita/actions/runs/4940536571/jobs/8832562245 so installed the latests pandas version in my local computer and got a lot of errors; then installed 'pandas<2.0' and everything worked fine again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense to punt. Some of the types of changes needed are outlined here scikit-bio/scikit-bio#1851

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #3289 so we don't forget.

'biom-format', 'tornado<6.0', 'toredis', 'redis',
'scp', 'pyparsing', 'h5py', 'natsort', 'nose', 'pep8',
'networkx', 'humanize', 'wtforms<3.0.0', 'nltk',
Expand Down