-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obtain empirical evidence for permissible values to be included the missing data code enumeration #7
Comments
I will include some counts of NCBI Attribute contents that seem to match INSDC missing value indicators. First, how many Biosamples are in my latest DuckDB database? (from https://portal.nersc.gov/project/m3408/biosamples_duckdb/) select
count(1)
from
ncbi_biosamples.main.attributes a ;
585,141,773
select
count(distinct id)
from
ncbi_biosamples.main.attributes a ;
What is the highest numerical Biosample select
max(id)
from
ncbi_biosamples.main.attributes a ;
|
Examples semi-fuzzy search select
count(1)
from
ncbi_biosamples.main.attributes a
where
lower("content") like '%control sample%'; |
Comprehensive querySELECT
"content",
count(1)
FROM
ncbi_biosamples.main.attributes a
WHERE
lower("content")
~ '.*(control sample|data agreement established pre-2023|endangered species|human-identifiable|lab stock|missing|not applicable|not collected|not provided|restricted access|sample group|synthetic construct|third party data).*'
GROUP BY
"content"
HAVING
count(1) > 1
ORDER BY
count(1) DESC; Partial resultsFull results attached
|
Partial results don't include these other common values:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
see also
The text was updated successfully, but these errors were encountered: