Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain empirical evidence for permissible values to be included the missing data code enumeration #7

Open
turbomam opened this issue Sep 9, 2024 · 4 comments

Comments

@turbomam
Copy link
Member

turbomam commented Sep 9, 2024

see also

@turbomam
Copy link
Member Author

turbomam commented Sep 9, 2024

I will include some counts of NCBI Attribute contents that seem to match INSDC missing value indicators.

First, how many Biosamples are in my latest DuckDB database? (from https://portal.nersc.gov/project/m3408/biosamples_duckdb/)

select
	count(1)
from
	ncbi_biosamples.main.attributes a ;

585,141,773

select
	count(distinct id)
from
	ncbi_biosamples.main.attributes a ;

40,462,422

What is the highest numerical Biosample id in there?

select
	max(id)
from
	ncbi_biosamples.main.attributes a ;

43,280,194

@turbomam
Copy link
Member Author

turbomam commented Sep 9, 2024

Examples semi-fuzzy search

select
	count(1)
from
	ncbi_biosamples.main.attributes a
where
	lower("content") like '%control sample%';

@turbomam
Copy link
Member Author

turbomam commented Sep 9, 2024

Comprehensive query

SELECT
	"content",
	count(1)
FROM
	ncbi_biosamples.main.attributes a
WHERE
	lower("content") 
	~ '.*(control sample|data agreement established pre-2023|endangered species|human-identifiable|lab stock|missing|not applicable|not collected|not provided|restricted access|sample group|synthetic construct|third party data).*'
GROUP BY
	"content"
HAVING
	count(1) > 1
ORDER BY
	count(1) DESC;

Partial results

Full results attached

content count(1)
not provided 25533568
missing 14115083
not applicable 8235881
Not provided 5052088
not collected 4019535
Missing 744604
restricted access 669718
Not applicable 455082
Not Provided 189753
Missing: Not provided 180559
Not Applicable 151370
Restricted Access 139848
Not collected 133586
NOT PROVIDED 122423
Not Collected 117148
NOT COLLECTED 85093
missing: data agreement established pre-2023 75688
NOT APPLICABLE 66434
synthetic construct 38754
control sample 36337
MIssing: Not provided 22765
missing: sample group 18911
not applicable' 10872
Missing: Not collected 8731
Missing: Restricted access 8054
Missing: Not recorded 6850
missing: lab stock 6620
Missing: Not reported 6533
Not Collected [GENEPIO:0001620] 6440
MISSING 6335

export_202409091419.csv

@turbomam turbomam changed the title choose the permissible values for the missing data code enumeration? Obtain empirical evidence for permissible values to be included the missing data code enumeration Sep 9, 2024
@turbomam
Copy link
Member Author

turbomam commented Sep 9, 2024

Partial results don't include these other common values:

  • NULL: 1,049,472
  • unknown: 3,502,884
  • na: 2,607,680
  • n/a: 381,861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant