Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesets to swap #56

Closed
wants to merge 61 commits into from
Closed

Conversation

will-moore
Copy link
Member

@will-moore will-moore commented Jun 13, 2023

Starting with the list of projects/screens from https://docs.google.com/spreadsheets/d/11Eg8JzY7dqRUMGVAjGIfTcGDs0BitJQJQ20BVoaiZOg/edit#gid=1516505897
and the script get_filesets_to_swap.py produced a list of 2062 Filesets (see idr_filesets.csv).

Following workflow at IDR/omero-mkngff#2, we harvest the Fileset names and UUIDs from BioStudies, and run python parse_bia_uuids.py idr0051 script from this PR to add the Fileset IDs based on their name from the idr_filesets.csv table created above.

The output idr00XX.csv tables are included in this PR.

They are used as the input for omero mkngff sql which generated sql outputs. They are also included in this PR.

We could move the idr00XX.csv files and the sql scripts to a different repo. Don't know if/where we want to document the usage instructions/workflow above?

@will-moore
Copy link
Member Author

Validating .sql files to check:

  • We aren't missing .zarray files (due to temp mkngff bug).
$ for i in $(ls ./); do echo "$i $(grep -c 'zarray' $i)"; done;
  • There is just 1 "UPDATE" string in each
$ for i in $(ls ./); do echo "$i $(grep -c 'UPDATE' $i)"; done;
for i in $(ls ./); do grep -v "Directory" $i > temp.txt && mv temp.txt $i ; done;

@will-moore
Copy link
Member Author

will-moore commented Oct 16, 2023

To test clientpath changes, need an original (not edited) Fileset.
Test on idr0138-pilot...

Checked-out this PR and updated idr0090/4053140.sql with SECRET.
Updated and run setup.sql with IDR/omero-mkngff#12
pip install 'omero-mkngff @ git+https://github.com/will-moore/omero-mkngff@clientpath'
Then ran 4053140.sql:

Screenshot 2023-10-16 at 22 10 37

Also used that omero-mkngff PR to generate sql and run that

(venv3) (base) [wmoore@pilot-idr0138-omeroreadwrite ~]$ omero mkngff sql 4053141 --clientpath=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr --secret=$SECRET /bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr > 4053141.sql

and ran that sql as above...

Screenshot 2023-10-16 at 22 33 04

@sbesson
Copy link
Member

sbesson commented Oct 17, 2023

@will-moore will-moore requested review from joshmoore and sbesson 31 minutes ago

How is this meant to be reviewed? Should that involve a large round of testing and thus the entire IDR team?

At the code-level, primary concern is that this PR is adding thousands of files (the GitHub files UI) cannot even cope. Are we certain this is not going to bloat this repository and make it unusable in the future? An alternative would be to migrate the generation script and all SQL files to a standalone one-off upgrade repository.

@will-moore
Copy link
Member Author

To review, I'll list some things that need checking, and some ways to check them. I don't think it makes sense to get others to run the sql themselves, but rather check the result (as applied to idr-testing)...

  • Check that we've not missed any .pattern data (or other data that won't be read by mainline BioFormats). All studies in gdoc spreadsheet (see description above) are included in idr_filesets.py or have a idr00XX.csv file in this PR.
  • All the other scripts build on each other, resulting in new NGFF Filesets on idr-testing. So it makes sense to test that images there are viewable, and also check the Fileset files in webclient - the "Imported from" links should be valid URLs (not 404).
  • Good to discuss where to put the csv and sql files. I also wonder if we're going to follow the mkngff workflow for future NGFF submissions to IDR, so we might have more of these in the future??

@sbesson
Copy link
Member

sbesson commented Oct 17, 2023

Starting from a clean checkout

(base) sbesson@Sebastiens-MacBook-Pro-3 idr-utils % du -csh .
1.1M	.
1.1M	total
(base) sbesson@Sebastiens-MacBook-Pro-3 idr-utils % git fetch origin pull/56/merge
remote: Enumerating objects: 4333, done.
remote: Counting objects: 100% (41/41), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 4333 (delta 39), reused 28 (delta 28), pack-reused 4292
Receiving objects: 100% (4333/4333), 102.44 MiB | 4.50 MiB/s, done.
Resolving deltas: 100% (3002/3002), done.
From https://github.com/IDR/idr-utils
 * branch            refs/pull/56/merge -> FETCH_HEAD
(base) sbesson@Sebastiens-MacBook-Pro-3 idr-utils % du -csh .                     
113M	.
113M	total

Even if the size is still relatively small, a 100x size increase to commit SQL scripts expected to be used once defies a bit the aim of this utility repository from my perspective. My vote would be to either use a separate repository or split each file into the appropriate study repository.

@will-moore
Copy link
Member Author

OK, I think I'll move each idr00XX.csv and the sql scripts to the different study repos once we're done using them (and they're not changing). But while they're still being used and potentially edited, I'll keep working with this PR for convenience.
Also much easier to check them out into one place in one go for running them on idr-next etc while they're still in one branch.

@will-moore
Copy link
Member Author

@francesw - could you create a repo named mkngff_upgrade_scripts where I can push everything from this PR?
I think everything here is very much a one-off usage and doesn't need to be added to idr-utils for any future use.
Thanks.

@francesw
Copy link
Member

francesw commented Dec 7, 2023

@will-moore
Copy link
Member Author

moved all contents from this PR to https://github.com/IDR/mkngff_upgrade_scripts . Closing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants