-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring #37
Comments
ConfigurationBetter distinguish Analysis parametersAgreed - I would suggest having two routines
Getters, Setters, Loaders, SaversAgreed, see above ! ValidationAgreed here too, thought there is still probably more discussion to be had around parameter validation to ensure we aren't accidentally boxing our DMQC operators into corners. At least a set of warnings should be considered, as we have previously discussed. Perhaps we could even give operators a "normal range" that values would be in. Data MappingReduce the size of salinity mappingAgain, completely agree. There is a ton of leg work done here that we could move out into separate functions to make the routine more accessible and friendly to engineers. I have already refactored some code compared to the matlab version, but there is more to be done Xarray and ParallelisationThe only concern I have here is that there are some shared resources that are appended to at the end of the loop, such as constructing the selected historical data array. I don't know how xarray would handle a race condition like this, but it would be interesting to find out! Data FetchingI'll comment on this section as a whole. The only place real data fetching takes place is in the
Clearly this could be done FAR better by either
A lot of the profiles will use largely the same historical data (since each sequential profile is near the previous profile), so we could do something with that. However, that might make parallelisation more difficult (?) Very much on board for a web API fetcher. Both for float data and bathymetry. Code designRefactorThere are a number of places where data manipulation happens outside a function that REALLY should happen inside. We can make a list of all the places this occurs and slowly work on refactoring these places. DocumentationI have sat on this issue for quite a long time, frustratingly. I have waited to see what the BODC software team do, but it seems that their documentation is a little all over the place - different depending on the project and the person running it. There was a time frame where they thought they might move everything to ReadTheDocs, but ended up being unable to do so. I'm happy using anything, especially if it makes the code more approachable and it's convenient. It would also be nice to have somewhere to point people to who are new to the code. We'd probably need a lot of input from our DMQC experts here. There is an instruction book that would probably be worth having a look over. I'm sure @kamwal can help us with that. |
Comments on refactoring ideas, with a particularly focus on what can be achieved in the near term. ConfigurationBetter distinguish Analysis parametersI am in favour of separating analysis and local configuration parameters. From a discussion in BODC today, local configuration parameters could be extended to include preferred plot output format for saving, and no doubt other features (we have talked about introducing perceptually neutral colours schemes, for instance). As we have a working piece of software, I don't think this area is a major priority right now, but equally not a major undertaking to implement some incremental improvements. Getters, Setters, Loaders, SaversAs above. ValidationAgree with Ed around not boxing DMQC operators in - I think this might be an area for future development as the code is adopted in routine use and we can develop a better shared understanding of what should and should not be done. Implementing anything now would risk non-adoption of the code if it inadvertently imposes limitations on operators. Data MappingReduce the size of salinity mappingWhilst I agree in principle, at this point I think refactoring this element of the code brings too much risk to delivering a complete and reliable set of code which can be brought into routine use. Xarray and ParallelisationI think this has one of the most important benefits in terms of performance and providing improved capabilities to users. I am not clear on how achievable this is in the short term, but from past conversations it seems this could be achieved in stages and be de-risked in the process. Data FetchingImprovements to data fetching would be good, although I am hesitant about always fetching the latest data - sometimes you may want to be working with a static set from iteration to iteration of DMQC run. At this stage, as we are limited by what is publicly available for reference data, I consider this to be a lower priority and not a current focus. Enabling use of Jupyter at least demonstrates the potential though, and I think is something to take back to ADMT as future growth. Code designRefactorWhilst I don't disagree with what has been highlighted, I suspect the more critical element at this juncture is to assess options for improving performance of the code. One such example is pre-allocation of variables that currently grow with each cycle and might be the cause of slower performance in some cases than Matlab. Improved performance was one of the goals of this conversion. DocumentationAgreed on implementing something, and quickly. I have no great preferences - so long as it has minimum of maintenance and maximum of accessibility. I think it would be good for 'developer' documentation and 'user' documentation to sit in the same place. |
To summarise my previous comment, I think the current focus should be on:
|
Improvements recently made to:Fixing the CI pipelines
Data Mapping
Documentation
|
Let's talk about refactoring ideas
Configuration:
This is the most important feature to properly control what's being done, so it requires a nice and flexible UI
Data Mapping:
This is the most time consuming step, so it requires optimization to improve perf
update_salinity_mapping
function (500 hundreds line !) by identifying recurrent patterns and affecting inner loop work to specific functionsupdate_salinity_mapping
has 2 main loop levels: on profile and on vertical levels. If inner loop work is delegated to functions and if data structure is clarified (with dictionaries or even better: xarray.DataSet), make these loops work in parallel will be much easier and a game changer in terms of performances.Data fetching:
This is a key component of the software, fetching float but more importantly reference data.
Code design:
update_salinity_mapping
the longitude values wrapping between -180/180 forget_topo_grid
is done outside ofget_topo_grid
, adding 4 lines and 1 variable to the code. It it this function inner responsibility to check for longitude values, must not be done outside.The text was updated successfully, but these errors were encountered: