-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SeaExplorer delayed mode time series data loss #128
Comments
We have had similar issues with our delayed mode datasets in the past. To get around it, we use NAV_LATITUDE as the timebase in all of our datasets, as this has a non-nan value for every line. Here's an example yml for one of our gliders with a GPCTD. https://github.com/voto-ocean-knowledge/deployment-yaml/blob/master/mission_yaml/SEA70_M15.yml Would this solution work for you? This also raises the broader point that perhaps we should have a test with delayed mode data, as currently all tests are run with nrt data |
We use NAV_LATITUDE as timebase in conjunction with keep_variables that account for each of the sensors. In this way, all data pass the |
I'm not sure if this is related, but I had previously noticed that the timestamps recorded for sensors that are known to be outputting at 1Hz (according to the sensor) are never exactly 1Hz, likely due to the time that they arrive at the payload computer -- in other words, timing information from the sensor is not recorded by the PLD but it assigns it's own. The result is that two different sensors sampling at the same rate (e.g. a CTD and an optical O2 sensor) end up with times that are different by microseconds from one sample to the next, and even though we expect them to be "simultaneous", in terms of the recorded time stamps, they aren't. |
For our data, we have the situation where indeed "Nav" is written every heartbeat; certainly though there is no new data in Nav. Some heartbeats have no data at all, some have CTD data (all variables), some have Optical, some have O2. Sometimes these line up, and currently we only save when the other sensors line up with the CTD sensors. I think what needs to happen is that for each instrument we need a timeseries, and then we need to align it with the timebase. For our SeaExplorers, that time base is most naturally the CTD. The other samples may be offset from that a it, but the CTD is sampling at 2 Hz, and I don't think we care about the other sensors slight phase errors at less than 2 Hz. I haven't explored how to do this with polars, but in xarray you would just do an time, ctd_temp = decode('ctd_temp', drop_na=True)
ds['time'] = ('time', time)
ds['temperature'] = ctd_temp
# etc other ctd variables
time_O2, O2_sat = decode('O2_sat', drop_na=True)
ds['O2'] = np.interp(time, time_O2, O2_sat)
...
# etc other O2 variables I don't think this step would slow things down very much, and I think linear interpolation should be pretty OK from a data point of view. A second option would be just to save the three instruments as separate polars arrays as raw output and then merge as a second step. That would allow double checking the raw data. However, I think the raw SeaExplorer data is simple enough that it's pretty usable as-is for any debugging. |
I'll ping about this. Is there a consensus about how to proceed? |
I agree with this approach, with the clarification (not really important for the decision or discussion) that I'm 99% sure the GP-CTD samples internally at a max of 1Hz. That's not true for a legato, which can be programmed to sample much faster (up to 16Hz), though the SeaExplorer PLD can only be configured to sample at 1Hz or "as fast as possible" (which IIRC is something like 20Hz) |
@callumrollo any objections to this? I can take a crack at doing it the next few days. @hvdosser, do we have a link to some data where this is problematic? |
The delayed L0-timeseries (dfo-bb046-20200908_delayed.nc) for this mission is a nice example of the problem. |
Is there a time, or ideally a set of raw files where this problem occurred? I couldn't replicate, though I couldn't get it to work at all with the setup in that directory. |
Sorry for the delay on this, just got back from vacation. The idea of aligning timebases sounds good to me. Should we linearly interpolate, or do a nearest neighbour? I'm always a little cautious about interpolating data. Especially as it would lead to some strange results with integer data like raw counts for chlorophyll etc. If we want to do it at the raw to raw nc step I can look at implementing it in polars tomorrow |
OK, I ran the whole data set, and can see the problem now. @callumrollo if you wanted to do this, happy to let you. I'm less nervous about linear interpolation - nearest neighbour leads to phase errors; However, if there is a difference for other data types, maybe we need an entry in the yml metadata that says which interpolation is used? |
The tests are currently failing on the main branch. Looks like this way caused by something in #129. That PR was only for slocum data, so I'll work from the last commit where all the tests cleared for this PR. I've downloaded the dataset that @hvdosser indicated and will use that as a test case for timestamp alignment |
@callumrollo #129 included sea explorer tests as well, so better to figure out why the tests are failing? |
@callumrollo I think the tests should be OK on main now (I hope) I think we need to do some work on the test infrastructure. In particular, we should force rewriting of the files we compare with. Right now the way the processing works is it does incremental. |
Thanks for working to get the tests passing again @jklymak. I'll start on resolving this Issue now. I agree on forced reprocessing in the tests. would using the |
@hvdosser I think the keep_vars functionality present in pyglider already could solve this problem to first order. Is it something you're using? I've put a demo together of the difference between using CTD as a timebase and using NAV_LATITUDE as a timebase then cutting down to the rows where at least one of the sensors has a sample. It's not perfect, but has been working pretty well for us so far. https://github.com/callumrollo/keep_vars_experiment/blob/main/timebase_keep_experiment.ipynb Sorry, it's a bit of a rushed Friday afternoon job! |
OK, I forgot about this, and I'm not sure it's documented. What does this do exactly? |
I haven't had a chance to look in detail yet, but the first thing I'd check is whether this works for a delayed-mode dataset. We didn't see much of an issue with the realtime data. |
looking at it quickly it seems to keep the data if any of the listed sensors are present in a line. I guess this is a philosophical thing - do we want all the raw data in a time series, which means any given sensor is riddled with NaN, or do we want time series where the sensors are time-aligned to one sensor's time base. I guess I'll argue for the latter. If someone needs the time series with raw O2, for instance, they can rerun the processing with just that variable, and that variable as the timebase. Or they can load the raw parquet files. I think by the time we make the first time series netcdf, it should be time-aligned, and not full of NaN's. This seems particularly apropos for the O2 and the optics on the SeaExplorer, which so far as I can tell are often ludicrously oversampled? But happy to discuss further. |
II agree we need a way to reduce the size of the final timeseries while keeping as much science data as possible. I've been white-boarding this morning to represent the way pyglider currently does this for seaexplorer data, and what potential improvements could look like. Here's some ASCII art: Key X: original data Input pld1 file
Current output optionstimebase: ctd
This illustrates @hvdosser's problem where oxygen is on a slightly offset timestamp to ctd so it is all dropped timebase: nav, keep_vars: [ctd, oxy]
With keep_vars, as @jklymak identified, any line with a non-nan value for one of the sensors in keep_vars is kept. Resultant files are big, particularly if one of the keep_vars has a high sampling rate. This is the method we currently implement at VOTO, as the scientists we supply data to don't want any interpolation of data. Some of our delayed mode datasets are almost 10 GB though! Not ideal. Potential improvementstimebase:ctd linear interpolation
This solves the problem nicely for oversampling sensors like oxygen. However, this would be a severe distortion of e.g. a methane sensor that records a sample every 10 minutes and now has an apparent sample every second timebase:ctd nearst neighbour
This avoids over-interpolation of slow sampling sensors, but the downsampling of more fast sampling sensors may be less preferable. May also be more complex to implement. Whatever we decide to do going forward, I recommend that it is either controllable from the yaml, like the keep_vars solution, or operates on the l0 timeseries as an extra step, so that an end user can get the final l0 timeseries without any interpolation if they want. We should also explain these various options in the documentation. Perhaps using diagrams like the ones in this comment. |
Folks who don't want any interpolation or alignment of sensors have two options (that I agree should be documented)
For interpolation options, totally fine with both linear and nearest. I'd never use nearest, but... The original reason for using the CTD for our data sets was that indeed oxygen had much more data, but it was all repeated and was really being sampled at 1 Hz or 2 Hz or something, and seemed a silly error of Alseamars to be sampling it at 48 Hz or whatever they were doing. Of course if someone has a real data set that needs sampling at higher frequency than the CTD, they should be using that as the timebase. |
This sounds like a good way to implement the interpolation. I've started working on a PR. |
In testing the new code for this pull request, I found an issue with processing the delayed mode SeaExplorer time series data for missions for which certain sensors (oxygen in this case) are severely oversampled. These missions end up with delayed mode data files that contain fewer actual (non-nan) data points than the realtime files. In other words, we are losing data during the processing.
Currently, the
dropna
function is used to remove the oversampled oxygen data when converting the raw data. Thedropna
function is working correctly, however note that the resulting data has many nan values in it, for both the CTD and optics. These nan values will often not co-occur.I think the problem in the processing is caused by using the
GPCTD_TEMPERATURE
as the default time base inseaexplorer.py
. This variable contains nan values that are not all co-located with the nan values in the oxygen and optical variables. It's desirable to use the CTD as the time base, but we may need to do some interpolation to avoid losing data when the other variables are mapped onto this base.The text was updated successfully, but these errors were encountered: