Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values based on interval #65

Open
brian-may opened this issue Apr 3, 2013 · 3 comments
Open

Missing values based on interval #65

brian-may opened this issue Apr 3, 2013 · 3 comments

Comments

@brian-may
Copy link

It would be convenient to be able to configure multigraph to detect when data is missing based on the interval between values rather than providing a special value.

I think there might be a general solution to this problem. It would involve truncating the dates to a specific field and amount then checking that current and last are consecutive. It would not matter when the value occurs within the interval just that it exists. An additional parameter might require that values must occur on the interval to be considered not missing. This method would just check that the value is equal to the truncated value and check that current and last are consecutive.

@brian-may
Copy link
Author

I am not really sure about the attribute names, but three values are required. The interval, the field to apply the interval to, and whether to require the value lands on the interval start. This could probably accommodate number based axis by leaving the field off. I am guessing that these attributes would be applied to the variables in the data section.

interval would be a number from 1 to field max
field would be second, minute, hour, day, month, year
strict would enforce that value must be at interval start when true

<data>
  ...
  <variables>
    <variable interval="15" intervalField="minutes" intervalStrict="true">
    </variable>
  </variables>
</data>

@embeepea
Copy link
Member

embeepea commented Apr 5, 2013

I think that the attributes should be on the <variables> tag, rather than the <variable> tag, because the notion of "missing" applies to the entire n-tuple of values, not just a single column. There is of course just one column that the repeating interval applies to, but that will always be column 0, because Multigraph assumes that column 0 contains the horizontal axis variable. (Or, more to the point, the variable in column 0 is the only one that is assumed to be sorted in increasing order.)

Also, I think we could combine the interval and intervalField attributes into one, since there are other attributes elsewhere in the MUGL spec that indicate a length of time or a distance along an axis. These values are called data measures in the docs, and are indicated by a number followed by a letter denoting a unit. For example, "1M" means one month, "1Y" one year, etc.

So, your example above would become:

<variables interval="15m" intervalStrict="true">

I wonder if we also need a few more attributes, though. In particular:

  • something to indicate where the expected values should fall within each interval. For example, with an interval
    of "1M" (one month), should the values fall on the first day of the month, the 15th of the month, or somewhere
    else? This is needed because in order to actually plot anything (for example, a renderer that draws a grey bar
    over missing values --- such a renderer doesn't actually exist yet but we have talked about writing it),
    Multigraph needs to know a specific point on the axis corresponding to the value.

    I propose we add an optional attribute called align that indicates the data value alignment. The value of this
    attribute would be a data value. So, for example, in the case of a datetime data type with interval="1M",
    the setting align="2010-01-13" would indicate that data values are expected to occur on the 13th of every
    month. The default value for align could be "2000-01-01" (or any other January 1 date),

  • something to indicate the limits of the period of record for the data. I'm not sure if this is needed, but it's
    something that Rich and I have talked about. There could be min and max attributes that indicate the
    minimum and maximum values for which data is expected, so Multigraph would not count any values < min
    or > max as missing. Do you think this is needed?

So, to summarize, assuming we add all these attributes, here's an example of how it would all look:

<data>
  ...
  <variables interval="1M" intervalStrict="true" align="2010-01-13" min="1995-01-13" max="2013-03-15">
    <variable.../>
    <variable.../>
    ...
  </variables>
</data>

This would indicate that this data set should contain regular monthly values, on the 13th of each month, from 1995-01-13 to 2013-03-15; any month during this period that does not contain a value on the 13th would cause Multigraph to insert a missing value.

How does this sound?

@brian-may
Copy link
Author

I like the shortened interval syntax a lot. Anything to increase brevity with out sacrificing clarity is good in my book. Along those like could we change align to offset and use the same syntax?

<data>
  ...
  <variables interval="1M" intervalStrict="true" offset="13d" min="1995-01-13" max="2013-03-15">
    <variable.../>
    <variable.../>
    ...
  </variables>
</data>

Wouldn't all of the values outside of the min and max range automatically render as missing because there is simply no data there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants