rounding-off comparison with B-spline's bounds #116

arbazkhan002 · 2018-01-09T01:24:10Z

This one's a minor fix imposing approximate comparison in floats.

njsmith · 2018-01-10T22:08:44Z

Using isclose like this makes me nervous – its definition of "close" is pretty arbitrary, which is fine in a test suite, but dangerous when embedded inside a library like this.

Can you give some more details about how you ran into this and why you need it?

arbazkhan002 · 2018-01-10T22:33:23Z

Its the classic float-comparison error that prompted me to push this change while using dmatrices API. The error said:

y, X = dmatrices(formula, data, return_type='dataframe')
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 310, in dmatrices
    NA_action, return_type)
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 689, in design_matrix_builders
    factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
  File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 370, in _factors_memorize
    factor.memorize_finish(factor_states[factor], which_pass)
  File "/usr/local/lib/python3.5/site-packages/patsy/eval.py", line 561, in memorize_finish
    state["transforms"][obj_name].memorize_finish()
  File "/usr/local/lib/python3.5/site-packages/patsy/splines.py", line 223, in memorize_finish
    lower_bound))
ValueError: some knot values ([ 60866.115]) fall below lower bound (60866.114999999998)

Now clearly 60866.115 > 60866.114999999998 but you see that error just because - floats.

I see you make a valid point that isclose is arbitrary but the other option I thought was
np.any(inner_knots-lower_bound < -1 * SMALL_POSITIVE_CONSTANT) but that again requires you to define this SMALL_POSITIVE_CONSTANT or let the user define it as an argument. If we do go by the latter route, isclose can take arguments rtol, atol to that effect. What do you think?

njsmith · 2018-01-10T23:35:49Z

Are you using the default knots and bounds? Are you able to share the data that's triggering this?

arbazkhan002 · 2018-01-11T00:21:43Z

Yes the knots and bounds used are both default. However df, degree and include_intercept are passed on to BS. Unfortunately I can't share the data. I confirm that once doing the comparison in either of the specified options, error went away - that means, inner_knots were just marginally greater than the lower bound

njsmith · 2018-01-11T01:29:37Z

Looking at the code, it appears that what we're doing is comparing np.min(data) to some percentiles of data. It's extremely weird that np.percentile is returning a value that is smaller than np.min(data), especially since we're using non-zero percentiles.

Can you try adding

print(inner_knots)
print(knot_quantiles)

to BS.memorize_finish just before the error, and paste what you get?

arbazkhan002 · 2018-01-11T01:37:40Z

Its actually not returning a smaller value, you can see that the knot reported in the output above is just greater than the bound. Nonetheless, I will paste these values shortly if I can reproduce it.

njsmith · 2018-01-11T01:49:18Z

Ah, no, I missed that, but it's just an issue with the numpy array printing rounding off the output. Floats do follow rules, you don't just randomly get a < b returning the wrong result because a is only a little bit bigger than b :-).

However, this means that my suggestion of printing the value isn't going to produce any useful output either, because numpy will round them off for display again :-(. So actually, can you try:

np.set_printoptions(precision=20)
print(inner_knots)
print(knot_quantiles)

?

arbazkhan002 · 2018-01-11T02:57:01Z

Sorry I can't seem to get the data back again. Floats returning wrong results won't happen always for you, I think it depends on what values you are dealing with (binary conversion has to kick in).

In [29]: float(60866.114999999999) > float(60866.114999999998)
Out[29]: False

In [30]: float(1.222222222223)>float(1.222222222222)
Out[30]: True

I guess I will wait till I can reproduce the issue again and get back

njsmith · 2018-01-11T03:49:18Z

In [29]: float(60866.114999999999) > float(60866.114999999998)
Out[29]: False

That's because these two values got rounded to be ==. But this error won't happen if np.percentile(data, x) == np.min(data), we have to actually have np.percentile(data, x) < np.min(data).

njsmith · 2018-01-12T01:57:17Z

numpy/numpy#10373 gives an example of a case where this could happen.

I'm thinking the right solution here is to clip the auto-generated knots to fall inside the lower/upper bounds.

arbazkhan002 · 2018-01-14T20:35:44Z

I doubt if clipping or any absolute comparison methods would help (otherwise we won't get an incorrect percentile at first place). I think at the end of the day it boils down to having approximate comparison somewhere.

njsmith · 2018-01-15T10:52:36Z

We probably get the incorrect percentile because of some rounding error when the percentile code tries to make a weighted average between two entries in your data. E.g. if you only have 50 data points, then the smallest is the 0th percentile, and the second-smallest is the 2nd percentile, and if we do np.percentile(data, 1) to ask for the 1st percentile then... there is no entry in the data that corresponds to that, so it has to approximate it by averaging together the two closest points. This might happen when those two values are actually the same, as in the numpy bug report linked above.

You seem to be under the impression that floating point math is all about vague clouds of points and impossible to predict results. This is a common and simplification, but it's not true :-). Each floating point value represents a specific real number. When you perform some operation like multiplication or addition (like when computing a weighted average) then there is always an exact mathematical answer which is also a real number... it's just that it might not be one of the real numbers that can be written down as a floating point number, so the operation instead picks some nearby number that is representable. But then that is represented exactly. Comparisons are performed exactly on the given values. Etc.

If we used something like np.clip to nudge the knots into the range of the lower/upper bound, then they'll stay there :-)

josef-pkt · 2018-10-31T14:54:33Z

Just to mention that I have seen the same exception a few times. My guess was that there are a few nulps differences in the floating point computations. AFAIR it doesn't happen with "nice" numbers but it happened with some simulated sin/cos data.
Unfortunately I don't find my statsmodels comments for this right now.
statsmodels/statsmodels#5296 (which has too many comments, too many issues, and too many moving parts.)

What I did as workaround was to set the boundary points so that they are 1e-15 (or something like that) outside of the data interval defined by min and max.

matthewwardrop · 2021-10-20T03:00:55Z

fyi: This is not an issue in formulaic, which allows extrapolation of the basis splines beyond the boundary knots when so configured. It's likely that this patch would work fine, but I would have to look more closely at the code to see. I also note that patsy's implementation has some weird behaviour on the boundaries anyway (I haven't looked into whether it is worth fixing).

Let me know if you are still interested in getting this merged into patsy!

rounding-off comparison with B-spline's bounds

a5f28a8

njsmith mentioned this pull request Jan 14, 2018

bug in percentile Linear numpy/numpy#10373

Closed

matthewwardrop force-pushed the master branch 2 times, most recently from b07ba3f to 48fd2e4 Compare September 5, 2021 04:56

matthewwardrop added the fixed in formulaic label Oct 20, 2021

matthewwardrop force-pushed the master branch from 4f8a70c to 691eb4e Compare November 12, 2024 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rounding-off comparison with B-spline's bounds #116

rounding-off comparison with B-spline's bounds #116

arbazkhan002 commented Jan 9, 2018

njsmith commented Jan 10, 2018 •

edited

Loading

arbazkhan002 commented Jan 10, 2018

njsmith commented Jan 10, 2018

arbazkhan002 commented Jan 11, 2018 •

edited

Loading

njsmith commented Jan 11, 2018

arbazkhan002 commented Jan 11, 2018

njsmith commented Jan 11, 2018

arbazkhan002 commented Jan 11, 2018 •

edited

Loading

njsmith commented Jan 11, 2018

njsmith commented Jan 12, 2018

arbazkhan002 commented Jan 14, 2018

njsmith commented Jan 15, 2018

josef-pkt commented Oct 31, 2018 •

edited

Loading

matthewwardrop commented Oct 20, 2021 •

edited

Loading

rounding-off comparison with B-spline's bounds #116

Are you sure you want to change the base?

rounding-off comparison with B-spline's bounds #116

Conversation

arbazkhan002 commented Jan 9, 2018

njsmith commented Jan 10, 2018 • edited Loading

arbazkhan002 commented Jan 10, 2018

njsmith commented Jan 10, 2018

arbazkhan002 commented Jan 11, 2018 • edited Loading

njsmith commented Jan 11, 2018

arbazkhan002 commented Jan 11, 2018

njsmith commented Jan 11, 2018

arbazkhan002 commented Jan 11, 2018 • edited Loading

njsmith commented Jan 11, 2018

njsmith commented Jan 12, 2018

arbazkhan002 commented Jan 14, 2018

njsmith commented Jan 15, 2018

josef-pkt commented Oct 31, 2018 • edited Loading

matthewwardrop commented Oct 20, 2021 • edited Loading

njsmith commented Jan 10, 2018 •

edited

Loading

arbazkhan002 commented Jan 11, 2018 •

edited

Loading

arbazkhan002 commented Jan 11, 2018 •

edited

Loading

josef-pkt commented Oct 31, 2018 •

edited

Loading

matthewwardrop commented Oct 20, 2021 •

edited

Loading