Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rounding-off comparison with B-spline's bounds #116

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arbazkhan002
Copy link

This one's a minor fix imposing approximate comparison in floats.

@njsmith
Copy link
Member

njsmith commented Jan 10, 2018

Using isclose like this makes me nervous – its definition of "close" is pretty arbitrary, which is fine in a test suite, but dangerous when embedded inside a library like this.

Can you give some more details about how you ran into this and why you need it?

@arbazkhan002
Copy link
Author

Its the classic float-comparison error that prompted me to push this change while using dmatrices API. The error said:

y, X = dmatrices(formula, data, return_type='dataframe')
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 310, in dmatrices
    NA_action, return_type)
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 689, in design_matrix_builders
    factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
  File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 370, in _factors_memorize
    factor.memorize_finish(factor_states[factor], which_pass)
  File "/usr/local/lib/python3.5/site-packages/patsy/eval.py", line 561, in memorize_finish
    state["transforms"][obj_name].memorize_finish()
  File "/usr/local/lib/python3.5/site-packages/patsy/splines.py", line 223, in memorize_finish
    lower_bound))
ValueError: some knot values ([ 60866.115]) fall below lower bound (60866.114999999998)

Now clearly 60866.115 > 60866.114999999998 but you see that error just because - floats.

I see you make a valid point that isclose is arbitrary but the other option I thought was
np.any(inner_knots-lower_bound < -1 * SMALL_POSITIVE_CONSTANT) but that again requires you to define this SMALL_POSITIVE_CONSTANT or let the user define it as an argument. If we do go by the latter route, isclose can take arguments rtol, atol to that effect. What do you think?

@njsmith
Copy link
Member

njsmith commented Jan 10, 2018

Are you using the default knots and bounds? Are you able to share the data that's triggering this?

@arbazkhan002
Copy link
Author

arbazkhan002 commented Jan 11, 2018

Yes the knots and bounds used are both default. However df, degree and include_intercept are passed on to BS. Unfortunately I can't share the data. I confirm that once doing the comparison in either of the specified options, error went away - that means, inner_knots were just marginally greater than the lower bound

@njsmith
Copy link
Member

njsmith commented Jan 11, 2018

Looking at the code, it appears that what we're doing is comparing np.min(data) to some percentiles of data. It's extremely weird that np.percentile is returning a value that is smaller than np.min(data), especially since we're using non-zero percentiles.

Can you try adding

print(inner_knots)
print(knot_quantiles)

to BS.memorize_finish just before the error, and paste what you get?

@arbazkhan002
Copy link
Author

Its actually not returning a smaller value, you can see that the knot reported in the output above is just greater than the bound. Nonetheless, I will paste these values shortly if I can reproduce it.

@njsmith
Copy link
Member

njsmith commented Jan 11, 2018

Ah, no, I missed that, but it's just an issue with the numpy array printing rounding off the output. Floats do follow rules, you don't just randomly get a < b returning the wrong result because a is only a little bit bigger than b :-).

However, this means that my suggestion of printing the value isn't going to produce any useful output either, because numpy will round them off for display again :-(. So actually, can you try:

np.set_printoptions(precision=20)
print(inner_knots)
print(knot_quantiles)

?

@arbazkhan002
Copy link
Author

arbazkhan002 commented Jan 11, 2018

Sorry I can't seem to get the data back again. Floats returning wrong results won't happen always for you, I think it depends on what values you are dealing with (binary conversion has to kick in).

In [29]: float(60866.114999999999) > float(60866.114999999998)
Out[29]: False

In [30]: float(1.222222222223)>float(1.222222222222)
Out[30]: True

I guess I will wait till I can reproduce the issue again and get back

@njsmith
Copy link
Member

njsmith commented Jan 11, 2018

In [29]: float(60866.114999999999) > float(60866.114999999998)
Out[29]: False

That's because these two values got rounded to be ==. But this error won't happen if np.percentile(data, x) == np.min(data), we have to actually have np.percentile(data, x) < np.min(data).

@njsmith
Copy link
Member

njsmith commented Jan 12, 2018

numpy/numpy#10373 gives an example of a case where this could happen.

I'm thinking the right solution here is to clip the auto-generated knots to fall inside the lower/upper bounds.

@arbazkhan002
Copy link
Author

I doubt if clipping or any absolute comparison methods would help (otherwise we won't get an incorrect percentile at first place). I think at the end of the day it boils down to having approximate comparison somewhere.

@njsmith
Copy link
Member

njsmith commented Jan 15, 2018

We probably get the incorrect percentile because of some rounding error when the percentile code tries to make a weighted average between two entries in your data. E.g. if you only have 50 data points, then the smallest is the 0th percentile, and the second-smallest is the 2nd percentile, and if we do np.percentile(data, 1) to ask for the 1st percentile then... there is no entry in the data that corresponds to that, so it has to approximate it by averaging together the two closest points. This might happen when those two values are actually the same, as in the numpy bug report linked above.

You seem to be under the impression that floating point math is all about vague clouds of points and impossible to predict results. This is a common and simplification, but it's not true :-). Each floating point value represents a specific real number. When you perform some operation like multiplication or addition (like when computing a weighted average) then there is always an exact mathematical answer which is also a real number... it's just that it might not be one of the real numbers that can be written down as a floating point number, so the operation instead picks some nearby number that is representable. But then that is represented exactly. Comparisons are performed exactly on the given values. Etc.

If we used something like np.clip to nudge the knots into the range of the lower/upper bound, then they'll stay there :-)

@josef-pkt
Copy link

josef-pkt commented Oct 31, 2018

Just to mention that I have seen the same exception a few times. My guess was that there are a few nulps differences in the floating point computations. AFAIR it doesn't happen with "nice" numbers but it happened with some simulated sin/cos data.
Unfortunately I don't find my statsmodels comments for this right now.
statsmodels/statsmodels#5296 (which has too many comments, too many issues, and too many moving parts.)

What I did as workaround was to set the boundary points so that they are 1e-15 (or something like that) outside of the data interval defined by min and max.

@matthewwardrop matthewwardrop force-pushed the master branch 2 times, most recently from b07ba3f to 48fd2e4 Compare September 5, 2021 04:56
@matthewwardrop
Copy link
Collaborator

matthewwardrop commented Oct 20, 2021

fyi: This is not an issue in formulaic, which allows extrapolation of the basis splines beyond the boundary knots when so configured. It's likely that this patch would work fine, but I would have to look more closely at the code to see. I also note that patsy's implementation has some weird behaviour on the boundaries anyway (I haven't looked into whether it is worth fixing).

Let me know if you are still interested in getting this merged into patsy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants