-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ResidualDouble regressor does not handle transforms correctly. #492
Comments
Working on a fix here: https://github.com/meh2135/skpro |
After thinking about it a bit, I'm even more convinced that there really isn't a good way of reasonably estimating a scale parameter by allowing the user to specify an arbitrary transform of the residual, and modeling that. I'm going to remove that option in my PR. Re the adjustment for absolute for student's t, you can see the issue here: from skpro.distributions import TDistribution
from scipy import stats
from matplotlib import pyplot as plt
import numpy as np
n_samples = 40_000
dof_vec = np.linspace(2.1, 7.0, 20)
ratio_vec = []
for dof in dof_vec:
# Generate data from a t-distribution
gen_dist = stats.t(df=dof, loc=0.0, scale=2.0)
gen_samp = gen_dist.rvs(n_samples)
# Estimate sigma from the data using the naive approach used in ResidualDouble, ie sigma = mean(abs(y))
naive_sigma_est = np.abs(gen_samp).mean()
# Plug that in to a skpro t-distribution'
naive_skpro_dist = TDistribution(df=dof, mu=0.0, sigma=naive_sigma_est)
# Sample from that skpro distribution
naive_skpro_samples = naive_skpro_dist.sample(n_samples)
# Calculate the variance of the data and the variance of the skpro samples
gen_var = gen_samp.var()
naive_skpro_var = naive_skpro_samples[0].var()
# print(f"Naive sigma estimate: {naive_sigma_est:.4f}")
# print(f"Naive skpro var: {naive_skpro_var:.4f}")
# print(f"Ratio of variances: {naive_skpro_var / gen_var:.4f}")
ratio_vec.append(naive_skpro_var / gen_var)
plt.figure()
plt.plot(dof_vec, ratio_vec)
plt.figure()
plt.plot(dof_vec, np.sqrt(ratio_vec)) I don't see the correction being made anywhere in the ResidualDouble code. That said, the quantiles that come out of it look decent, so maybe I am missing something. FYI: @fkiraly |
@meh2135, thanks a lot for your report and work on this! I agree that the inverse transform is missing in However, I would split work on adding back the missing inverse transform from modifications of the algorithm. There is a simple software engineering reason for that, we should not change anything in the package that may break user code - at least not without giving warnings for a few months and then changing it. Removing an option would be such a change that could break code of users that use the option. On the content side, that is, regarding the algorithm itself, I get what you are saying about violated statistical assumptions, we should keep in mind though that the algorithm is a simple heuristic, and also a common "rule of thumb" estimator. Of course there are better ways to estimate the residual distribution, under statistical assumptions. For this, I would recommend discussing more systematically the target state first. Perhaps writing down a mock docstring of what you think the "best end state" should be. Because if you implement first, we might end up writing unnecessary code if in the end we agree on a different docstring. |
Regarding this, I think this is simply a model mismatch issue. Your data were not t-distributed residuals to start with, so it may or may not be that a deviation from an incorrect assumption (t-distribution with certain df) leads to a better fit! That is also related to what I was saying about this algorithm being a heuristic. In case of model mismatch - and in real life nothing ever is exactly normally or t-distributed (for single samples, i.e., large sample statistics approximately do follow limit distributions etc). So, if you have model mismatch anyway, one or the other version of a heuristic with model mistmatch may be better. The "machine learner" approach would be to take algorithms as they are and not be too concerned with "wrong statistical assumptions", but do stringent statistics for empirical performance evaluation. |
@meh2135, are you still working on this? Would be really nice to have! Feel free to make a Pr with an incomplete version, and someone else can complete it, in case you do not have time to finish. |
@fkiraly hey sorry I've been so slow! I'm onsite at my job these past few weeks, but I'll be home next week and should have bandwidth to finish it then. |
No problem, just checking that it is not just a forgotten press of the pull request button or similar. No rush, and thanks for contributing! |
Draft PR ready for review |
Describe the bug
In skpro.regression.residual.ResidualDouble, for the argument residual_trafo="squared", we should be taking the sqrt of the scale estimate before feeding it into the scale parameters, possibly followed by a further, distribution specific correction.
For instance, if we're using an error dist X~T(mu=0, dof=v) distribution, then for abs E[|X|] = ComplicatedFunc(dof) * sigma, and E[X^2]=Var(X)=sigma^2 * v / (v-2)
derivation from stack exchange
For arbitrary transformations, this quickly gets tricky, even for those limited distributions. Further, arbitrary transformations can't easily be checked for monotonicity or positivity. I would advise removing that capability.
To Reproduce
Running the following should yield uniform quantiles, but that totally breaks down for the squared transform.
Confusion
I don't understand why it does yield uniform quantiles for the absolute transform for the t distribution. I would expect them to deviate from uniform because there is no degrees of freedom correction. Can anyone explain why that works?
Tacking the following code onto the end of the block above, we see that "raw" is the most uniform. That makes no sense to me.
The text was updated successfully, but these errors were encountered: