-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLJ API for Missing Imputation ? #950
Comments
@sylvaticus Thanks for raising your use case. This kind of issue (the first one) has recurred in a few cases, and to be honest, while there are possible paths within the API I've never been completely happy with any of them. For now, let me just record what is possible at the moment and try to add more later. I'm sorry, I haven't had a chance yet to look at your POC yet Your first question, if I understand correctly, is what to do if you have a "one-shot" transformer which has byproducts of "training" that you want to inspect - you call it "complementary data" and it's the report in MLJ lingo. Here are options within the current MLJ API (the same options mentioned in the clustering thread).
(Note that presently, From the point of view of model composition, As an aside, there's been suggestions to have a fit_transform(model, X) = transform(machine(model, X) |> fit!, X) but this is still going to return both ( Regarding your second question, I don't quite understand what is meant by multiple imputation. Maybe you can point me to an example of this somewhere. |
Thanks @ablaom, I'll look on it. |
Thinking about this some more today, I think The multiple imputations apparatus looks interesting. I don't think this is impossible, but realistically, it's out of scope for MLJ integration. at present. |
Okay, I have some further thoughts on how we can improve the API for models that don't generalize to new data, such as some imputers and some clustering algorithms, and where there are byproducts of the computation you want accessible. As above, I suggest these be implemented as Currently, only the In implementation function MLJModelInterface.transform(my_imputer::MyImputer, ::Nothing, X)
...
return (Ximputed, report)
end
# new trait to flag the fact that `transform` is returning extra "report" data:
MLJModelnterface.reporting_operations(::Type{<:MyImputer}) = (:transform,) User workflow mach = machine(MyImputer(...)) # No need to `fit!` here
X = ... # some data to impute
Ximputed = transfrom(mach, X)
report(mach) # returns extra stats about the imputation If you don't care for the report, you can just do Ximputed = transform(machine(MyImputer()), X) or, if we add the overloading Ximputed = transform(MyImputer, X) but this last assumes As proposed, this is non-breaking, but requires the addition of a trait. I'm working on a POC but it wold be good to get any feedback before I get too far along. In pipelines, the report would be accessible in the usual way (something like |
Okay, the proposal referenced above has now been implemented. You will need to make your lower bound on MLJModelInterface = "1.6" to make use of it in BetaML. Let me know if further guidance is needed. |
Poster's question has been addressed. |
[Possibly related to the API discussion on Clustering Models]
I am in the process to implement several Missing Imputers in a new BetaML
Imputation
sub module, based on GMM (as the currentMissingImputator
that I will deprecate onceImputation
is ready ), random forests and simple means.My tentative BetaML API is currently
This however doesn't fit well with the MLJ interface currently used for
MissingImputator
This approach seems a bit forced to me.. in missing imputation problems we don't really have the concept of generalising a model to new data.. what we would need is instead a sort of fit_transform function.. however with the possibility to extract the imputed values (whether in a dense or sparse way) but also some information related to the fitting...
What do you think? Should I just implement my "low level"
fit!
andpredict
in MLJfit
and return imputed values and fitting info from it ?How to deal with multiple imputations (an option of the random forest imputer) ? Currently
BetaML.Imputation.predict(mod::RFImputer,X)
returns a vector of imputed values instead of a scalar ifmod.multipleImputations
(a parameter of the model) is higher than 1...EDIT: This tentative interface implements the fit/predict in the MLJ
fit
function.. does it looks good for you?https://github.com/sylvaticus/BetaML.jl/blob/b631d82a2a86b13877fafa139c7db97625f36700/src/Imputation/Imputation_MLJ.jl
(still don't know how to return the imputations when multiple ones are possible depending on the parameter - that will be the case with BetaMLRFImputer.. should I always return a vector, even if most of the users will just need one? Or should I ignore multiple imputations in the MLJ interface ?)
The text was updated successfully, but these errors were encountered: