-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Clustering.DBSCAN to interface #11
Conversation
@juliohm Many thanks for this contribution. I think I have misled you and we should rethink this slightly. Unlike KMeans, DBSCAN does not predict on new data. You only get cluster labels for the data on which you train, right? So, we should probably conceive of this as
The |
If the 1.0 fail is a pain, feel free to bump to 1.3, as MLJBase is there anyhow. |
Hi @ablaom , I am confused with the API. Shouldn't we aim for a consistent API across different clustering models? I am writing code downstream that is becoming really hard to maintain with very different functions to call. Can you please clarify how I should modify this code to make it work with any clustering model from Clustering.jl? https://github.com/JuliaEarth/GeoClustering.jl/blob/2dec18d6f73ef6db425be8ae2a25f4e74e35f80e/src/clustering.jl#L62-L68 We can always transfer clustering labels using the nearest neighbor approach I implemented in this PR for DBSCAN, even when the original method doesn't provide a natural "transfer" option to a new Xnew matrix of features. |
I can finish this PR tomorrow, but it would be really nice if we could easily switch between different clustering models in downstream applications. Maybe the interface is missing an explicit trait for clustering? Can we just stick to the original KMeans/KMedoids interface for now? I am always stumbling upon JuliaAI/MLJModelInterface.jl#120, and my opinion is that we should really consider a trait-based approach in the near future. It can potentially decouple MLJ dependencies and facilitate fixes in models defined in 3rd-party packages. |
This PR is now ready with basic tests included. Appreciate if you can reconsider the Static Transformer issue in a future refactoring of the code base. I plan to add more clustering models to the interface now that I am working on it. |
This PR is not API compliant. For any I like the idea of a However, I acknowledge the following problems with the
I therefore suggest we implement both a
This interface can be formalized with the introduction of a new trait @juliohm How does that sound? |
I don't know if that is correct. In this PR the result of predict is a function of the new unseen Xnew using the nearest neighbor strategy. Can you please double check? The code fits a KDTree to the seen X and then uses the nearest neighbor to assign a label to an unseen sample. After you double check this part, I can try to address the other comments, which rely on this misinterpretation of the PR. |
@ablaom my main concern is the fact that there isn't a consistent set of function names to call different clustering algorithms programmatically. After years of working in the GeoStats.jl stack I realized a few important facts:
Given that this is a long-term issue, I think we could just follow the same API of KMeans and KMedoids for now, and use more time to think deeply about how to replace the black-box "unsupervised" trait by various more useful traits. How does that sound? I am happy to continue adding more clustering models to the MLJClusteringInterface.jl if this plan is acceptable. |
Codecov Report
@@ Coverage Diff @@
## master JuliaAI/MLJClusteringInterface.jl#11 +/- ##
==========================================
+ Coverage 94.73% 95.52% +0.78%
==========================================
Files 1 1
Lines 38 67 +29
==========================================
+ Hits 36 64 +28
- Misses 2 3 +1
Continue to review full report at Codecov.
|
Ah, I see this implementation of DBSCAN indeed generalises to new data. For a quicker an smoother review, I suggest that in the future your read reviewer comments more carefully:
The DBSCSAN as described in the original paper and in Wikipedia is pure clustering (no generalization) as is the scikitlearn implementation. Hence my query here. Given we are viewing DBSCAN as a cluster that generalizes (ie as an unsupervised classifier) it's interface ought to be the same as KMeans and KMedoids. This means the |
Please address the 1.0 fail or bump the compat and ci appropriately. I'd be happy with 1.3 but no higher, thanks. |
Thank you @ablaom, I will try to work on the adjustments tomorrow. Can you please clarify the definition of |
Sorry @juliohm, please ignore my earlier comment, This means the
Unfortunately not, which I can see has made your task a little
I think this is addressed in the issue, except for For your reference, the general model API spec is here. As far as the current PR is concerned:
Otherwise, this looks good to go. |
Hi @ablaom , thank you for the new documentation, I will read it carefully with more time in the following weeks and will come back to this PR as soon as possible. |
@juliohm I would write a wrapper model for this and not bundle it with a specific clusterer. What do you think @ablaom? |
Yes, a wrapper would be better. |
What do you mean by wrapper here? I don't understand the proposal. |
@ablaom can you please elaborate on this wrapper proposal? I was planning to come back to this PR in the following weeks but now it is closed, so I want to understand what is the plan here. |
Sorry, closing must have been accidental. I will get back to you shortly re the wrapper proposal. |
@juliohm I've opened an issue on the wrapping suggestion: JuliaAI/MLJBase.jl#768 . However, my understanding from our discussion above is that Clustering.jl already has the classification wrapper hard-wired (using KNN). In that case we could view an MLJ wrapper as orthogonal to this PR. I mean, if you wanted to use a different classifier with DBSCAN, say, you could still apply the wrapper to the model implemented in this PR, right? All the wrapper needs from the clusterer is labels on the training data. That being the case, I suggest you finish off this PR (see my three bulllet points above). I should probably be the one to implement the wrapper. |
Thank you @ablaom , makes total sense now. The idea is to work on a DBSCAN model that only works with a single data set and let a wrapper model perform the predictions on unseen data. I will try to work on it over the weekend. It is really busy around here. |
I got really busy when we first started discussing this addition, and then after some delay I couldn't get back to it. I will close the PR so that others can work on it with more time. |
This PR fixes JuliaAI/MLJ.jl#845
I have a few questions:
transform
function should return anything that we think it useful, right? I am returning the assignments and point types.fit
is following the API correctly in terms ofresult, cache, report
?I will work on a simple test with a toy data set and after that the PR should be ready.