Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Clustering.DBSCAN to interface #11

Closed
wants to merge 2 commits into from
Closed

Conversation

juliohm
Copy link
Contributor

@juliohm juliohm commented Oct 4, 2021

This PR fixes JuliaAI/MLJ.jl#845

I have a few questions:

  1. Why the output of KMeans/KMedoids is Continuous? I copied the pattern to DBSCAN without understanding it.
  2. The transform function should return anything that we think it useful, right? I am returning the assignments and point types.
  3. Can you please double check if the implementation of fit is following the API correctly in terms of result, cache, report?

I will work on a simple test with a toy data set and after that the PR should be ready.

@ablaom
Copy link
Member

ablaom commented Oct 4, 2021

@juliohm Many thanks for this contribution.

I think I have misled you and we should rethink this slightly. Unlike KMeans, DBSCAN does not predict on new data. You only get cluster labels for the data on which you train, right? So, we should probably conceive of this as Static transformer. This means that there is no fit method to implement, only a transform. The awkward point here is that everything you want to extract must be returned by transform, as there is no fit to generate a report with extras. So, as I think you are suggesting for the original implementation, transform returns a tuple (assignments, point_types). Then you grab the assignments by adding first to the end of your pipeline, for example.

Why the output of KMeans/KMedoids is Continuous? I copied the pattern to DBSCAN without understanding it.

The output_scitype refers to the scitype of what transform returns. If we return the tuple above, this would be Tuple{AbstractVector{<:Multiclass}, AbstractVector{<:Multiclass}}, assuming you encode the point types as a CategoricalVector.

@ablaom
Copy link
Member

ablaom commented Oct 4, 2021

If the 1.0 fail is a pain, feel free to bump to 1.3, as MLJBase is there anyhow.

@juliohm
Copy link
Contributor Author

juliohm commented Oct 4, 2021

Hi @ablaom , I am confused with the API. Shouldn't we aim for a consistent API across different clustering models? I am writing code downstream that is becoming really hard to maintain with very different functions to call. Can you please clarify how I should modify this code to make it work with any clustering model from Clustering.jl? https://github.com/JuliaEarth/GeoClustering.jl/blob/2dec18d6f73ef6db425be8ae2a25f4e74e35f80e/src/clustering.jl#L62-L68

We can always transfer clustering labels using the nearest neighbor approach I implemented in this PR for DBSCAN, even when the original method doesn't provide a natural "transfer" option to a new Xnew matrix of features.

@juliohm
Copy link
Contributor Author

juliohm commented Oct 5, 2021

I can finish this PR tomorrow, but it would be really nice if we could easily switch between different clustering models in downstream applications. Maybe the interface is missing an explicit trait for clustering? Can we just stick to the original KMeans/KMedoids interface for now?

I am always stumbling upon JuliaAI/MLJModelInterface.jl#120, and my opinion is that we should really consider a trait-based approach in the near future. It can potentially decouple MLJ dependencies and facilitate fixes in models defined in 3rd-party packages.

@juliohm juliohm changed the title [WIP] Add Clustering.DBSCAN to interface Add Clustering.DBSCAN to interface Oct 5, 2021
@juliohm
Copy link
Contributor Author

juliohm commented Oct 5, 2021

This PR is now ready with basic tests included. Appreciate if you can reconsider the Static Transformer issue in a future refactoring of the code base. I plan to add more clustering models to the interface now that I am working on it.

@ablaom
Copy link
Member

ablaom commented Oct 6, 2021

This PR is not API compliant. For any Supervised or Unsupervised
model predict(model, fitresult, X) is expected to output a
prediction generalized to the previously unseen input X. In this
PR X is ignored and training labels are returned.

I like the idea of a Static transformer here because it is
conceptually sound and more composable (at least as composition is
conceived in MLJ, with careful separation of the roles of training
data and new production data).

However, I acknowledge the following problems with the Static
transformer suggestion:

  • It is awkward and unnatural to bundle point type along with the
    labels in the transform return value (see my earlier comment).

  • This does not resolve the reasonable desire to have a convenient way
    to access the "training" labels common to all clustering algorithms.

I therefore suggest we implement both a StaticDBSCAN (returning
labels only) and a "plain" DBSCSAN, with the following simple
changes to the current PR:

  • For the plain DBSCAN already added, neither predict nor transform is
    implemented.

  • Instead, labels are returned in the report component of the output of
    fit; something like report = (training_labels=..., point_types=...)
    (report is always a named tuple or other
    property-accessible object).

  • We add training_labels=... to the reports of the other clustering
    algorithms, KMeans and KMedoids.

This interface can be formalized with the introduction of a new trait
is_clustererer or reports_training_annotations or whatever
(outlier detectors also want to report training scores). I can open an
issue for discussion. This could be followed up with PR's to the other
providers of clustering algorithms, such as GMMClusterer.

@juliohm How does that sound?

@juliohm
Copy link
Contributor Author

juliohm commented Oct 6, 2021

This PR is not API compliant. For any Supervised or Unsupervised
model predict(model, fitresult, X) is expected to output a
prediction generalized to the previously unseen input X. In this
PR X is ignored and training labels are returned.

I don't know if that is correct. In this PR the result of predict is a function of the new unseen Xnew using the nearest neighbor strategy. Can you please double check? The code fits a KDTree to the seen X and then uses the nearest neighbor to assign a label to an unseen sample.

After you double check this part, I can try to address the other comments, which rely on this misinterpretation of the PR.

@juliohm
Copy link
Contributor Author

juliohm commented Oct 7, 2021

@ablaom my main concern is the fact that there isn't a consistent set of function names to call different clustering algorithms programmatically. After years of working in the GeoStats.jl stack I realized a few important facts:

  1. The "Unsupervised" trait isn't useful in machine learning in general, and just to clarify, this issue is not specific to MLJ. The ML community decided to call any model that is not well-defined in statistical learning theory as "unsupervised". This includes projections, transformations, clustering algorithms, outlier detection algorithms, ... basically any function. The "Supervised" trait on the other hand is very useful as it defines inputs and outputs very precisely. We can easily define an API because the learning problem is well-defined.
  2. It would be much more useful to adopt specific traits for specific learning tasks. For example, clustering models, outlier detection models, projection models, ... That way people can advance specific areas and compare different algorithms within the same category.

Given that this is a long-term issue, I think we could just follow the same API of KMeans and KMedoids for now, and use more time to think deeply about how to replace the black-box "unsupervised" trait by various more useful traits. How does that sound? I am happy to continue adding more clustering models to the MLJClusteringInterface.jl if this plan is acceptable.

@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2021

Codecov Report

Merging #11 (bfb5056) into master (e46a5a3) will increase coverage by 0.78%.
The diff coverage is 96.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      JuliaAI/MLJClusteringInterface.jl#11      +/-   ##
==========================================
+ Coverage   94.73%   95.52%   +0.78%     
==========================================
  Files           1        1              
  Lines          38       67      +29     
==========================================
+ Hits           36       64      +28     
- Misses          2        3       +1     
Impacted Files Coverage Δ
src/MLJClusteringInterface.jl 95.52% <96.55%> (+0.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e46a5a3...bfb5056. Read the comment docs.

@ablaom
Copy link
Member

ablaom commented Oct 10, 2021

I don't know if that is correct. In this PR the result of predict is a function of the new unseen Xnew using the nearest neighbor strategy.

Ah, I see this implementation of DBSCAN indeed generalises to new data.

For a quicker an smoother review, I suggest that in the future your read reviewer comments more carefully:

Unlike KMeans, DBSCAN does not predict on new data. You only get cluster labels for the data on which you train, right?

The DBSCSAN as described in the original paper and in Wikipedia is pure clustering (no generalization) as is the scikitlearn implementation. Hence my query here.

Given we are viewing DBSCAN as a cluster that generalizes (ie as an unsupervised classifier) it's interface ought to be the same as KMeans and KMedoids. This means the transform method should do dimension reduction. You can instead report the point types for the training data as part of the report, along with the training labels. I see that the other two classifiers return the training labels in the report as assignments=..., so please mimic that.

@ablaom
Copy link
Member

ablaom commented Oct 10, 2021

Please address the 1.0 fail or bump the compat and ci appropriately. I'd be happy with 1.3 but no higher, thanks.

@juliohm
Copy link
Contributor Author

juliohm commented Oct 10, 2021

Thank you @ablaom, I will try to work on the adjustments tomorrow. Can you please clarify the definition of MMI.fit, MMI.transform and MMI.predict? Are they documented somewhere for clustering models? What are the expected contents of result, cache, report for example? When I think KMeans and KMedoids I think of them as clustering models, but it seems that you have a different view? All ML frameworks I used in the past consider KMeans and KMedoids as clustering models, hence my confusion.

@ablaom
Copy link
Member

ablaom commented Oct 12, 2021

Sorry @juliohm, please ignore my earlier comment, This means the
transform method should do dimension reduction
. I must have been very
tired when I wrote it. Since DBSCAN has no "centers" there's no
dimension reduction as we have for KMeans and
KMedoids.

Can you please clarify the definition of MMI.fit, MMI.transform and
MMI.predict? Are they documented somewhere for clustering models?

Unfortunately not, which I can see has made your task a little
difficult. To address this I've opened this issue, which I hope will help.

What are the expected contents of result, cache, report

I think this is addressed in the issue, except for cache; this is
algorithm "state" passed to the optional update method, useful for
implementing "warm restart" for iterative models - see here. If
not implementing update you just put cache=nothing.

For your reference, the general model API spec is here.


As far as the current PR is concerned:

  • Since there is no dimension reduction, there is no transform to
    implement. The current transform(model, fitesult, Xnew) is not
    generalizing to the new observations Xnew, and so is not a valid
    operation in the MLJ sense. (I was wrong about predict but believe
    I am correct in this case). As I suggested earlier, one could
    include the point types in the report, where I suggest you also
    put the training labels, as in report = (assignments=..., point_types=...);
    use whatever keys that are consistent with the
    Clustering.jl documentation.

  • The output of predict, a categorical vector, should have all the
    cluster labels in its pool, not just those generated for the new
    data. See the comment after "Important" at the issue mentioned
    above. You can just mimic the KMeans code for arranging this.

  • In tests, please replace X with MLJBase.table(X) (which I think
    works) since the declared input_scitype is Table(Continuous). Since
    matrices also work, we could declare input_scitype to be
    Union{AbstractMatrix{Continuous}, Table(Continuous)}?

Otherwise, this looks good to go.

@juliohm
Copy link
Contributor Author

juliohm commented Oct 12, 2021

Hi @ablaom , thank you for the new documentation, I will read it carefully with more time in the following weeks and will come back to this PR as soon as possible.

@davnn
Copy link

davnn commented Nov 15, 2021

We can always transfer clustering labels using the nearest neighbor approach I implemented in this PR for DBSCAN, even when the original method doesn't provide a natural "transfer" option to a new Xnew matrix of features.

@juliohm I would write a wrapper model for this and not bundle it with a specific clusterer. What do you think @ablaom?

@ablaom
Copy link
Member

ablaom commented Nov 16, 2021

Yes, a wrapper would be better.

@ablaom ablaom closed this Nov 16, 2021
@juliohm
Copy link
Contributor Author

juliohm commented Nov 16, 2021

What do you mean by wrapper here? I don't understand the proposal.

@juliohm
Copy link
Contributor Author

juliohm commented Nov 16, 2021

@ablaom can you please elaborate on this wrapper proposal? I was planning to come back to this PR in the following weeks but now it is closed, so I want to understand what is the plan here.

@ablaom ablaom reopened this Nov 16, 2021
@ablaom
Copy link
Member

ablaom commented Nov 16, 2021

Sorry, closing must have been accidental.

I will get back to you shortly re the wrapper proposal.

@ablaom
Copy link
Member

ablaom commented Nov 18, 2021

@juliohm I've opened an issue on the wrapping suggestion: JuliaAI/MLJBase.jl#768 .

However, my understanding from our discussion above is that Clustering.jl already has the classification wrapper hard-wired (using KNN). In that case we could view an MLJ wrapper as orthogonal to this PR. I mean, if you wanted to use a different classifier with DBSCAN, say, you could still apply the wrapper to the model implemented in this PR, right? All the wrapper needs from the clusterer is labels on the training data.

That being the case, I suggest you finish off this PR (see my three bulllet points above).

I should probably be the one to implement the wrapper.

@juliohm
Copy link
Contributor Author

juliohm commented Nov 18, 2021

Thank you @ablaom , makes total sense now. The idea is to work on a DBSCAN model that only works with a single data set and let a wrapper model perform the predictions on unseen data. I will try to work on it over the weekend. It is really busy around here.

@juliohm
Copy link
Contributor Author

juliohm commented May 10, 2022

I got really busy when we first started discussing this addition, and then after some delay I couldn't get back to it. I will close the PR so that others can work on it with more time.

@juliohm juliohm closed this May 10, 2022
@ablaom ablaom mentioned this pull request May 16, 2022
@ablaom ablaom mentioned this pull request Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBSCAN from Clustering.jl not registered
4 participants