Add Clustering.DBSCAN to interface #11

juliohm · 2021-10-04T17:40:36Z

I have a few questions:

Why the output of KMeans/KMedoids is Continuous? I copied the pattern to DBSCAN without understanding it.
The transform function should return anything that we think it useful, right? I am returning the assignments and point types.
Can you please double check if the implementation of fit is following the API correctly in terms of result, cache, report?

I will work on a simple test with a toy data set and after that the PR should be ready.

ablaom · 2021-10-04T21:05:02Z

@juliohm Many thanks for this contribution.

I think I have misled you and we should rethink this slightly. Unlike KMeans, DBSCAN does not predict on new data. You only get cluster labels for the data on which you train, right? So, we should probably conceive of this as Static transformer. This means that there is no fit method to implement, only a transform. The awkward point here is that everything you want to extract must be returned by transform, as there is no fit to generate a report with extras. So, as I think you are suggesting for the original implementation, transform returns a tuple (assignments, point_types). Then you grab the assignments by adding first to the end of your pipeline, for example.

Why the output of KMeans/KMedoids is Continuous? I copied the pattern to DBSCAN without understanding it.

The output_scitype refers to the scitype of what transform returns. If we return the tuple above, this would be Tuple{AbstractVector{<:Multiclass}, AbstractVector{<:Multiclass}}, assuming you encode the point types as a CategoricalVector.

ablaom · 2021-10-04T21:07:21Z

If the 1.0 fail is a pain, feel free to bump to 1.3, as MLJBase is there anyhow.

juliohm · 2021-10-04T22:01:32Z

Hi @ablaom , I am confused with the API. Shouldn't we aim for a consistent API across different clustering models? I am writing code downstream that is becoming really hard to maintain with very different functions to call. Can you please clarify how I should modify this code to make it work with any clustering model from Clustering.jl? https://github.com/JuliaEarth/GeoClustering.jl/blob/2dec18d6f73ef6db425be8ae2a25f4e74e35f80e/src/clustering.jl#L62-L68

We can always transfer clustering labels using the nearest neighbor approach I implemented in this PR for DBSCAN, even when the original method doesn't provide a natural "transfer" option to a new Xnew matrix of features.

juliohm · 2021-10-05T02:04:13Z

I can finish this PR tomorrow, but it would be really nice if we could easily switch between different clustering models in downstream applications. Maybe the interface is missing an explicit trait for clustering? Can we just stick to the original KMeans/KMedoids interface for now?

I am always stumbling upon JuliaAI/MLJModelInterface.jl#120, and my opinion is that we should really consider a trait-based approach in the near future. It can potentially decouple MLJ dependencies and facilitate fixes in models defined in 3rd-party packages.

juliohm · 2021-10-05T16:55:32Z

This PR is now ready with basic tests included. Appreciate if you can reconsider the Static Transformer issue in a future refactoring of the code base. I plan to add more clustering models to the interface now that I am working on it.

ablaom · 2021-10-06T19:57:26Z

This PR is not API compliant. For any Supervised or Unsupervised
model predict(model, fitresult, X) is expected to output a
prediction generalized to the previously unseen input X. In this
PR X is ignored and training labels are returned.

I like the idea of a Static transformer here because it is
conceptually sound and more composable (at least as composition is
conceived in MLJ, with careful separation of the roles of training
data and new production data).

However, I acknowledge the following problems with the Static
transformer suggestion:

It is awkward and unnatural to bundle point type along with the
labels in the transform return value (see my earlier comment).
This does not resolve the reasonable desire to have a convenient way
to access the "training" labels common to all clustering algorithms.

I therefore suggest we implement both a StaticDBSCAN (returning
labels only) and a "plain" DBSCSAN, with the following simple
changes to the current PR:

For the plain DBSCAN already added, neither predict nor transform is
implemented.
Instead, labels are returned in the report component of the output of
fit; something like report = (training_labels=..., point_types=...)
(report is always a named tuple or other
property-accessible object).
We add training_labels=... to the reports of the other clustering
algorithms, KMeans and KMedoids.

This interface can be formalized with the introduction of a new trait
is_clustererer or reports_training_annotations or whatever
(outlier detectors also want to report training scores). I can open an
issue for discussion. This could be followed up with PR's to the other
providers of clustering algorithms, such as GMMClusterer.

@juliohm How does that sound?

juliohm · 2021-10-06T20:02:36Z

This PR is not API compliant. For any Supervised or Unsupervised
model predict(model, fitresult, X) is expected to output a
prediction generalized to the previously unseen input X. In this
PR X is ignored and training labels are returned.

I don't know if that is correct. In this PR the result of predict is a function of the new unseen Xnew using the nearest neighbor strategy. Can you please double check? The code fits a KDTree to the seen X and then uses the nearest neighbor to assign a label to an unseen sample.

After you double check this part, I can try to address the other comments, which rely on this misinterpretation of the PR.

juliohm · 2021-10-07T12:34:50Z

@ablaom my main concern is the fact that there isn't a consistent set of function names to call different clustering algorithms programmatically. After years of working in the GeoStats.jl stack I realized a few important facts:

The "Unsupervised" trait isn't useful in machine learning in general, and just to clarify, this issue is not specific to MLJ. The ML community decided to call any model that is not well-defined in statistical learning theory as "unsupervised". This includes projections, transformations, clustering algorithms, outlier detection algorithms, ... basically any function. The "Supervised" trait on the other hand is very useful as it defines inputs and outputs very precisely. We can easily define an API because the learning problem is well-defined.
It would be much more useful to adopt specific traits for specific learning tasks. For example, clustering models, outlier detection models, projection models, ... That way people can advance specific areas and compare different algorithms within the same category.

Given that this is a long-term issue, I think we could just follow the same API of KMeans and KMedoids for now, and use more time to think deeply about how to replace the black-box "unsupervised" trait by various more useful traits. How does that sound? I am happy to continue adding more clustering models to the MLJClusteringInterface.jl if this plan is acceptable.

codecov-commenter · 2021-10-10T20:23:29Z

Codecov Report

Merging #11 (bfb5056) into master (e46a5a3) will increase coverage by 0.78%.
The diff coverage is 96.55%.

@@            Coverage Diff             @@
##           master      JuliaAI/MLJClusteringInterface.jl#11      +/-   ##
==========================================
+ Coverage   94.73%   95.52%   +0.78%     
==========================================
  Files           1        1              
  Lines          38       67      +29     
==========================================
+ Hits           36       64      +28     
- Misses          2        3       +1

Impacted Files	Coverage Δ
src/MLJClusteringInterface.jl	`95.52% <96.55%> (+0.78%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e46a5a3...bfb5056. Read the comment docs.

ablaom · 2021-10-10T20:57:55Z

I don't know if that is correct. In this PR the result of predict is a function of the new unseen Xnew using the nearest neighbor strategy.

Ah, I see this implementation of DBSCAN indeed generalises to new data.

For a quicker an smoother review, I suggest that in the future your read reviewer comments more carefully:

Unlike KMeans, DBSCAN does not predict on new data. You only get cluster labels for the data on which you train, right?

The DBSCSAN as described in the original paper and in Wikipedia is pure clustering (no generalization) as is the scikitlearn implementation. Hence my query here.

Given we are viewing DBSCAN as a cluster that generalizes (ie as an unsupervised classifier) it's interface ought to be the same as KMeans and KMedoids. This means the transform method should do dimension reduction. You can instead report the point types for the training data as part of the report, along with the training labels. I see that the other two classifiers return the training labels in the report as assignments=..., so please mimic that.

ablaom · 2021-10-10T20:59:02Z

Please address the 1.0 fail or bump the compat and ci appropriately. I'd be happy with 1.3 but no higher, thanks.

juliohm · 2021-10-10T22:22:39Z

Thank you @ablaom, I will try to work on the adjustments tomorrow. Can you please clarify the definition of MMI.fit, MMI.transform and MMI.predict? Are they documented somewhere for clustering models? What are the expected contents of result, cache, report for example? When I think KMeans and KMedoids I think of them as clustering models, but it seems that you have a different view? All ML frameworks I used in the past consider KMeans and KMedoids as clustering models, hence my confusion.

ablaom · 2021-10-12T00:36:37Z

Sorry @juliohm, please ignore my earlier comment, This means the
transform method should do dimension reduction. I must have been very
tired when I wrote it. Since DBSCAN has no "centers" there's no
dimension reduction as we have for KMeans and
KMedoids.

Can you please clarify the definition of MMI.fit, MMI.transform and
MMI.predict? Are they documented somewhere for clustering models?

Unfortunately not, which I can see has made your task a little
difficult. To address this I've opened this issue, which I hope will help.

What are the expected contents of result, cache, report

I think this is addressed in the issue, except for cache; this is
algorithm "state" passed to the optional update method, useful for
implementing "warm restart" for iterative models - see here. If
not implementing update you just put cache=nothing.

For your reference, the general model API spec is here.

As far as the current PR is concerned:

Since there is no dimension reduction, there is no transform to
implement. The current transform(model, fitesult, Xnew) is not
generalizing to the new observations Xnew, and so is not a valid
operation in the MLJ sense. (I was wrong about predict but believe
I am correct in this case). As I suggested earlier, one could
include the point types in the report, where I suggest you also
put the training labels, as in report = (assignments=..., point_types=...);
use whatever keys that are consistent with the
Clustering.jl documentation.
The output of predict, a categorical vector, should have all the
cluster labels in its pool, not just those generated for the new
data. See the comment after "Important" at the issue mentioned
above. You can just mimic the KMeans code for arranging this.
In tests, please replace X with MLJBase.table(X) (which I think
works) since the declared input_scitype is Table(Continuous). Since
matrices also work, we could declare input_scitype to be
Union{AbstractMatrix{Continuous}, Table(Continuous)}?

Otherwise, this looks good to go.

juliohm · 2021-10-12T10:49:03Z

Hi @ablaom , thank you for the new documentation, I will read it carefully with more time in the following weeks and will come back to this PR as soon as possible.

davnn · 2021-11-15T11:16:14Z

We can always transfer clustering labels using the nearest neighbor approach I implemented in this PR for DBSCAN, even when the original method doesn't provide a natural "transfer" option to a new Xnew matrix of features.

@juliohm I would write a wrapper model for this and not bundle it with a specific clusterer. What do you think @ablaom?

ablaom · 2021-11-16T02:10:30Z

Yes, a wrapper would be better.

juliohm · 2021-11-16T02:14:59Z

What do you mean by wrapper here? I don't understand the proposal.

juliohm · 2021-11-16T02:51:55Z

@ablaom can you please elaborate on this wrapper proposal? I was planning to come back to this PR in the following weeks but now it is closed, so I want to understand what is the plan here.

ablaom · 2021-11-16T04:16:53Z

Sorry, closing must have been accidental.

I will get back to you shortly re the wrapper proposal.

ablaom · 2021-11-18T03:46:52Z

@juliohm I've opened an issue on the wrapping suggestion: JuliaAI/MLJBase.jl#768 .

However, my understanding from our discussion above is that Clustering.jl already has the classification wrapper hard-wired (using KNN). In that case we could view an MLJ wrapper as orthogonal to this PR. I mean, if you wanted to use a different classifier with DBSCAN, say, you could still apply the wrapper to the model implemented in this PR, right? All the wrapper needs from the clusterer is labels on the training data.

That being the case, I suggest you finish off this PR (see my three bulllet points above).

I should probably be the one to implement the wrapper.

juliohm · 2021-11-18T21:08:06Z

Thank you @ablaom , makes total sense now. The idea is to work on a DBSCAN model that only works with a single data set and let a wrapper model perform the predictions on unseen data. I will try to work on it over the weekend. It is really busy around here.

juliohm · 2022-05-10T13:03:12Z

I got really busy when we first started discussing this addition, and then after some delay I couldn't get back to it. I will close the PR so that others can work on it with more time.

Add Clustering.DBSCAN to interface

d9168ed

ablaom mentioned this pull request Oct 4, 2021

Implement DBSCAN as a Static transformer? JuliaAI/MLJScikitLearnInterface.jl#36

Closed

Add tests for DBSCAN

bfb5056

juliohm changed the title ~~[WIP] Add Clustering.DBSCAN to interface~~ Add Clustering.DBSCAN to interface Oct 5, 2021

ablaom closed this Nov 16, 2021

ablaom reopened this Nov 16, 2021

juliohm closed this May 10, 2022

ablaom mentioned this pull request May 16, 2022

Add interface for DBSCAN #14

Closed

ablaom mentioned this pull request Aug 24, 2022

Add interface for DBSCAN #17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Clustering.DBSCAN to interface #11

Add Clustering.DBSCAN to interface #11

juliohm commented Oct 4, 2021 •

edited

Loading

ablaom commented Oct 4, 2021

ablaom commented Oct 4, 2021

juliohm commented Oct 4, 2021

juliohm commented Oct 5, 2021

juliohm commented Oct 5, 2021

ablaom commented Oct 6, 2021

juliohm commented Oct 6, 2021

juliohm commented Oct 7, 2021

codecov-commenter commented Oct 10, 2021 •

edited

Loading

ablaom commented Oct 10, 2021 •

edited

Loading

ablaom commented Oct 10, 2021

juliohm commented Oct 10, 2021 •

edited

Loading

ablaom commented Oct 12, 2021

juliohm commented Oct 12, 2021

davnn commented Nov 15, 2021

ablaom commented Nov 16, 2021

juliohm commented Nov 16, 2021

juliohm commented Nov 16, 2021

ablaom commented Nov 16, 2021

ablaom commented Nov 18, 2021

juliohm commented Nov 18, 2021

juliohm commented May 10, 2022

Add Clustering.DBSCAN to interface #11

Add Clustering.DBSCAN to interface #11

Conversation

juliohm commented Oct 4, 2021 • edited Loading

ablaom commented Oct 4, 2021

ablaom commented Oct 4, 2021

juliohm commented Oct 4, 2021

juliohm commented Oct 5, 2021

juliohm commented Oct 5, 2021

ablaom commented Oct 6, 2021

juliohm commented Oct 6, 2021

juliohm commented Oct 7, 2021

codecov-commenter commented Oct 10, 2021 • edited Loading

Codecov Report

ablaom commented Oct 10, 2021 • edited Loading

ablaom commented Oct 10, 2021

juliohm commented Oct 10, 2021 • edited Loading

ablaom commented Oct 12, 2021

juliohm commented Oct 12, 2021

davnn commented Nov 15, 2021

ablaom commented Nov 16, 2021

juliohm commented Nov 16, 2021

juliohm commented Nov 16, 2021

ablaom commented Nov 16, 2021

ablaom commented Nov 18, 2021

juliohm commented Nov 18, 2021

juliohm commented May 10, 2022

juliohm commented Oct 4, 2021 •

edited

Loading

codecov-commenter commented Oct 10, 2021 •

edited

Loading

ablaom commented Oct 10, 2021 •

edited

Loading

juliohm commented Oct 10, 2021 •

edited

Loading