-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for "cyclic" categorical variables, such as month of year. #287
Comments
I think that having this in the ecosystem would be super nice. However, my first reaction is that it would be more natural to have a separate mini-package providing this functionality. |
I think I mustn't understand your statement properly. It seems to me that cyclic categorical arrays would - apart from what I outline above - share all the behaviour of the existing arrays (indexing - a pool of all possible values, and so forth). So, how would it be more natural to have a separate mini package to do this? And how would we reproduce all this behaviour without either making some changes to CategoricalArrays itself, or replicating most of it elsewhere? |
You asked for something like:
to work which is something that
Also - if I understand the use-case correctly - most of the time you will know the pool of values you want to cycle over when creating the container (like week day names, month names etc.) and keep them fixed (which is opposite to what CategoricalArrays.jl does - as it is flexible here - which makes sense for non-cyclic factors - but I feel that for cyclic factors this pool should be kept fixed). What could be added that we do not have now, is to define |
As an additional comment - such a functionality could be a type that is just a thin wrapper over CategoricalArrays.jl as indeed maybe a lot of machinery can be reused internally. |
Thanks for that. I was hoping to avoid a wrapper.
I completely agree, and I need this "different nature". In fact, by the same argument, we should have different types I still think having the inconsistency of immutable cyclic pools - versus introducing a wrapper - is the lesser of two evils. I imagined the main argument against what I am proposing is that it would be difficult to implement without breaking the existing API. We could simply add a new Side note. From the point-of-view of supervised machine learning (as opposed to data wrangling) I have often lamented the mutability of the pools. Perhaps we can have an ImmutableCategoricalArrays package with a type parameter for cyclic/ordered/plain and a type parameter for the number of levels. If there were 48 hours in a day.... |
I feel that I would have just written the same. However, as you probably guessed my earlier comment was from the perspective of package design and how expensive it would be to add what you ask for to the current design / how hard it would be to maintain it. And my feeling was that it would be hard. However, let us wait for @nalimilan to comment as he might have some better ideas on the implementation side.
While I agree with this my only question is - if we agree to have immutable pools how would you want to handle a situation in which on production a new level pops-up for prediction? (this can be handled easily in dev/validation/test as you know the whole pool you got for model building but in production it is often the case you can get new levels - especially for nominal categorical variables) |
Interesting. I had never encountered cyclic categorical variables before. Do you have more references/examples, apart from hour/month? How common are they? One issue I see is that hours/months are not just categorical variables: it's also numeric in the sense that you know the difference between subsequent categories is always 1. Actually you could even treat them as continuous (10.5h is perfectly well defined). Do you also need a type (in another package) to handle continuous cyclic variables? I think we need to distinguish three issues:
|
Of other examples I have been building models for wind direction prediction where it is natural to have an angle that is in range [0, 2pi] and is cyclic. Of other things - I think these are great comments I will gladly help here, but I guess @nalimilan should have a lead in decisions (as he has best knowledge of the internals 😄). |
In scientific applications these are potentially all over the place - everywhere you discretise a cyclic variable - so everywhere you have a periodic function of time or an angle of some kind.
Actually, no, I don't need it. I mention it is because it is very natural and you don't have this for ordered factors (without "overflowing" in a non-meaningful way). For our immediate use-cases, it is probably sufficient just to have a cyclic categorical value distinguishable from the others (through type, but trait will do). Our main use-case is transformers that take a cyclic variable and convert it into two or more continuous features using trigonometric functions (see the issue prompting our request. ) This is a pretty standard operation in time-series analysis (eg temperature) where you expect cyclic variations over a single day (say), but only discrete hours are in the record. We want to allow the user to do things like "trigonometricize" all cyclic features in a table.
We probably do want an appropriate re-implementation of
I don't think I have anything new to add there, thanks. |
If new levels appear in an input feature to a supervised learning problem in production then one strategy is to (i) reflag them as "missing", and (ii) apply a pre-learned imputation strategy (replace with the training mode, in the simplest case), and (iii) include both steps as the first part of your overall pipeline (composite model). A few supervised models are specifically designed to deal with missing values, in which case (ii) is unnecessary. But you make a good point here: Under the assumption of immutability, new categorical vectors will need to be created in step (ii) because you must delete the newly appearing levels from the pool (not just replace their occurrences with Thankfully, thanks to CategoricalValues, we don't have to worry about new levels appearing in a test/validation set, since the train sets know all the levels! |
Yes, we also want this. My initial suggestion is to use complex numbers, but they would have to be wrapped to enforce the constraint that they have unit modulus (lie on the unit circle). This would be very natural because exponentiation would immediately deliver the trigonmetric representation we are after. This might already exist somewhere, I haven't really looked. |
I agree :).
I do not know the MLJ pipelines in detail, but oftentimes test data set is taken from a different table than train and validation one. |
OK, feel free to make a PR if you want to add support for marking a |
Context: JuliaAI/MLJ.jl#620
I wonder if it would be feasible to add to the "ordered" and "unordered" cases a third "cyclic" category. Instead of "<" being defined, we would have new behaviour which allows one to add or subtract an integer from any
CategoricalValue
. So, roughly, something like this:Perhaps one would need to worry about two pools being regarded as equivalent if the levels of one is a cyclic permutation of the the other, and so forth.
How hard would this be?
The text was updated successfully, but these errors were encountered: