-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow expressions in group_by
#494
Comments
Something to consider, is dplyr also permits per operation grouping using the by(selection)_ argument which is applied and scoped just for the operation. Ref: Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of .by is to allow you to place that grouping specification alongside the code that actually uses it. As an added benefit, with .by you no longer need to remember to ungroup() after summarise(), and summarise() won't ever message you about how it's handling the groups! I did run into this problem of locality of the grouping to the intent/assumptions of the operation I am performing. You may run into cases where you need to push and pop the active group so the callers assumptions on the dataframe returned are not violated. Elixir favours explicitness so I see an operation local grouping being aligned to this ideal whereas group_by /ungroup is much easier to get wrong. I would possibly go as far to say that the dataframe level group_by may be a bad idea and will lead to surprises and coding patterns of saving groups, ungrouping, applying new groups, ungrouping and restoring groups aka pushing and popping group specs. In terms of group expressions is there any issue using any series however computed (including lead or lagged series) to be applied a group_by()/operation by() selection providing it has the same row count as the dataframe? |
That's a nice API but unfortunately we would have ambiguity with mutate:
So we would need to come up with a solution to that.
I don't think so, you can either compute it with |
Yes good point, I believe that why dplyr uses the dot ".by" to distinguish the parameter from columns. I don't think that approach is all that clean as it would put restrictions/reservations on column names. I thought something like apply_grouping_for(df, grouping, fn) might work but its a bit ugly having to use a function for the operation it's applied to. Then I considered something like push_grouping()/pop_grouping() which is nicer from a pipe perspective but is not adding a whole lot of value over the existing group_by/ungroup and it is still a bit clumsy as the pop can easily be forgotten. So there is nothing really compelling in any of those options unfortunately. |
One option I considered was: mutate(
foo: this |> that |> by(:user_id)
) That would apply per column but we could optimize it a bit and do the grouping once in case several columns have the same |
I think that could work, intersting approach! Would it be possible to also apply to a keyword list piping into the group by? mutate([a: 7, b: a * 2]
|> this()
|> that()
|> group_by(:user_id)
) That would ameliorate having to collect the group by clauses for efficiency and running into potential issues in changing the order of dependencies between the column values. |
@josevalim I'm not sure how I overlooked this for most of a year (sorry!), but I don't have opposition to this but also don't see too much reason to do it. I think it's cool and reduces verbosity, seems clear enough to me. I also don't think I'd personally end up using it much. Maybe just because I'm old and crotchety and set in my ways. |
@cigrainger, instead of:
what do you think about supporting this:
They will be basically equivalent.
Thoughts?
The text was updated successfully, but these errors were encountered: