-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We should change how duplicated works #240
Labels
api-suggestion
Early API idea and discussion, it is NOT ready for implementation
Comments
OrestZborowski-SIG
added
the
api-suggestion
Early API idea and discussion, it is NOT ready for implementation
label
Nov 4, 2021
OrestZborowski-SIG
pushed a commit
that referenced
this issue
Aug 23, 2022
Improve test run scripts (#314) Improve the run scripts so they will propagate exit codes. This allows them to break the build if tests fail. Disable the tooling integration tests since they actually fail, and will now fail the builds. This will need to be remedied (issue #313). Expose operator methods on FastArray Fixes issues in #240 Ignore test dirs for autodoc Fix is so that filtered elements still show the count of non-filtered elements for their group Add Replace methods to FAString Fix median kwargs Fixes issues in categorical comparisons for example those in #256 Fix str mean crashes Adds a Sydney Time Zone addressing #265. Check count filter length Fixes #291 Adds a statx method for fastarrays. Fixes ufunc2 custom out test due to np122 changes Support numpy<1.23 Allows multiple arguments in col_filter Replace eq = __eq__ with def eq(self,other): return self.__eq__(other) Fix check for fixed size binary conversion Update Dataset docstrings: head, tail, sample, describe Add YAML and shell script specializations Fixed formatting errors (except duplicate obj descs) Update riptable GH CI for numpy122 Tighten dependencies Update dependency constraints Fixed formatting errors in rt_timezone and rt_utils Fixed formatting errors in rt_struct Fixed formatting errors in rt_str Fixed formatting errors in rt_pgroupby and rt_sds Fixed formatting errors in rt_pdataset Fixed formatting errors in rt_multiset Fixed formatting errors in rt_misc Fixed formatting errors in rt_meta Fixed formatting errors in rt_merge Fixed formatting errors in rt_groupbykeys Fixed formatting errors in rt_itemcontainer Fix docstring for regex_replace Update setup.py dependencies Support the 'Australia/Sydney' timezone. Fix formatting errors and update math methods Fixed Sphinx formatting errors in rt_appdirs Fixed short underline in benchmarking.rst Add set_valid method; identical to filter Fix Categorical Where
Merged
OrestZborowski-SIG
added a commit
that referenced
this issue
Aug 23, 2022
* Latest changes Improve test run scripts (#314) Improve the run scripts so they will propagate exit codes. This allows them to break the build if tests fail. Disable the tooling integration tests since they actually fail, and will now fail the builds. This will need to be remedied (issue #313). Expose operator methods on FastArray Fixes issues in #240 Ignore test dirs for autodoc Fix is so that filtered elements still show the count of non-filtered elements for their group Add Replace methods to FAString Fix median kwargs Fixes issues in categorical comparisons for example those in #256 Fix str mean crashes Adds a Sydney Time Zone addressing #265. Check count filter length Fixes #291 Adds a statx method for fastarrays. Fixes ufunc2 custom out test due to np122 changes Support numpy<1.23 Allows multiple arguments in col_filter Replace eq = __eq__ with def eq(self,other): return self.__eq__(other) Fix check for fixed size binary conversion Update Dataset docstrings: head, tail, sample, describe Add YAML and shell script specializations Fixed formatting errors (except duplicate obj descs) Update riptable GH CI for numpy122 Tighten dependencies Update dependency constraints Fixed formatting errors in rt_timezone and rt_utils Fixed formatting errors in rt_struct Fixed formatting errors in rt_str Fixed formatting errors in rt_pgroupby and rt_sds Fixed formatting errors in rt_pdataset Fixed formatting errors in rt_multiset Fixed formatting errors in rt_misc Fixed formatting errors in rt_meta Fixed formatting errors in rt_merge Fixed formatting errors in rt_groupbykeys Fixed formatting errors in rt_itemcontainer Fix docstring for regex_replace Update setup.py dependencies Support the 'Australia/Sydney' timezone. Fix formatting errors and update math methods Fixed Sphinx formatting errors in rt_appdirs Fixed short underline in benchmarking.rst Add set_valid method; identical to filter Fix Categorical Where * Enable magical preserve_egg_dir Co-authored-by: rtosholdings-bot <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I would like to propose a specific change to the duplicated method in rt_fastarray. In particular, there is a kwarg high_unique that I think was clearly intended to be passed into Grouping (as the kwarg lex) which speeds up the process for arrays with many unique values by a lot by doing a sort to get nCountGroup or whatever. So I would propose something like this
Although obviously feel free to change anything, including the comments. Over all, we definitely could have a more efficient algorithm here (you don’t have to essentially create a categorical just to find duplicated elements, and even if you do create a categorical, I think we probably could be doing both the searching and hashing in parallel which would significantly speed up a lot of operations that we do) but I think we should at a minimum allow people to input a high_unique kwarg. Thanks
The text was updated successfully, but these errors were encountered: