We should change how duplicated works #240

OrestZborowski-SIG · 2021-11-04T13:40:35Z

I would like to propose a specific change to the duplicated method in rt_fastarray. In particular, there is a kwarg high_unique that I think was clearly intended to be passed into Grouping (as the kwarg lex) which speeds up the process for arrays with many unique values by a lot by doing a sort to get nCountGroup or whatever. So I would propose something like this

    def duplicated(self, keep='first', high_unique=False):
        '''
        See pandas.Series.duplicated

        Duplicated values are indicated as True values in the resulting
        FastArray. Either all duplicates, all except the first or all except the
        last occurrence of duplicates can be indicated.

        Parameters
        ----------
        keep : {'first', 'last', False}, default 'first'
            - 'first' : Mark duplicates as True except for the first occurrence.
            - 'last' : Mark duplicates as True except for the last occurrence.
            - False : Mark values with just one occurrence as False.

        high_unique : bool, default False
            Change this to true if your array has many unique values

        '''
        arr = self

        if keep == 'last':
            arr = arr[::-1].copy()

        elif keep is not False and keep != 'first':
            raise ValueError(f'keep must be either "first", "last" or False')

        # create an return array all set to True
        result = ones(len(arr), dtype=np.bool_)

        g = Grouping(arr._fa if hasattr(arr,'_fa') else arr, lex = high_unique)

        if keep is False:
            # search for groups with a count of 1
           result[g.ifirstkey[g.ncountgroup[1:]==1]] = False
        else:
            result[g.ifirstkey] = False

            if keep == 'last':
                result= result[::-1].copy()
        return result

Although obviously feel free to change anything, including the comments. Over all, we definitely could have a more efficient algorithm here (you don’t have to essentially create a categorical just to find duplicated elements, and even if you do create a categorical, I think we probably could be doing both the searching and hashing in parallel which would significantly speed up a lot of operations that we do) but I think we should at a minimum allow people to input a high_unique kwarg. Thanks

The text was updated successfully, but these errors were encountered:

Improve test run scripts (#314) Improve the run scripts so they will propagate exit codes. This allows them to break the build if tests fail. Disable the tooling integration tests since they actually fail, and will now fail the builds. This will need to be remedied (issue #313). Expose operator methods on FastArray Fixes issues in #240 Ignore test dirs for autodoc Fix is so that filtered elements still show the count of non-filtered elements for their group Add Replace methods to FAString Fix median kwargs Fixes issues in categorical comparisons for example those in #256 Fix str mean crashes Adds a Sydney Time Zone addressing #265. Check count filter length Fixes #291 Adds a statx method for fastarrays. Fixes ufunc2 custom out test due to np122 changes Support numpy<1.23 Allows multiple arguments in col_filter Replace eq = __eq__ with def eq(self,other): return self.__eq__(other) Fix check for fixed size binary conversion Update Dataset docstrings: head, tail, sample, describe Add YAML and shell script specializations Fixed formatting errors (except duplicate obj descs) Update riptable GH CI for numpy122 Tighten dependencies Update dependency constraints Fixed formatting errors in rt_timezone and rt_utils Fixed formatting errors in rt_struct Fixed formatting errors in rt_str Fixed formatting errors in rt_pgroupby and rt_sds Fixed formatting errors in rt_pdataset Fixed formatting errors in rt_multiset Fixed formatting errors in rt_misc Fixed formatting errors in rt_meta Fixed formatting errors in rt_merge Fixed formatting errors in rt_groupbykeys Fixed formatting errors in rt_itemcontainer Fix docstring for regex_replace Update setup.py dependencies Support the 'Australia/Sydney' timezone. Fix formatting errors and update math methods Fixed Sphinx formatting errors in rt_appdirs Fixed short underline in benchmarking.rst Add set_valid method; identical to filter Fix Categorical Where

* Latest changes Improve test run scripts (#314) Improve the run scripts so they will propagate exit codes. This allows them to break the build if tests fail. Disable the tooling integration tests since they actually fail, and will now fail the builds. This will need to be remedied (issue #313). Expose operator methods on FastArray Fixes issues in #240 Ignore test dirs for autodoc Fix is so that filtered elements still show the count of non-filtered elements for their group Add Replace methods to FAString Fix median kwargs Fixes issues in categorical comparisons for example those in #256 Fix str mean crashes Adds a Sydney Time Zone addressing #265. Check count filter length Fixes #291 Adds a statx method for fastarrays. Fixes ufunc2 custom out test due to np122 changes Support numpy<1.23 Allows multiple arguments in col_filter Replace eq = __eq__ with def eq(self,other): return self.__eq__(other) Fix check for fixed size binary conversion Update Dataset docstrings: head, tail, sample, describe Add YAML and shell script specializations Fixed formatting errors (except duplicate obj descs) Update riptable GH CI for numpy122 Tighten dependencies Update dependency constraints Fixed formatting errors in rt_timezone and rt_utils Fixed formatting errors in rt_struct Fixed formatting errors in rt_str Fixed formatting errors in rt_pgroupby and rt_sds Fixed formatting errors in rt_pdataset Fixed formatting errors in rt_multiset Fixed formatting errors in rt_misc Fixed formatting errors in rt_meta Fixed formatting errors in rt_merge Fixed formatting errors in rt_groupbykeys Fixed formatting errors in rt_itemcontainer Fix docstring for regex_replace Update setup.py dependencies Support the 'Australia/Sydney' timezone. Fix formatting errors and update math methods Fixed Sphinx formatting errors in rt_appdirs Fixed short underline in benchmarking.rst Add set_valid method; identical to filter Fix Categorical Where * Enable magical preserve_egg_dir Co-authored-by: rtosholdings-bot <[email protected]>

OrestZborowski-SIG added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 4, 2021

OrestZborowski-SIG mentioned this issue Aug 23, 2022

Latest changes #318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should change how duplicated works #240

We should change how duplicated works #240

OrestZborowski-SIG commented Nov 4, 2021

We should change how duplicated works #240

We should change how duplicated works #240

Comments

OrestZborowski-SIG commented Nov 4, 2021