Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should change how duplicated works #240

Open
OrestZborowski-SIG opened this issue Nov 4, 2021 · 0 comments
Open

We should change how duplicated works #240

OrestZborowski-SIG opened this issue Nov 4, 2021 · 0 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation

Comments

@OrestZborowski-SIG
Copy link
Contributor

I would like to propose a specific change to the duplicated method in rt_fastarray. In particular, there is a kwarg high_unique that I think was clearly intended to be passed into Grouping (as the kwarg lex) which speeds up the process for arrays with many unique values by a lot by doing a sort to get nCountGroup or whatever. So I would propose something like this

    def duplicated(self, keep='first', high_unique=False):
        '''
        See pandas.Series.duplicated

        Duplicated values are indicated as True values in the resulting
        FastArray. Either all duplicates, all except the first or all except the
        last occurrence of duplicates can be indicated.

        Parameters
        ----------
        keep : {'first', 'last', False}, default 'first'
            - 'first' : Mark duplicates as True except for the first occurrence.
            - 'last' : Mark duplicates as True except for the last occurrence.
            - False : Mark values with just one occurrence as False.

        high_unique : bool, default False
            Change this to true if your array has many unique values

        '''
        arr = self

        if keep == 'last':
            arr = arr[::-1].copy()

        elif keep is not False and keep != 'first':
            raise ValueError(f'keep must be either "first", "last" or False')

        # create an return array all set to True
        result = ones(len(arr), dtype=np.bool_)

        g = Grouping(arr._fa if hasattr(arr,'_fa') else arr, lex = high_unique)

        if keep is False:
            # search for groups with a count of 1
           result[g.ifirstkey[g.ncountgroup[1:]==1]] = False
        else:
            result[g.ifirstkey] = False

            if keep == 'last':
                result= result[::-1].copy()
        return result

Although obviously feel free to change anything, including the comments. Over all, we definitely could have a more efficient algorithm here (you don’t have to essentially create a categorical just to find duplicated elements, and even if you do create a categorical, I think we probably could be doing both the searching and hashing in parallel which would significantly speed up a lot of operations that we do) but I think we should at a minimum allow people to input a high_unique kwarg. Thanks

@OrestZborowski-SIG OrestZborowski-SIG added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 4, 2021
OrestZborowski-SIG pushed a commit that referenced this issue Aug 23, 2022
Improve test run scripts (#314)

Improve the run scripts so they will propagate exit codes. This allows them to break the build if tests fail.
Disable the tooling integration tests since they actually fail, and will now fail the builds. This will need to be remedied (issue #313).
Expose operator methods on FastArray

Fixes issues in #240

Ignore test dirs for autodoc
Fix is so that filtered elements still show the count of non-filtered elements for their group

Add Replace methods to FAString

Fix median kwargs

Fixes issues in categorical comparisons for example those in #256

Fix str mean crashes

Adds a Sydney Time Zone addressing #265.

Check count filter length
Fixes #291

Adds a statx method for fastarrays.

Fixes ufunc2 custom out test due to np122 changes

Support numpy<1.23

Allows multiple arguments in col_filter

Replace eq = __eq__ with def eq(self,other): return self.__eq__(other)

Fix check for fixed size binary conversion

Update Dataset docstrings: head, tail, sample, describe

Add YAML and shell script specializations

Fixed formatting errors (except duplicate obj descs)

Update riptable GH CI for numpy122

Tighten dependencies

Update dependency constraints

Fixed formatting errors in rt_timezone and rt_utils

Fixed formatting errors in rt_struct

Fixed formatting errors in rt_str

Fixed formatting errors in rt_pgroupby and rt_sds

Fixed formatting errors in rt_pdataset

Fixed formatting errors in rt_multiset

Fixed formatting errors in rt_misc

Fixed formatting errors in rt_meta

Fixed formatting errors in rt_merge

Fixed formatting errors in rt_groupbykeys

Fixed formatting errors in rt_itemcontainer

Fix docstring for regex_replace

Update setup.py dependencies

Support the 'Australia/Sydney' timezone.

Fix formatting errors and update math methods

Fixed Sphinx formatting errors in rt_appdirs

Fixed short underline in benchmarking.rst

Add set_valid method; identical to filter

Fix Categorical Where
OrestZborowski-SIG added a commit that referenced this issue Aug 23, 2022
* Latest changes

Improve test run scripts (#314)

Improve the run scripts so they will propagate exit codes. This allows them to break the build if tests fail.
Disable the tooling integration tests since they actually fail, and will now fail the builds. This will need to be remedied (issue #313).
Expose operator methods on FastArray

Fixes issues in #240

Ignore test dirs for autodoc
Fix is so that filtered elements still show the count of non-filtered elements for their group

Add Replace methods to FAString

Fix median kwargs

Fixes issues in categorical comparisons for example those in #256

Fix str mean crashes

Adds a Sydney Time Zone addressing #265.

Check count filter length
Fixes #291

Adds a statx method for fastarrays.

Fixes ufunc2 custom out test due to np122 changes

Support numpy<1.23

Allows multiple arguments in col_filter

Replace eq = __eq__ with def eq(self,other): return self.__eq__(other)

Fix check for fixed size binary conversion

Update Dataset docstrings: head, tail, sample, describe

Add YAML and shell script specializations

Fixed formatting errors (except duplicate obj descs)

Update riptable GH CI for numpy122

Tighten dependencies

Update dependency constraints

Fixed formatting errors in rt_timezone and rt_utils

Fixed formatting errors in rt_struct

Fixed formatting errors in rt_str

Fixed formatting errors in rt_pgroupby and rt_sds

Fixed formatting errors in rt_pdataset

Fixed formatting errors in rt_multiset

Fixed formatting errors in rt_misc

Fixed formatting errors in rt_meta

Fixed formatting errors in rt_merge

Fixed formatting errors in rt_groupbykeys

Fixed formatting errors in rt_itemcontainer

Fix docstring for regex_replace

Update setup.py dependencies

Support the 'Australia/Sydney' timezone.

Fix formatting errors and update math methods

Fixed Sphinx formatting errors in rt_appdirs

Fixed short underline in benchmarking.rst

Add set_valid method; identical to filter

Fix Categorical Where

* Enable magical preserve_egg_dir

Co-authored-by: rtosholdings-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation
Projects
None yet
Development

No branches or pull requests

1 participant