Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added toolkit compare #719

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Added toolkit compare #719

wants to merge 7 commits into from

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Dec 17, 2024

Added compare function to toolkit. It uses datachain.lib.diff.compare() to compare DataChains and returns dictionary with values for each of added, removed, modified and unchanged chains. Each chain consists only of those changes that are related to it (added only has added fields, deleted only deleted fields etc.) Keys of dicts are shortcuts for each status: A, D, MandU`.
Status column is removed from each resulting chain as it's not needed.

Also added CompareStatus enum and replaced hardcoded status letters (A, D ...) across the codebasej

Copy link

codecov bot commented Dec 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.36%. Comparing base (eb22c85) to head (24834f0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #719      +/-   ##
==========================================
+ Coverage   87.33%   87.36%   +0.03%     
==========================================
  Files         116      117       +1     
  Lines       11130    11160      +30     
  Branches     1528     1532       +4     
==========================================
+ Hits         9720     9750      +30     
  Misses       1031     1031              
  Partials      379      379              
Flag Coverage Δ
datachain 87.30% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Dec 17, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 24834f0
Status: ✅  Deploy successful!
Preview URL: https://951e5ab4.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-715-toolkit-diff.datachain-documentation.pages.dev

View logs

@ilongin ilongin marked this pull request as ready for review December 18, 2024 01:49
@shcheklein
Copy link
Member

@skshetry @dreadatour can we please review this folks?

@shcheklein
Copy link
Member

@iterative/datachain a reminder, please take a look team.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Not related to this PR, but it is also would be nice to have an example of real-world usage 👀

@@ -112,25 +125,27 @@ def _to_list(obj: Union[str, Sequence[str]]) -> list[str]:
for c in [f"{_rprefix(c, rc)}{rc}" for c, rc in zip(on, right_on)]
]
)
diff_cond.append((added_cond, "A"))
diff_cond.append((added_cond, CompareStatus.ADDED))

This comment was marked as off-topic.

src/datachain/toolkit/diff.py Outdated Show resolved Hide resolved
@shcheklein
Copy link
Member

@dmpetrov can you also take a look please? you have more context on this.

Comment on lines 22 to 25
added: bool = True,
deleted: bool = True,
modified: bool = True,
unchanged: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these arguments? Can we leave this up to the user to filter?

dc.compare(...).filter(C("col") == "added")

Copy link
Contributor Author

@ilongin ilongin Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataChain.compare() returns new chain with that column which can be filtered and this new toolkit method does exactly what you described, so it saves the user that one filter step. Now, we can discuss if toolkit method is even needed in the first place.
So with this PR we have:

  1. compare() in src.datachain.diff -> accepts 2 chains and returns new "diff" chain with status column
  2. DataChain.compare() -> simple wrapper around 1) where left chain is self
  3. compare() in src.datachain.toolkit -> wrapper around 1) but instead of returning one "diff" chain with status column, it splits that chain into multiple chains where each chain represents only one status which is basically what you did in your comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this new toolkit we maybe have too many functions, although 2) was meant to be "private"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the 3rd should have a different name and should follow similar pattern as the compare() - it should be in dc.py if there is not much code or in lib.diff otherwise with a wrapper in dc.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, lib.diff is a better place for the code than a new toolkit.

PS: I might be the person who proposed the toolkit file but it does not seem a good idea in this case 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 3) should not be in dc.py as it returns multiple instances of DataChain so it should be in util file.
Also, the question is should we use src.datachain.diff.py or src.datachain.lib.diff.py for public util functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain the difference between a toolkit and a lib?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to move "heavy" code from dc.py to somewhere else and keep only simple wrapper function in dc.py.

If it's implemented in lib/diff.py - it's enough and we don't need toolkit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we organize into individual top-level modules, eg: datachain.diff, instead of cramming everything in a nested module in datachain.toolkit or datahchain.lib modules?

Namespaces are one honking great idea -- let's do more of those!
Flat is better than nested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks like the best option.

@ilongin what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also for top-level modules. I will move this to datachain.diff

@@ -16,6 +17,21 @@
C = Column


def get_status_col_name() -> str:
Copy link
Member

@skshetry skshetry Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the column name need to be random? Can we have a default column name that can be changed by users?

Eg:

def compare(col="status"):
   pass


dc.compare(col=...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this column name will not be in the results. It's only needed for our internal implementation of the diff. User will have separate chains for each status and status column is not needed in that case.

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments are inline. Please take a look.

I'm approving to not to block.

)
# we still need status column for internal implementation even if not
# needed in the output
status_col = status_col or get_status_col_name()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure you drop this random column if it was None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status column is dropped with select_except(...) before returning to the user.

)
```
"""
status_col = get_status_col_name()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can define it, can't they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this toolkit method status column is not returned to the user so it's created only for our internal implementation and removed before returning to the user. User can define status column in core DataChain.compare() which returns one chain with all statuses written in that status column

CompareStatus.UNCHANGED
)

return chains
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how status column is cleaned up. or I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cleaned in filter_by_status() method with select_except()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to move "heavy" code from dc.py to somewhere else and keep only simple wrapper function in dc.py.

If it's implemented in lib/diff.py - it's enough and we don't need toolkit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants