Added toolkit `compare` #719

ilongin · 2024-12-17T01:54:02Z

Added compare function to toolkit. It uses datachain.lib.diff.compare() to compare DataChains and returns dictionary with values for each of added, removed, modified and unchanged chains. Each chain consists only of those changes that are related to it (added only has added fields, deleted only deleted fields etc.) Keys of dicts are shortcuts for each status: A, D, MandU`.
Status column is removed from each resulting chain as it's not needed.

Also added CompareStatus enum and replaced hardcoded status letters (A, D ...) across the codebasej

codecov · 2024-12-17T02:03:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.36%. Comparing base (eb22c85) to head (24834f0).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #719      +/-   ##
==========================================
+ Coverage   87.33%   87.36%   +0.03%     
==========================================
  Files         116      117       +1     
  Lines       11130    11160      +30     
  Branches     1528     1532       +4     
==========================================
+ Hits         9720     9750      +30     
  Misses       1031     1031              
  Partials      379      379

Flag	Coverage Δ
datachain	`87.30% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cloudflare-workers-and-pages · 2024-12-17T02:24:47Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`24834f0`
Status:	✅ Deploy successful!
Preview URL:	https://951e5ab4.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-715-toolkit-diff.datachain-documentation.pages.dev

View logs

shcheklein · 2024-12-21T01:17:06Z

@skshetry @dreadatour can we please review this folks?

shcheklein · 2024-12-22T18:48:43Z

@iterative/datachain a reminder, please take a look team.

dreadatour

Looks good to me!
Not related to this PR, but it is also would be nice to have an example of real-world usage 👀

src/datachain/lib/diff.py

@@ -112,25 +125,27 @@ def _to_list(obj: Union[str, Sequence[str]]) -> list[str]:
                for c in [f"{_rprefix(c, rc)}{rc}" for c, rc in zip(on, right_on)]
            ]
        )
-        diff_cond.append((added_cond, "A"))
+        diff_cond.append((added_cond, CompareStatus.ADDED))


src/datachain/toolkit/diff.py

shcheklein · 2024-12-23T16:34:21Z

@dmpetrov can you also take a look please? you have more context on this.

skshetry · 2024-12-24T04:20:20Z

src/datachain/toolkit/diff.py

+    added: bool = True,
+    deleted: bool = True,
+    modified: bool = True,
+    unchanged: bool = False,


Do we need these arguments? Can we leave this up to the user to filter?

dc.compare(...).filter(C("col") == "added")

DataChain.compare() returns new chain with that column which can be filtered and this new toolkit method does exactly what you described, so it saves the user that one filter step. Now, we can discuss if toolkit method is even needed in the first place.
So with this PR we have:

compare() in src.datachain.diff -> accepts 2 chains and returns new "diff" chain with status column

DataChain.compare() -> simple wrapper around 1) where left chain is self

compare() in src.datachain.toolkit -> wrapper around 1) but instead of returning one "diff" chain with status column, it splits that chain into multiple chains where each chain represents only one status which is basically what you did in your comment.

I think with this new toolkit we maybe have too many functions, although 2) was meant to be "private"

It looks like the 3rd should have a different name and should follow similar pattern as the compare() - it should be in dc.py if there is not much code or in lib.diff otherwise with a wrapper in dc.py

Anyway, lib.diff is a better place for the code than a new toolkit.

PS: I might be the person who proposed the toolkit file but it does not seem a good idea in this case 😅

I think 3) should not be in dc.py as it returns multiple instances of DataChain so it should be in util file.
Also, the question is should we use src.datachain.diff.py or src.datachain.lib.diff.py for public util functions?

skshetry · 2024-12-24T04:23:32Z

src/datachain/toolkit/diff.py

Could you please explain the difference between a toolkit and a lib?

The idea was to move "heavy" code from dc.py to somewhere else and keep only simple wrapper function in dc.py.

If it's implemented in lib/diff.py - it's enough and we don't need toolkit.

Why don't we organize into individual top-level modules, eg: datachain.diff, instead of cramming everything in a nested module in datachain.toolkit or datahchain.lib modules?

Namespaces are one honking great idea -- let's do more of those!
Flat is better than nested.

Yes, that looks like the best option.

@ilongin what do you think?

I'm also for top-level modules. I will move this to datachain.diff

skshetry · 2024-12-24T04:25:00Z

src/datachain/lib/diff.py

@@ -16,6 +17,21 @@
 C = Column


+def get_status_col_name() -> str:


Does the column name need to be random? Can we have a default column name that can be changed by users?

Eg:

def compare(col="status"): pass dc.compare(col=...)

Note that this column name will not be in the results. It's only needed for our internal implementation of the diff. User will have separate chains for each status and status column is not needed in that case.

dmpetrov

Some comments are inline. Please take a look.

I'm approving to not to block.

dmpetrov · 2024-12-26T01:32:29Z

src/datachain/lib/diff.py

-    )
+    # we still need status column for internal implementation even if not
+    # needed in the output
+    status_col = status_col or get_status_col_name()


Please make sure you drop this random column if it was None.

Status column is dropped with select_except(...) before returning to the user.

dmpetrov · 2024-12-26T01:35:08Z

src/datachain/toolkit/diff.py

+        )
+        ```
+    """
+    status_col = get_status_col_name()


User can define it, can't they?

In this toolkit method status column is not returned to the user so it's created only for our internal implementation and removed before returning to the user. User can define status column in core DataChain.compare() which returns one chain with all statuses written in that status column

dmpetrov · 2024-12-26T01:36:10Z

src/datachain/toolkit/diff.py

+            CompareStatus.UNCHANGED
+        )
+
+    return chains


I don't see how status column is cleaned up. or I'm missing something.

It's cleaned in filter_by_status() method with select_except()

dmpetrov · 2024-12-26T01:39:46Z

src/datachain/toolkit/diff.py

The idea was to move "heavy" code from dc.py to somewhere else and keep only simple wrapper function in dc.py.

If it's implemented in lib/diff.py - it's enough and we don't need toolkit.

added compare to toolkit

87694b2

ilongin temporarily deployed to internal December 17, 2024 01:54 — with GitHub Actions Inactive

ilongin linked an issue Dec 17, 2024 that may be closed by this pull request

Create util function to return multiple DataChain instances, each for added, deleted and modified value of chain diff #715

Open

ilongin marked this pull request as draft December 17, 2024 01:54

added docs

1d40f26

ilongin temporarily deployed to internal December 17, 2024 02:24 — with GitHub Actions Inactive

added CompareStatus

29f744e

ilongin temporarily deployed to internal December 18, 2024 01:49 — with GitHub Actions Inactive

ilongin marked this pull request as ready for review December 18, 2024 01:49

ilongin requested review from dreadatour, dmpetrov, skshetry and mattseddon December 18, 2024 01:50

dreadatour approved these changes Dec 23, 2024

View reviewed changes

skshetry reviewed Dec 24, 2024

View reviewed changes

dmpetrov approved these changes Dec 26, 2024

View reviewed changes

ilongin added 2 commits December 31, 2024 09:54

Merge branch 'main' into ilongin/715-toolkit-diff

6f0365f

resolving conflicts

c08e002

ilongin temporarily deployed to internal December 31, 2024 10:21 — with GitHub Actions Inactive

ilongin requested a review from skshetry January 1, 2025 23:51

removed not needed docs

c48803d

ilongin temporarily deployed to internal January 2, 2025 00:03 — with GitHub Actions Inactive

skshetry approved these changes Jan 2, 2025

View reviewed changes

Merge branch 'main' into ilongin/715-toolkit-diff

24834f0

ilongin temporarily deployed to internal January 3, 2025 09:11 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added toolkit `compare` #719

Added toolkit `compare` #719

ilongin commented Dec 17, 2024 •

edited

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 17, 2024 •

edited

Loading

shcheklein commented Dec 21, 2024

shcheklein commented Dec 22, 2024

dreadatour left a comment

This comment was marked as off-topic.

shcheklein commented Dec 23, 2024

skshetry Dec 24, 2024

ilongin Dec 31, 2024 •

edited

Loading

ilongin Jan 1, 2025

dmpetrov Jan 2, 2025

dmpetrov Jan 2, 2025

ilongin Jan 3, 2025

skshetry Dec 24, 2024

dmpetrov Dec 26, 2024

skshetry Dec 26, 2024

dmpetrov Dec 26, 2024

ilongin Dec 31, 2024

skshetry Dec 24, 2024 •

edited

Loading

ilongin Dec 31, 2024

dmpetrov left a comment

dmpetrov Dec 26, 2024

ilongin Dec 31, 2024

dmpetrov Dec 26, 2024

ilongin Dec 31, 2024

dmpetrov Dec 26, 2024

ilongin Dec 31, 2024

dmpetrov Dec 26, 2024

		@@ -16,6 +17,21 @@
		C = Column


		def get_status_col_name() -> str:

Added toolkit compare #719

Are you sure you want to change the base?

Added toolkit compare #719

Conversation

ilongin commented Dec 17, 2024 • edited Loading

codecov bot commented Dec 17, 2024 • edited Loading

Codecov Report

cloudflare-workers-and-pages bot commented Dec 17, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

shcheklein commented Dec 21, 2024

shcheklein commented Dec 22, 2024

dreadatour left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

shcheklein commented Dec 23, 2024

Choose a reason for hiding this comment

ilongin Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skshetry Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmpetrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Added toolkit `compare` #719

Added toolkit `compare` #719

ilongin commented Dec 17, 2024 •

edited

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 17, 2024 •

edited

Loading

ilongin Dec 31, 2024 •

edited

Loading

skshetry Dec 24, 2024 •

edited

Loading