Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Starlark set data type #290

Merged
merged 10 commits into from
Dec 17, 2024
Merged

Add a Starlark set data type #290

merged 10 commits into from
Dec 17, 2024

Conversation

tetromino
Copy link
Collaborator

@tetromino tetromino commented Nov 9, 2024

Modeled on the Python 3 set type and the existing implementations in Go, Rust, and the proposed one for Java, with the following differences:

  • set literals and set comprehensions not supported (unlike python3)
    • Rationale: readability; {} can be confusing (empty set or empty dict?)
  • copy() method not supported (unlike python3)
    • Rationale: we do not have it on lists or dictionaries; if we add it, we ought to add it to all containers for symmetry.
  • comparison operators not supported (unlike starlark-go and python3)
    • Rationale: readability; python3-style comparison operators on sets provide only a partial order, resulting in unexpected behavior if one, for example, attempts to sort a list of sets.
  • update() method supported (unlike starlark-go)
    • Rationale: follow python's example, useful for in-place mutation when rhs is not a set
  • isdisjoint(), intersection_update(), difference_update(), symmetric_difference_update() method supported (unlike starlark-go and starlark-rust)
    • Rationale: follow python's example for a tiny efficiency gain - e.g. if we didn't have s.intersection_update(rhs), and rhs was a non-set sequence, we'd need to instead do s &= set(rhs), which would mean allocating an unnecessary temporary set for rhs's elements.
  • multiple-argument form of union(), intersection(), difference() and corresponding _update methods supported (unlike starlark-go and starlark-rust)
    • Rationale: follow python's example for a tiny efficiency gain for the non-_update methods - allows avoiding temporary intermediate sets when unioning (or intersecting, etc.) a collection of sequences/sets into a set. (For the _update methods, the multi-argument form is syntactic sugar and doesn't provide an efficiency gain, but seems useful for api symmetry.)
  • |, &, -, and ^ operators (and their augmented forms) require both sides to be sets if lhs is a set (unlike starlark-go)
    • Rationale: preserve syntactic compatibility with python3; starlark's | operator already requires both sides to be dicts if lhs is a dict.

Fixes #264

@adonovan @stepancheg FYI

Modeled on the Python 3 set type and the existing implementations in Go,
Rust, and the proposed one for Java, with the following differences:

* set literals and set comprehensions *not* supported (unlike python3)
* `copy()` method *not* supported (unlike python3) because we do not
  have it on lists or dictionaries.
* comparison operators *not* supported (unlike starlark-go and python3)
* `update()` method supported (unlike starlark-go)
* `isdisjoint()`,`intersection_update()`, `difference_update()`,
  `symmetric_difference_update()` method supported (unlike starlark-go
  and starlark-rust)
* multiple-argument form of `union()`, `intersection()`, `difference()`
  and corresponding _update methods supported (unlike starlark-go and
  starlark-rust).

Fixes bazelbuild#264
@tetromino tetromino requested a review from brandjon November 9, 2024 20:37
spec.md Outdated Show resolved Hide resolved
@comius
Copy link

comius commented Nov 26, 2024

Can you add a rationale for the differences you're listing?

@stepancheg
Copy link
Contributor

multiple-argument form of union(), intersection(), difference() and corresponding _update methods supported (unlike starlark-go and starlark-rust).

Do we need multiple-arguments? They are used like once in a year, and maybe not worth extra complexity.

But this is not a big deal.

@tetromino
Copy link
Collaborator Author

tetromino commented Nov 27, 2024

Can you add a rationale for the differences you're listing?

@comius - rationales added.

Do we need multiple-arguments? They are used like once in a year, and maybe not worth extra complexity.

@stepancheg - for union, intersection, and difference, the multi-argument form provides a tiny efficiency gain by allowing you to avoid temporaries. In other words, s.union(t, u, v) allocates one new set object, while s | t | u | v or s.union(t).union(u).union(v) would allocate 3 set objects, 2 of which are unnecessary temporaries.

spec.md Outdated Show resolved Hide resolved
@comius
Copy link

comius commented Nov 27, 2024

rationales added.

Thanks! Very nice.

Copy link
Contributor

@adonovan adonovan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Most of my comments are based on comparison with the Starlark-Go spec.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
composed from hashable values. Most mutable values, such as lists
and dictionaries, are not hashable, unless they are frozen.
composed from hashable values. Most mutable values, such as lists,
dictionaries, and sets, are not hashable, unless they are frozen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing -- and don't change it in this PR -- but I think we changed this behavior so that unhashable types remain unhashable even after freezing. That's currently the behavior in the Java interpreter for Dict, for example. Something we can look at later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@brandjon
Copy link
Member

All the justifications of differences from prior implementations sound good to me.

set literals and set comprehensions not supported (unlike python3)

  • Rationale: readability; {} can be confusing (empty set or empty dict?)

I feel like "{} is a dict, not a set" is just one of those things you learn when you're dabbling in a Python dialect. But I suppose you could also go the other way, and say that even a non-empty set literal makes it harder to distinguish between sets and dicts on first sight. In either case, even if we wanted a literal syntax, it would be a separate extension.

@tetromino
Copy link
Collaborator Author

@adonovan @brandjon - ptal at current state of the text?

Copy link
Member

@brandjon brandjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Following python3 behavior, as explained by brandjon@
@tetromino
Copy link
Collaborator Author

@brandjon - following the behavior python3 clearly aspires towards (but doesn't quite attain), I've added a requirement for the argument to set methods to not contain unhashable values.

spec.md Outdated
@@ -4070,7 +4076,7 @@ not have any values in common, and `False` otherwise.

This is equivalent to `not S.intersection(x)`.

`isdisjoint` fails if `x` is not an iterable sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you drop the "not an iterable sequence" wording because it's implied/obvious from context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the first line already states that x must be an iterable sequence.

spec.md Outdated
element, use [`discard`](#set·discard) instead. If you need to remove multiple
elements from a set, see [`difference_update`](#set·difference_update) or the
[`-=`](#sets) augmented assignment operation.
`remove` fails if the set does not contain `x` (in particular, if `x` is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do you mean "in particular, even if x is unhashable"? Without the "even" I'm not sure what "in particular" is emphasizing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the "in particular" to indicate that "x is unhashable" is a potentially non-obvious subcase of "the set doesn't contain x"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to be less confusing.

@brandjon
Copy link
Member

Discussed offline but FTR: Apparently the behavior in Python, that my_set.discard(set([...])) succeeds, is intentional and is there to support the case where my_set has a frozenset element matching the operand [1]. So it's a convenience feature that makes the API more inconsistent. In that case, we're willingly choosing to break Python's precedent in the name of self-consistency.

[1] Source: reference doc quoted in first post of this thread.

@tetromino tetromino merged commit b130d3b into bazelbuild:master Dec 17, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proposal: add a set data type
5 participants