Shifting asset metadata to consumer-node: Our options #15847

thejcannon · 2022-06-16T14:48:41Z

thejcannon
Jun 16, 2022
Maintainer

There's a proposal to unify file and resource into asset, which is missing a key piece of Pants functionality we need to decide on.

The TL;DR of all this is "We (both the user and us devs) can't always know how a source file will be consumed (resource v file), so forcing that metadata at the node-site is folly". Instead we need to have one asset type, and shift that metadata to consumption-site. This eventually turns into "typed dependencies", which is a much larger discussion.

This isn't unique, however this will be widespread. Examples of current modeling:

archive having both files and packages (and no dependencies)
python_test having both dependencies and runtime_package_dependencies
jvm_war having both dependencies and content fields

There's also some cases we don't yet implement which could benefit from the result of this discussion:

Stop including type stubs in pex_binary targets #15454 could be implemented with a typecheck_dependencies field (or equivalent). This also allows us to allow Treat imports under "if TYPE_CHECKING" as weak. #15384 , because we can infer the dep into a different dependencies field.
Right now if a dependency doesn't fit into our expected "schema" we (usually) silently ignore it. As an example, a user tried to have a dependency on an go_package in a python_source, with the expectation that they could run the package as a subprocess because it will exist in the test sandbox.
- There's a few ways to solve this, with overlap with this discussion (e.g. I've seen Bazel rules error if a dependency is of a type the rule doesn't expect)

Some options we can consider (plus more if you have your thinking cap on):

# Using this as a backdrop for the examples...
# Assuming we've coalesced `file` and `resource` into `asset`
asset(
    name="config",
    source="config.json"
)

Ease towards explicit field(s)

E.g. files and/or resources field added to relevant targets. (Even though we can get away with one field, and leave the other "type" to be implicit in dependencies, I'd argue we'd want both to be explicit and to call a spade a spade)

# E.g.
java_source(
    dependencies = [...],
    # This would ensure the config exists in "runtime" environments on the filesystem
    # for dependees (test sandbox, maybe an extracted `archive` or `docker_image`)
    files = [":config"],
    # This would ensure the config gets packaged as a resource in a dependee `deploy_jar`.
    resources = [":config"],
)

Admittedly this represents a paradigm shift towards typed dependencies fields
Combined with a proposal to implement more "generates a file" generators could actually make Pants more powerful. E.g. any BuiltPackage could be converted to a file, so we'd actually know how to consume it from, say, a python_test (especially if the built package address was in a field called files. There's no ambiguity).
This could also lead to laundry list of "dependency" field types if we aren't careful. Consider perhaps the fields (source_deps, files, resources, py_distribution_deps, system_deps).

(possibly) Multiple "dependencies" fields prefixed with the environment they are consumed in.

E.g. runtime_dependencies would contain all dependencies needed to be present at runtime. This would bleed into test sandboxes and possibly archives and docker_images. dependencies would contain dependencies in every environment (either on the FS or in a packaged thing). We could also support one-offs like typecheck_dependencies case above.
It's more generic, which is both a pro and a con. Targets like python_test get runtime_package_dependencies "promoted" to just runtime_dependencies.
For assets, dependencies -> resource, runtime_dependencies -> file.

# E.g.
java_source(
    # This would ensure the config gets packaged as a resource in a dependee `deploy_jar`.
    dependencies = [":config"],
    # This would ensure the config exists in "runtime" environments on the filesystem
    # for dependees (test sandbox, maybe an extracted `archive` or `docker_image`)
    runtime_dependencies = [":config"],
    # and maybe even (for `python_source)`
    typecheck_dependencies = [...],
)

Typed dependencies field.

E.g. (and bikeshed on DSL, specifically) dependencies = {"build": [...], "runtime": [...]} or something to this vein of keeping one field, but having it become more "rich".

# E.g.
java_source(
    dependencies = {
        # This would ensure the config gets packaged as a resource in a dependee `deploy_jar`.
        "build": [":config"],
        # This would ensure the config exists in "runtime" environments on the filesystem
        # for dependees (test sandbox, maybe an extracted `archive` or `docker_image`)
        "runtime": [...],
    },
)

The engine is flexible enough we can get any of these working, so it's a matter of what direction we want to head with the lens of our users in mind.

Eric-Arellano · 2022-06-16T18:19:23Z

Eric-Arellano
Jun 16, 2022
Collaborator

I'm generally +1 to this proposal.

If dependencies field gets split it into multiple ~typed fields, how will users know what gets inferred? Maybe the dependencies goal still smashes together every dep-like field together, and then peek gives the granular per-field breakdown?

2 replies

thejcannon Jun 16, 2022
Maintainer Author

Do you have a +1 on which of the options you think should be our ideal future state?

I'll say I'm kinda -1 on option 3 because it does have a lot of quirks and is quite invasive, so "idk" to your question 😛

Eric-Arellano Jun 16, 2022
Collaborator

I lean the most towards option 1. I like that it's more intuitive and declarative. Although, I don't love the name dependencies because it's kind of like an "other" field. And I worry if we'd have too much of a proliferation of fields.

So I'm between 1 and 2. My concern with 2 is if it is still too general, that we're trying to force things.

kaos · 2022-06-16T21:39:19Z

kaos
Jun 16, 2022
Collaborator

I think there's a fourth option (which basically is an alternative of option 3, I think), and that is to keep the dependencies field as a sequence of strings, but with optional extra metadata for the edges.

Perhaps something like this, to give an idea of what it could look like:

dependencies = [
  ":config @resource",
  "//:lib @test",
  "//:util @runtime",
  ...
]

Related discussions on previous issues:
#12794 (comment)
(I thought I'd seen more, but can't find them now.. at least the comment above hints at at least one more.. but oh well)

I'd like to keep the dependencies syntax as strings, if possible, to make the change less impactful (and backwards compatible), so you only add edge data if you need it.

Edit: so this idea is inspired by a comment made by @jsirois, only I can't find that particular comment now..

8 replies

cognifloyd Jun 18, 2022
Collaborator

If we want to enhance dependencies, then I think I would go for something like this:

dependencies = [
  as_resource(":config"),
  as_test("//:lib"),
  as_runtime("//:util"),
  ...
]

Where each of these as_* functions returns a str subclass that "annotates" the dependency with additional type information.

jsirois Jun 18, 2022

@cognifloyd from 2012:
https://github.com/twitter-archive/commons/blob/1be86cf64f240486c1e3fc15ee3538b264b54c04/3rdparty/BUILD#L375-L381

There the jar(...) dependency and pants(...) dependency are along your lines of thought. At the time pants(...) indicated an address pointer to the real dependency, which was de-referenced and jar(...) was a concrete jvm 3rdparty dependency type.

cognifloyd Jun 18, 2022
Collaborator

Oh. So that was a thing in v1? Huh.

I think I still lean toward option 1, multiple dependency fields, but this would be ok too.

I used as_ in place of a symbol to distinguish from targets. I guess the question is which feature should be most clear?

Targets vs dependency types
Number of fields on a target vs enriching dependencies with usage info
the word "dependency" vs the dependency's purpose

cognifloyd Jun 18, 2022
Collaborator

Or here's another idea. Instead of a dependency type function, each dependency type can be an modeled as an instance of a special class with one of the operators overridden.

Similar to how pathlib overrides \ to simplify building paths, we could override an operator that is not normally used with strings.

Possibilities:

^
//
**
|
@ the matmul matrix multiplier symbol new in version 3.5
<<
>>

dependencies = [
  Resource | ":config",
  Test | "//:lib",
  Runtime | "//:util",
  ...
]

jsirois Jun 18, 2022

Yeah, there are a zillion ways to spell idea 3. I'm not advocating for this, just pointing out we've spun around and around on this over the years.

jsirois · 2022-06-17T01:02:06Z

jsirois
Jun 17, 2022

I harp on this point alot, but "Shifting asset metadata to consumer-node" is another way of saying Pants metadata configures the verbs; i.e.: it should always be thought of fundamentally as use-dependendent. I.E.: targets never made much sense, but configuring goals does make sense, and even more so to an end user. A user fundamentally wants to do things, not declare things. It goes: "I want to vroom - oh, that needs gas.". Not "I want to state I have gas here. Ah, maybe I should vroom?".

So, a doc goal may have a different idea of dependencies from a javac goal, for example; so 1 above makes sense. A doc goal or its rules might need a field set that includes css dependencies whereas a javac goal or its rules need a field set with just dependencies that can be turned into classfiles (metaprogramming and annotation processors aside - since those could use resources to drive compilation). These different ideas of dependencies, or even just the more general different field sets needed by different goals, points to 1 as far as I can tell. Some verbs will simply not fit "runtime", "provided", etc scopes at all.

1 reply

cognifloyd Jun 17, 2022
Collaborator

Agreed. I like option 1 best.

Option 2 doesn't fit my mental model very well because in python something loaded from the file system or via pkg_resources is all "runtime".

Option 3 looks like a backwards compatibility and UX nightmare. If the structure of dependencies is more rich/complicated, then overrides will get even more complicated.

With the status quo, I get the distinction between file and resource, but I have no idea how to create a target for it without inspecting how it gets used. So, option 1 would have vastly simplified my initial bootstrapping of BUILD files. I'm still not sure I have all the files/resources targets set up just so.

This proposal also makes dual purpose files easier to reason about. For example: If you have a file, let's say a yaml file to make the example more concrete, that gets loaded via pkg_resources within an app, but should be just another file that gets embedded in docs by sphinx, then that asset is a resource in one case and a file in the other.

sureshjoshi · 2022-06-17T23:09:10Z

sureshjoshi
Jun 17, 2022
Collaborator

In general, huge +1 to this. I like consolidating files/resources into a single type (asset or whatever it will be named).

If I were to rank these, as is, it would be 1, 2, 4 (the kaos proposal) as all being very close, with a slightly distant 3.

I really hate file as having a real, tangible meaning here - because in the context of build tools, file is kinda over-generalized and meaningless. I like the general idea of "compile/build time" vs "run time" dependencies, though - still on board there. Just the file name that gives me pause.

In the current dependencies field, we already have some level of custom syntax, but I view that as being more about where the dependency comes from, not how it's consumed. But having said that, specifying up to 3 fields of content would guarantee I'm forced to re-read the docs over and over. I review files vs resources every single time I use them.

Here's a question: Are the dependencies field assumed to be compile-time or runtime deps? And based on that, couldn't we use that for sources and assets, and then have a single, extra field which represents runtime assets? I guess this might be vibing off Option 2?

my_sources(
  name="lib", 
  ...
)

assets(
    name="build-stuff",
    sources=["config.json", "another-config.json"]
)

assets(
    name="runtime-stuff",
    sources=["fancy-pants.exe"]
)

some_target(
  name="t",
  dependencies=[":lib", ":build-stuff"],
  runtime=[":runtime-stuff"]
)

Having typed all this, I started thinking about python_distribution - and realizing dependencies are pushed through to the final package there, not just a part of the build... I guess what I'm wondering is how reasonable it would be to only add either a build-only/runtime-only field would be.

2 replies

jsirois Jun 17, 2022

Here's a question: Are the dependencies field assumed to be compile-time or runtime deps?

I contend that question assumes too much. What does that even mean for a doc goal with source code dependencies (reads doc from doc strings or comments in those) and CSS dependencies (for styling in the doc comments)? You could say that shoe-horns into compile-time I guess since generating docs is sort-of like a compile step? I'm not very creative though and I'm sure other examples could stress this further introducing many more phases or scopes than the handful we tend to discuss.

sureshjoshi Jun 18, 2022
Collaborator

I agree, I have these same existential code questions myself when working on APIs.

From my perspective, I basically view a target dep as a "thing" or bunch of "things" that can be consumed as needed, in whatever way is required.

I think that's what makes this files vs resources even a bit more confusing to me, because intuitively, I would have assumed that the rules governing the goal + target combo would account for this. However, sometime's it's nice to just be able to straight copy to an endpoint.

Things get a bit whacky when I take into account stuff like ConfigFilesRequest which can just kinda... "pull" from the filesystem, in spite of a file not being explicitly mentioned in the target dependencies.

benjyw · 2022-06-21T21:00:35Z

benjyw
Jun 21, 2022
Maintainer Sponsor

Leaning in to @jsirois's insight, to truly represent this we would need one set of metadata per goal. Or at least one dependencies field per goal. That is of course very verbose, so we would also need some way to shorthand that for the 95% of cases where those are all the same in practice. We'd also need to figure out how this all interacts with dep inference.

4 replies

benjyw Jun 21, 2022
Maintainer Sponsor

One could go further and argue (as @jsirois has IIRC) that target types themselves are goal-dependent.

sureshjoshi Jun 21, 2022
Collaborator

Would that be considered intentional? Or a side-effect?

With some exceptions, I'd considered targets to basically be goal metadata - and I'd assumed that was the intention.I think the docs even say it (https://www.pantsbuild.org/docs/targets):

Most goals require metadata about your code
Targets are an addressable set of metadata describing your code.

Or is this to say that the relationship between goals and targets is more 1:1 rather than n:1 ?

stuhood Jun 22, 2022
Maintainer Sponsor

Leaning in to @jsirois's insight, to truly represent this we would need one set of metadata per goal. Or at least one dependencies field per goal. That is of course very verbose, so we would also need some way to shorthand that for the 95% of cases where those are all the same in practice. We'd also need to figure out how this all interacts with dep inference.

From a practical perspective, it just doesn't make sense to have "metadata per goal". There are diminishing returns on the precision of "the doc goal doesn't need absolutely everything that the compile goal does", and so certainly when it comes to explicitly specified dependencies, you wouldn't have two lists for this case.

So the answer is almost certainly neither "exactly one dependencies list per target" nor "exactly one dependencies list per goal"... it's a spectrum, and the answer is somewhere in between. Which honestly makes the point not that compelling to me.

kaos Jun 22, 2022
Collaborator

I think field sets are the bridge between the target and the goal.

Perhaps a similar way of classifying dependencies could be used to bridge "a bag of dependencies" on the target to "a set of typed dependencies" for a field set.

stuhood · 2022-06-22T03:46:17Z

stuhood
Jun 22, 2022
Maintainer Sponsor

I think that from a grokkability perspective, option 2 makes the most sense to me (although possibly with a shorter name, like deps and *_deps. But it is also the case that not all dependencies are multi-valued, and that argues for option 1 (because you're not going to want to force a single-valued field to append _dep to its name).

But really: given that we are already mostly in the world of option 1, sticking with option 1, but:

making SpecialCasedDependencies less special cased, and more typesafe
doing a better job of writing up conventions around when and how to use multiple dependencies fields

... will go a long way. For example: if SCD or a replacement allowed for declaring type-safe dependencies (maybe from a "matching FieldSet" perspective), and then we discouraged putting multiple types of dependencies in the default dependencies field via conventions, we might naturally begin separating out more typed deps.

0 replies

siemato · 2022-06-22T21:33:02Z

siemato
Jun 22, 2022

Since we have to decide on what to do with assents at the consumer, should we leave a door open to describe, what an asset is (not how it's supposed to be used), like an optional field?. If we focus on handling it in a 'this is an asset, it represents something' way, we would have to rely on the toolchains to figure out how to use it. There might be cases where that's not immediately possible. To continue, I could see use for attaching meaning to a file, for intent of deprecating assets (so that a warning appears).

I think, adding metadata to an asset is not wrong per se. It's just the wrong approach to let it define in advance, how it's going to be used.

About the syntax, I am mighty undecided. I probably still need to take a few showers to think it through. All in all, it sounds reasonable to have a more general asset definition.

Is it possible to describe groups of assets with this proposal?

0 replies

kaos · 2022-06-23T12:56:26Z

kaos
Jun 23, 2022
Collaborator

Another aspect I think is worth considering in this context is how to treat the various syntaxes used for assets.

What I mean by that is, say we have linters for yaml, json, toml, html, css etc... you get the picture. Now if we include such files with an assets target, how do we match the proper sources to the various linters, if any.

The brute force method would be that every "asset file" linter iterates over all the files, and picks out the ones it recognizes.

Maybe this is good enough (or even best). Just wanted to raise the question in case there are something clever to be done here, while things are still in the mold.

11 replies

siemato Jun 26, 2022

this plays into my post here. I was thinking about having edge cases where the content of a file is not distinguishable by the file ending (toml/css/...) and hence having an optional extra field to define it would be nice.

benjyw Jun 26, 2022
Maintainer Sponsor

That is more of an implementation detail, but from the user's perspective, when a target adds no information, it shouldn't be needed. And if you need it to add information, then of course you can do so.

There are two kinds of information a target can add: 1) Extra metadata, 2) Knowing what to select when using :: in a cmd-line spec.

siemato Jun 26, 2022

thanks for the clarification 👍

benjyw Jun 26, 2022
Maintainer Sponsor

I think we want to strive towards doing the thing the user intended, when it's obvious what that thing is, without requiring boilerplate...

siemato Jun 26, 2022

yeah, it's rough to balance it from a practicability standpoint. I'd rather avoid having to parse a file for a toolchain to figure out, if it's responsible for handling it. It is probably an edge case, still worth to keep in mind. That's why I suggested a field to describe what it is, not how it's supposed to be consumed. It just feels weird to do that in a target, instead of the asset definition. Again, it's not supposed to work as a discriptor of how to use a file. That would make sense in the target description.

Shifting asset metadata to consumer-node: Our options #15847

thejcannon Jun 16, 2022 Maintainer

Replies: 8 comments · 28 replies

Eric-Arellano Jun 16, 2022 Collaborator

thejcannon Jun 16, 2022 Maintainer Author

Eric-Arellano Jun 16, 2022 Collaborator

kaos Jun 16, 2022 Collaborator

cognifloyd Jun 18, 2022 Collaborator

cognifloyd Jun 18, 2022 Collaborator

cognifloyd Jun 18, 2022 Collaborator

cognifloyd Jun 17, 2022 Collaborator

sureshjoshi Jun 17, 2022 Collaborator

sureshjoshi Jun 18, 2022 Collaborator

benjyw Jun 21, 2022 Maintainer Sponsor

benjyw Jun 21, 2022 Maintainer Sponsor

sureshjoshi Jun 21, 2022 Collaborator

stuhood Jun 22, 2022 Maintainer Sponsor

kaos Jun 22, 2022 Collaborator

stuhood Jun 22, 2022 Maintainer Sponsor

kaos Jun 23, 2022 Collaborator

benjyw Jun 26, 2022 Maintainer Sponsor

benjyw Jun 26, 2022 Maintainer Sponsor

thejcannon
Jun 16, 2022
Maintainer

Replies: 8 comments 28 replies

Eric-Arellano
Jun 16, 2022
Collaborator

thejcannon Jun 16, 2022
Maintainer Author

Eric-Arellano Jun 16, 2022
Collaborator

kaos
Jun 16, 2022
Collaborator

cognifloyd Jun 18, 2022
Collaborator

cognifloyd Jun 18, 2022
Collaborator

cognifloyd Jun 18, 2022
Collaborator

cognifloyd Jun 17, 2022
Collaborator

sureshjoshi
Jun 17, 2022
Collaborator

sureshjoshi Jun 18, 2022
Collaborator

benjyw
Jun 21, 2022
Maintainer Sponsor

benjyw Jun 21, 2022
Maintainer Sponsor

sureshjoshi Jun 21, 2022
Collaborator

stuhood Jun 22, 2022
Maintainer Sponsor

kaos Jun 22, 2022
Collaborator

stuhood
Jun 22, 2022
Maintainer Sponsor

kaos
Jun 23, 2022
Collaborator

benjyw Jun 26, 2022
Maintainer Sponsor

benjyw Jun 26, 2022
Maintainer Sponsor