-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add more flexible contribution filtering explainer (#109)
Details a proposal for allowing more flexible querying. See #92 for earlier discussion.
- Loading branch information
1 parent
55019b5
commit 66f4bc3
Showing
1 changed file
with
367 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,367 @@ | ||
# More flexible contribution filtering for Aggregation Service queries | ||
|
||
_Note: This document proposes a new backwards compatible change in the Private | ||
Aggregation API, Attribution Reporting API and Aggregation Service. While this | ||
new functionality is being developed, we still highly encourage testing the | ||
existing API functionalities to support core utility and compatibility needs._ | ||
|
||
#### Table of Contents | ||
|
||
- [Introduction](#introduction) | ||
- [Motivating use cases](#motivating-use-cases) | ||
- [Processing contributions at different cadences](#processing-contributions-at-different-cadences) | ||
- [Processing contributions by campaign ID](#processing-contributions-by-campaign-id) | ||
- [Non-goals](#non-goals) | ||
- [Proposal: filtering ID in the encrypted payload](#proposal-filtering-id-in-the-encrypted-payload) | ||
- [Use case examples](#use-case-examples) | ||
- [Processing contributions at different cadences](#processing-contributions-at-different-cadences-1) | ||
- [Processing contributions by campaign ID](#processing-contributions-by-campaign-id-1) | ||
- [Details](#details) | ||
- [Small ID space by default, but configurable](#small-id-space-by-default-but-configurable) | ||
- [Backwards compatibility](#backwards-compatibility) | ||
- [One ID per contribution](#one-id-per-contribution) | ||
- [Possible future extension: batching ID in the shared_info](#possible-future-extension-batching-id-in-the-shared_info) | ||
- [Use case examples](#use-case-examples-1) | ||
- [Processing contributions at different cadences](#processing-contributions-at-different-cadences-2) | ||
- [Processing contributions by campaign ID](#processing-contributions-by-campaign-id-2) | ||
- [Details](#details-1) | ||
- [Requires deterministic reports and specifying batching ID from a single-site context](#requires-deterministic-reports-and-specifying-batching-id-from-a-single-site-context) | ||
- [Backwards compatibility](#backwards-compatibility-1) | ||
- [One ID per report](#one-id-per-report) | ||
- [Use with filtering ID](#use-with-filtering-id) | ||
- [Limits on number of IDs used](#limits-on-number-of-ids-used) | ||
- [Application to Attribution Reporting API](#application-to-attribution-reporting-api) | ||
- [Privacy considerations](#privacy-considerations) | ||
|
||
## Introduction | ||
|
||
Currently, the Aggregation Service only allows each '[shared | ||
ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' | ||
to be present in one query. A set of reports with the same shared ID cannot be | ||
split for separate queries, even if the resulting batches are disjoint. However, | ||
there have been requests to introduce additional flexibility to this query model | ||
(see GitHub issues for [Private | ||
Aggregation](https://github.com/patcg-individual-drafts/private-aggregation-api/issues/92) | ||
and [Attribution Reporting](https://github.com/WICG/attribution-reporting-api/issues/732)). | ||
|
||
Here, we propose introducing a new _filtering ID_ set when a contribution is | ||
made and embedded in the encrypted payload. This allows for these queries to be | ||
split further, with the aggregation service filtering contributions based on the | ||
provided IDs. | ||
|
||
We also propose a possible future extension where a _batching ID_ is set from a | ||
first-party context and embedded in the `shared_info`. This would allow for the | ||
ad tech to filter the reports directly, improving the ergonomics for some use | ||
cases. | ||
|
||
## Motivating use cases | ||
|
||
#### Processing contributions at different cadences | ||
|
||
For some measurements, it may be desirable to query the Aggregation Service less | ||
frequently; this would allow for more contributions to be aggregated before | ||
noise is added, improving the signal-to-noise ratio. However, for other | ||
measurements, it may be more valuable to receive a result faster. (Support for | ||
this use case has been requested for Attribution Reporting | ||
[here](https://github.com/WICG/attribution-reporting-api/issues/732).) Filtering | ||
IDs could be used to separate these measurements into different queries. | ||
|
||
#### Processing contributions by campaign ID | ||
|
||
An ad tech might want to process measurements — for example, reach measurements | ||
— separately for each advertising campaign. To allow for this, it might want to | ||
use a different filtering ID or batching ID for each campaign. Note that, | ||
without this new functionality, the advertiser is not part of the shared ID and | ||
so it's not currently possible to process these separately. | ||
|
||
## Non-goals | ||
|
||
While we aim to increase the flexibility of report batching strategies, we don't | ||
intend to allow every report or every contribution to be queried separately. | ||
Further, we don't intend to allow for arbitrary groupings decided after | ||
reporting is complete. This is to ensure that the scale of aggregatable | ||
reporting accounting remains feasible, see [discussion | ||
below](#limits-on-number-of-ids-used). | ||
|
||
## Proposal: filtering ID in the encrypted payload | ||
|
||
We plan to introduce additional IDs in the payload called _filtering IDs_. By | ||
embedding these IDs within the encrypted payload, their values could be set | ||
within a worklet/script runner — e.g. for a Protected Audience bidder — and | ||
could even be chosen based on cross-site data. For example: | ||
|
||
```js | ||
privateAggregation.contributeToHistogram({bucket: 1234n, value: 56, filteringId: 3}); | ||
``` | ||
|
||
If no filtering ID is provided, a default ID of 0 will be used. (See also | ||
[Backwards compatibility](#backwards-compatibility) below.) | ||
|
||
As the reporting endpoint cannot determine the IDs within a given report, the | ||
aggregation service will provide new functionality for filtering contributions | ||
based on their IDs. In particular, each aggregation service query's parameters | ||
should provide a list of allowed filtering IDs and all contributions with other | ||
IDs will be filtered out. For example: | ||
|
||
```jsonc | ||
// ... | ||
"job_parameters": { | ||
"output_domain_blob_prefix": "domain/domain.avro", | ||
"output_domain_bucket_name": "<data_bucket>", | ||
"filtering_ids": [1, 3] // IDs to keep in the query | ||
}, | ||
``` | ||
|
||
Note that this API is not final, e.g. it might make more sense to specify the | ||
IDs via an avro file. | ||
|
||
The aggregation service would include a filtering ID in the computation of each | ||
'[shared ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' | ||
hash. For aggregatable report accounting, the aggregation service would assume | ||
that each filtering ID listed in the job parameters is present in every report. | ||
This avoids leaking any information about which IDs were actually present in | ||
each report. | ||
|
||
### Use case examples | ||
|
||
#### Processing contributions at different cadences | ||
|
||
As discussed [above](#processing-contributions-at-different-cadences), a | ||
reporting site may want to query the Aggregation Service at different cadences | ||
for different kinds of measurements. | ||
|
||
Filtering IDs could be used to separate these measurements into different | ||
queries. For example, you could specify a filtering ID of 1 for measurements | ||
that should be queried daily and an ID of 2 for measurements that should be | ||
queried monthly. Each day, the reporting site would then send a day's reports to | ||
the aggregation service and specify that only contributions with a filtering ID | ||
1 should be processed. Each month, the reporting site would send an entire | ||
month's payloads (which have been sent earlier in the daily queries), but | ||
specify that only contributions with a filtering ID 2 should be processed. | ||
|
||
Note that, in this flow, every report needs to be sent to the aggregation | ||
service multiple times. However, as the filtering IDs are different, no | ||
_contribution_ is being included in an aggregation twice and so there are no | ||
issues with aggregatable report accounting. | ||
|
||
#### Processing contributions by campaign ID | ||
|
||
For certain use cases, the filtering ID may be a deterministic function of the | ||
context. For example, if an ad tech wants to process measurements separately for | ||
each campaign, it could use a different filtering ID for each campaign. As the | ||
campaign would be known outside the Shared Storage worklet, the ad tech could | ||
externally maintain a mapping from the [context | ||
ID](https://patcg-individual-drafts.github.io/private-aggregation-api/#aggregatable-report-context-id) | ||
to the filtering ID. | ||
|
||
When batching reports for the aggregation service, the ad tech could use this | ||
mapping to separate the reports by filtering ID, even though it cannot decrypt | ||
the payload. By avoiding reprocessing every report for each campaign ID, the | ||
number of IDs used can be much larger while keeping processing costs reasonable. | ||
|
||
### Details | ||
|
||
#### Small ID space by default, but configurable | ||
|
||
The filtering ID would be an unsigned integer limited to a small number of bits | ||
(e.g. 8) by default. If no filtering ID is provided, a value of 0 will be used. | ||
We limit the size of the ID space to prevent unnecessarily increasing the | ||
payload size and thus storage and processing costs. As filtering IDs are not | ||
readable by the reporting endpoint, processing multiple filtering IDs separately | ||
would typically require reprocessing the same reports for each query (see [the | ||
first example use](#processing-contributions-at-different-cadences-1) above). | ||
Given this performance constraint, it is unlikely that a larger ID space will be | ||
practical with this flow. | ||
|
||
However, other flows could make use of a larger ID space (see [the second | ||
example use case](#processing-contributions-by-campaign-id-1) above), so we plan | ||
to allow for the ID space to be configurable per-report, e.g. to 32 bits. To | ||
avoid amplifying a counting attack due to the different payload size, we plan to | ||
make the number of reports emitted with this custom label size deterministic. | ||
This will result in a null report being sent if no contributions are made. Note | ||
that this means the filtering ID _space_ for Private Aggregation reports must | ||
also be specified outside Shared Storage worklets/Protected Audience script | ||
runners. | ||
|
||
For Shared Storage and Protected Audience sellers, we propose reusing the | ||
`privateAggregationConfig` implemented/proposed for report verification, adding | ||
a new field, e.g. | ||
|
||
```js | ||
sharedStorage.run('example-operation', { | ||
privateAggregationConfig: { | ||
contextId: 'example-id', | ||
filteringIdBitSize: 32 | ||
} | ||
}); | ||
``` | ||
|
||
We do not currently plan to allow the filtering ID bit size to be configured for | ||
Protected Audience bidders as these flows require context IDs to make the scale | ||
practical; we do not currently plan to expose context IDs to bidders (see the | ||
[explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#specifying-a-contextual-id-and-each-possible-ig-owner) | ||
for more discussion). | ||
|
||
#### Backwards compatibility | ||
|
||
For backwards compatibility, if no list of `filtering_ids` is provided in an | ||
aggregation query, the list containing only the default ID will be used (i.e. | ||
`[0]`). This means that any contributions that don't specify a filtering ID | ||
would be included in that aggregation, along with any contributions that | ||
explicitly specify the default ID. Additionally, the aggregation service will | ||
process reports using older format versions (i.e. before labels were supported) | ||
as if every contribution uses the default filtering ID. | ||
|
||
This should ensure that no changes need to be made to existing pipelines if | ||
filtering IDs are not needed. | ||
|
||
#### One ID per contribution | ||
|
||
We plan to allow for a filtering ID to be set individually for each contribution | ||
in a report's payload. To reduce the impact on payload size, we could consider | ||
instead limiting the number of distinct filtering IDs per report to a smaller | ||
number. However, this may pose ergonomic difficulties. | ||
|
||
## Possible future extension: batching ID in the shared\_info | ||
|
||
Later, to improve ergonomics (see [example | ||
below](#processing-contributions-by-campaign-id-2)), we could consider | ||
introducing a new, optional field to an aggregatable | ||
[report](https://github.com/patcg-individual-drafts/private-aggregation-api#reports)'s | ||
shared\_info called a _batching ID_. For example: | ||
|
||
```jsonc | ||
"shared_info": "{\"api\":\"shared-storage\",\"batching_id\":1234,\"report_id\":\"[UUID]\",\"reporting_origin\":\"https://reporter.example\",\"scheduled_report_time\":\"[timestamp in seconds]\",\"version\":\"[api version]\"}", | ||
``` | ||
|
||
This ID would be an unsigned 32-bit integer. The aggregation service would | ||
include the batching ID in computation of the '[shared | ||
ID](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' | ||
hash, allowing reports with differing batching IDs to be batched and queried | ||
separately. | ||
|
||
### Use case examples | ||
|
||
#### Processing contributions by campaign ID | ||
|
||
As discussed [above](#processing-contributions-by-campaign-id-1), an ad tech may | ||
want to process measurements separately for each campaign. In that example, the | ||
filtering ID used is a deterministic function of the context. Instead of setting | ||
a filtering ID, a batching ID could be specified. | ||
|
||
As the batching ID would be readable by the ad tech, it would then be able to | ||
use this batching ID to identify what campaign the report is for and to batch | ||
and query the reports for each campaign separately. It would no longer have to | ||
rely on maintaining a context ID to filtering ID mapping, which would provide | ||
improved ergonomics and might reduce the risk of bugs from the context ID to | ||
filtering ID mapping. | ||
|
||
#### Processing contributions at different cadences | ||
|
||
While a reporting site could potentially use a batching ID for processing | ||
contributions at different cadences, it has a few downsides relative to a | ||
filtering ID. As only one batching ID can be set per report, multiple reports | ||
would need to be triggered, e.g. through multiple Shared Storage operations. | ||
Further, as the batching ID [requires deterministic | ||
reports](#requires-deterministic-reports-and-specifying-batching-id-from-a-single-site-context), | ||
this would result in a report being sent for each ID, even if there are no | ||
contributions for that cadence. These additional reports would negate the | ||
benefit of being able to split reports into separate batches at the reporting | ||
endpoint. | ||
|
||
### Details | ||
|
||
#### Requires deterministic reports and specifying batching ID from a single-site context | ||
|
||
As this option embeds highly specific information about the context that | ||
triggered a particular report (in plaintext), we need to make the number of | ||
reports emitted with the batching ID deterministic. (See the [report | ||
verification explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#deterministic-number-of-reports) | ||
for a similar discussion with respect to context IDs.) This will result in a | ||
null report being sent if no contributions are made. Note that this means the | ||
batching ID for Private Aggregation reports must also be specified outside | ||
Shared Storage worklets/Protected Audience script runners. | ||
|
||
For Shared Storage and Protected Audience sellers, we propose reusing the | ||
`privateAggregationConfig` implemented/proposed for report verification, adding | ||
a new field, e.g. | ||
|
||
```js | ||
sharedStorage.run('example-operation', { | ||
privateAggregationConfig: { | ||
contextId: 'example-id', | ||
batchingId: 1234 | ||
} | ||
}); | ||
``` | ||
|
||
We do not currently plan to use a context ID for Protected Audience bidders due | ||
to the potential for a large number of null reports, see | ||
[explainer](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#specifying-a-contextual-id-and-each-possible-ig-owner) | ||
for more discussion. Identical considerations would apply to this batching ID in | ||
the `shared_info`; so, we would not allow a batching ID to be set for bidders. | ||
Note that Protected Audience auction winners could still report using Shared | ||
Storage in the rendering (fenced) frame. | ||
|
||
#### Backwards compatibility | ||
|
||
If no batching ID is specified, the field will not be present in the | ||
`shared_info`. This should ensure the change is backwards compatible. | ||
|
||
#### One ID per report | ||
|
||
Each report can have at most one batching ID (unlike filtering IDs which are | ||
per-contribution). This aligns with the behavior for context IDs, given they are | ||
both readable by the reporting endpoint. | ||
|
||
#### Use with filtering ID | ||
|
||
Both a batching ID and a filtering ID could be used at the same time. | ||
|
||
## Limits on number of IDs used | ||
|
||
This proposal increases the number of '[shared | ||
IDs](https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#disjoint-batches)' | ||
that the Aggregatable Report Accounting service will need to keep track of. So, | ||
we will need to ensure there are limits to this increase to prevent scale | ||
issues. (Note that it is not practical for each report to have its own entry | ||
recorded in the accounting service.) | ||
|
||
We plan to impose a limit on the number of shared IDs for any particular | ||
aggregation. That is, if too many are used by a query, an error would occur. The | ||
effect of this limit on the number of filtering IDs or batching IDs (or both) | ||
that can be provided will depend on other details of the batching strategy. | ||
|
||
Straw proposal: a limit of 1000 shared IDs per aggregation. | ||
|
||
## Application to Attribution Reporting API | ||
|
||
The filtering ID approach should be extendable to the Attribution Reporting API | ||
and, in principle, we could allow the label to be set based on either source or | ||
trigger-side information. | ||
|
||
The batching ID approach may not be viable for all Attribution Reporting API | ||
callers as a null report would need to be sent for every unattributed trigger. | ||
This could increase report volume substantially (e.g. 4 to 20 times); however, | ||
some callers may be able to tolerate this increase (see the discussion in [ARA | ||
issue #974](https://github.com/WICG/attribution-reporting-api/issues/974) about | ||
introducing a trigger ID). If making reports deterministic is acceptable for | ||
some callers, we could support setting a batching ID for a trigger with a | ||
similar mechanism to the already proposed trigger ID. | ||
|
||
The details of these approaches will be explored in a separate GitHub issue. | ||
|
||
## Privacy considerations | ||
|
||
While this change does allow for reprocessing the same report in different | ||
aggregations, each query will only aggregate distinct contributions from that | ||
report. In other words, each contribution is still guaranteed to only be | ||
aggregated once, maintaining our current [privacy protection | ||
model](https://github.com/patcg-individual-drafts/private-aggregation-api#contribution-bounding-and-budgeting). | ||
|
||
One other potential concern is that introducing new (plaintext) metadata to the | ||
report might amplify counting attacks (see related discussion for context IDs | ||
[here](https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/report_verification.md#privacy-considerations)). | ||
However, we ensure that any new metadata (including a batching ID and any | ||
non-default payload size) is paired with making the sending of that report | ||
deterministic. This avoids any risk of the report count leaking information. |