-
Notifications
You must be signed in to change notification settings - Fork 1
Support for combining reports with different IDs #7
Comments
I want to make sure that I understand this clearly. Consider the scenario where the aggregation system is being made more flexible and there is no change to the browser computation. Combining bins within a single histogram makes sense since the user contributes a fixed amount to that particular histogram. Supporting pulling together sub-bins in order to get larger bin populations before adding noise to the sum would enable better analysis on periods with small numbers of users. This seems very similar to supporting dynamic time windows for data aggregation. Combining multiple conversions is a more complicated question. Since we consider conversions independently, there is the possibility that publishers and advertisers could collude to link individual advertisements with individual conversions across sites and create a way to pull more cross site information into multiple conversions than would be supported with a single aggregate. This is mitigated by the noise being inserted per conversion histogram and maintaining a per-site privacy budget in the browser. Enabling the combination of these conversion histograms would require a re-budgeting of the central noise being added on any combined bins since they may contain more signal contribution from a single user than in a single conversion histogram. Are you suggesting something more than new functionality in the protocol for aggregation? Is there a better analysis for this than I've done (since this is my first pass at this question)? |
Yes I am suggesting combining separate histograms, not just combining bins within a single histogram. The privacy question is a good one. Let me try to answer. First for simplicity consider the case where every user is only allowed to engage with a single ad per time window, and each ad/conversion marks itself as contributing to N separate conversions. Now, the ad tech wants to combine all of these N histograms together. In the aggregate of the N histograms, you know any given ad contributed at most N times, so it is safe to add noise proportional to O(N). In a way, you can view this as a "combining bins" operation within the "virtual histogram" which is just all N histograms concatenated with each other, and you've bounded the sensitivity of any ad to this "virtual histogram" to N. This trick works in the more complicated case where you allow multiple ad events per some partition (e.g. <site, time>). For instance, if you have a per-site/day bound of k ad events, then you know your "virtual histogram" in this boundary will have effective sensitivity k * N, and you would need to add O(k * N) noise. The trick even works if the sensitivities of the histograms are different (and you still want to combine them for some reason), by just artificially scaling the histograms with smaller sensitivities to match the histograms with larger sensitivities, and then applying the larger noise. LMK if that makes sense (or if I'm missing something 😄 ) |
The simple combination that I'm imagining here seems fine to me. @csharrison, you didn't specify how the histograms were combined, but I'm assuming a simple mapping. That is, either you just glue the histograms together so that you get a histogram of size The simple concatenation might allow you to use L2 sensitivity, but the full mapping might force you to use L1. Does this have any potential to combine information from multiple ad impressions that occur on different sites? I couldn't see how, but I'm not particularly smart right now. I think that that is only a risk if the combination method is more complicated than the ones I described. |
I agree with @martinthomson that this simple mapping described by @csharrison will keep the privacy promises made by the browser. So I'm not opposed to allowing this type of combination in the back end. However, I'm not sure how much utility gain is gotten from combining multiple histograms this way. I haven't thought through this completely but naively if I have Let's call the intermediate sums The released |
Good point about utility. In your example, given the mappings involved, the sensitivity of |
@martinthomson regarding L1 vs. L2 sensitivity, I was assuming an L1 sensitivity bound in my previous comments. Regarding utility, with an L1 bound, If we are trying to bound using the L2 sensitivity, I think the best we can say is that |
@csharrison — can you help me understand this? I think I'm missing something fundamental. You say
And you previously said
I am having trouble wrapping my head around |
On mobile right now so will be brief (can elaborate more later if needed). If we imagine |
@csharrison — I'm to sure I'm following. Why would |
E.g. in the current PAM design, |
There is no guarantee that |
Sorry you're right I was making a simplifying assumption (the "simple case" of #7 (comment)) where all events are contributing to the same N histograms. But consider the case where This argument extends beyond the impression-level epsilon as I mentioned in #7 (comment). |
I'm wondering why this needs to be complicated still. If we start from the assumption that I get that maybe there are composition methods that might allow for us to apply less noise than double, but I'm struggling with why you wouldn't just leave that to the caller of the API to untangle with some post-processing. |
This is not true, the impression-level sensitivity will not change. Imagine every user only interacts with a single ad event, and each event contributes to either |
But that is not how this system works: every histogram that is produced has a certain sensitivity. I guess that you can maybe say that If you do want to be able to ask for an aggregate histogram that where only one of the histograms that are aggregated has a contribution, then yes, there is some value in doing this. I just didn't see how you could guarantee that. |
I think I may have misunderstood something, so maybe it would help if you could write out a counterexample where the technique I am proposing won't work? The goal of this issue was to allow to combine histograms like I believe if |
Here is my thinking: The contract between the browser and the aggregator is that if I contribute to a particular histogram, an amount of noise deemed appropriate will be added to that histogram to satisfy the privacy requirements of the browser. The aggregator does not know who has populated any histogram — in particular it does not have any guarantee that the users contributing to any two histograms are disjoint sets. So, when two histograms are combined, the amount of signal any user may have made to the resulting concatenated histogram is the sum of the amount that may have been contributed to any individual histogram. In your original example
I would argue that the 'add to cart' and 'conversion' histograms are most likely not disjoint sets (in fact it's closer to one being a proper subset of the other). So many users will have contributed to both of these histograms. Combining the bins would have a double contribution for these users that needs to be accounted for. Consider defining a single histogram (with twice as many bins) for both 'add to cart' and 'conversion' events. When a user adds an item to the cart, the website can create a first report histogram mapping ad events to the first set of bins and when the user completes a purchase the website can create a second report histogram mapping ad events to the second set of bins. If a website only wants to measure one or the other, the reports can be given twice the privacy budget from the browser since the design will guarantee only a single report. Designing a system that guarantees somehow that the user population in different histograms is disjoint allows combinations without this worst case type of analysis. But that would require more coordination than is already involved in this proposal. |
I can't think of a scenario in which |
Thank you both for responding and for your patience. First I want to clarify that any kind of "disjointness" is not required for the scheme to work, I was only explaining that "disjointness" does not break the scheme as @martinthomson suggested in #7 (comment). We don't need the system to enforce this kind of constraint (and I agree with @winstrom that for many use-cases these won't be disjoint). Second, I want to clarify the measure of sensitivity I am primarily talking about, which is impression level sensitivity. i.e. each histogram is marked with the maximum times its impressions can contribute across all histograms1. As far as I understand, this is PAM's primary measure of sensitivity and each query provides an impression-level epsilon DP guarantee, because each histogram is tagged with its impression-level sensitivity. Any user-level bounds will be captured by something like capping the # of impressions so your user-level epsilon is The concern you have @martinthomson in #7 (comment) is because you are considering a user-level sensitivity analysis vs. just impression level. If we consider impression-level DP + composition I think it will alleviate your concerns (LMK if it doesn't). However, the approach should work in general even with user-level sensitivity, it just requires the system to understand more about the user-level sensitivity (which AFAIU PAM doesn't have this). i.e. if you imagine that If this still isn't making sense, one question that might help get to the crux is in your example in #7 (comment), how would you add noise to If it is helpful, I can write something out that is more detailed in a document if this is still confusing, and again thanks for your patience. Footnotes
|
I think that we're coming at this from different angles and trying to guard against different things. This proposal is not primarily interested in providing an impression-level epsilon DP guarantee. We want to provide ways to account for the release of information across multiple privacy budgets (e.g. total user level, site level, impression level). They may cross boundaries (e.g. impressions may be used on multiple sites) and have different sizes (e.g. the user may have a larger budget than a single impression does). We try to ensure the information release respects those budgets. The model of privacy accounting means that a single histogram release ( There is nothing stopping a bad actor from registering shadow ads and conversions that are not known to the browser or the aggregation system to be linked but are known to the bad actor to be identical. Combining |
@csharrison — I've been thinking about this quite a bit. I believe that your mental model of having an on-device budget that limits individual contributions does support that within any combination of histograms, a particular user's contributions are bounded by a particular I do want to get more opinions on this since I can imagine that allowing combinations across time windows and in combination with other proposed changes may create a vulnerability (and I'd like the more paranoid security folks to weigh in). The more complexity gets added the more difficult the reasoning and explainability of any proposal — and the more difficult the implementation of a backend aggregation system. This particular question seems to me to be focused on the aggregation system rather than the interface between the device and the aggregation backend and so I think is orthogonal to the implementation of how the cross site information is transmitted. |
Looking at this again, I realize I think one of my comments got eaten. I thought I had responded to your post 2 weeks ago Luke, but I don't see the comment now. Thank you for continuing to think about it and I'm glad I think we're coming to an alignment. I agree we will need to think carefully about aggregating across boundaries / privacy units etc., but we should be able to be rigorous about the boundaries we want to defend and prove that we do indeed defend them. If a bit more complicated privacy proof will yield us much better utility, I will tend to take that bargain, but it's certainly something we can debate on a case-by-case basis :) |
I now follow the user- vs. impression- level distinction, so I'm happy with that analysis. I even follow the claims about why The question of combinations over time seems like it has a workable solution. If we consider a budget that periodically refreshes, then we'll need to track multiple budgets over time. Any attempt to use an impression will spend budget from the corresponding time period. |
Ideally, it should be possible to make aggregate queries as flexible as possible within the system. To that end, it might be useful to consider support for grouping together multiple histograms that otherwise would need to be queried separately.
Here is an example: imagine I am tracking two types of conversions, purchases and “add to cart” events. Usually I query these independently to get breakdowns per type so I give them different Conversion IDs. However, during a particular week I may have a reduced volume of reports, and so I’d prefer to get a histogram combining the purchase and add to cart reports, which allows me to get aggregate data with less overall noise (one noise addition vs. two).
Note that this is the analog to ARA/issues/470 on the ARA side to allow more dynamic decision making.
The text was updated successfully, but these errors were encountered: