-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPA Sharding with Sensitivity Bound #35
Comments
Note that each helper will receive a different ciphertext. Thus, the sharding here will need to be based on having one helper pick and the others follow, or it might involve some communication in order to agree on the shard allocation for each event. If one helper picks and it is malicious, then it can send events with the same ciphertext to multiple shards, which might be used to bust any user-level capping. This only works if there are sufficient events for a user to be able to get a significant contribution on multiple shards, but that is a condition that a malicious site can engineer. This suggests that we would need something more sophisticated for the allocation process. |
cc @marianapr who has been looking at malicious secure protocols for dummy insertion. One thing we need to be careful of here is malicious helpers amplifying/duplicating unknown match keys from within the protocol, which could violate the sensitivity bounds.
Is there a reason this is better than just encrypting a dummy record (null match key) on the client when a threshold of match keys M have been extracted? This avoids the need for a dedupe stage. I have some concerns that caching the ciphertext constrains the amount we could allow match keys to change, as rotating a match key would have to invalidate the cache (and therefore leak that the match key changed) |
@martinthomson, or we could just send all ciphertexts to all parties for this step. They can then use all the values consistently to shard and that hopefully helps avoid depending on one to do it. Shouldn't reveal anything as each can only decrypt their own.
@csharrison, that's great; the mechanics of adding dummy records are still something we need to flush out. I feel the best place to do it should be in the shuffle that happens before computing a PRF to get the shard id in Stage 2 sharding.
Yes, I think I see you're point. You could try to duplicate an unknown matchkey (to increase M above) while also inserting a known matchkey to have a large N and check whether they collide or not in sharding. Maybe when inserting the dummy records their needs to be a way to prove you've added them from some space of dummy records disjoint from the matchkey space (may only need to check a couple bits in the MPC if we go with matchkeys like #34).
It could better in terms of utility if we would ever process null matchkeys instead of real ones. I'm not sure it would avoid needing a dedupe stage either, as even if you just call
That is a good point I haven't thought about. At first it doesn't seem too bad. If we limit rotating matchkeys to once per epoch anyway (as needed to maintain the DP capping against malicious MKPs) and update to the new one at the start of each epoch if the MKP requested before that time, then we could also just reset all the caching at the start of every epoch. Since the epoch id is in the associated data encrypted with the matchkey, it should give a way for the Helper Parties to have visibility into how the sensitivity bound of a matchkey has changed and to use a different sensitivity bound in a cross-epoch query. |
does it contradict with the statement above?
I was hoping with 1B events we could actually fit a few hundred GB into a single host for de-duplication purposes, so multicasting ciphertexts to helper parties may be a plausible mitigation |
@akoshelev, sorry wasn't very clear. I'm suggesting each of the three Helper parties could receive all the ciphertexts -- "all" as in all reports and for each report the encryptions of all 6 replicated shares. If all of those do not fit into a single host run by each Helper for de-duplication, then each Helper would need to do Stage 1 sharding for them. |
Yup, I missed that. We can't remove the dedupe stage :)
Yes I agree with the current design this is benign so it is probably fine, but it makes me a bit nervous for match key epochs to be load bearing in two different ways. If we came up with a new and improved server-side protocol which made it OK for match keys to change arbitrarily, we might not want to continue being burdened by this constraint. That being said, it's a minor issue. |
I believe we can address this, though it will require an extra step when a Match Key Provider is involved in the query (causing a slightly higher cost.) An honest Match Key Provider should be able to provide a consistent encryption for each match key (I am assuming their first party cookie / storage should persist just as long as a match key which is set by them.) We could then require that Match Key Providers provide consistent encryption for each match key they provide, and then have the helper parties, as a new step, calculate both the number of distinct encryptions and (via MPC) the cardinality of distinct match keys. If these two values differ, the helper parties would abort. Given these match keys come from an individual website/app, we could potentially use an optimized approach to calculate this unique cardinality, such as an OPRF. |
@eriktaubeneck, I think I follow where you're going but not quite sure what you mean by "consistent encryptions for each matchkey"
I think the issue here is that we can't necessarily know which matchkeys were encrypted by a MKP and which by a UA. Was discussing with @benjaminsavage and tried to include this in the section on "An old approach revisted" that
I believe what you're suggesting can help with the case that a MKP is a source site letting us check that the set of source events they submit doesn't have duplicate matchkeys. But I don't think it would prevent a malicious MKP from submitting duplicate matchkeys as if they were trigger reports, because for real trigger reports we can't expect uniqueness of matchkeys to hold. |
The other thing I was thinking about in terms of deduplicating encrypted reports is that it could mess up a real use case where a website simply had multiple different events (source or trigger) in an epoch for a user and wanted to include all of them in a query. Since the encryption of the matchkey is cached, they wouldn't be able to get results from all of them if they would be removed by deduplication. I see a couple possible ways around this:
|
Hello, Correct me if I am wrong, but my understanding here was that the only harmful way in which the DP sensitivity bound can be violated is if one party manages to duplicate the contributions of a certain client, which would be reflected in a noticeable way in one of the released aggregates because the DP noise will not be sufficient. One thing that I am still hazy about is why this will be a problem only in the PRF based approaches and not in the sorting circuit based MPC (in the later case the DP bound violation will be reflected in the released counts right?) Going back to solving the issue, it seems that as long as we prevent the malicious actor from being able to duplicate records, this should be resolved. This can be guaranteed with respect to parties that collect encrypted data and then run the helper network to obtain the aggregate output. The remaining challenge for mechanisms that rely on record duplication in the DP privacy mechanism, is to prove that the duplication was done according to honestly sampled DP noise distribution. Though this is more about proving that dummy records were generated correctly never contributing anything but 0 value. I saw above some discussion about multiple match keys encrypting the same value, I think you can do something about this by providing PRF value of the encrypted key and proving that the match keys encrypt the same values as the ones on which you evaluated the PRF, and then make sure that different match keys do not encrypt the same value. Attacks where the client is tricked to submit more contributions with the same match key that honest executions, should be stopped at the client I guess. |
Hello @marianapr
If by "released aggregates" you mean the final outputs of the protocol, I actually don't think duplicating will be an issue there because we have the per matchkey sensitivity capping that happens in the MPC. What we are mainly concerned about here is leaking information from sharding (shard sizes) if we can't maintain a sensitiving for unknow matchkeys. |
I'll just elaborate on this a bit because I think this is different from the ARA proposal and might be unfamiliar to folks reading this comment thread. In the IPA protocol there are a few stages. After the "attribution" stage, where trigger events are attributed to source events, there is a "user contribution capping" stage. This is one of our more expensive stages. In MPC we compute the total contribution of each user to the measurement result (across all histogram buckets). If this contribution exceeds the per-user cap, we drop their contribution down to the cap. In our current implementation this requires doing a bit-decomposition of their "total contribution" value, then performing a secure comparison protocol on those bits to get a sharing of 0 or 1 indicating if they've exceeded the cap. |
OK, I see, maybe I missed the contribution capping (you are bounding the L1 norm of user contributions). Does the current prototype do already the capping? |
That's right, our capping is essentially bounding the L1 norm of a user's contribution. Yes, we have it implemented in both MP-SPDZ and Rust. As of the most recent PATCG Rust numbers capping was the second most expensive stage in terms of network (~30%). |
This approach (caching ciphertexts) will result in some amount of information leakage to the report collector and helper nodes. In the case of a source-site that delegates measurement to a different entity that acts as the report collector, it would reveal something akin to their total unique user count, and their ad-load per user. While this is not cross-site data, it is information a business might want to keep confidential and could be concerned about the report collector (or potentially even the helper parties) being exposed to. An alternative design to caching ciphertexts would be to cache the same values for the secret-shares, but have the browser generate a different encryption of these shares each time. In this fashion, no information would be leaked to a delegated report collector, but the helper parties could still enforce a sensitivity bound by looking for identical share values (after decryption). This would only work in a large field where the chances of two shares (in a replicated secret sharing) having identical values for two different people is very small. Assuming a 32-bit prime, the odds of two different people randomly having share 1 and share 2 the same would be 1 out of 2^64, which seems sufficiently small. This would entail identical storage requirements in the browser, but a different value would be cached, shares instead of ciphertexts. This approach does not address the issue of information leakage to the helper parties. I don't have any good idea of how to solve that problem. |
Isn't this better modeled by the birthday problem paradox with N users and looking at the expected # of collisions? |
Yeah the birthday paradox seems the right direction to analyze. If we're doing on device generated matchkeys, we probably need those to be 64 bits or more anyway though (for birthday paradox implications with the actual matching), so with replicated shares of that value that we would be looking for collisions in a space of size 2^128. |
Was chatting with Daniel and he pointed out one other detail on the DP analysis for sharding is that the total amount of noise (across all shards) will be known. Doesn't seem like it is directly an issue but we should figure out how to factor into the formal analysis of the privacy parameters. |
IPA Sharding with Sensitivity Bound
Thus far sharding IPA has proven difficult because of the inability to bound the sensitivity of a matchkey’s contribution to shard size. An attacker could request a large number N of encryptions of a matchkey on two websites and then (colluding with a Helper Party) see whether one shard increased by 2N (meaning it was the same user across the two websites). Several ideas for sharding were presented at the last PATCG and referred to this paper for differentially private histograms but for these to be applied the matchkeys would first need to have bounded multiplicity.
As a step towards bounding the sensitivity of a matchkey it was suggested (@benjaminsavage mentioned discussing with several folks at the last PATCG) that the device can cache the encryption of the matchkey per domain/app (or report collector) that is requesting it. Then when a website requests N encryption of a matchkey from a user, they will get back the exact same authenticated ciphertext N times. The Helper Party Network processing these reports in a query will be able to filter out or reject this query as malicious.
If a Match Key Provider is the one colluding with a Helper Party to launch this attack, they are able to bypass the sensitivity capping for a matchkey by directly encrypting with different encryptions as many times as they like. This seems to allow the same style of attacks to continue.
However, we claim that the technique for caching encryptions on the device to get a bound on the sensitivity of unknown matchkeys is sufficient to provide protection against cardinality attacks in sharding, even if the Matchkey Provider is able to add arbitrarily many encryptions for known matchkeys into the sharding process. The reason is that even though a MKP can create an arbitrary number N of encryptions of a known matchkey, the attack only works by observing the sum of this number and the number M of matchkeys that come from a user’s actual devices. If we can bound M with some reasonable sensitivity (say 100 = (# different devices) * (max #domains visited per device)), then we can add non-negative DP noise such that the sum N + M + noise is differentially private.
(This may have been clear to some discussing this at the PATCG, but overall I've found it helpful write out the approach more fully).
Amount of Noise Needed
For DP parameters that are a couple orders of magnitude less than the overall DP budget, we can get by with adding around 4k dummy records per shard. This is a relatively small amount compared to the overall query sizes and does not depend on the size of the total query or the number of shards. The number of dummy records that need to be added only depends on the sensitivity bound for unknown matchkeys and the epsilon and delta DP parameters.
More specifically, if the sensitivity of a matchkey is 100, then we can make the value N + M visible to any attacker differentially private with epsilon = 0.01 and delta = 10^{-8} by adding noise from a Truncated Geometric distribution with the expected size of the noise being 1827 (see appendix section for details). To let this be the amount of noise from the perspective of all parties, we will probably need at least two of the Helper Parties to add this many (in expectation) dummy records. So we can expect to add fewer than 4k dummy rows per shard.
Two stage sharding
To enforce the removal of identical encryptions of matchkeys, we will need to use an extra sharding stage, since we don’t assume the full input set could fit on any one shard for deduplication. To be maliciously secure, we most likely need each Helper Party to enforce the deduplication on its own set encrypted shares.
Stage 1: Ciphertext Sharding: First we will shard based on the value of the ciphertext itself, specifically the ciphertexts that encrypt the matchkey. This cannot leak any additional information since the ciphertext values are public to all parties. In the case the matchkey is encrypted as replicated shares (6 ciphertexts in all) the Helpers could agree on public sharding function such as hashing all the ciphertexts in a certain order and then interpreting those bits as an integer, y, to then compute the share as i = y modulo number of shards.
Stage 2: PRF Sharding: Then each shard will decrypt and compute a PRF of the matchkey using MPC. Here each of these shards for a particular Helper Party will need to communicate with the paired shards of the other Helpers.
Optimized encryption for shard id computation
When the device encrypts the matchkey towards the helper parties, it may be a useful tradeoff to let the matchkey be encrypted in two ways:
Sharding messes up the presort on timestamps
One optimization we’ve been hoping to use is to have the report collector presort on timestamps in the clear before submitting a query. However, in the computation of the shard id with the PRF we must shuffle all the rows. This breaks the previous sort on timestamps. It is an open problem to figure out if there is a way to maintain this presorting optimization in some way.
An old approach revisited
An earlier approach that we considered for IPA was to perform the join by applying an OPRF to all of the matchkeys, instead of using sorting. This approach is vulnerable to known cardinality attacks on the matchkeys and for this reason we moved away from it. It is worth noting here though that bounding the cardinality of unknown matchkeys is not sufficient to let us go back to this approach.
First consider the direction we would like to go in to make that OPRF approach have differentially private leakage. If the sensitivity bound on a user’s matchkeys coming off real devices is M = 100, there will be some number of matchkeys with cardinality 1, some with cardinality 2, … some with cardinality 100. For each of these histogram buckets will we add non-negative DP noise by adding sets of dummy reports with those certain numbers of identical matchkeys. We could add DP noise to each of the buckets with sensitivity 1. Here the bound on the user’s number of matchkeys gives us the number of buckets in the histogram.
The flaw here is that we cannot actually bound the number of times a user’s matchkey is submitted. A malicious MKP can submit a targeted user’s matchkey many times (e.g. in a source-fan-out query there could be many copies of a matchkey with source events or split across multiple sets of trigger reports from fake trigger sites).
As a result the histogram has an unbounded number of buckets that we would need to add noise to. To use the notation above, even though M is bounded N is unbounded, but to make this approach work there would have to be a way to bound N + M.
Truncated Geometric/Discrete Laplace with large sensitivity
We can use the techniques for non-negative DP from our other paper. In the paper we had focused on the case the sensitivity$\Delta=1$ . We now go back and show first for the truncated geometric / discrete laplace how to determine what the parameters of the noise distribution should be for large values of $\Delta$ . Other distributions could also be used. This extends section 3.2 of the paper.
In that section we consider a DP mechanism using the following distribution:
$M(X) = f(X) + z; \quad z \sim p_{DoubleGeometric}(n)$
$\textrm{Pr}_{DoubleGeometric}(x|n) = Ae^{-\epsilon|n-x|}; \qquad x\in {0,\ldots,2n}$ $0< A< 1$ and some $n \in \mathbb{N}$ .
For some normalizing constant
As a probability this must sum to 1, which lets us solve for$A$ . Letting $r=e^{-\epsilon}$ , we get (see equation 10 in paper)
$A=(1-r)/(1+r-2r^{n+1})$
If we have a sensitivity and desired and , then we need to cover the tail as
\delta \geq A \sum_{k=n-Delta+1}^{n} e^{-k \epsilon}.
For larger than 1, we give a simple script to compute the smallest n such that the above equation holds for a given$\Delta, \epsilon, \delta$ .
Thanks @benjaminsavage, @eriktaubeneck, @akoshelev, and @danielmasney for lots of discussions on this.
The text was updated successfully, but these errors were encountered: