Abstract
Despite excellent theoretical support, Differential Privacy (DP) can still be a challenge to implement in practice. In part, this challenge is due to the concerns associated with converting arbitrary or infinite-precision theoretical mechanisms to the reality of floating point or fixed-precision. Beginning with the troubling result of Mironov demonstrating the security issues of using floating point for implementing the Laplace mechanism, there have been many reasonable concerns raised on the vulnerabilities of real-world implementations of DP.
In this work, we examine the practical implementation of the exponential mechanism of McSherry and Talwar. (The exponential mechanism can be used to express any DP mechanism, although this is not always the most computationally efficient approach.) We demonstrate that naive or malicious implementations can result in catastrophic privacy failures. To address these problems, we show that the mechanism can be implemented exactly for a rich set of values of the privacy parameter
How do we achieve this result? We employ a simple trick of switching from base
In this notebook, we demonstrate the base 2 exponential mechanism and vulnerabilities of naive implementations of the exponential mechanism. We also compare running times and mean-squared error of an implementation of the Laplace mechanism using the base 2 exponential mechanism implementation.
Differential privacy requires that the outputs of mechanisms operating on private data are stable. That is, they cannot change ``too much'' if a single entry in the input database is changed. A key strength of DP is that it doesn't apply to a single mechanism: any randomized mechanism satisfying the stability requirement is differentially private. We give the formal definition below.
Pure Differential Privacy:
A randomized mechanism
One of the most popular DP mechanisms is the Laplace mechanism, which adds noise drawn from the Laplace distribution scaled to the sensitivity of a function, i.e., how much it can change if a single individual's value changes.
The Laplace Mechanism:
Given a function
Unfortunately, sampling from the Laplace distribution using floating point arithmetic leads to "gaps" in the possible noise values, which, in the right circumstances, can allow an adversary to exactly guess the original value of
The exponential mechanism, proposed by McSherry and Talwar, is a general purpose DP mechanism for releasing arbitrary strings, values or other outputs. For example, suppose a company is trying to set a price for a new piece of software. To set the price, they ask a set of potential buyers the price they are willing to pay, and they promise to keep the prices reported by each potential buyer private and to only release a DP estimate of the optimal price. The company now has a problem: computing the optimal price and then adding noise may result in a price that \textit{zero} buyers are willing to pay. For example, if all buyers are willing to pay at most $$10$, setting a price of $$10.50$ will result in zero sales. The exponential mechanism handles this problem in an elegant way by assigning a utility score (potential profit) based on the database (the maximum prices) to each potential outcome (each price in a pre-determined range) and then sampling a single outcome based on the utility scores. Using this mechanism, the price of $$10$ would have very high utility and thus a high probability of selection, whereas any price over $$10$ would have low utility and probability of selection. In fact, any DP mechanism can be expressed using the exponential mechanism with an appropriately chosen utility function.
The Exponential Mechanism:
Given a privacy parameter
An initial examination of the exponential mechanism might lead one to imagine that it is immune to the floating point issues of the Laplace mechanism, as no floating point calculation is released directly and the computations of
We rely on GNU Multiple Precision Arithmetic Library (GMP and MPFR via the gmpy2
Python interface for high precision arithmetic and exact arithmetic monitoring in our implementation, as well as some standard libraries (numpy
, etc).
- Install any libraries needed (gmpy2, etc)
- Standard imports
!apt-get install libgmp-dev
!apt-get install libmpfr-dev
!apt-get install libmpc-dev
!pip install gmpy2
import numpy as np
import gmpy2
from gmpy2 import mpfr, mpz
import matplotlib.pyplot as plt
from timeit import default_timer as timer
import scipy.stats as sk
Although the problems with floating-point implementations of mechanisms like the Laplace mechanism are well documented (see Mironov and related works), the exponential mechanism is vulnerable to a different class of attacks which takes advantage of rounding and truncation properties of floating-point arithmetic. Below, we demonstrate two such attacks using a naive implementation (see naive.py
) of the exponential mechanism which uses numpy
for arithmetic operations.
- Attack 1 takes advantage of rounding behavior and chooses a utility function which rounds all weights but one to 0 if a condition is met.
- Attack 2 takes advantage of truncation in addition, and chooses a utility function that always produces positive weights, but the additions are truncated such that some weights are treated as if they were zero.
Note: to demonstrate these attacks, we construct utility functions which take only the outcome as an argument, as the dependence on the database is simple, and this makes the code easier to read. The utility functions u1
refer to the case when our target, Alice, is in the database, and u2
refer to the case when she is not.
from naive import * # Import naive implementation of the exponential mechanism.
Suppose the attacker wishes to know if Alice is in the database. The attacker designs an outcome space
- If Alice is in the database
$u(o_1)=x$ and the utility for all other outcomes is$x+1$ . - If Alice is not in the database
$u(o_1)=x$ and the utility for all other outcomes is also$x$ .
(By convention
Notice that the sensitivity of
# Choose constants
k = 100
eps = 0.5
# Find x as described in the attack
x = 0
for i in range(0,4000):
if np.exp(-(eps/2)*i) == 0:
x = i-1
break
print("x =",x)
# Construct the output space O and two utility functions representing the cases
# for each database
O = [i for i in range(1,k+1)]
u1 = lambda o: x if o == 1 else x+1
u2 = lambda o: x
# Run 100 trials to test the average output in each case
numtrials = 100
u1_outcomes = []
u2_outcomes = []
for i in range(0, numtrials):
u1_outcomes.append( naive_exp_mech(eps,u1,O))
u2_outcomes.append(naive_exp_mech(eps,u2,O))
print("u1 average: ", np.average(u1_outcomes))
print("u2 average: ", np.average(u2_outcomes))
x = 2980
u1 average: 1.0
u2 average: 47.57
plt.hist(u1_outcomes)
plt.title("Outcomes of u1")
plt.show()
plt.hist(u2_outcomes)
plt.title("Outcomes of u2")
plt.show()
Notice that the average output when Alice is not in the database (u2
) is significantly different from the average value when Alice is in the database (u1
), allowing the attacker to easily determine whether or not Alice's data is included. In particular, if Alice is in the database, no value other than 1 is ever produced, as illustrated in the output histograms.
To mitigate Attack 1, one might suggest that we add an assertion that each weight is positive. However, this fix can be subverted by taking advantage of truncated addition. That is, if a very small floating point number is added to a very large floating point number, the addition may result in a no-op if there aren't sufficient bits to express the low order bits of the addition result.
For example:
a = 2**(-30) # a small positive number
b = 2**(30) # a large positive number
c = a + b
print(a>0, c==b) # Despite adding a positive number, the addition is a no-op
True True
Knowing that small, but positive, weights are truncated in addition, the adversary constructs two utilities
eps = 30 # Note: we choose a large epsilon to limit the amount of time searching the space.
y = -0.5
print("y:",y)
ey = np.exp(-(eps/2)*y)
#print(ey)
for x in range(0,5000):
if (np.exp(-(eps/2)*x) + ey) == ey and np.exp(-(eps/2)*x) > 0:
break
print("x:", x)#, np.exp(-(eps/2)*x) )
#print(np.exp(-(eps/2)*x) + ey, ey)
u1 = lambda o: y if o == 1 else x
u2 = lambda o: y+1 if o==1 else x-1
s = np.exp(-(eps/2)*(y+1))
t = np.exp(-(eps/2)*(x-1))
total = s
for k in range(1000,10000):
ratio = t*k/s
if ratio > 1/10:
print("k:",k)
break
O = [i for i in range(0,k)]
# Run several trials to see difference in outcomes.
numtrials = 100
u1_outcomes = []
u2_outcomes = []
for i in range(0, numtrials):
u1_outcomes.append(naive_exp_mech(eps,u1,O))
u2_outcomes.append(naive_exp_mech(eps,u2,O))
print("u1 average: ", np.average(u1_outcomes))
print("u2 average: ", np.average(u2_outcomes))
y: -0.5
x: 2
k: 1000
u1 average: 1.0
u2 average: 208.76
plt.hist(u1_outcomes)
plt.title("Attack 2: Outcomes of u1")
plt.show()
plt.hist(u2_outcomes)
plt.title("Attack 2: Outcomes of u2")
plt.show()
Notice that once again, if Alice is in the database, that the only possible output is 1, whereas if Alice is not in the database, a variety of outcomes in
The troubling property of Attack 2 is that all weights are positive, and one would need to monitor the outcome of each addition in the mechanism to ensure that values are not excessively truncated to try to catch such subversion. However, such monitoring is non-trivial as the additions are not expected to be exactly correct as np.exp
)?
The key problem with declaring a certain range of utilities to be safe is the difficulty in characterizing the privacy loss due to inexact arithmetic. For example, suppose that a range was chosen which has a maximum error on the order of
To mitigate these problems, we propose thinking about DP in base
With this change of perspective, we can rewrite the probability distribution of exponential mechanism in base 2 as
The benefit of this change is that we now have a path to implementing the exponential mechanism exactly rather than allowing floating point errors to creep in. Exact arithmetic also allows for more straightforward monitoring of correctness and thus privacy.
Although the logic of the implementation is nearly identical to the naive version, showing that the base-2 exponential mechanism can be implemented exactly is not quite as trivial as swapping np.exp
for 2**x
.
There are several components of the implementation and mechanism design:
- Allowing expressive privacy parameters: we show that taking
$\eta \in {-z\log(\frac{x}{2^y})}$ for integer$x$ ,$y$ and$z$ such that$\frac{x}{2^y}<1$ results in an expressive set of privacy parameters such that$2^{-\eta}$ can be computed exactly with limited precision. - Handling non-integer utilities via randomized rounding: we show that randomized rounding, i.e. rounding up with probability
$p = x - \lfloor{x}\rfloor$ and down with probability$1-p$ does not hurt privacy. This allows us to run the mechanism with integer utilities, which makes computing$2^{-\eta u(d,o)}$ simple. - Sampling without division: we give a simple method for sampling with normalized probabilities (i.e.
$\frac{1}{\sum_{o \in O} 2^{-\eta u(d,o)}}$ ) that doesn't require division. Please see the draft paper for complete descriptions of the mechanisms and analysis.
The base-2 exponential mechanism uses the gmpy2
interface to the GNU Multiple Precision Arithmetic Library (GMP) for high-precision binary arithmetic. See expmech.py
for code and additional documentation.
# Example usage
from expmech import *
# Initialize outcomes and utilities
u = lambda x : abs(0 - x) # Laplace distribution utility centered at 0
# privacy parameters
eta_y = 1
eta_x = 1
eta = -np.log2((eta_x/(2**eta_y)))
eps = 2*np.log(2)*(-np.log2((eta_x/(2**eta_y))))
print("eta = ", eta, "epsilon = ",eps)
# outcome space
gamma = 2**(-4)
k = 160
O = [i*gamma for i in range(-k,k+1)]
# random bit generator
rng = lambda : np.random.randint(0,2) # np.random is a placeholder for demonstration purposes only
# Initialize the mechanism
e = ExpMech(rng,eta_x=eta_x,eta_y=eta_y,eta_z=1,u_max=(abs(O[0]*2)),u_min=0,max_O=len(O))
# set the utility function
e.set_utility(u)
# run the mechanism
result = e.exact_exp_mech(O)
print("result:", result)
eta = 1.0 epsilon = 1.3862943611198906
result: 0.0625
The base-2 exponential mechanism can also be used to simulate drawing from the Laplace distribution or implementing the discretized Laplace mechanism. We implement the Laplace mechanism as a child class of the Exponential mechanism, which simplifies the semantics significantly.
b_min = -10 # the range of outputs [-10,10] at granularity 2^-4
b_max = 10
gamma = 2**(-4)
eta_z = 1
s = 1 # the sensitivity of the function
t = np.random.randint(b_min,b_max) # select a random "target" value in the range
l = LaplaceMech(rng, t, s, eta_x, eta_y, eta_z, b_min,b_max,gamma)
n = 100
samples = []
for i in range(0,n):
samples.append(l.run_mechanism())
plt.hist(samples,bins=int(n/10),alpha=0.5)
plt.scatter(t,0,marker='o',color='r',label='target')
plt.legend()
plt.title("Samples from the Laplace Mechanism targeting "+str(t))
plt.show()
The key question we must address other than privacy is the performance of our mechanism. To demonstrate the difference (in fact, lack of difference) in running time and output distribution of the naive exponential mechanism versus the base-2 exponential mechanism, we use both to simulate the Laplace distribution centered at
Note: the base-2 mechanism expects the input
First, we show a comparison between the distributions output by the naive implementation of the Laplace mechanism and the base-2 mechanism, which includes randomized rounding.
# Get n samples from each of the mechanisms
n = 1000
util = lambda x : abs(x - t) # Laplace distribution utility
O = l.Outcomes
b2_results = []
naive_results = []
for i in range(0,n):
b2_results.append(l.run_mechanism())
naive_results.append(naive_exp_mech(eps,util,O))
# Plot the samples in a histogram
bin_set = np.linspace(-10,10,50)
plt.hist(naive_results,bins=bin_set,color='r', alpha = 0.5, label='naive')
plt.hist(b2_results,bins=bin_set, color='b', alpha = 0.5, label='base2')
plt.legend()
plt.title("base2 vs naive Exp Mechanism")
plt.show()
Another way of comparing the distributions is the KS statistic, which tests if two distributions are different. In this case, the KS statistic is small, and the p-value large, so we cannot reject the null hypothesis that the distributions are the same.
sk.ks_2samp(b2_results,naive_results)
Ks_2sampResult(statistic=0.023, pvalue=0.9542189106778983)
The difference between the probability distribution of the base-2 mechanism, which uses randomized rounding to guarantee integer utilities, and the un-rounded naive mechanism can also be compared by upper and lower-bounding the probability assigned to each outcome by randomized rounding. Below, we plot a conservative theoretical estimate of the maximum point-wise error on the probability assigned to any given element by the randomized rounding base-2 mechanism compared with the naive mechanism. The maximum error decreases as the granularity increases (smaller accuracy_comparison.py
for more details).
Please see timing_tests.py
and the paper for benchmarks.
Of course, if we wish to use the base-2 exponential mechanism to implement the Laplace mechanism, it is also relevant to test against the (vulnerable) implementation of the Laplace distribution provided in numpy
. In these cases, the running time of the base-2 mechanism or the naive exponential mechanism is significantly slower than the numpy
version. However, the base-2 mechanism still closely approximates the target distribution.
# Get n samples from the numpy laplace implementation
f = lambda : t
numpy_results = [naive_laplace(f,1.0,eps) for i in range(0,n)]
plt.hist(numpy_results,bins=bin_set,color='g', alpha = 0.5, label='numpy laplace')
plt.hist(b2_results,bins=bin_set, color='b', alpha = 0.5, label='base2')
plt.legend()
plt.title("base2 vs numpy Laplace")
plt.show()