Methods
for Reducing Disclosure Risks When Sharing Data:
Overview of
Computer Science Methods
Like
statisticians, computer scientists conduct research on privacy and
confidentiality. A first area is secure distributed computing. In
this scenario, multiple parties (e.g., companies, government
agencies) own different data sources. The parties seek to allow one
another, and possibly others, to analyze the combined data without
actually revealing their own data. Computer scientists have developed
algorithms and protocols to facilitate this form of data sharing. A
second active area is designing systems that answer queries without
revealing sensitive data. Typically, the protection in these systems
derives from withholding information to selected queries or from
perturbing outputs of queries. A third area overlaps largely with
research in statistical disclosure limitation: developing measures of
disclosure risk and data usefulness, and creating methods for
altering individual data values to protect confidentiality. Research
in these three topics is often called privacy preserving data mining
in the computer science literature.
Below are introductions
to these three topics. Computer scientists also work extensively on
information security, for example encrypting data to prevent hackers
from learning sensitive information stored on computing systems. The
literature on information security is vast, and we do not cover it
here.
1. Secure Distributed
Computing
When two or more parties seek
to compute specific functions of their combined data, they can use
protocols from secure multi-party computation. Although there are
many protocols, the secure summation protocol illustrates the
approach. Suppose that K groups own one data value each, and the
groups want to compute the sum of the K elements without revealing
the individual data values. The first party selects a very large
random number, say R,
and adds their data element x
to it. The first party then sends S
= R+x
to the second party, which adds its value y
to S
and passes this sum to the third party. The process continues until
party K adds its value and returns the sum to the first party, which
can subtract R
to obtain the sum. It then shares the sum with the other parties.
A
summary of secure multi-party computation, including some key
references, can be found at Latanya
Sweeney's data privacy lab website.
Secure summation protocols can be used to estimate parameters of a
variety of statistical models; see Karr, et
al. (2007,
Technometrics,
49, 335 - 345) for a review.
2. Secure query systems
In a secure query system, users
submit requests for functions of the data, but they are not allowed
to see or download the individual records in the database. Additional
protections are needed because some queries can result in
disclosures; an obvious example is a direct query for a particular
record's sensitive data. To protect confidentiality, designers of
query systems take one, or both, of the following approaches. First,
they do not give answers to all queries. Specifying the unanswerable
set is a challenging problem because of the potential for interaction
among queries, i.e., the results of one query when combined with
another may lead to disclosures. Second, they perturb the output of
the query--not the data themselves--by aggregating or adding noise.
Computer scientists have developed theory on the types of
perturbations that best ensure protection. An excellent website on
this theory is maintained by Microsoft's
database privacy group.
3. Some risk metrics and
perturbation approaches
k-anonymity:
any released record cannot be distinguished from at least k-1
records in the database. Latanya
Sweeney's k-anonymity research
describes some of the algorithms used to achieve k-anonymity.
-
l-diversity:
all records with the same non-sensitive attributes have sufficient
variation of sensitive values. Computer
scientists at Cornell University
describe an algorithm for implementing l-diversity.
-
differential privacy:
complicated mathematically, but intuitively says that the risks of
disclosures of any individual's data do not meaningfully depend on
whether or not that individual is in the released database. Adam Smith has tutorial presentations on privacy in statistical databases on his research page.