D: Computer Science Approaches

Methods for Reducing Disclosure Risks When Sharing Data:
Overview of Computer Science Methods

Like statisticians, computer scientists conduct research on privacy and confidentiality. A first area is secure distributed computing. In this scenario, multiple parties (e.g., companies, government agencies) own different data sources. The parties seek to allow one another, and possibly others, to analyze the combined data without actually revealing their own data. Computer scientists have developed algorithms and protocols to facilitate this form of data sharing. A second active area is designing systems that answer queries without revealing sensitive data. Typically, the protection in these systems derives from withholding information to selected queries or from perturbing outputs of queries. A third area overlaps largely with research in statistical disclosure limitation: developing measures of disclosure risk and data usefulness, and creating methods for altering individual data values to protect confidentiality. Research in these three topics is often called privacy preserving data mining in the computer science literature.

Below are introductions to these three topics. Computer scientists also work extensively on information security, for example encrypting data to prevent hackers from learning sensitive information stored on computing systems. The literature on information security is vast, and we do not cover it here.

1. Secure Distributed Computing

When two or more parties seek to compute specific functions of their combined data, they can use protocols from secure multi-party computation. Although there are many protocols, the secure summation protocol illustrates the approach. Suppose that K groups own one data value each, and the groups want to compute the sum of the K elements without revealing the individual data values. The first party selects a very large random number, say R, and adds their data element x to it. The first party then sends S = R+x to the second party, which adds its value y to S and passes this sum to the third party. The process continues until party K adds its value and returns the sum to the first party, which can subtract R to obtain the sum. It then shares the sum with the other parties.

A summary of secure multi-party computation, including some key references, can be found at Latanya Sweeney's data privacy lab website. Secure summation protocols can be used to estimate parameters of a variety of statistical models; see Karr, et al. (2007, Technometrics, 49, 335 - 345) for a review.

2. Secure query systems

In a secure query system, users submit requests for functions of the data, but they are not allowed to see or download the individual records in the database. Additional protections are needed because some queries can result in disclosures; an obvious example is a direct query for a particular record's sensitive data. To protect confidentiality, designers of query systems take one, or both, of the following approaches. First, they do not give answers to all queries. Specifying the unanswerable set is a challenging problem because of the potential for interaction among queries, i.e., the results of one query when combined with another may lead to disclosures. Second, they perturb the output of the query--not the data themselves--by aggregating or adding noise. Computer scientists have developed theory on the types of perturbations that best ensure protection. An excellent website on this theory is maintained by Microsoft's database privacy group.

3. Some risk metrics and perturbation approaches

k-anonymity: any released record cannot be distinguished from at least k-1 records in the database. Latanya Sweeney's k-anonymity research describes some of the algorithms used to achieve k-anonymity.

l-diversity: all records with the same non-sensitive attributes have sufficient variation of sensitive values. Computer scientists at Cornell University describe an algorithm for implementing l-diversity.
differential privacy: complicated mathematically, but intuitively says that the risks of disclosures of any individual's data do not meaningfully depend on whether or not that individual is in the released database. Adam Smith has tutorial presentations on privacy in statistical databases on his research page.