Protecting Business and Tax Data: Special Issues and Applications
Business and tax
data are among the most sensitive data collected by government agencies and
researchers. These data often contain highly skewed variables that can be at
risk for disclosures. For example, if given actual total payroll on
manufacturers in the airline industry, it may be relatively easy to identify
Boeing; it is the record with the largest payroll. Furthermore, businesses and
individuals understandably want to guard the privacy of this information. For
example, private companies do not want their competition to know the amounts
they spend on marketing, research and development, payroll, etc., as this might
compromise their business practice. And, individuals may be reluctant for others
to learn their salaries or total incomes.
If data collectors disseminated
business and tax data in ways that resulted in harm to businesses and
individuals, data subjects might not be willing to provide their data. This
would damage government's ability to make economic policy and reduce
researchers' opportunities to analyze economic data. Thus, most business and tax
data, if released at all (in fact, there are no public use business micrdata
available in the U.S.), are altered before release.
Nearly all the
typical alteration strategies are applied on business and tax data; see the Methods tab at the top of this page for explanation of the methods. Below are links to
illustrative applications of confidentiality protections on business and tax
data. This list is by no means exhaustive, but it does illustrate the techniques
typically used to protect these data.
Aggregation in the County Business Patterns (CBP)
Business and tax microdata are frequently
aggregated for public use. This link to the CBP, released by the Census Bureau,
illustrates how establishments' payroll and employee size are aggregated to
create public use tables.
Noise addition in the Commodity Flow Survey (CFS)
This paper illustrates how noise can be
added to underlying economic microdata when the released data are tabular. The
CFS is released by the Census Bureau.
Noise addition and Synthetic Data in the Longitudinal Employer-Household Dynamics (LEHD) Program
This presentation
provides an example of adding noise and using synthetic data in establishment-level data. The LEHD
program is run by the Census Bureau.
Microaggregation in the Individual Tax Model Public Use File (ITMPUF)
This link is to a paper in the 2002
proceedings of the Joint Statistical Meetings that describes the
microaggregation strategy used for the ITMPUF, which is released by the
Statistics of Income division of the Internal Revenue Service.
Synthetic data in the Survey of Consumer Finances
The Federal Reserve Board protects
sensitive monetary values by replacing them with multiple imputations. This is
the first published instance of what is now known as partially synthetic data.
Synthetic data in the Longitudinal Business Database (LBD)
The
U.S. Bureau of the Census is developing a partially synthetic public use data
set for the LBD. This working paper summarizes some of the initial
development.