ASA Connect

 View Only
Expand all | Collapse all

Practical Utility of Differentially Private Discrete Tabular Data in Statistical Data Publications

  • 1.  Practical Utility of Differentially Private Discrete Tabular Data in Statistical Data Publications

    Posted 04-22-2020 11:25

    Hi All

    Since 2006 Differential Privacy (DP) has been promoted to the scientific community as an effective tool to protect individual privacy in tabular format discrete data (contingency Tables) without sacrificing data utility/quality. Recently U. S. Census Bureau proposed to use DP technique for Census2020. As a part of the project, Census Bureau recently conducted end2end study and made the output of the study available to the entire scientific community for review and comments. This output, I believe, does not provide clear picture of what Differential privacy is. The end2end output was sanitized by post processing of DP output by using a procedure referred to as "Top down procedure". The quality degradation of the end2end study was mostly attributed by the Census Bureau to the post processing activity. [On April 21, 2020 NPR reports that The Census Bureau will be sending an email about COVID-19 to conduct "Household Pulse Survey to try to measure how the COVID-19 is upending life for households in the U. S."]

     To determine the practical utility (data quality) of the DP technique on the discrete data proposed for Census2020 and potentially used by many other statistical data publication systems in the future, I conducted the following simple experiment/simulation:

    Step 1: Create a count of one

    Step 2: adjust the count using DP method (Laplace Distribution)

    Step 3: compute cumulative value of true count and DP_count

    Step 4: increase the true count by one and repeat step 2 until count reached 500,000,000

     The output of the experiment/simulation is shown below as an attachment (Simulation Output Laplace constant of three.pdf). The cumulative true Count is in the first column. Differentially private cumulative count is in the second column. Third and fourth columns display actual difference and percent difference respectively between cumulative true count and cumulative DP count. The last column shows DP adjustment to the nth count. Laplace distribution with a mean value of zero and scale factor B of 3.0 was used which, I believe, is slightly lower than the value used in the end2end study conducted by the Census Bureau. My understanding is that B values used by the Census Bureau range anywhere from 4 to 10. Such a high value further adversely affects data quality.

     It would be interesting to know how the statistical community will use these cumulative DP counts if published in the federal statistical publications instead of true counts to protect individual privacy. Any comments from the potential data users? I am aware that I am not the first person to raise concern about DP applicability issues. However, my experiment/simulation covers a wide range of individual counts, ranging from 1 to over 200 million counts that will be encountered in the Census2020 and many other federal statistical data publications.

     In the separate PDF file (Summary ten simulations for 10 Laplace Constant Values.pdf ) I have also provided a summary statistics of 10 different experiments/simulations conducted by using ten different Laplace's values ranging from 1 to 10. Computer generated random number used in these ten experiments/simulations give different outcomes for the same true count and the B value, which is expected.

    Ramesh A Dandekar

    Retired



    ------------------------------
    Ramesh Dandekar
    Retired

    ------------------------------


  • 2.  RE: Practical Utility of Differentially Private Discrete Tabular Data in Statistical Data Publications

    Posted 04-24-2020 09:03
      |   view attached
    Hi All
    Some one suggested that I should apply DP technique to simple two dimensional hypothetical table with six race categories and five districts. In the attached file I have provided output from twenty separate DP simulations conducted by using Laplace B value of 4.  It is obvious from the test output that not only the numbers in individual categories jump around a lot but also often provide negative counts. My understanding is that census used top down procedure to eliminate negative counts.
    Ramesh

    ------------------------------
    Ramesh Dandekar
    Retired
    Retired --- for private use
    ------------------------------

    Attachment(s)

    pdf
    RaceTable.pdf   122 KB 1 version