Discussion: View Thread

  • 1.  Principal Components with binary data

    Posted 07-03-2014 11:53
    Has anyone done or have any recommendations re: using PCA on binary data?
    We have a 30 question survey on n=550 subjects.  Each question is binary no/yes.
    We did PCA and got some meaningful interpretations on the first 3 or 4 PCs, even though the cume variance
    for the first 4 is only 36%.
    I feel that it can be used as an exploratory tool, recognizing that the required multivariate normal assumptions do not hold.
    Suggestions welcome.
    Thanks.

    -------------------------------------------
    Martin Lesser
    -------------------------------------------

    Martin L Lesser, PhD, EMT-CC
    Director and Investigator,
       Biostatistics Unit,
       Feinstein Institute for Medical Research
    Professor, Dep't of Molecular Medicine &
         Dep't of Population Health,
       Hofstra North Shore-LIJ School of
          Medicine
    Chair, IRB Committee "B"

    Mailing Address:
    Biostatistics Unit
    Feinstein Institute for Medical Research
    North Shore - LIJ Health System
    350 Community Drive
    Manhasset, NY  11030
    Phone: 516-562-0300
    FAX: 516-562-0344




  • 2.  RE: Principal Components with binary data

    Posted 07-03-2014 12:30
    I sometimes use principal components on binary etc. data, just to 'see what happens".

    The methodology works best when the variables are continuous.
    fortunately, a google search turns up others  who have extended the methodology to categorical etc. data.
    such as:
    http://www.stat.columbia.edu/~gelman/stuff_for_blog/csda.pdf
    http://repository.tamu.edu/handle/1969.1/ETD-TAMU-2009-05-602
    -------------------------------------------
    Chris Barker, Ph.D.
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www,barkerstats.com

    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    -------------------------------------------




  • 3.  RE: Principal Components with binary data

    Posted 07-03-2014 12:32
    I am absolutely sure that this is available on mPlus, which you have to purchase.

    -------------------------------------------
    Paul Thompson
    Director, Methodology and Data Analysis Center
    Sanford Research/USD
    -------------------------------------------




  • 4.  RE: Principal Components with binary data

    Posted 07-04-2014 11:26
    > Has anyone done or have any recommendations re: using PCA on binary data?
    > We have a 30 question survey on n=550 subjects.  Each question is binary no/yes.
    > We did PCA and got some meaningful interpretations on the first 3 or 4 PCs, even though the cume variance
    > for the first 4 is only 36%.
    > I feel that it can be used as an exploratory tool, recognizing that the required multivariate normal assumptions do not hold.

    It can be an excellent exploratory tool.  As far as I know, there are no "required multivariate" normality assumptions: it is a general tool to characterize the second moments of the data.  Indeed, one of the chief uses of PCA is to detect strong violations of multivariate unimodality (that is, certain kinds of clustering).

    In medium to high dimensions, binary data often tend to look normal from almost all geometric points of view anyway.  (The Central Limit Theorem tells us independent random binary data will look close to normal in a particular direction with direction vector proportional to (1,1, ..., 1), but the same result applies to any direction vectors whose components are roughly balanced in size.)

    To get some appreciation for this, generate correlated binary data in appropriate dimensions and perform the PCA.  The appended `R` code simulates data like those described: 550 cases, 30 binary variables (each averaging close to 50%), with three or four recognizable principal components accounting for 36% of the variance.  (It takes only a second or two to execute.)  The scatterplot matrix of the rotated data--that is, the data when recorded using a principal component basis--looks completely multivariate normal to the eye.  Histograms of the scores appear slightly non-normal but not to any great extent.

    Varying these parameters will reveal what's really going on.  In smaller dimensions (say, 6 instead of 30) the scatterplot matrix is visibly discrete.  When the marginals differ much from 50%, the original binary variables have to be situated in peculiar geometric patterns (crosses and vees) in order to produce the desired strong correlations and this will show up in the PCA plots.

    -------------------------------------------
    William Huber
    Quantitative Decisions
    -------------------------------------------

    #
    # Explore PCA of medium-dimensional binary data.
    #
    require(bindata)        # Exports rmvbin() to generate correlated binary data
    par(mfrow=c(2,2))       # Puts plots on the same page
    set.seed(17)            # Allows reproduction of the results

    n <- 550                # Number of cases
    d <- 30                 # Dimension (try, e.g., 6 and 32)
    p <- 1/2                # Marginal probilities (try, e.g., 1/3 and 1/2)
    m <- 4                  # Number of major PCs (try, e.g., 1 and 2)
    rho <- 1/3              # Covariances among d-1 components (try, e.g., 1/10)
    #
    # Generate correlated binary data.
    #
    v <- svd(matrix(rnorm(d^2), d))$v                   # Random orthonormal matrix
    sigma <- t(v) %*% diag(c(m:1, rep(rho, d-m))) %*% v # Normal covariance matrix
    margprob <- rep(p, d)                               # Marginal probabilities
    x <- rmvbin(n, margprob, sigma=sigma)               # Correlated binary variables
    #
    # PCA .
    #
    fit <- princomp(x)
    summary(fit)
    plot(fit)
    biplot(fit)
    #
    # Examine the PCA more closely.
    #
    x.0 <- apply(x, 2, function(z) z - mean(z))
    s <- svd(x.0)
    hist(x.0 %*% s$v[1,])
    hist(x.0 %*% s$v[d,])
    #
    # Look at 2D sections of the data along principal components.
    #
    pairs(s$u[, 1:min(8,d)], col="#a0a0a0", cex=0.6)
    #---- end of example ----#








  • 5.  RE: Principal Components with binary data

    Posted 07-07-2014 21:21
    To the best of my understanding, the technique called Correspondence Analysis consists basically of doing PCA on contingency-table data.  So if you're doing PCA on multivariate binary data, you may already be doing a version of Correspondence Analysis in disguise, and if you looked into the literature on Correspondence Analysis, you might be pleasantly surprised. 

    -------------------------------------------
    Eric Siegel
    Biostatistician
    Univ of Arkansas for Medical Sciences of Biostatistics
    -------------------------------------------




  • 6.  RE: Principal Components with binary data

    Posted 07-12-2014 10:15
    Correspondence Analysis is available in SPSS.

    Also see CATPCA categorical PCA in SPSS.

    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------