Discussion: View Thread

Back to discussions

Expand all | Collapse all

Principal Components with binary data

1. Principal Components with binary data

Recommend
Martin Lesser
Posted 07-03-2014 11:53
Has anyone done or have any recommendations re: using PCA on binary data?
We have a 30 question survey on n=550 subjects. Each question is binary no/yes.
We did PCA and got some meaningful interpretations on the first 3 or 4 PCs, even though the cume variance
for the first 4 is only 36%.
I feel that it can be used as an exploratory tool, recognizing that the required multivariate normal assumptions do not hold.
Suggestions welcome.
Thanks.

-------------------------------------------
Martin Lesser
-------------------------------------------

Martin L Lesser, PhD, EMT-CC
Director and Investigator,
   Biostatistics Unit,
   Feinstein Institute for Medical Research
Professor, Dep't of Molecular Medicine &
     Dep't of Population Health,
   Hofstra North Shore-LIJ School of
      Medicine
Chair, IRB Committee "B"
Mailing Address:
Biostatistics Unit
Feinstein Institute for Medical Research
North Shore - LIJ Health System
350 Community Drive
Manhasset, NY 11030
Phone: 516-562-0300
FAX: 516-562-0344
2. RE: Principal Components with binary data

Recommend
Chris Barker
Posted 07-03-2014 12:30
I sometimes use principal components on binary etc. data, just to 'see what happens".

The methodology works best when the variables are continuous.
fortunately, a google search turns up others who have extended the methodology to categorical etc. data.
such as:
http://www.stat.columbia.edu/~gelman/stuff_for_blog/csda.pdf
http://repository.tamu.edu/handle/1969.1/ETD-TAMU-2009-05-602
-------------------------------------------
Chris Barker, Ph.D.
Consultant and
Adjunct Associate Professor of Biostatistics
www,barkerstats.com

---
"In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
-Steve Lacy
-------------------------------------------
3. RE: Principal Components with binary data

Recommend
Paul Thompson
Posted 07-03-2014 12:32
I am absolutely sure that this is available on mPlus, which you have to purchase.

-------------------------------------------
Paul Thompson
Director, Methodology and Data Analysis Center
Sanford Research/USD
-------------------------------------------

Original Message
4. RE: Principal Components with binary data

Recommend
William Huber
Posted 07-04-2014 11:26
> Has anyone done or have any recommendations re: using PCA on binary data?
> We have a 30 question survey on n=550 subjects. Each question is binary no/yes.
> We did PCA and got some meaningful interpretations on the first 3 or 4 PCs, even though the cume variance
> for the first 4 is only 36%.
> I feel that it can be used as an exploratory tool, recognizing that the required multivariate normal assumptions do not hold.

It can be an excellent exploratory tool. As far as I know, there are no "required multivariate" normality assumptions: it is a general tool to characterize the second moments of the data. Indeed, one of the chief uses of PCA is to detect strong violations of multivariate unimodality (that is, certain kinds of clustering).

In medium to high dimensions, binary data often tend to look normal from almost all geometric points of view anyway. (The Central Limit Theorem tells us independent random binary data will look close to normal in a particular direction with direction vector proportional to (1,1, ..., 1), but the same result applies to any direction vectors whose components are roughly balanced in size.)

To get some appreciation for this, generate correlated binary data in appropriate dimensions and perform the PCA. The appended `R` code simulates data like those described: 550 cases, 30 binary variables (each averaging close to 50%), with three or four recognizable principal components accounting for 36% of the variance. (It takes only a second or two to execute.) The scatterplot matrix of the rotated data--that is, the data when recorded using a principal component basis--looks completely multivariate normal to the eye. Histograms of the scores appear slightly non-normal but not to any great extent.

Varying these parameters will reveal what's really going on. In smaller dimensions (say, 6 instead of 30) the scatterplot matrix is visibly discrete. When the marginals differ much from 50%, the original binary variables have to be situated in peculiar geometric patterns (crosses and vees) in order to produce the desired strong correlations and this will show up in the PCA plots.

-------------------------------------------
William Huber
Quantitative Decisions
-------------------------------------------

#
# Explore PCA of medium-dimensional binary data.
#
require(bindata)        # Exports rmvbin() to generate correlated binary data
par(mfrow=c(2,2))       # Puts plots on the same page
set.seed(17)            # Allows reproduction of the results

n <- 550                # Number of cases
d <- 30                 # Dimension (try, e.g., 6 and 32)
p <- 1/2                # Marginal probilities (try, e.g., 1/3 and 1/2)
m <- 4                  # Number of major PCs (try, e.g., 1 and 2)
rho <- 1/3              # Covariances among d-1 components (try, e.g., 1/10)
#
# Generate correlated binary data.
#
v <- svd(matrix(rnorm(d^2), d))$v                   # Random orthonormal matrix
sigma <- t(v) %*% diag(c(m:1, rep(rho, d-m))) %*% v # Normal covariance matrix
margprob <- rep(p, d)                               # Marginal probabilities
x <- rmvbin(n, margprob, sigma=sigma)               # Correlated binary variables
#
# PCA .
#
fit <- princomp(x)
summary(fit)
plot(fit)
biplot(fit)
#
# Examine the PCA more closely.
#
x.0 <- apply(x, 2, function(z) z - mean(z))
s <- svd(x.0)
hist(x.0 %*% s$v[1,])
hist(x.0 %*% s$v[d,])
#
# Look at 2D sections of the data along principal components.
#
pairs(s$u[, 1:min(8,d)], col="#a0a0a0", cex=0.6)
#---- end of example ----#
5. RE: Principal Components with binary data

Recommend
Eric Siegel
Posted 07-07-2014 21:21
To the best of my understanding, the technique called Correspondence Analysis consists basically of doing PCA on contingency-table data. So if you're doing PCA on multivariate binary data, you may already be doing a version of Correspondence Analysis in disguise, and if you looked into the literature on Correspondence Analysis, you might be pleasantly surprised.

-------------------------------------------
Eric Siegel
Biostatistician
Univ of Arkansas for Medical Sciences of Biostatistics
-------------------------------------------
6. RE: Principal Components with binary data

Recommend
Arthur Kendall
Posted 07-12-2014 10:15
Correspondence Analysis is available in SPSS.

Also see CATPCA categorical PCA in SPSS.

-------------------------------------------
Arthur Kendall
Social Research Consultants
-------------------------------------------

Original Message

Discussion: View Thread

Principal Components with binary data

Martin Lesser07-03-2014 11:53

Chris Barker07-03-2014 12:30

Paul Thompson07-03-2014 12:32

William Huber07-04-2014 11:26

Eric Siegel07-07-2014 21:21

Arthur Kendall07-12-2014 10:15

1. Principal Components with binary data

2. RE: Principal Components with binary data

3. RE: Principal Components with binary data

4. RE: Principal Components with binary data

5. RE: Principal Components with binary data

6. RE: Principal Components with binary data