> Has anyone done or have any recommendations re: using PCA on binary data?
> We have a 30 question survey on n=550 subjects. Each question is binary no/yes.
> We did PCA and got some meaningful interpretations on the first 3 or 4 PCs, even though the cume variance
> for the first 4 is only 36%.
> I feel that it can be used as an exploratory tool, recognizing that the required multivariate normal assumptions do not hold.
It can be an excellent exploratory tool. As far as I know, there are no "required multivariate" normality assumptions: it is a general tool to characterize the second moments of the data. Indeed, one of the chief uses of PCA is to detect strong violations of multivariate unimodality (that is, certain kinds of clustering).
In medium to high dimensions, binary data often tend to look normal from almost all geometric points of view anyway. (The Central Limit Theorem tells us independent random binary data will look close to normal in a particular direction with direction vector proportional to (1,1, ..., 1), but the same result applies to any direction vectors whose components are roughly balanced in size.)
To get some appreciation for this, generate correlated binary data in appropriate dimensions and perform the PCA. The appended `R` code simulates data like those described: 550 cases, 30 binary variables (each averaging close to 50%), with three or four recognizable principal components accounting for 36% of the variance. (It takes only a second or two to execute.) The scatterplot matrix of the rotated data--that is, the data when recorded using a principal component basis--looks completely multivariate normal to the eye. Histograms of the scores appear slightly non-normal but not to any great extent.
Varying these parameters will reveal what's really going on. In smaller dimensions (say, 6 instead of 30) the scatterplot matrix is visibly discrete. When the marginals differ much from 50%, the original binary variables have to be situated in peculiar geometric patterns (crosses and vees) in order to produce the desired strong correlations and this will show up in the PCA plots.
-------------------------------------------
William Huber
Quantitative Decisions
-------------------------------------------
#
# Explore PCA of medium-dimensional binary data.
#
require(bindata) # Exports rmvbin() to generate correlated binary data
par(mfrow=c(2,2)) # Puts plots on the same page
set.seed(17) # Allows reproduction of the results
n <- 550 # Number of cases
d <- 30 # Dimension (try, e.g., 6 and 32)
p <- 1/2 # Marginal probilities (try, e.g., 1/3 and 1/2)
m <- 4 # Number of major PCs (try, e.g., 1 and 2)
rho <- 1/3 # Covariances among d-1 components (try, e.g., 1/10)
#
# Generate correlated binary data.
#
v <- svd(matrix(rnorm(d^2), d))$v # Random orthonormal matrix
sigma <- t(v) %*% diag(c(m:1, rep(rho, d-m))) %*% v # Normal covariance matrix
margprob <- rep(p, d) # Marginal probabilities
x <- rmvbin(n, margprob, sigma=sigma) # Correlated binary variables
#
# PCA .
#
fit <- princomp(x)
summary(fit)
plot(fit)
biplot(fit)
#
# Examine the PCA more closely.
#
x.0 <- apply(x, 2, function(z) z - mean(z))
s <- svd(x.0)
hist(x.0 %*% s$v[1,])
hist(x.0 %*% s$v[d,])
#
# Look at 2D sections of the data along principal components.
#
pairs(s$u[, 1:min(8,d)], col="#a0a0a0", cex=0.6)
#---- end of example ----#