Hi Georgette,
Fitting a regression by absorbing a categorical covariate is a computational approach/trick ("projection") that results in a fit equivalent to the model that includes dummy variables for all the categories. It is faster but does not return coefficients for the dummy variables, which are not of interest.
Chris
library(estimatr)
set.seed(20220704)
nn <- 1e4
kk <- 300
cc <- factor(sample(kk, nn, replace = TRUE))
xx <- rnorm(nn)
yy <- 1 + 2 * xx + rnorm(nn)
system.time(mod <- lm(yy ~ xx + cc))
head(coef(mod))
head(coef(summary(mod)))
system.time(amod <- estimatr::lm_robust(yy ~ xx, fixed_effects = ~ cc, se_type = "classical"))
coef(amod)
summary(amod)
####
> head(coef(summary(mod)))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2406509 0.17261745 7.187286 7.096729e-13
xx 2.0226823 0.01012584 199.754575 0.000000e+00cc2 -0.4540576 0.23477621 -1.934002 5.314198e-02
cc3 -0.4131068 0.25946896 -1.592124 1.113894e-01
cc4 -0.3522683 0.24073263 -1.463317 1.434129e-01
cc5 -0.2662022 0.23477514 -1.133860 2.568812e-01
####
Call:
estimatr::lm_robust(formula = yy ~ xx, fixed_effects = ~cc, se_type = "classical")
Standard error type: classical
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
xx 2.023 0.01013 199.8 0 2.003 2.043 9699Multiple R-squared: 0.8108 , Adjusted R-squared: 0.805
Multiple R-squared (proj. model): 0.8045 , Adjusted R-squared (proj. model): 0.7984
F-statistic (proj. model): 3.99e+04 on 1 and 9699 DF, p-value: < 2.2e-16
------------------------------
Chris Andrews
Statistician Expert
University of Michigan
------------------------------
Original Message:
Sent: 07-01-2022 09:35
From: Georgette Asherman
Subject: Regression with a 50,000 plus parameters
Economists often use different words than statisticians for the same thing. Is absorption the same as regularization, e.g. LASSO?
Georgette Asherman
------------------------------
Georgette Asherman
Original Message:
Sent: 06-30-2022 07:38
From: Christopher Andrews
Subject: Regression with a 50,000 plus parameters
Econometricians often use absorption to control for a variable with very many (e.g., 50K) categories. E.g. xtreg, areg in stata; the estimatr package in R; or the absorb statement of glm in SAS. Not sure if that is your use case.
------------------------------
Chris Andrews
Statistician Expert
University of Michigan
Original Message:
Sent: 06-29-2022 12:07
From: Terry Meyer
Subject: Regression with a 50,000 plus parameters
Anyone have any ideas on how to do a regression with so many parameters?
------------------------------
Terry Meyer
------------------------------