# ASA Connect

View Only

## Regression with a 50,000 plus parameters #### Terry Meyer06-29-2022 12:08 #### Glen Colopy06-30-2022 07:09 #### Christopher Andrews06-30-2022 07:38 #### Georgette Asherman07-01-2022 09:35 • #### 1.  Regression with a 50,000 plus parameters

Posted 06-29-2022 12:08
Anyone have any ideas on how to do a regression with so many parameters?

------------------------------
Terry Meyer
------------------------------

• #### 2.  RE: Regression with a 50,000 plus parameters

Posted 06-30-2022 07:09
Hey Terry,
Just to get the ball rolling....

 Does the data have (i) >50k predictive features or (ii) do you already have a model and it has 50k+ parameters?

 If it's (i), do you already have a model in mind? Ex. linear regression, or some type of regularized regression, or anything under the sun that can create y=f(x)?

- What is x? In particular what are it's dimensions, and how do the features relate to each other and to the outcome variable.
- What is y (the outcome)? What is it's dimensionality?

 What do you need from the model? Prediction? Explanation?
Do you need to use all of the predictive variables in the model?
Do you need to understand joint/marginal relationships between the features and the outcome?

That should help get things started.

------------------------------
Glen Wright Colopy
DPhil Oxon
The Data & Science Podcast / LifeBell AI
------------------------------

• #### 3.  RE: Regression with a 50,000 plus parameters

Posted 06-30-2022 07:38
Econometricians often use absorption to control for a variable with very many (e.g., 50K) categories.  E.g. xtreg, areg in stata; the estimatr package in R; or the absorb statement of glm in SAS.  Not sure if that is your use case.

------------------------------
Chris Andrews
Statistician Expert
University of Michigan
------------------------------

• #### 4.  RE: Regression with a 50,000 plus parameters

Posted 07-01-2022 09:35
Economists often use different words than statisticians for the same thing.  Is absorption the same as regularization, e.g. LASSO?

Georgette Asherman

------------------------------
Georgette Asherman
------------------------------

• #### 5.  RE: Regression with a 50,000 plus parameters

Posted 07-04-2022 11:04
Hi Georgette,

Fitting a regression by absorbing a categorical covariate is a computational approach/trick ("projection") that results in a fit equivalent to the model that includes dummy variables for all the categories. It is faster but does not return coefficients for the dummy variables, which are not of interest.

Chris

library(estimatr)

set.seed(20220704)
nn <- 1e4
kk <- 300
cc <- factor(sample(kk, nn, replace = TRUE))
xx <- rnorm(nn)
yy <- 1 + 2 * xx + rnorm(nn)

system.time(mod <- lm(yy ~ xx + cc))

system.time(amod <- estimatr::lm_robust(yy ~ xx, fixed_effects = ~ cc, se_type = "classical"))
coef(amod)
summary(amod)

####

Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2406509 0.17261745 7.187286 7.096729e-13
xx 2.0226823 0.01012584 199.754575 0.000000e+00
cc2 -0.4540576 0.23477621 -1.934002 5.314198e-02
cc3 -0.4131068 0.25946896 -1.592124 1.113894e-01
cc4 -0.3522683 0.24073263 -1.463317 1.434129e-01
cc5 -0.2662022 0.23477514 -1.133860 2.568812e-01
####

Call:
estimatr::lm_robust(formula = yy ~ xx, fixed_effects = ~cc, se_type = "classical")

Standard error type: classical

Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
xx 2.023 0.01013 199.8 0 2.003 2.043 9699

Multiple R-squared: 0.8108 , Adjusted R-squared: 0.805
Multiple R-squared (proj. model): 0.8045 , Adjusted R-squared (proj. model): 0.7984
F-statistic (proj. model): 3.99e+04 on 1 and 9699 DF, p-value: < 2.2e-16

​​

------------------------------
Chris Andrews
Statistician Expert
University of Michigan
------------------------------