ASA Connect

 View Only
  • 1.  Regression with a 50,000 plus parameters

    Posted 06-29-2022 12:08
    Anyone have any ideas on how to do a regression with so many parameters?

    ------------------------------
    Terry Meyer
    ------------------------------


  • 2.  RE: Regression with a 50,000 plus parameters

    Posted 06-30-2022 07:09
    Hey Terry,
    Just to get the ball rolling....

    [0] Does the data have (i) >50k predictive features or (ii) do you already have a model and it has 50k+ parameters?

    [1] If it's (i), do you already have a model in mind? Ex. linear regression, or some type of regularized regression, or anything under the sun that can create y=f(x)?

    [2] Please describe your data a little more, in particular:
    - What is x? In particular what are it's dimensions, and how do the features relate to each other and to the outcome variable.
    - What is y (the outcome)? What is it's dimensionality?

    [3] What do you need from the model? Prediction? Explanation?
    Do you need to use all of the predictive variables in the model?
    Do you need to understand joint/marginal relationships between the features and the outcome?

    That should help get things started.

    ------------------------------
    Glen Wright Colopy
    DPhil Oxon
    The Data & Science Podcast / LifeBell AI
    ------------------------------



  • 3.  RE: Regression with a 50,000 plus parameters

    Posted 06-30-2022 07:38
    Econometricians often use absorption to control for a variable with very many (e.g., 50K) categories.  E.g. xtreg, areg in stata; the estimatr package in R; or the absorb statement of glm in SAS.  Not sure if that is your use case.

    ------------------------------
    Chris Andrews
    Statistician Expert
    University of Michigan
    ------------------------------



  • 4.  RE: Regression with a 50,000 plus parameters

    Posted 07-01-2022 09:35
    Economists often use different words than statisticians for the same thing.  Is absorption the same as regularization, e.g. LASSO?

    Georgette Asherman

    ------------------------------
    Georgette Asherman
    ------------------------------



  • 5.  RE: Regression with a 50,000 plus parameters

    Posted 07-04-2022 11:04
    Hi Georgette,

    Fitting a regression by absorbing a categorical covariate is a computational approach/trick ("projection") that results in a fit equivalent to the model that includes dummy variables for all the categories. It is faster but does not return coefficients for the dummy variables, which are not of interest.

    Chris


    library(estimatr)

    set.seed(20220704)
    nn <- 1e4
    kk <- 300
    cc <- factor(sample(kk, nn, replace = TRUE))
    xx <- rnorm(nn)
    yy <- 1 + 2 * xx + rnorm(nn)

    system.time(mod <- lm(yy ~ xx + cc))
    head(coef(mod))
    head(coef(summary(mod)))

    system.time(amod <- estimatr::lm_robust(yy ~ xx, fixed_effects = ~ cc, se_type = "classical"))
    coef(amod)
    summary(amod)

    ####

    > head(coef(summary(mod)))
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1.2406509 0.17261745 7.187286 7.096729e-13
    xx 2.0226823 0.01012584 199.754575 0.000000e+00
    cc2 -0.4540576 0.23477621 -1.934002 5.314198e-02
    cc3 -0.4131068 0.25946896 -1.592124 1.113894e-01
    cc4 -0.3522683 0.24073263 -1.463317 1.434129e-01
    cc5 -0.2662022 0.23477514 -1.133860 2.568812e-01
    ####

    Call:
    estimatr::lm_robust(formula = yy ~ xx, fixed_effects = ~cc, se_type = "classical")

    Standard error type: classical

    Coefficients:
    Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
    xx 2.023 0.01013 199.8 0 2.003 2.043 9699

    Multiple R-squared: 0.8108 , Adjusted R-squared: 0.805
    Multiple R-squared (proj. model): 0.8045 , Adjusted R-squared (proj. model): 0.7984
    F-statistic (proj. model): 3.99e+04 on 1 and 9699 DF, p-value: < 2.2e-16



    ​​

    ------------------------------
    Chris Andrews
    Statistician Expert
    University of Michigan
    ------------------------------