ASA Connect

 View Only
Expand all | Collapse all

Minimizing p-values

  • 1.  Minimizing p-values

    Posted 07-30-2020 11:15
    I have a methodology question. Suppose I have a dependent variable, Y, and n+1 independent variables, Z and X1 thru XN. I want to choose a subset of the Xs to minimize the p-value of Z in a regression of Y on Z and the subset of Xs. Is there a analytical means of choosing the subset? Is there a practical means other than an 'all subsets' regression? Thank you for any help.

    ------------------------------
    Terry Meyer
    ------------------------------


  • 2.  RE: Minimizing p-values

    Posted 07-31-2020 06:54
    I am not so sure how practical this would be, but finding a subset that was most closely orthogonal to the response vector should mean that Z would explain most of the variability remaining, which in turn would imply a small p value. Possibly an adaption of the LASSO method could be done.

    ------------------------------
    Steven Denham
    Sr. Biostatistical Scientist
    Charles River Laboratories]
    ------------------------------



  • 3.  RE: Minimizing p-values

    Posted 07-31-2020 08:35
    Hi Terry,

    I suggest focusing on the best subset of predictors, based on information critieria (AIC and BIC), rather than trying to minimize the p-value.  SAS has an excellent procedure to select the best subset of predictors, Proc GLMSelect, using LASSO, and other state-of-the art methods.

    Link to SAS documentation for Proc GLMSelect
    https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statug_glmselect_toc.htm&locale=en

    At the last Michigan SAS Users conference in June 2019, we had some great presentations on variable selection methods:
    http://www.misug.org/uploads/8/1/9/1/8191072/bgillespie__machine_learning.pdf

    http://www.misug.org/uploads/8/1/9/1/8191072/candrews_machine_learning.pdf

    Hope this helps to get you started.  In 2020, the annual Michigan SAS Users Group conference was canceled due to COVID.

    ------------------------------
    Brandy Sinco, BS, MA, MS
    Statistician Senior
    Michigan Medicine
    ------------------------------



  • 4.  RE: Minimizing p-values

    Posted 08-02-2020 08:08
    Variable selection methods, no matter what the stopping rule, have serious problems and should almost always be avoided.  Some of the problems are cataloged here: https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/

    ------------------------------
    Frank Harrell
    Department of Biostatistics
    Vanderbilt University School of Medicine
    ------------------------------



  • 5.  RE: Minimizing p-values

    Posted 07-31-2020 14:12
    I don't know know the answer to your question, but I would like to point out that this is not similar to a variable selection problem. In fact, there can be situations where choosing a "bad set" of covariates can lead to smallest p-value.
    I am curious about what is the motivation behind this problem.

    ------------------------------
    Pratyaydipta Rudra
    Assistant Professor
    Oklahoma State University
    ------------------------------



  • 6.  RE: Minimizing p-values

    Posted 07-31-2020 14:28
    Fit the regression of the dependent variable on the first principal component of the independent variables.

    ------------------------------
    [S.R.S. Rao] [Poduri][Professor of Statistics][University of Rochester][Rochester, NY 14627]
    S.R.S. Rao Poduri
    ------------------------------



  • 7.  RE: Minimizing p-values

    Posted 08-03-2020 10:21
    The problem is that this produces a set of coefficient estimates which have not been selected for interpretability or causality.

    ------------------------------
    Barry DeCicco
    Sr. SAS Programmer
    Stratacent (onsite/remote for Volkswagen Credit, Inc.)
    ------------------------------



  • 8.  RE: Minimizing p-values

    Posted 07-31-2020 15:21
    Hi Terry,

    I would be wary of relying on p-values solely, for an analytical technique like you suggest.   Partly because of concerns about p-values, in general, but even if you consider supporting them by looking at effect sizes, etc. 

    If you start by throwing all the possible X's into an initial regression, there could be interactions between the X's and proxy effects and so on, that could bias the displayed p-values not just overall, but for the X's you've included.   For multiple regressions like this, it's often better to use an iterative and interactive process.   Before you eliminate a variable that looks 'non-significant', can you draw on your subject matter knowledge to ask and check whether interactions may be going on, etc.?   Also, in practice, not all variables are as easy or hard to get accurate values for, when it's time to actually apply your regression formula.  A rote technique might leave you with variables that, themselves, don't have directly observable values, which adds to uncertainty for using the model.


    ------------------------------
    William (Bill) Goodman
    Professor (Retired) and Adjunct Professor, Faculty of Business and Information Technology
    Ontario Tech University
    ------------------------------



  • 9.  RE: Minimizing p-values

    Posted 07-31-2020 15:22
    why do you want to do this?  sounds like cooking the books to me.  covariates should be chosen based on relevance to the outcome.
    and given that we tend not to be into p-values these days, even weirder.
    the only possible reason to do this would be that you are reporting to someone who wants to see p-values and you want to convince them that even the most flattering model does not give a significant p-value for Z.

    ------------------------------
    Ellen Hertzmark
    ------------------------------



  • 10.  RE: Minimizing p-values

    Posted 07-31-2020 16:04
    Can you clarify what you have in mind when you refer to an "analytical means of choosing the subset"?  Does this mean without knowledge of the data?  All the linear regression variable selection techniques generally taught in elementary linear regression seem to have this object in mind, so I am thinking you must mean something else by saying "analytical means".

    ------------------------------
    Raoul Burchette
    Biostatistician
    ------------------------------



  • 11.  RE: Minimizing p-values

    Posted 08-03-2020 07:41

    I would point out, as others done, that you have multiplicity and overfitting problems. The p-value obtained ahould be adjusted by all the combinations that you looked at. In any sufficiently large number of variables in a completely random sample, we would expect to find some number of combinations that appear "significant" by chance alone. Repetition will result in different combinations with the lowest observed p-value that time. 


    Minimizing involves an extremum, and extrema are particularly variable and vulnerable to overfitting. It is thus particularly likely that the the "optimal" model found uskmg one sample will not be reproducinle and woll not be the optimal model found using another. 


    One way of addressing reproducibility in general is dividing your overall sample into training and validation samples, and after building the model using the training sample, using the validation sample to check if the selected model is reproducible. Even if the repetition results in a low p-value, you may find it does not result in the lowest. 


    In general, models of this nature are hypothesis-generating, while p-values convey an impression of a confirmation, of the existence of a quantum of evidence to establish a proposition. For this reason, it might be best not to use p-values for model-building, as you may convey a false impression that your model has a level of reliability and reproducibility that it simply doesn't. If you are going to use p-values, at least adjust them for multiplicity. 



    ------------------------------
    Jonathan Siegel
    Director Clinical Statistics
    ------------------------------



  • 12.  RE: Minimizing p-values

    Posted 08-03-2020 12:45
    Terry, if you are serious to think about this, perhaps you could think about the prediction of y based on Z and one other variable x.  What about x in relation to Z would make Z a better predictor?  Clearly if x is independent of Z, it is not going to help anything.  So apparently there there will need to be some sort of dependence between x and Z.  Under what circumstances would this help Z.  Whether you use p-value or reduction of variance or some other measure, you will certainly want to take into account the number of covariates involved.  But if you look at a single covariate with Z, then two covariates with the possibility that only one or neither would be chosen, what would it necessary to make the determination whether either or both of the two could help Z predict y better.  I think by the time you work through these two cases, you will be able to see what it would take for any finite number of possibilities.  If you found an algorithm for predicting which of a set of supplement set of covariates would maximize (in some sense) the predictive ability of Z, you would be able to move directly to the prediction solution without having to do any sort of sequential variable selection process.  However, it would surprise me if it has not already been done.  It is a fairly common problem.  Maybe the reference is hidden in some related topic like instrumental variables in econometrics.  Of course there can be other problems with purely data driven variable selection methods.  However, when approaching something de novo (that is, lacking subject matter expertise), it seems a rather reasonable start. 

    I also liked that principal components/factor analysis suggestion.  Mathematically, the idea is the project y into the nonorthogonal space spanned by Z and the candidate covariates (x1, ..., xn) and finding a representative (of y) vector which maximizes the contribution of Z.  One possible approach might be to take Z as the first basis vector and add orthogonal combinations of the other variables to create an othrogonal (or even orthonormal) basis, then take the projection of y into this coordinate system.  The tricky part is finding the projection of y into this coordinate system, this is where all the statistical problems of estimation show up because we only have realizations of y and not y (the function of Z and other things) itself..

    ------------------------------
    Raoul Burchette
    Biostatistician
    ------------------------------



  • 13.  RE: Minimizing p-values

    Posted 08-04-2020 13:24
    I have found that the best way to deal with mulitple predictors is to go through the regression process step by step. At each step:
    1. look to see if the new predictor adds value (in that it increases the predictive value),
    2. check to see if it "makes sense" by confirming with a SME (Subject Matter Expert) that the coefficients is in the correct direction,
    3. Makes sure it doesn't reverse coefficients on variables already in the equation.
    If all of these conditions are satisfied then include that variable and go on to the next variable, if not, then remove that variable.

    Stop when the predictive value stabilizes or decreases.

    This may not be foolproof and needs to be done using common sense and SME consultation, but it has served me well for 20 years in the credit and insurance risk industry.

    ------------------------------
    Michael Mout
    MIKS
    ------------------------------



  • 14.  RE: Minimizing p-values

    Posted 08-04-2020 13:37
    That strategy, which incorporates supervised learning, will ruin several aspects of later statistical inference.

    ------------------------------
    Frank Harrell
    Department of Biostatistics
    Vanderbilt University School of Medicine
    ------------------------------



  • 15.  RE: Minimizing p-values

    Posted 08-04-2020 14:53
    And I forgot one crucial last step.

    Validate using a holdout sample to insure against over fit and an independent out of time sample to make sure the model works.

    ------------------------------
    Michael Mout
    MIKS
    ------------------------------



  • 16.  RE: Minimizing p-values

    Posted 08-04-2020 16:30
    Compared to resampling model validation, split sample validation can be quite unstable unless the sample size is huge, and if you use out-of-time samples you may miss a secular trend that could have been easily modeled with the entire sample.

    ------------------------------
    Frank Harrell
    Department of Biostatistics
    Vanderbilt University School of Medicine
    ------------------------------



  • 17.  RE: Minimizing p-values

    Posted 08-05-2020 11:24
    All of these recommendations are, of course, dependent on the environment in which one is modeling. I worked in the credit and insurance risk and marketing industries where we had large samples (10,000 was a small sample) and the ability to continually monitor and validate performance to insure stability across geography and time.

    Small samples and time variant environments, of course, are much more difficult challenges, but still require common sense in how the model is developed and implemented.

    ------------------------------
    Michael Mout
    MIKS
    ------------------------------