ASA Connect

 View Only
  • 1.  in search of the number of clusters

    Posted 08-12-2021 12:55
      |   view attached
    it is possible to estimate the number of clusters before undertaking the Clustering process. See attachment.

    ------------------------------
    [Ulderico] [Santarelli]
    [Las Vegas][Nevada]
    ------------------------------

    Attachment(s)



  • 2.  RE: in search of the number of clusters

    Posted 08-13-2021 14:22
    Check out the Gap Statistic:  https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html

    Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.

    ------------------------------
    Albyn Jones
    Professor of Statistics, Emeritus
    Reed College
    ------------------------------



  • 3.  RE: in search of the number of clusters

    Posted 08-13-2021 19:10
    thank you Albin.
    Tibshirani, R., Walther, G. and Hastie, T. underline in their paper the difficulty of the problem. It is quite different as an approach from my own. Basically it stays within the deviance minimization approach where it searches for an elbow in the deviance decrease as k - the number of clusters - grows. It looks like an Akaike Criterion for a Regression diagnostics where the trade off is between the likelihood ratio improvement against the number of Regressors. 
    My approach is quite different. It uses a change of variables, let me call it the Newton Transformation, that endows each sample point with an index: the "nullity" of the coordinate-wise forces. The points where the index is at its minimum are "central", due to the force compensation from surrounding points. In addition, the method is not sequential but global and parallel
    In problems that demand a "Substantive Clustering" solution, like Gene Expression Data, I think the two following issues should be of high concern
    1. sample independence of the solution (replicability)
    2. shortcomings of a sequential approach. 
    Because
    points differ from sample to sample, going sequential makes the sample independence questionable. On the opposite, "central points" with almost null coordinate-wise forces remain stable from sample to sample because they are averages of the surrounding points. Though different their means are stable.     

    ------------------------------
    [Ulderico] [Santarelli]
    [Las Vegas][Nevada]
    ------------------------------