ASA Connect

 View Only
  • 1.  FINDING THE NUMBER OF CLUSTERS BEFORE CLUSTERING

    Posted 07-12-2022 17:55
    on the wake of my eightieth birthday, I led a bit forward the solution of an old obsession of mine: substantive clustering, that is finding clusters when they naturally are  in the data body. The problem started boiling in my mind when I was very young (see attachment 1) because all Euclidean Space based algorithms are amenable to a mechanical interpretation: the moment of inertia of the data body. The moment of inertia makes you to think to a "mechanics-style" framing of the problem. For many years I tried to find substantive clusters searching for potential wells in a gravitational field dictated by the data themselves as if points were asteroids. I already communicated something about to this community. However, I didn't fully succeed in my search because of "black holes" because too near points are endowed with an unlimited gravitational force. The current state of my research efforts shows that using a weighted Convolution of the sample's Density with the Laplace Distribution one can have a kind of Gravitational Field without singularities, the infamous black holes. You find details in (2).
    Hope somebody will continue my work because I have no hope to find a better solution. For me, the problem is closed. I'm now trying to address a different problem. Following Lancaster, I want to provide users with an improved Conjont Analysis  Model where Price is NOT a Conjoint factor. My experience in some hundred cases so far shows that Products seem anelastic when price is considered a Conjoint Factor.   
    I'm going to publish this result also within the SAS Community and within some Linkedin groups

    ------------------------------
    [Ulderico] [Santarelli]
    [Las Vegas][Nevada]
    ------------------------------


  • 2.  RE: FINDING THE NUMBER OF CLUSTERS BEFORE CLUSTERING

    Posted 07-17-2022 11:14
    Congratulations on your birthday, and many more!

    The issue of the number of clusters is an old and lively discussion. I would like your feedback on the following process. Use hierarchical agglomerative clustering (HAC); I prefer Ward's method. Examining the dendrogram, starting at the root of the tree, compute the value of R^2 as each cluster is divided, one cluster at a time, and find the amount of R^2 added at each step. As long as this added R^2 remains significant (p-value less than some significance threshold function chosen in advance), add the new cluster. Stop when the next division no longer adds a significant R^2. This approach is similar in motivation to the pseudo-F statistic of Calinski and Harabasz.

    ------------------------------
    Hal Switkay
    United States
    ------------------------------



  • 3.  RE: FINDING THE NUMBER OF CLUSTERS BEFORE CLUSTERING

    Posted 07-17-2022 14:20
      thank you for your wishes and greetings. Eighty years are a gift of God. I find your method very interesting, though it falls among a number of other methods that share the problem of "sample dependence". Because each sample is different from any other one, all sequential methods you apply will provide different results. "How much" this difference is can be measured by the sample variance that comes from the inter-distance between points that acts as the driver in the sequential aggregation. The variance is too much for the protection of the sample invariance of the solution. Therefore, one should look at something that digests randomness. For instance, in the estimation of the mean all points in the sample contribute simultaneously to the estimate. You get a remarkable reduction in randomness because the sample mean has a gain in its standard error measured by sqrt(n). The variance too follows the same path: all points simultaneously contribute to the estimate. Sequential methods trigger a decision tree that leads to a solution that bears the accumulation of the effects of a step by step random choice. The choice is random because the sample point is random so that satisfying any criterion you choose is also random. All clustering methods, including non-hierarchical ones, suffer from this inconvenience. In my paper, I try to follow a different approach that aims at the search of stable points (points at the bottom of potential wells) whose estimate views the contribution of all other points. All points concur to the location of potential wells in a force field generated by a modified Newton Law. If one changes the sample, while almost all points will be different, the location of potential wells will be  stable enough so that you can take them as something substantive. Like when one estimates the mean, a function of the sample indeed, but whose variance is much less than that of the sample points.
       






  • 4.  RE: FINDING THE NUMBER OF CLUSTERS BEFORE CLUSTERING

    Posted 07-17-2022 14:46
    Thank you for your reply. It seems that your approach may be more computationally intensive. Does it run in polynomial time?

    ------------------------------
    Hal Switkay
    United States
    ------------------------------



  • 5.  RE: FINDING THE NUMBER OF CLUSTERS BEFORE CLUSTERING

    Posted 07-17-2022 16:41
    dear Hal, computing the force field is a quadratic operation because you need pairwise point inter-distances. Searching for Potential Wells is equivalent to the search of the minimums of abs(T(w)). This search can be done coordinate-wise. If the coordinates are p and the sample size is N the problem is of order Np. The aggregation of points in the same potential well is straightforward. The only arbitrariness is in the radius for aggregation. Taking it too small will lead to some potential wells splitting. You can however aggregate again. The pseudo chisquare test will help you in distinguishing between random aggregations and substantive ones. The straight aggregation, is also straightforward. You avoid any center drift that would bring you back to a sequentiality problem (drifting is done aggregating sequentially, therefore you would come back to the decision tree diversions with the sample substitution) .