Discussion: View Thread

  • 1.  Types of variables to use in cluster analysis

    Posted 10-03-2012 09:39
    Dear All,

    Can anyone please suggest some good reference about the do's and dont's of including different types of variables (categorical, continuous) while running cluster analysis on a dataset.

    Also, is there a preference for using  continuous variables over categorical in cases where both are possible for instance age vs age groups, income vs income groups?

    I would really appreciate any comments and suggestions?

    Thank you 
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------


  • 2.  RE:Types of variables to use in cluster analysis

    Posted 10-03-2012 10:49
    I do not have a citation.  However, I have been using cluster analysis and related techniques since 1972.

    A lot depend on what you are trying to do.
    What are the cases (i.e., objects, entities) that you want to find clusters of?
    Are you trying to see if there is a tree structure behind the data? A network?
    Are you trying to find a single grouping (a new nominal level variable) that you want to use in further analysis?

    In general, I would lean toward NOT starting out with categorical data on the first few efforts unless the constructs the variables are supposed to measure are intrinsically categorical.  I would not coarsen the measurement at the beginning. [One of my soapboxes is that one should always gather data at the highest level of measurement and granularity as is practical under the circumstances. It can always be coarsened or aggregated later. ]

    Cluster analysis is a family of heuristic exploratory techniques that use different algorithms for combining cases (or partitioning cases) using different measures of similarity (distance) between cases. Since 1974 I have made it a practice to use agreement among several algorithm-measure combinations to find a usable set of clusters.

    HTH

    If you give more detail list members may be able to shed more light on the topic.


    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 3.  RE:Types of variables to use in cluster analysis

    Posted 10-03-2012 10:55
    I suggest you take a look at this book:

    L. Kaufman and P. J. Rousseau. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & sons 1990.



    -------------------------------------------
    [Daniel] [Jeske]
    [Professor and Chair]
    [Department of Statistics]
    [University of California - Riverside]
    -------------------------------------------








  • 4.  RE:Types of variables to use in cluster analysis

    Posted 10-03-2012 11:00
    Hi Arthur and Daniel

    Thanks for your response to my post, it's very informative. I really appreciate it.

     I am trying to group patients with a chronic disease into clusters based on their demographics [such as gender, socio economic status, age etc]as well as other factors such as existence of other co morbidity conditions. 

    Thanks
    Tasneem


    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------








  • 5.  RE:Types of variables to use in cluster analysis

    Posted 10-03-2012 11:37
    HI Tasneem:

    I've conducted cluster analyses with both categorical and continuous data. The Kaufman and Rousseeuw book is a classic. The book by Everitt, Landau and Leese: Cluster Analysis is a very good introductory book that you may also wish to consider. Seber's Multivariate Observations also contains substantive chapters on cluster analysis with sufficient mathematical detail to help you understand the process.

    I also find the suite of tools in R to be quite good, and the graphics produced are far better than most other software packages that I have used. Here is a list of some of the packages available for R:

    library(cluster)  # Kaufman and Rousseeuw Libraries
    library(e1071)   # Latent class models
    library(gclus)    # Auxillary tool for graphing and ordering hierarichal cluster solutions
    library(mclust)   # Model-based clustering algorithm libraries
    library(mva)      # Additional hierarchial clustering algorithms
    library(multiv)   # More hierarchial clustering algorithms

    Cheers,

    -------------------------------------------
    John Cornell
    Professor
    University of Texas Health Science Center
    -------------------------------------------








  • 6.  RE:Types of variables to use in cluster analysis

    Posted 10-03-2012 11:49
    Thanks a lot John, I will go through these references as well as look at the R libraries. Your information is very helpful and much appreciated.
    Best Regards,
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------