ASA Connect

 View Only
  • 1.  Tool for stem-and-leaf plots: stemgraphic

    Posted 03-05-2018 16:38
    I'm pretty sure most here are familiar with stem-and-leaf plots (if not check out the stemgraphic companion brochure)

    What you might not know is that there is an open source (first released in 2016)  EDA toolkit which has full stem-and-leaf support. It is stemgraphic (http://stemgraphic.org). It includes a command line tool that can be used to analyze distribution of data. It is also a very easy to use python package for making graphical stem-and-leaf plots, usable in other programs, or in a Jupyter notebook environment.

    distribution plot

    It scales numerical stem-and-leaf plots with support for large computing clusters (tackling billions of data points without problem - see pydata 2016 video). Beyond the original Tukey stem-and-leaf plots with numerical values, as of version 0.5.0 (current is 0.5.3) it is able to handle categorical data or even text (for NLP, language analysis etc)

    Additionally, the stemgraphic package include support for stem-and-leaf heatmaps, for comparing multiple heatmaps, for radar plots (levenshtein distance), for stem-and-leaf and word counts as bar or donut charts and 2d and 3d scatter plots to compare multiple text sources.

    Documentation is available online and as a pdf.

    Source code is on github (feel free to star the project if you find it interesting) along with example notebooks.

    I'd love to get feedback, comments, suggestions, requests for enhancements etc.

    Thank you kindly,
    Francois

    ------------------------------
    Francois Dion
    Chief Data Scientist
    Dion Research LLC
    ------------------------------


  • 2.  RE: Tool for stem-and-leaf plots: stemgraphic

    Posted 03-06-2018 11:46

    Nice work.  The software package looks interesting and useful, although I am wondering why you chose stem and leaf plots as the lead in?
        Although it still seems to appear in many elementary textbooks, I no longer teach stem and leaf plots in such courses, on the grounds that Tukey's original intention was just to provide a simple way to construct a histogram when doing pencil and paper data analysis.  As an aside, note that in his elegant book on EDA (Exploratory Data Analysis) he also had some clever ideas for applying smoothing methods using just paper and pencil.
        But since computational methods long ago completely replaced pencil and paper analyses, I don't view the stem and leaf plot as worth class time, as histograms are so easy to obtain using anybody's software.  I wonder if folks who knew Tukey better than I have comments on how he would view all this?
        Back to the main point: Am I missing something as to why it is the lead in for this software package?  Is there something about NLP (Natural Language Processing) applications, which makes the stem and leaf idea worth keeping around (instead of just showing a simple histogram)?
    Best,
    Steve
     






  • 3.  RE: Tool for stem-and-leaf plots: stemgraphic

    Posted 03-06-2018 12:56
    I agree that density plots or histograms seem more current than the stem-and-leaf plot, but I still think they're interesting, partly because computers do break.

    Incidentally, if you use R, there's a function you can use in base:
    > data(iris)
    > stem(iris$Petal.Length)

    The decimal point is at the |

    1 | 012233333334444444444444
    1 | 55555555555556666666777799
    2 |
    2 |
    3 | 033
    3 | 55678999
    4 | 000001112222334444
    4 | 5555555566677777888899999
    5 | 000011111111223344
    5 | 55566666677788899
    6 | 0011134
    6 | 6779

    >


    ------------------------------
    Edward Cashin
    Research Scientist II
    ------------------------------



  • 4.  RE: Tool for stem-and-leaf plots: stemgraphic

    Posted 03-09-2018 11:17
    Edited by Francois Dion 03-09-2018 11:21

    Incidentally, if you use R, there's a function you can use in base:
    Edward Cashin,  03-06-2018 12:56
    Edward,

    Thanks for the comment. I used stem() some. The issue with it is that once you pass about 300 or so value (depending on the distribution), you start to get truncated rows. Something similar to:

    1 | 000000000000000000000000000000000000000000000000000000000000000000000000+10238

    That was why I wrote stem_graphic, so I could look at the overall distribution and detail of very large data set (micro/macro design in Tufte speak, coarse/fine in Tukey speak, tree/leaves in Bertin speak).

    I currently don't have any customers running R only, they usually have Python also. And I have a pretty good backlog already, but I will at some point have a version for R.

    Having said that, you can use it from RStudio using the reticulate package (on github). After installing reticulate (and with a valid python 3 installation, I recommend anaconda.org, and stemgraphic installed with pip install temgraphic), then you can from the R prompt do something like:

    library(reticulate)
    library(png)

    somedata <- read.csv('file.csv')
    var_of_interest <- somedata$column_name
    py_repl()
    from stemgraphic.num import stem_graphic
    fig, ax = stem_graphic(r.var_of_interest)
    fig.savefig('plot.png')
    exit
    grid::grid.raster(read.raster('plot.png')

    Let me know if you try it out and how it works out for you.

    Thank you

    ------------------------------
    Francois Dion
    Chief Data Scientist
    Dion Research LLC
    ------------------------------



  • 5.  RE: Tool for stem-and-leaf plots: stemgraphic

    Posted 03-06-2018 15:05
    Edited by Francois Dion 03-06-2018 15:08
    Good question, why use a stem-and-leaf plot when we have automated histograms?

    A bit of background. In 2015, I decided to read (not just skim) EDA (Tukey, 1977). One thing I found in his book is that I had complete understanding of the data he was using. I could see the data. Not surprising, because Tukey was a fan of seeing the "coarse and the fine" (or as Jacques Bertin used to say, to see the tree and the leaves of the tree). That was the beauty of the stem-and-leaf plot (even more than the ability to do it by hand).

    Sure, Tukey's data sets were all tiny. But perhaps I could make this work with large data. At the time, I worked with a lot of data  which was at a minimum in the hundreds of millions of rows, even when reduced to weekly sets. No way to do anything by hand, obviously. But histograms were often uninformative on data sets at that scale. So I had to try something else.

    And that is why I started with the classic stem-and-leaf. Everything else in the package, from heatmaps to visualizers for NLP also rely on the very flexible approach of splitting numbers or words as stems and as leaves (except the plain word frequency plot, obviously).

    ------------------------------
    Francois Dion
    Chief Data Scientist
    Dion Research LLC
    ------------------------------