Nice work. The software package looks interesting and useful, although I am wondering why you chose stem and leaf plots as the lead in?
Although it still seems to appear in many elementary textbooks, I no longer teach stem and leaf plots in such courses, on the grounds that Tukey's original intention was just to provide a simple way to construct a histogram when doing pencil and paper data analysis. As an aside, note that in his elegant book on EDA (Exploratory Data Analysis) he also had some clever ideas for applying smoothing methods using just paper and pencil.
But since computational methods long ago completely replaced pencil and paper analyses, I don't view the stem and leaf plot as worth class time, as histograms are so easy to obtain using anybody's software. I wonder if folks who knew Tukey better than I have comments on how he would view all this?
Back to the main point: Am I missing something as to why it is the lead in for this software package? Is there something about NLP (Natural Language Processing) applications, which makes the stem and leaf idea worth keeping around (instead of just showing a simple histogram)?
Best,
Steve
------Original Message------
I'm pretty sure most here are familiar with stem-and-leaf plots (if not check out the stemgraphic
companion brochure)
What you might not know is that there is an open source (first released in 2016) EDA toolkit which has full stem-and-leaf support. It is stemgraphic (
http://stemgraphic.org). It includes a command line tool that can be used to analyze distribution of data. It is also a very easy to use python package for making graphical stem-and-leaf plots, usable in other programs, or in a Jupyter notebook environment.

It scales numerical stem-and-leaf plots with support for large computing clusters (tackling billions of data points without problem - see pydata 2016 video). Beyond the original Tukey stem-and-leaf plots with numerical values, as of version 0.5.0 (current is 0.5.3) it is able to handle categorical data or even text (for NLP, language analysis etc)
Additionally, the stemgraphic package include support for stem-and-leaf heatmaps, for comparing multiple heatmaps, for radar plots (levenshtein distance), for stem-and-leaf and word counts as bar or donut charts and 2d and 3d scatter plots to compare multiple text sources.
Documentation is available
online and as a
pdf.
Source code is on
github (feel free to star the project if you find it interesting) along with example
notebooks.
I'd love to get feedback, comments, suggestions, requests for enhancements etc.
Thank you kindly,
Francois
------------------------------
Francois Dion
Chief Data Scientist
Dion Research LLC
------------------------------