Gold has been the standard precious commodity for financial futures markets for decades, but diamonds may soon give them a run for their money. Martin Rapaport, a diamond exchange executive, is hoping for a late 2016 to early 2017 launch of futures market in diamonds (see http://www.marketwatch.com/story/forget-gold-diamonds-may-be-the-next-big-thing-in-the-futures-market-2015-08-04?siteid=bigcharts&dist=bigcharts)

Gold pricing is straightforward. Although there are slight variations in purity (some coins are only 91.7% – 22 karat gold, while bars and other buillion coins are 99.99% pure – 24 karat), the prices are based on one (constantly changing) theoretical price called the spot price.

But diamonds are different. With gold, if you buy a bar 10 times as heavy as another, you’ll pay about 10 times as much. With diamonds, it’s more complicated. Not only is the size (carat weight) important, but anyone who’s ever thought about buying a diamond knows about the four C’s: Carat, Cut, Color and Clarity.

The relationship of Price with each is fairly obvious:

  1. Big diamonds are more expensive (but not necessarily linearly)

  2. Colorless diamonds are more expensive (D,E and F are colorless, then G - K)

  3. Clarity matters too (Internally Flawless down to Included)

  4. Cut (Ideal, Very Good, Good, Fair, Poor)

But which matters most? What are the tradeoffs?


This Study

We have data on some 2690 Diamonds that were “scraped” off the web by Lou Valente of JMP. We want to explore the relationship between each of the four C’s and the price of a diamond. We’ll also build a model to see how well we can predict the price of a diamond knowing the four C’s.

Diamonds <- read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Diamonds.txt")
Diamonds = Diamonds[,c(8,1,2,3,6)] #Remove all but Price and the 4 C's

Some of our goals for this study include building and reinforcing skills for

* Examining the Distribution of a Variable

* Comparing groups via boxplots and summary statistics

* Summarizing the relationship between variables using multiple regression including interaction effects

* Selecting a model -- which variables to include in the final model 

Exploration

We start by exploring all the variables.

options(width=100)
summary(Diamonds)
##      Price         Carat.Size         Color     Clarity           Cut      
##  Min.   : 1000   Min.   :0.3000   E      :504   IF  :144   Excellent:1276  
##  1st Qu.: 1801   1st Qu.:0.6000   F      :431   SI1 :624   Good     : 165  
##  Median : 3604   Median :0.9000   G      :396   SI2 :530   Ideal    : 185  
##  Mean   : 3971   Mean   :0.8701   H      :394   VS1 :392   Very Good:1064  
##  3rd Qu.: 5544   3rd Qu.:1.0600   I      :316   VS2 :460                   
##  Max.   :10000   Max.   :2.0200   D      :277   VVS1:269                   
##                                   (Other):372   VVS2:271

Order the categorical variables:

Diamonds$Color=ordered(Diamonds$Color,c("D","E","F","G","H","I","J","K"))
Diamonds$Clarity=ordered(Diamonds$Clarity,c("IF","VVS1","VVS2","VS1","VS2","SI1","SI2"))
Diamonds$Cut=ordered(Diamonds$Cut,c("Ideal","Excellent","Very Good","Good"))

Univariate summaries:

with(Diamonds,barplot(summary(Color),col="light green"))

with(Diamonds,barplot(summary(Cut),col="light green"))

with(Diamonds,barplot(summary(Clarity),col="light green"))

with(Diamonds,hist(Price,col="lightblue",bty="n",xlim=c(0,10000)))

Price and Color

From the literature we know that the least colored diamonds (D,E and F ) are the most sought after and rare. Let’s examine the relationship of Price with Color via boxplots.

with(Diamonds,boxplot(Price~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Price"))

Wait – this looks backward. What happened? Can you think of a reason why the least desirable diamonds in terms of color are the most expensive?


More than one variable

The problem with looking at the simple relationship of one variable with another is that the world is more complex. Why are the least colored diamonds the least expensive? They should be the most expensive. The simplest answer is that they are the smallest (!)

with(Diamonds,boxplot(Carat.Size~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Size (carats)"))