Gold has been the standard precious commodity for financial futures markets for decades, but diamonds may soon give them a run for their money. Martin Rapaport, a diamond exchange executive, is hoping for a late 2016 to early 2017 launch of futures market in diamonds (see http://www.marketwatch.com/story/forget-gold-diamonds-may-be-the-next-big-thing-in-the-futures-market-2015-08-04?siteid=bigcharts&dist=bigcharts)

Gold pricing is straightforward. Although there are slight variations in purity (some coins are only 91.7% – 22 karat gold, while bars and other buillion coins are 99.99% pure – 24 karat), the prices are based on one (constantly changing) theoretical price called the spot price.

But diamonds are different. With gold, if you buy a bar 10 times as heavy as another, you’ll pay about 10 times as much. With diamonds, it’s more complicated. Not only is the size (carat weight) important, but anyone who’s ever thought about buying a diamond knows about the four C’s: Carat, Cut, Color and Clarity.

The relationship of Price with each is fairly obvious:

  1. Big diamonds are more expensive (but not necessarily linearly)

  2. Colorless diamonds are more expensive (D,E and F are colorless, then G - K)

  3. Clarity matters too (Internally Flawless down to Included)

  4. Cut (Ideal, Very Good, Good, Fair, Poor)

But which matters most? What are the tradeoffs?


This Study

We have data on some 2690 Diamonds that were “scraped” off the web by Lou Valente of JMP. We want to explore the relationship between each of the four C’s and the price of a diamond. We’ll also build a model to see how well we can predict the price of a diamond knowing the four C’s.

Diamonds <- read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Diamonds.txt")
Diamonds = Diamonds[,c(8,1,2,3,6)] #Remove all but Price and the 4 C's

Some of our goals for this study include building and reinforcing skills for

* Examining the Distribution of a Variable

* Comparing groups via boxplots and summary statistics

* Summarizing the relationship between variables using multiple regression including interaction effects

* Selecting a model -- which variables to include in the final model 

Exploration

We start by exploring all the variables.

options(width=100)
summary(Diamonds)
##      Price         Carat.Size         Color     Clarity           Cut      
##  Min.   : 1000   Min.   :0.3000   E      :504   IF  :144   Excellent:1276  
##  1st Qu.: 1801   1st Qu.:0.6000   F      :431   SI1 :624   Good     : 165  
##  Median : 3604   Median :0.9000   G      :396   SI2 :530   Ideal    : 185  
##  Mean   : 3971   Mean   :0.8701   H      :394   VS1 :392   Very Good:1064  
##  3rd Qu.: 5544   3rd Qu.:1.0600   I      :316   VS2 :460                   
##  Max.   :10000   Max.   :2.0200   D      :277   VVS1:269                   
##                                   (Other):372   VVS2:271

Order the categorical variables:

Diamonds$Color=ordered(Diamonds$Color,c("D","E","F","G","H","I","J","K"))
Diamonds$Clarity=ordered(Diamonds$Clarity,c("IF","VVS1","VVS2","VS1","VS2","SI1","SI2"))
Diamonds$Cut=ordered(Diamonds$Cut,c("Ideal","Excellent","Very Good","Good"))

Univariate summaries:

with(Diamonds,barplot(summary(Color),col="light green"))

with(Diamonds,barplot(summary(Cut),col="light green"))

with(Diamonds,barplot(summary(Clarity),col="light green"))

with(Diamonds,hist(Price,col="lightblue",bty="n",xlim=c(0,10000)))

Price and Color

From the literature we know that the least colored diamonds (D,E and F ) are the most sought after and rare. Let’s examine the relationship of Price with Color via boxplots.

with(Diamonds,boxplot(Price~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Price"))

Wait – this looks backward. What happened? Can you think of a reason why the least desirable diamonds in terms of color are the most expensive?


More than one variable

The problem with looking at the simple relationship of one variable with another is that the world is more complex. Why are the least colored diamonds the least expensive? They should be the most expensive. The simplest answer is that they are the smallest (!)

with(Diamonds,boxplot(Carat.Size~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Size (carats)"))

Another way to visualize this is by repeating the boxplots of Price against Color for different ranges of Size:

carat=cut(Diamonds$Carat.Size,breaks=c(0,1,1.5,2.05)) # First split the Diamonds into 3 size groups
with(Diamonds, bwplot(Price~Color | carat,xlab="Color",ylab="Price",main="Price vs. Color by Carat Size"))

Now we can see that for small diamonds (less than 1 carat), there’s not much of a relationship (or there are other factors involved), but for diamonds larger than 1 carat, the prices go more or less as we expected.

A model, such as a multiple linear regression model, can adjust for several factors at once and provide estimates of the price that will take into account all of the characteristics of the diamond in combination.

When the effect of one predictor variable (call it \(x_1\)) on \(y\) changes depending on the level of another predictor (\(x_2\)) we say that there is an interaction between variables \(x_1\) and \(x_2\). We will need to incorporate that into the model. Here there is a Size - Color interaction. The relationship between Price and Color depends on the Size of the diamond.


Examining Two Predictor Variables at a Time

Before building a model for Price, let’s explore the relationships among the four C’s a bit more.

with(Diamonds,mosaicplot(Clarity~Color ,col=c(rep("White",3),rep("Yellow",3),rep("Orange",2)),main="Clarity by Color"))

There seem to be greater percentages of desirable clarities (IF and the VV’s) among the colorless diamonds.

with(Diamonds,mosaicplot(Cut~Color ,col=c(rep("White",3),rep("Yellow",3),rep("Orange",2)),main="Cut by Color"))

Cut seems to be fairly constant across the Colors.

with(Diamonds,mosaicplot(Clarity~Cut ,col=rainbow(7),main="Clarity by Cut"))

There is an association as well between desirable Clairity and desirable Cut with a much larger percentage of the top three Clarities among the Ideal and Excellent cuts.


Building a Model

Since we know that a diamond’s size is an important factor, let’s start with a scatterplot of Price by Size:

with(Diamonds,plot(Price~Carat.Size,bty="n",pch=19,xlab="Size (carats)",ylab="Price ($)",main=""))

The relationship is strong and positive, but it’s clearly not straight. We’ll need to straighten it out before we can fit a linear model.

Taking logs is often a useful strategy and a plot of Log(Price) against Size shows that it has straightened the plot considerably (it actually may be slightly too strong a transformation. We can see that the curve is starting to go slightly the other way. In fact, the Box-Cox algorithm suggests raising Price to the power 0.2, but log seems more intuitive and the difference is not large)

with(Diamonds,plot(log10(Price)~Carat.Size,xlab="Price",main="",bty="n",pch=19,cex=.7))

We’ll go ahead and fit a regression to this:

lm.d1=lm(log10(Price)~Carat.Size,Diamonds)
summary(lm.d1)
## 
## Call:
## lm(formula = log10(Price) ~ Carat.Size, data = Diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46628 -0.08523 -0.01440  0.08265  0.49795 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.820780   0.007238   389.7   <2e-16 ***
## Carat.Size  0.793084   0.007801   101.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1303 on 2688 degrees of freedom
## Multiple R-squared:  0.7936, Adjusted R-squared:  0.7935 
## F-statistic: 1.033e+04 on 1 and 2688 DF,  p-value: < 2.2e-16

So, according to this model, every increase of 1 carat is associated, on average, with an increase of 10^.793 or 6.21. In other words, a diamond 1 carat bigger will cost, on average, about 6.21 times the smaller one.

As a check on the model, let’s look at diamonds between .35 and .45 carats and then diamonds between 1.35 and 1.45 carats:

with(subset(Diamonds,.35<Carat.Size & .45>Carat.Size),mean(Price))
## [1] 1201.562
with(subset(Diamonds,1.35<Carat.Size & 1.45>Carat.Size),mean(Price))
## [1] 6785.381

And the ratio is 5.6471324. Not perfect, but in the ball park.


But what about Color?

When we add a categorical variable to a linear regression, it has the possibility of adjusting both the intercept (for each level of the categorical variable) and the slope (by using an interaction term). We’ll start simply – giving each color a different intercept, but leaving the slopes the same, so that we get 8 parallel lines.

To do it, we create an indicator (or dummy) variable for each Color (but one – more on that later). So if we create

\[x_{2i} = \begin{cases} 1 &\mbox{if} ~~~Color == "E" \\ 0 & \mbox{otherwise}\end{cases}\]

and

\[x_{3i} = \begin{cases} 1 &\mbox{if} ~~~Color == "F" \\ 0 & \mbox{otherwise}\end{cases}\]

etc. down to:

\[x_{8i} = \begin{cases} 1 &\mbox{if} ~~~Color == "K" \\ 0 & \mbox{otherwise}\end{cases}\]

we have the following model:

\[ \widehat{log(Price)} = b_0 + b_1 Carat Size + b_2 x_2 + ... + b_8 x_8\]

Notice that the diamonds with Color D have value \(0\) for all variables \(x_2,...x_8\) and so their model is the “original” model:

\[ \widehat{log(Price)} = b_0 + b_1 Carat Size\]

For each of the other color levels (say E), there is an additional term (\(b_2\) for E) that adds to the intercept. For for E diamonds:

\[ \widehat{log(Price)} = b_0 + b_1 Carat Size + b_2\] \[ = (b_0 + b_2) + b_1 Carat Size \]

essentially “adjusting” the intercept by \(b_2\) for those diamonds. The result is 8 different parallel lines:

lm.d2=lm(log10(Price)~Carat.Size+Color,Diamonds,contrasts=list(Color="contr.treatment"))
summary(lm.d2)
## 
## Call:
## lm(formula = log10(Price) ~ Carat.Size + Color, data = Diamonds, 
##     contrasts = list(Color = "contr.treatment"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41550 -0.07547 -0.00788  0.07027  0.41063 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.809895   0.008298 338.603  < 2e-16 ***
## Carat.Size   0.911400   0.007609 119.783  < 2e-16 ***
## ColorE      -0.032436   0.008164  -3.973 7.29e-05 ***
## ColorF      -0.045008   0.008432  -5.338 1.02e-07 ***
## ColorG      -0.057093   0.008724  -6.545 7.12e-11 ***
## ColorH      -0.105930   0.008858 -11.958  < 2e-16 ***
## ColorI      -0.166000   0.009468 -17.533  < 2e-16 ***
## ColorJ      -0.236116   0.010036 -23.526  < 2e-16 ***
## ColorK      -0.297170   0.012524 -23.729  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1091 on 2681 degrees of freedom
## Multiple R-squared:  0.8556, Adjusted R-squared:  0.8552 
## F-statistic:  1986 on 8 and 2681 DF,  p-value: < 2.2e-16
anova(lm.d2)
## Analysis of Variance Table
## 
## Response: log10(Price)
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## Carat.Size    1 175.59 175.590 14738.96 < 2.2e-16 ***
## Color         7  13.73   1.961   164.64 < 2.2e-16 ***
## Residuals  2681  31.94   0.012                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “Anova” table tests (all at once) whether the intercept adjusters are necessary (formally the null hypothesis is that \(\beta_2 = ... = \beta_8 = 0\)). Here, with a P-value less than \(10^16\), there is clear evidence that the intercepts are not the same.

Here are the 8 regression models on the scatterplot:

plot(log10(Price)~Carat.Size,data=Diamonds,bty="n",pch=19,cex=.6)
abline(coef(lm.d2)[1],coef(lm.d2)[2],col="red",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[3],coef(lm.d2)[2],col="blue",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[4],coef(lm.d2)[2],col="green",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[5],coef(lm.d2)[2],col="yellow",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[6],coef(lm.d2)[2],col="orange",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[7],coef(lm.d2)[2],col="pink",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[8],coef(lm.d2)[2],col="grey",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[9],coef(lm.d2)[2],col="purple",lwd=3)

We can also adjust the slopes, by creating slope adjusters in much the way we just adjusted the intercepts. Now we add a term \[b_i \times Carat.Size\] for each color E through K. This will give 8 different intercepts and 8 different slopes:

lm.d3=lm(log10(Price)~Carat.Size+Color+Carat.Size*Color,Diamonds,contrasts=list(Color="contr.treatment"))
summary(lm.d3)
## 
## Call:
## lm(formula = log10(Price) ~ Carat.Size + Color + Carat.Size * 
##     Color, data = Diamonds, contrasts = list(Color = "contr.treatment"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35334 -0.06863 -0.00336  0.06774  0.39073 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.7276231  0.0180705 150.943  < 2e-16 ***
## Carat.Size         1.0345056  0.0253369  40.830  < 2e-16 ***
## ColorE            -0.0262055  0.0224359  -1.168  0.24290    
## ColorF            -0.0266150  0.0232569  -1.144  0.25256    
## ColorG            -0.0007409  0.0247696  -0.030  0.97614    
## ColorH             0.0009724  0.0271573   0.036  0.97144    
## ColorI             0.0560934  0.0287253   1.953  0.05095 .  
## ColorJ            -0.0354357  0.0308309  -1.149  0.25051    
## ColorK             0.0820540  0.0413955   1.982  0.04756 *  
## Carat.Size:ColorE -0.0113356  0.0312522  -0.363  0.71685    
## Carat.Size:ColorF -0.0385720  0.0311848  -1.237  0.21624    
## Carat.Size:ColorG -0.0941868  0.0310566  -3.033  0.00245 ** 
## Carat.Size:ColorH -0.1485335  0.0324056  -4.584 4.78e-06 ***
## Carat.Size:ColorI -0.2548668  0.0324612  -7.851 5.90e-15 ***
## Carat.Size:ColorJ -0.2310598  0.0335320  -6.891 6.90e-12 ***
## Carat.Size:ColorK -0.3801792  0.0401497  -9.469  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.105 on 2674 degrees of freedom
## Multiple R-squared:  0.8667, Adjusted R-squared:  0.8659 
## F-statistic:  1159 on 15 and 2674 DF,  p-value: < 2.2e-16
anova(lm.d3)
## Analysis of Variance Table
## 
## Response: log10(Price)
##                    Df Sum Sq Mean Sq   F value    Pr(>F)    
## Carat.Size          1 175.59 175.590 15916.378 < 2.2e-16 ***
## Color               7  13.73   1.961   177.791 < 2.2e-16 ***
## Carat.Size:Color    7   2.44   0.349    31.596 < 2.2e-16 ***
## Residuals        2674  29.50   0.011                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(log10(Price)~Carat.Size,data=Diamonds,pch=19,cex=.6,bty="n")
abline(lm.d3,col="red",lwd=3)
## Warning in abline(lm.d3, col = "red", lwd = 3): only using the first two of 16 regression
## coefficients
for (i in 1:7) abline(c(coef(lm.d3)[1]+coef(lm.d3)[2+i],coef(lm.d3)[2]+coef(lm.d3)[9+i]),col=i+1,lwd=3)


Adding Complexity

We can also add intercept and slope adjusters for each level of Cut and Clarity. A complex model with all these terms woul d be:

lm.d4=lm(log10(Price)~Carat.Size+Color+Carat.Size*Color+Clarity+Carat.Size*Clarity+Cut+Carat.Size*Cut,Diamonds,contrasts=list(Color="contr.treatment",Cut="contr.treatment",Clarity="contr.treatment"))
anova(lm.d4)
## Analysis of Variance Table
## 
## Response: log10(Price)
##                      Df  Sum Sq Mean Sq    F value    Pr(>F)    
## Carat.Size            1 175.590 175.590 35989.2530 < 2.2e-16 ***
## Color                 7  13.730   1.961   402.0118 < 2.2e-16 ***
## Clarity               6  12.193   2.032   416.5325 < 2.2e-16 ***
## Cut                   3   0.231   0.077    15.7963  3.51e-10 ***
## Carat.Size:Color      7   5.182   0.740   151.7243 < 2.2e-16 ***
## Carat.Size:Clarity    6   1.360   0.227    46.4639 < 2.2e-16 ***
## Carat.Size:Cut        3   0.014   0.005     0.9889    0.3969    
## Residuals          2656  12.958   0.005                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the Cut levels require intercept, but not slope adjustment. Our final model may be:

lm.d5=lm(log10(Price)~Carat.Size+Color+Carat.Size*Color+Clarity+Carat.Size*Clarity+Cut,Diamonds,contrasts=list(Color="contr.treatment",Cut="contr.treatment",Clarity="contr.treatment"))
anova(lm.d5)
## Analysis of Variance Table
## 
## Response: log10(Price)
##                      Df  Sum Sq Mean Sq   F value    Pr(>F)    
## Carat.Size            1 175.590 175.590 35989.702 < 2.2e-16 ***
## Color                 7  13.730   1.961   402.017 < 2.2e-16 ***
## Clarity               6  12.193   2.032   416.538 < 2.2e-16 ***
## Cut                   3   0.231   0.077    15.796 3.508e-10 ***
## Carat.Size:Color      7   5.182   0.740   151.726 < 2.2e-16 ***
## Carat.Size:Clarity    6   1.360   0.227    46.465 < 2.2e-16 ***
## Residuals          2659  12.973   0.005                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All terms are significant. The \(R^2\) value is \(94.14%\). However, the residual plot still shows curvature – perhaps that log transformation really was too strong…

plot(residuals(lm.d5)~predict(lm.d5),bty="n",pch=19,cex=.6)


Going Further?

In fact, a better model can be made by transforming both Price and Carat Size by log(). The corresponding \(R^2\) for that model is \(96.79%\) and the residual plot is quite well behaved:

lm.d7=lm(log10(Price)~log10(Carat.Size)+Color+log10(Carat.Size)*Color+Clarity+log10(Carat.Size)*Clarity+Cut,Diamonds,contrasts=list(Color="contr.treatment",Cut="contr.treatment",Clarity="contr.treatment"))
plot(residuals(lm.d7)~predict(lm.d7),bty="n",pch=19,cex=.6)

This model predicts the price of \(0.9\) carat, with Color “D”, Clarity “IF” and Cut “Ideal” at $7413 with a 95% prediction range of $5857 to $9397.

These data were taken in 2010. Since then diamonds have increased in value about 21%. So, in 2015, this range would be: $7087 to $11,370.

A look at Blue Nile.com finds 10 diamonds on Oct. 7, 2015 with these characteristics ranging from $8905 to $11,230.


What have we Learned?

We’ve seen that looking at only two variables at a time can lead to incorrect and often silly conclusions.

Even though complex models can never prove causality, they can offer explanations that are more plausible that relying on looking at one variable at a time.

We’ve built the model through a series of steps: