Gold has been the standard precious commodity for financial futures markets for decades, but diamonds may soon give them a run for their money. Martin Rapaport, a diamond exchange executive, is hoping for a late 2016 to early 2017 launch of futures market in diamonds (see http://www.marketwatch.com/story/forget-gold-diamonds-may-be-the-next-big-thing-in-the-futures-market-2015-08-04?siteid=bigcharts&dist=bigcharts)
Gold pricing is straightforward. Although there are slight variations in purity (some coins are only 91.7% – 22 karat gold, while bars and other buillion coins are 99.99% pure – 24 karat), the prices are based on one (constantly changing) theoretical price called the spot price.
But diamonds are different. With gold, if you buy a bar 10 times as heavy as another, you’ll pay about 10 times as much. With diamonds, it’s more complicated. Not only is the size (carat weight) important, but anyone who’s ever thought about buying a diamond knows about the four C’s: Carat, Cut, Color and Clarity.
The relationship of Price with each is fairly obvious:
Big diamonds are more expensive (but not necessarily linearly)
Colorless diamonds are more expensive (D,E and F are colorless, then G - K)
Clarity matters too (Internally Flawless down to Included)
Cut (Ideal, Very Good, Good, Fair, Poor)
But which matters most? What are the tradeoffs?
We have data on some 2690 Diamonds that were “scraped” off the web by Lou Valente of JMP. We want to explore the relationship between each of the four C’s and the price of a diamond. We’ll also build a model to see how well we can predict the price of a diamond knowing the four C’s.
Diamonds <- read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Diamonds.txt")
Diamonds = Diamonds[,c(8,1,2,3,6)] #Remove all but Price and the 4 C's
Some of our goals for this study include building and reinforcing skills for
* Examining the Distribution of a Variable
* Comparing groups via boxplots and summary statistics
* Summarizing the relationship between variables using multiple regression including interaction effects
* Selecting a model -- which variables to include in the final model
We start by exploring all the variables.
options(width=100)
summary(Diamonds)
## Price Carat.Size Color Clarity Cut
## Min. : 1000 Min. :0.3000 E :504 IF :144 Excellent:1276
## 1st Qu.: 1801 1st Qu.:0.6000 F :431 SI1 :624 Good : 165
## Median : 3604 Median :0.9000 G :396 SI2 :530 Ideal : 185
## Mean : 3971 Mean :0.8701 H :394 VS1 :392 Very Good:1064
## 3rd Qu.: 5544 3rd Qu.:1.0600 I :316 VS2 :460
## Max. :10000 Max. :2.0200 D :277 VVS1:269
## (Other):372 VVS2:271
Order the categorical variables:
Diamonds$Color=ordered(Diamonds$Color,c("D","E","F","G","H","I","J","K"))
Diamonds$Clarity=ordered(Diamonds$Clarity,c("IF","VVS1","VVS2","VS1","VS2","SI1","SI2"))
Diamonds$Cut=ordered(Diamonds$Cut,c("Ideal","Excellent","Very Good","Good"))
Univariate summaries:
with(Diamonds,barplot(summary(Color),col="light green"))
with(Diamonds,barplot(summary(Cut),col="light green"))
with(Diamonds,barplot(summary(Clarity),col="light green"))
with(Diamonds,hist(Price,col="lightblue",bty="n",xlim=c(0,10000)))
From the literature we know that the least colored diamonds (D,E and F ) are the most sought after and rare. Let’s examine the relationship of Price with Color via boxplots.
with(Diamonds,boxplot(Price~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Price"))
Wait – this looks backward. What happened? Can you think of a reason why the least desirable diamonds in terms of color are the most expensive?
The problem with looking at the simple relationship of one variable with another is that the world is more complex. Why are the least colored diamonds the least expensive? They should be the most expensive. The simplest answer is that they are the smallest (!)
with(Diamonds,boxplot(Carat.Size~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Size (carats)"))
Another way to visualize this is by repeating the boxplots of Price against Color for different ranges of Size:
carat=cut(Diamonds$Carat.Size,breaks=c(0,1,1.5,2.05)) # First split the Diamonds into 3 size groups
with(Diamonds, bwplot(Price~Color | carat,xlab="Color",ylab="Price",main="Price vs. Color by Carat Size"))
Now we can see that for small diamonds (less than 1 carat), there’s not much of a relationship (or there are other factors involved), but for diamonds larger than 1 carat, the prices go more or less as we expected.
A model, such as a multiple linear regression model, can adjust for several factors at once and provide estimates of the price that will take into account all of the characteristics of the diamond in combination.
When the effect of one predictor variable (call it \(x_1\)) on \(y\) changes depending on the level of another predictor (\(x_2\)) we say that there is an interaction between variables \(x_1\) and \(x_2\). We will need to incorporate that into the model. Here there is a Size - Color interaction. The relationship between Price and Color depends on the Size of the diamond.
Before building a model for Price, let’s explore the relationships among the four C’s a bit more.
with(Diamonds,mosaicplot(Clarity~Color ,col=c(rep("White",3),rep("Yellow",3),rep("Orange",2)),main="Clarity by Color"))
There seem to be greater percentages of desirable clarities (IF and the VV’s) among the colorless diamonds.
with(Diamonds,mosaicplot(Cut~Color ,col=c(rep("White",3),rep("Yellow",3),rep("Orange",2)),main="Cut by Color"))
Cut seems to be fairly constant across the Colors.
with(Diamonds,mosaicplot(Clarity~Cut ,col=rainbow(7),main="Clarity by Cut"))
There is an association as well between desirable Clairity and desirable Cut with a much larger percentage of the top three Clarities among the Ideal and Excellent cuts.
Since we know that a diamond’s size is an important factor, let’s start with a scatterplot of Price by Size:
with(Diamonds,plot(Price~Carat.Size,bty="n",pch=19,xlab="Size (carats)",ylab="Price ($)",main=""))
The relationship is strong and positive, but it’s clearly not straight. We’ll need to straighten it out before we can fit a linear model.
Taking logs is often a useful strategy and a plot of Log(Price) against Size shows that it has straightened the plot considerably (it actually may be slightly too strong a transformation. We can see that the curve is starting to go slightly the other way. In fact, the Box-Cox algorithm suggests raising Price to the power 0.2, but log seems more intuitive and the difference is not large)
with(Diamonds,plot(log10(Price)~Carat.Size,xlab="Price",main="",bty="n",pch=19,cex=.7))
We’ll go ahead and fit a regression to this:
lm.d1=lm(log10(Price)~Carat.Size,Diamonds)
summary(lm.d1)
##
## Call:
## lm(formula = log10(Price) ~ Carat.Size, data = Diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46628 -0.08523 -0.01440 0.08265 0.49795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.820780 0.007238 389.7 <2e-16 ***
## Carat.Size 0.793084 0.007801 101.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1303 on 2688 degrees of freedom
## Multiple R-squared: 0.7936, Adjusted R-squared: 0.7935
## F-statistic: 1.033e+04 on 1 and 2688 DF, p-value: < 2.2e-16
So, according to this model, every increase of 1 carat is associated, on average, with an increase of 10^.793 or 6.21. In other words, a diamond 1 carat bigger will cost, on average, about 6.21 times the smaller one.
As a check on the model, let’s look at diamonds between .35 and .45 carats and then diamonds between 1.35 and 1.45 carats:
with(subset(Diamonds,.35<Carat.Size & .45>Carat.Size),mean(Price))
## [1] 1201.562
with(subset(Diamonds,1.35<Carat.Size & 1.45>Carat.Size),mean(Price))
## [1] 6785.381
And the ratio is 5.6471324. Not perfect, but in the ball park.
When we add a categorical variable to a linear regression, it has the possibility of adjusting both the intercept (for each level of the categorical variable) and the slope (by using an interaction term). We’ll start simply – giving each color a different intercept, but leaving the slopes the same, so that we get 8 parallel lines.
To do it, we create an indicator (or dummy) variable for each Color (but one – more on that later). So if we create
\[x_{2i} = \begin{cases} 1 &\mbox{if} ~~~Color == "E" \\ 0 & \mbox{otherwise}\end{cases}\]
and
\[x_{3i} = \begin{cases} 1 &\mbox{if} ~~~Color == "F" \\ 0 & \mbox{otherwise}\end{cases}\]
etc. down to:
\[x_{8i} = \begin{cases} 1 &\mbox{if} ~~~Color == "K" \\ 0 & \mbox{otherwise}\end{cases}\]
we have the following model:
\[ \widehat{log(Price)} = b_0 + b_1 Carat Size + b_2 x_2 + ... + b_8 x_8\]
Notice that the diamonds with Color D have value \(0\) for all variables \(x_2,...x_8\) and so their model is the “original” model:
\[ \widehat{log(Price)} = b_0 + b_1 Carat Size\]
For each of the other color levels (say E), there is an additional term (\(b_2\) for E) that adds to the intercept. For for E diamonds:
\[ \widehat{log(Price)} = b_0 + b_1 Carat Size + b_2\] \[ = (b_0 + b_2) + b_1 Carat Size \]
essentially “adjusting” the intercept by \(b_2\) for those diamonds. The result is 8 different parallel lines:
lm.d2=lm(log10(Price)~Carat.Size+Color,Diamonds,contrasts=list(Color="contr.treatment"))
summary(lm.d2)
##
## Call:
## lm(formula = log10(Price) ~ Carat.Size + Color, data = Diamonds,
## contrasts = list(Color = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41550 -0.07547 -0.00788 0.07027 0.41063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.809895 0.008298 338.603 < 2e-16 ***
## Carat.Size 0.911400 0.007609 119.783 < 2e-16 ***
## ColorE -0.032436 0.008164 -3.973 7.29e-05 ***
## ColorF -0.045008 0.008432 -5.338 1.02e-07 ***
## ColorG -0.057093 0.008724 -6.545 7.12e-11 ***
## ColorH -0.105930 0.008858 -11.958 < 2e-16 ***
## ColorI -0.166000 0.009468 -17.533 < 2e-16 ***
## ColorJ -0.236116 0.010036 -23.526 < 2e-16 ***
## ColorK -0.297170 0.012524 -23.729 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1091 on 2681 degrees of freedom
## Multiple R-squared: 0.8556, Adjusted R-squared: 0.8552
## F-statistic: 1986 on 8 and 2681 DF, p-value: < 2.2e-16
anova(lm.d2)
## Analysis of Variance Table
##
## Response: log10(Price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Carat.Size 1 175.59 175.590 14738.96 < 2.2e-16 ***
## Color 7 13.73 1.961 164.64 < 2.2e-16 ***
## Residuals 2681 31.94 0.012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The “Anova” table tests (all at once) whether the intercept adjusters are necessary (formally the null hypothesis is that \(\beta_2 = ... = \beta_8 = 0\)). Here, with a P-value less than \(10^16\), there is clear evidence that the intercepts are not the same.
Here are the 8 regression models on the scatterplot:
plot(log10(Price)~Carat.Size,data=Diamonds,bty="n",pch=19,cex=.6)
abline(coef(lm.d2)[1],coef(lm.d2)[2],col="red",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[3],coef(lm.d2)[2],col="blue",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[4],coef(lm.d2)[2],col="green",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[5],coef(lm.d2)[2],col="yellow",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[6],coef(lm.d2)[2],col="orange",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[7],coef(lm.d2)[2],col="pink",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[8],coef(lm.d2)[2],col="grey",lwd=3)
abline(coef(lm.d2)[1]+coef(lm.d2)[9],coef(lm.d2)[2],col="purple",lwd=3)
We can also adjust the slopes, by creating slope adjusters in much the way we just adjusted the intercepts. Now we add a term \[b_i \times Carat.Size\] for each color E through K. This will give 8 different intercepts and 8 different slopes:
lm.d3=lm(log10(Price)~Carat.Size+Color+Carat.Size*Color,Diamonds,contrasts=list(Color="contr.treatment"))
summary(lm.d3)
##
## Call:
## lm(formula = log10(Price) ~ Carat.Size + Color + Carat.Size *
## Color, data = Diamonds, contrasts = list(Color = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35334 -0.06863 -0.00336 0.06774 0.39073
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7276231 0.0180705 150.943 < 2e-16 ***
## Carat.Size 1.0345056 0.0253369 40.830 < 2e-16 ***
## ColorE -0.0262055 0.0224359 -1.168 0.24290
## ColorF -0.0266150 0.0232569 -1.144 0.25256
## ColorG -0.0007409 0.0247696 -0.030 0.97614
## ColorH 0.0009724 0.0271573 0.036 0.97144
## ColorI 0.0560934 0.0287253 1.953 0.05095 .
## ColorJ -0.0354357 0.0308309 -1.149 0.25051
## ColorK 0.0820540 0.0413955 1.982 0.04756 *
## Carat.Size:ColorE -0.0113356 0.0312522 -0.363 0.71685
## Carat.Size:ColorF -0.0385720 0.0311848 -1.237 0.21624
## Carat.Size:ColorG -0.0941868 0.0310566 -3.033 0.00245 **
## Carat.Size:ColorH -0.1485335 0.0324056 -4.584 4.78e-06 ***
## Carat.Size:ColorI -0.2548668 0.0324612 -7.851 5.90e-15 ***
## Carat.Size:ColorJ -0.2310598 0.0335320 -6.891 6.90e-12 ***
## Carat.Size:ColorK -0.3801792 0.0401497 -9.469 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.105 on 2674 degrees of freedom
## Multiple R-squared: 0.8667, Adjusted R-squared: 0.8659
## F-statistic: 1159 on 15 and 2674 DF, p-value: < 2.2e-16
anova(lm.d3)
## Analysis of Variance Table
##
## Response: log10(Price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Carat.Size 1 175.59 175.590 15916.378 < 2.2e-16 ***
## Color 7 13.73 1.961 177.791 < 2.2e-16 ***
## Carat.Size:Color 7 2.44 0.349 31.596 < 2.2e-16 ***
## Residuals 2674 29.50 0.011
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(log10(Price)~Carat.Size,data=Diamonds,pch=19,cex=.6,bty="n")
abline(lm.d3,col="red",lwd=3)
## Warning in abline(lm.d3, col = "red", lwd = 3): only using the first two of 16 regression
## coefficients
for (i in 1:7) abline(c(coef(lm.d3)[1]+coef(lm.d3)[2+i],coef(lm.d3)[2]+coef(lm.d3)[9+i]),col=i+1,lwd=3)
We can also add intercept and slope adjusters for each level of Cut and Clarity. A complex model with all these terms woul d be:
lm.d4=lm(log10(Price)~Carat.Size+Color+Carat.Size*Color+Clarity+Carat.Size*Clarity+Cut+Carat.Size*Cut,Diamonds,contrasts=list(Color="contr.treatment",Cut="contr.treatment",Clarity="contr.treatment"))
anova(lm.d4)
## Analysis of Variance Table
##
## Response: log10(Price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Carat.Size 1 175.590 175.590 35989.2530 < 2.2e-16 ***
## Color 7 13.730 1.961 402.0118 < 2.2e-16 ***
## Clarity 6 12.193 2.032 416.5325 < 2.2e-16 ***
## Cut 3 0.231 0.077 15.7963 3.51e-10 ***
## Carat.Size:Color 7 5.182 0.740 151.7243 < 2.2e-16 ***
## Carat.Size:Clarity 6 1.360 0.227 46.4639 < 2.2e-16 ***
## Carat.Size:Cut 3 0.014 0.005 0.9889 0.3969
## Residuals 2656 12.958 0.005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the Cut levels require intercept, but not slope adjustment. Our final model may be:
lm.d5=lm(log10(Price)~Carat.Size+Color+Carat.Size*Color+Clarity+Carat.Size*Clarity+Cut,Diamonds,contrasts=list(Color="contr.treatment",Cut="contr.treatment",Clarity="contr.treatment"))
anova(lm.d5)
## Analysis of Variance Table
##
## Response: log10(Price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Carat.Size 1 175.590 175.590 35989.702 < 2.2e-16 ***
## Color 7 13.730 1.961 402.017 < 2.2e-16 ***
## Clarity 6 12.193 2.032 416.538 < 2.2e-16 ***
## Cut 3 0.231 0.077 15.796 3.508e-10 ***
## Carat.Size:Color 7 5.182 0.740 151.726 < 2.2e-16 ***
## Carat.Size:Clarity 6 1.360 0.227 46.465 < 2.2e-16 ***
## Residuals 2659 12.973 0.005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All terms are significant. The \(R^2\) value is \(94.14%\). However, the residual plot still shows curvature – perhaps that log transformation really was too strong…
plot(residuals(lm.d5)~predict(lm.d5),bty="n",pch=19,cex=.6)
In fact, a better model can be made by transforming both Price and Carat Size by log(). The corresponding \(R^2\) for that model is \(96.79%\) and the residual plot is quite well behaved:
lm.d7=lm(log10(Price)~log10(Carat.Size)+Color+log10(Carat.Size)*Color+Clarity+log10(Carat.Size)*Clarity+Cut,Diamonds,contrasts=list(Color="contr.treatment",Cut="contr.treatment",Clarity="contr.treatment"))
plot(residuals(lm.d7)~predict(lm.d7),bty="n",pch=19,cex=.6)
This model predicts the price of \(0.9\) carat, with Color “D”, Clarity “IF” and Cut “Ideal” at $7413 with a 95% prediction range of $5857 to $9397.
These data were taken in 2010. Since then diamonds have increased in value about 21%. So, in 2015, this range would be: $7087 to $11,370.
A look at Blue Nile.com finds 10 diamonds on Oct. 7, 2015 with these characteristics ranging from $8905 to $11,230.
We’ve seen that looking at only two variables at a time can lead to incorrect and often silly conclusions.
Even though complex models can never prove causality, they can offer explanations that are more plausible that relying on looking at one variable at a time.
We’ve built the model through a series of steps:
Examining the Distribution of a Variable
Summarizing the relationship between two quantitative variables using a linear regression
Adding complexity by using indicator variables to adjust the intercept of the regression for each level of a categorical variable
Using interaction terms (multiplying an indicator variable by a quantitative variable) to create a more complex and realistic model that offers different intercepts and slopes for each level of the categorical variables
Model selection is as much art as science. There are many models that may be useful both for prediction as well as explanation, but no single model is ever “correct”