Gold has been the standard precious commodity for financial futures markets for decades, but diamonds may soon give them a run for their money. Martin Rapaport, a diamond exchange executive, is hoping for a late 2016 to early 2017 launch of futures market in diamonds (see http://www.marketwatch.com/story/forget-gold-diamonds-may-be-the-next-big-thing-in-the-futures-market-2015-08-04?siteid=bigcharts&dist=bigcharts)

Gold pricing is straightforward. Although there are slight variations in purity (some coins are only 91.7% – 22 karat gold, while bars and other buillion coins are 99.99% pure – 24 karat), the prices are based on one (constantly changing) theoretical price called the spot price.

But diamonds are different. With gold, if you buy a bar 10 times as heavy as another, you’ll pay about 10 times as much. With diamonds, it’s more complicated. Not only is the size (carat weight) important, but anyone who’s ever thought about buying a diamond knows about the four C’s: Carat, Cut, Color and Clarity.

The relationship of Price with each is fairly obvious:

Big diamonds are more expensive (but not necessarily linearly)

Colorless diamonds are more expensive (D,E and F are colorless, then G - K)

Clarity matters too (Internally Flawless down to Included)

Cut (Ideal, Very Good, Good, Fair, Poor)

But which matters most? What are the tradeoffs?

We have data on some 2690 Diamonds that were “scraped” off the web by Lou Valente of JMP. We want to explore the relationship between each of the four C’s and the price of a diamond. We’ll also build a model to see how well we can predict the price of a diamond knowing the four C’s.

```
Diamonds <- read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Diamonds.txt")
Diamonds = Diamonds[,c(8,1,2,3,6)] #Remove all but Price and the 4 C's
```

Some of our goals for this study include building and reinforcing skills for

```
* Examining the Distribution of a Variable
* Comparing groups via boxplots and summary statistics
* Summarizing the relationship between variables using multiple regression including interaction effects
* Selecting a model -- which variables to include in the final model
```

We start by exploring all the variables.

```
options(width=100)
summary(Diamonds)
```

```
## Price Carat.Size Color Clarity Cut
## Min. : 1000 Min. :0.3000 E :504 IF :144 Excellent:1276
## 1st Qu.: 1801 1st Qu.:0.6000 F :431 SI1 :624 Good : 165
## Median : 3604 Median :0.9000 G :396 SI2 :530 Ideal : 185
## Mean : 3971 Mean :0.8701 H :394 VS1 :392 Very Good:1064
## 3rd Qu.: 5544 3rd Qu.:1.0600 I :316 VS2 :460
## Max. :10000 Max. :2.0200 D :277 VVS1:269
## (Other):372 VVS2:271
```

Order the categorical variables:

```
Diamonds$Color=ordered(Diamonds$Color,c("D","E","F","G","H","I","J","K"))
Diamonds$Clarity=ordered(Diamonds$Clarity,c("IF","VVS1","VVS2","VS1","VS2","SI1","SI2"))
Diamonds$Cut=ordered(Diamonds$Cut,c("Ideal","Excellent","Very Good","Good"))
```

Univariate summaries:

`with(Diamonds,barplot(summary(Color),col="light green"))`

`with(Diamonds,barplot(summary(Cut),col="light green"))`

`with(Diamonds,barplot(summary(Clarity),col="light green"))`

`with(Diamonds,hist(Price,col="lightblue",bty="n",xlim=c(0,10000)))`

From the literature we know that the least colored diamonds (D,E and F ) are the most sought after and rare. Let’s examine the relationship of Price with Color via boxplots.

`with(Diamonds,boxplot(Price~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Price"))`

Wait – this looks backward. What happened? Can you think of a reason why the least desirable diamonds in terms of color are the most expensive?

The problem with looking at the simple relationship of one variable with another is that the world is more complex. Why are the least colored diamonds the least expensive? They should be the most expensive. The simplest answer is that they are the smallest (!)

`with(Diamonds,boxplot(Carat.Size~Color,col=c(rep("White",3),rep("Light Yellow",3),rep("Yellow",3)),xlab="Color",ylab="Size (carats)"))`