Zillow

The website Zillow estimates the home prices for over 100,000,000 homes around the United States. (Well, actually they call them Zestimates.) In their own words,“We use proprietary automated valuation models that apply advanced algorithms to analyze our data to identify relationships within a specific geographic area, between this home-related data and actual sales prices. Home characteristics, such as square footage, location or the number of bathrooms, are given different weights according to their influence on home sale prices in each specific geography over a specific period of time, resulting in a set of valuation rules, or models that are applied to generate each home’s Zestimate. Specifically, some of the data we use in this algorithm include:

Physical attributes: Location, lot size, square footage, number of bedrooms and bathrooms and many other details.

Tax assessments: Property tax information, actual property taxes paid, exceptions to tax assessments and other information provided in the tax assessors’ records.

Prior and current transactions: Actual sale prices over time of the home itself and comparable recent sales of nearby homes

Currently, we have data on 110 million homes and Zestimates and Rent Zestimates on approximately 100 million U.S. homes. (Source: Zillow Internal, March 2013) " ———–

This Study

Everyone’s heard the real estate adage that the three most important things in real estate are: location, location and location. And notice that it’s first on Zillow’s list of variables. We’re not going to try to compete with Zillow here. Instead, we’ll make things simpler by keeping the location restricted to Saratoga County, New York. We’ll explore some data on home prices to see if we can come up with our own model. In particular, we’ll study how much more a house can fetch if it has a fireplace in it. We’ll use the prices of 1000 homes collected from public records about 8 years ago in Saratoga County, New York by a student studying at Williams College.

Some of our goals for this study include building and reinforcing skills for


Exploration

We start by exploring all the variables. Of course, we first need to bring in the data. We import the data from a webset using the simple command read.delim(“http://sites.williams.edu/rdeveaux/files/2014/09/Saratoga.txt”).

We can then look at a summary – and we notice that some of the categorical variables (Fuel Type, Waterfront etc) are treated as numerical.

real=read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Saratoga.txt")  # read in data from the web 
options(width=100)
summary(real)
##      Price           Lot.Size         Waterfront            Age           Land.Value    
##  Min.   :  5000   Min.   : 0.0000   Min.   :0.000000   Min.   :  0.00   Min.   :   200  
##  1st Qu.:145000   1st Qu.: 0.1700   1st Qu.:0.000000   1st Qu.: 13.00   1st Qu.: 15100  
##  Median :189900   Median : 0.3700   Median :0.000000   Median : 19.00   Median : 25000  
##  Mean   :211967   Mean   : 0.5002   Mean   :0.008681   Mean   : 27.92   Mean   : 34557  
##  3rd Qu.:259000   3rd Qu.: 0.5400   3rd Qu.:0.000000   3rd Qu.: 34.00   3rd Qu.: 40200  
##  Max.   :775000   Max.   :12.2000   Max.   :1.000000   Max.   :225.00   Max.   :412600  
##  New.Construct      Central.Air       Fuel.Type       Heat.Type       Sewer.Type     Living.Area  
##  Min.   :0.00000   Min.   :0.0000   Min.   :2.000   Min.   :2.000   Min.   :1.000   Min.   : 616  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1300  
##  Median :0.00000   Median :0.0000   Median :2.000   Median :2.000   Median :3.000   Median :1634  
##  Mean   :0.04688   Mean   :0.3675   Mean   :2.432   Mean   :2.528   Mean   :2.695   Mean   :1755  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2138  
##  Max.   :1.00000   Max.   :1.0000   Max.   :4.000   Max.   :4.000   Max.   :3.000   Max.   :5228  
##   Pct.College       Bedrooms       Fireplaces       Bathrooms       Rooms       
##  Min.   :20.00   Min.   :1.000   Min.   :0.0000   Min.   :0.0   Min.   : 2.000  
##  1st Qu.:52.00   1st Qu.:3.000   1st Qu.:0.0000   1st Qu.:1.5   1st Qu.: 5.000  
##  Median :57.00   Median :3.000   Median :1.0000   Median :2.0   Median : 7.000  
##  Mean   :55.57   Mean   :3.155   Mean   :0.6019   Mean   :1.9   Mean   : 7.042  
##  3rd Qu.:64.00   3rd Qu.:4.000   3rd Qu.:1.0000   3rd Qu.:2.5   3rd Qu.: 8.250  
##  Max.   :82.00   Max.   :7.000   Max.   :4.0000   Max.   :4.5   Max.   :12.000

So we need to do some data processing – telling the software that the numbers indicating Fuel Type (2-4), Heat Type (2-4), Waterfront (0 or 1) etc. are categorical. In R, categorical variables are called factors. We can also give the categories labels inside the factor() function.By consulting with the Director of Data Processing of Saratoga County, I was able to assign the actual definitions of the numeric codes to the various levels.

I also created some new variables based on the original variables. The original variable Fireplaces is the actual number of fireplace in the house. I’ve created two variables from it: Fireplace is a \(0/1\) or Yes/No variable indicating presence of a fireplace and the new coding for Fireplaces uses just three levels: “None”,“1” and “2 or more”. Similarly, Beds is an ordered categorical variable created from Bedrooms.

real$Fuel.Type=factor(real$Fuel.Type,labels=c("Gas","Electric","Oil")) # Turn numbers into categories
real$Sewer.Type=factor(real$Sewer.Type,labels=c("None","Private","Public"))
real$Heat.Type=factor(real$Heat.Type,labels=c("Hot Air","Hot Water","Electric"))
real$Waterfront=factor(real$Waterfront,labels=c("No","Yes"))
real$Central.Air=factor(real$Central.Air,labels=c("No","Yes"))
real$New.Construct=factor(real$New.Construct,labels=c("No","Yes"))
real$Fireplace=factor(real$Fireplaces==0,labels=c("No","Yes"))
real$Fireplaces=factor(real$Fireplaces)
real$Fireplace=factor(real$Fireplaces==0,labels=c("Yes","No"))
levels(real$Fireplaces)=c("None","1","2 or more","2 or more","2 or more")
real$Beds=as.factor(real$Bedrooms)
levels(real$Beds)=c("2 or fewer","2 or fewer","3","4 or more","4 or more","4 or more","4 or more") 

Running the summary again, we see that it looks much better:

summary(real)
##      Price           Lot.Size       Waterfront      Age           Land.Value     New.Construct
##  Min.   :  5000   Min.   : 0.0000   No :1713   Min.   :  0.00   Min.   :   200   No :1647     
##  1st Qu.:145000   1st Qu.: 0.1700   Yes:  15   1st Qu.: 13.00   1st Qu.: 15100   Yes:  81     
##  Median :189900   Median : 0.3700              Median : 19.00   Median : 25000                
##  Mean   :211967   Mean   : 0.5002              Mean   : 27.92   Mean   : 34557                
##  3rd Qu.:259000   3rd Qu.: 0.5400              3rd Qu.: 34.00   3rd Qu.: 40200                
##  Max.   :775000   Max.   :12.2000              Max.   :225.00   Max.   :412600                
##  Central.Air    Fuel.Type        Heat.Type      Sewer.Type    Living.Area    Pct.College   
##  No :1093    Gas     :1197   Hot Air  :1121   None   :  12   Min.   : 616   Min.   :20.00  
##  Yes: 635    Electric: 315   Hot Water: 302   Private: 503   1st Qu.:1300   1st Qu.:52.00  
##              Oil     : 216   Electric : 305   Public :1213   Median :1634   Median :57.00  
##                                                              Mean   :1755   Mean   :55.57  
##                                                              3rd Qu.:2138   3rd Qu.:64.00  
##                                                              Max.   :5228   Max.   :82.00  
##     Bedrooms         Fireplaces    Bathrooms       Rooms        Fireplace         Beds    
##  Min.   :1.000   None     :740   Min.   :0.0   Min.   : 2.000   Yes:988   2 or fewer:355  
##  1st Qu.:3.000   1        :942   1st Qu.:1.5   1st Qu.: 5.000   No :740   3         :822  
##  Median :3.000   2 or more: 46   Median :2.0   Median : 7.000             4 or more :551  
##  Mean   :3.155                   Mean   :1.9   Mean   : 7.042                             
##  3rd Qu.:4.000                   3rd Qu.:2.5   3rd Qu.: 8.250                             
##  Max.   :7.000                   Max.   :4.5   Max.   :12.000

Getting Started

A summary provides a good way to see how the variables are treated (as quantitative or categorical) and whether they are indundated with missing values. However, don’t spend too much time examining variables this way – graphical displays, such as histograms and bar charts are a much better way to go.

Our response variable – the variable we’re trying to predict – is house price. Here’s a histogram.

options(scipen=1000)
with(real,hist(Price,col="light blue",nclass=30,main="",xlab="Price ($)"))

From the summary we saw that the median house price is $189,000 the mean is $211,967, and the max is $775,000. Now we see that the prices are skewed to the right with relatively few houses priced above $400K. That skewness explains why the mean is higher than the median. The few houses priced between $400K and $800K have pulled it away from the typical price.


House Size

Clearly, the size of the house is an important factor when considering the price. Here’s a histogram of the sizes (in square feet). How would you describe the distribution of house sizes?

with(real,hist(Living.Area,col="light blue",nclass=30,main="",xlab="Living Area (sq.ft)"))

Histogram of Living Area


Fireplaces

Are houses with fireplaces generally more expensive than houses without? Let’s try a simple comparison of the prices for the two groups – those without and those with fireplaces. A boxplot is a great way to display the differences:

with(real,boxplot(Price~Fireplace,col=c("light pink","blue")))

mean(Price~Fireplace,data=real)
##      Yes       No 
## 239914.0 174653.4
compareMean(Price~Fireplace,data=real)
## Warning: 'compareMean' is deprecated.
## Use 'diffmean' instead.
## See help("Deprecated")
## [1] -65260.61

The houses with fireplaces are more expensive. In fact, they are $65,261 more expensive, on average.

So, is a fireplace worth about $65,000 ? If you’re thinking of selling a house that doesn’t have a fireplace should you invest $50,000 to put one in?


What about other Variables?

Unless we have data from a designed experiment it is impossible to conclude that a change in one variable is the cause for the changes in another. Looking at differences in averages between groups is one of the most common ways to be led to make such false conclusions. While we can never assign cause, we can do better by adjusting for other variables that we know also contribute to the changes. This is the basic idea behind epidemiological, or case/control studies.

For the real estate date, we know that larger houses are genearlly more expensive. Let’s examine the relationship between Price and Living Area with a scatterplot:

with(real,plot(Price~Living.Area,pch=20,col="#0000FF",xlab="Living Area",ylab="Price"))
options(scipen=1, digits=2)
mod0=lm(Price~Living.Area,data=real)
mod0
## 
## Call:
## lm(formula = Price ~ Living.Area, data = real)
## 
## Coefficients:
## (Intercept)  Living.Area  
##       13439          113
abline(mod0)

If we model the relationship between Price on Living Area with a linear regression, the model shows a slope of 113.10 which means that each additional square foot is associated with a Price increase of about $113.10:

Are the houses with fireplaces evenly distributed in this scatter plot? Let’s color the ones with fireplaces red:

colors=rep("pink",length(real$Fireplace))
colors[real$Fireplace=="No"]="#0000FF"
colors[real$Fireplace=="Yes"]="#FF0000"
with(real,plot(Price~Living.Area,pch=20,col=colors,xlab="Living Area",ylab="Price"))
abline(mod0)

Aha. So the price goes up with Living Area and the houses with fireplaces tend to be bigger.

Maybe that’s some (or most) of the increase in Price that we’re seeing.

Let’s fit a model that adds a term for Fireplace to the simple regression:

\[ y_i = \beta_0 + \beta_1 x_{1i} + \epsilon_i\]

(Here \(x_1\) is Living Area)

The easiest approach might be to split the data into two groups and fit a linear regression to each, but using indicator variables has some important advantages. An indicator variable is simply a two-valued variable that labels houses with fireplaces with a \(1\) and those without with a \(0\). Let’s call it \(x_2\). So:

\[x_{2i} = \begin{cases} 1 &\mbox{if house i has a Fireplace} \\ 0 & \mbox{if not}\end{cases}\]

So now the model looks like this:

\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i\]

Or, after we fit it:

\[\hat{y}_i = b_0 + b_1 x_{1i} + b_2 x_{2i}\]

In fact, we actually have two models:

\[y_i = \begin{cases} ~\beta_0 + ~~~~~~~~~~~ \beta_1 x_{1i} + \epsilon_i &\mbox{if no Fireplace}\\ (\beta_0 + \beta_2) + \beta_1 x_{1i} + \epsilon_i & \mbox{if house i does have a Fireplace} \end{cases}\]

Or in other words: \[ \widehat{Price} = b_0 + b_1 Living Area + b_2 (Fireplace == "Yes") \]

which, for houses without fireplaces is the same as the original model (since the variable Fireplace == “Yes” has value 0): \[ \widehat{Price} = b_0 + b_1 Living Area \]

For houses with fireplaces there’s a “different” model:

\[ \widehat{Price} = (b_0 + b_2) + b_1 Living Area \]

Because the slope on Living Area is the same, we have two parallel lines. The coefficient \(b_2\) isn’t a “slope” at all, but the adjustment to the intercept for the houses with fireplaces. It’s the constant difference between the two parallel lines, each with slope \(b_1\).


mod1=lm(Price~Living.Area+(Fireplace=="Yes"),data=real)
options(scipen=1,digits=2)
mod1
## 
## Call:
## lm(formula = Price ~ Living.Area + (Fireplace == "Yes"), data = real)
## 
## Coefficients:
##            (Intercept)             Living.Area  Fireplace == "Yes"TRUE  
##                  13599                     111                    5567
colors=rep("pink",length(real$Fireplace))
colors[real$Fireplace=="No"]="#0000FF"
colors[real$Fireplace=="Yes"]="#FF0000"
with(real,plot(Price~Living.Area,pch=20,col=colors,xlab="Living Area",ylab="Price"))
abline(mod1,col="blue")
## Warning in abline(mod1, col = "blue"): only using the first two of 3 regression coefficients
abline(c(19167+5567,111),col="red")

So, it seems that a fireplace might adds about $5567, on average to the Price, not $65,000 if we adjust for the size of the house. We can see this as the difference between the two slopes, one for houses with fireplaces, and one for those without.


Do we need the intercept adjustor?

We can test to see if we need \(\beta_2\) just like any coefficient by looking at

\[t_{n-2} = \frac{b_2-0}{se(b_2)}\]

just as we did for the slope and intecept in simple regression.

summary(mod1)
## 
## Call:
## lm(formula = Price ~ Living.Area + (Fireplace == "Yes"), data = real)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -271421  -39935   -7887   28215  554651 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            13599.16    4991.70    2.72   0.0065 ** 
## Living.Area              111.22       2.97   37.48   <2e-16 ***
## Fireplace == "Yes"TRUE  5567.38    3716.95    1.50   0.1344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69100 on 1725 degrees of freedom
## Multiple R-squared:  0.508,  Adjusted R-squared:  0.508 
## F-statistic:  891 on 2 and 1725 DF,  p-value: <2e-16

Actually, it appears that perhaps we didn’t need it at all (with a P-value of 0.13) – but wait.


Are the slopes the same?

Of course, in this model we’ve assumed that the difference in price between houses with and without fireplaces is constant, no matter what their size. Let’s let this assumption go by adding an interaction term and adding it to the model. We do that by multiplying the indicator variable by the quantitative term:

\[ \widehat{Price} = b_0 + b_1~ Living Area + b_2 (Fireplace == "Yes") + b_3~ Living_Area *(Fireplace=="Yes") \]

Again, for the houses without fireplaces, both the \(x_2\) and \(x_3\) terms will be zero, so, for them, we’ll still have: \[ \widehat{Price} = b_0 + b_1 Living Area \]

But for houses with fireplaces Fireplace ==“Yes” is \(1\), and so: \[ \widehat{Price} = (b_0 + b_2) + (b_1 + b_3) ~ Living Area \]

The term \(b_2\) acts (as before) as an intercept adjuster for houses with fireplaces, and the new (interaction) term \(b_3\) acts as a slope adjuster:

mod2=lm(Price~Living.Area+Fireplace+Living.Area*Fireplace,data=real)
options(scipen=1,digits=2)
summary(mod2)
## 
## Call:
## lm(formula = Price ~ Living.Area + Fireplace + Living.Area * 
##     Fireplace, data = real)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -241710  -39588   -7821   28480  542055 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3290.88    7330.60    0.45  0.65354    
## Living.Area               119.22       3.53   33.82  < 2e-16 ***
## FireplaceNo             37610.41   11024.85    3.41  0.00066 ***
## Living.Area:FireplaceNo   -26.85       6.46   -4.16 0.000034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68800 on 1724 degrees of freedom
## Multiple R-squared:  0.513,  Adjusted R-squared:  0.512 
## F-statistic:  605 on 3 and 1724 DF,  p-value: <2e-16

All the terms are significant with very small P-values, so it seems that this added complexity is justifed by the data.

This is an important lesson. It’s often true that a more complex model will reveal insights and patterns that a simpler model fails to show. By adding a term allowing the slopes to vary we see what happened to the constrained model.

What does this model look like?

colors=rep("pink",length(real$Fireplace))
colors[real$Fireplace=="No"]="#0000FF"
colors[real$Fireplace=="Yes"]="#FF0000"
with(real,plot(Price~Living.Area,pch=20,col=colors,xlab="Living Area",ylab="Price"))
coefs=coefficients(mod2)
abline(c(coefs[1],coefs[2]),col="red")
abline(c(coefs[1]+coefs[3],coefs[2]+coefs[4]),col="blue")

So, how much is a fireplace worth? Well, we still can’t assign causality, but it certainly seems that the answer is at least “it depends”. Fireplaces are associated with generally higher prices for large houses, and the bigger the house, the more a fireplace seems to increase the price, but for smaller houses it seems that the prices are about the same with fireplaces and without. (The lines actually appear to cross, but it would be unwise to infer that, in fact, a fireplace in a small house is associated with a decrease in price. It’s probably not a statistically signficant difference.)

(After thought – if we have a categorical variable with more than two levels we can extend this, but in a non-obvious way. For example if we have three fuel types (Gas, Oil and Electric), we could create two indicators \(x_2\) and \(x_3\) to separate the houses with Oil from those with Gas and another to separate those with Electric from Gas. As with the no fireplace houses, one level (here we’ve chosen Gas) becomes the “default, and all other category levels are compared to it. If that’s undesirable there are even other coding schemes that can make all levels compare to the mean, but that’s a story for another day.)


What have we Learned?

We’ve seen that basing our analysis on simple models and simple averages can lead to incorrect conclusions.

Even though complex models can never prove causality, they can offer explanations that are more plausible that relying on looking at one variable at a time.

We’ve built the model through a series of steps: