Internet search engines, such as Google, Bing, and Ask, keep copies of Web pages so that when you make a query, they can quickly search their stored pages and return their findings to you. These saved pages are called a Web cache. Of course, if the page has changed since the last time it was stored then the search engine serves stale pages. In order to keep the cache fresh, Web pages need to be visited regularly and the cache updated with any changed pages. How often do Web pages change? How often should the sites be visited to keep the cache fresh? In this article, we consider these questions. This case study is based on the statistical analyses described in [1-3].

This Study

To help answer these questions, we study the behavior of 1,000 Web pages. Each of these pages was visited every hour for 30 days. The page was compared to the previous visit, and if it had changed, the cache was updated and time of the visit was recorded.

For example, consider the hypothetical set of 8 hourly visits to a Web page shown in the figure below. The top horizontal line shows when the page actually changed and the bottom line shows when a change was detected by a visit to the site. We see that the page changed 7 times, as denoted by the 7 dots that appear along the top line at the times when the page changed. Notice that the page changed twice between hours 3 and 4, but when we visit the page, we observe only that the page has changed between these times, not that it has changed twice. Likewise, the page changed 3 times between hours 4 and 5, and all we can ascertain is that the page changed one or more times between these two visits. This means that in our 8 visits, we observed the page had changed on 4 visits. The data recorded for this page would be the visits on which it changed, i.e., 1, 2, 4, 5.

This simple example demonstrates three important features of our data:

These features of the data impact our analysis. The notion of non-rectangular data simply impacts how the data are represented in the computer and how we write code to analyze the data. The notion of censoring affects our analysis. We find that if we ignore the censoring then our estimates are systematically underestimating the rate of change. We consider the question of how to estimate a Web page’s rate of change after we have the opportunity to explore the data from the study.

Exploration

We load the data into our R session as follows:

load(url("http://www.stat.berkeley.edu/users/nolan/data/stat101/pages.rda"))
names(pages)
## [1] "domain"         "numVisits"      "visitsWChanges"

We have loaded the object called pages, and it contains domain, numVisits, and visitsWChanges. The domain variable contains the last part of the domain name for the Web page, e.g., .com, .edu, .gov, which identifies the type of institution that owns the page. We examine the first few values in domain:

head(pages$domain)
## [1] "net" "jp"  "com" "net" "ca"  "de"

We see that the first domain is .net, which stands for a network technology; the next is .jp, which means the domain is registered in Japan; similarly .ca and .de are for domains registered in Canada and Germany, respectively.

Next, we examine summary statistics for numVisits. This variable contains the number of visits made to the Web page. Each page was supposed to be visited hourly for 30 days, which in this case means that there were \(24 \times 30\), or 720 visits. Since the first visit has no previous version of the page for comparison, we have only 719 subsequent visits (or revisits) where we may observe a change.

summary(pages$numVisits)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0   368.8   592.0   521.7   719.0   719.0

We see from the summary that one page was only visited 5 times and that half of the pages were visited fewer than 600 times. This happens when the page is not successfully crawled.

Let’s next examine visitsWChanges. This is a slightly unusual data structure because one Web page may have changed a different number of times than another. For example, let’s look at the values for visitsWChanges for the first two elements in pages:

pages$visitsWChanges[1:2]
## [[1]]
##  [1]  35 134 155 157 177 204 314 315 319 350 366 369 371
## 
## [[2]]
## [1] 552 604 672

And to help us interpret these values, we also take note of the first two values of numVisits:

pages$numVisits[1:2]
## [1] 378 707

We see that the first page changed 13 times; the first time a change was observed was on the 35th visit to this page, the second time a change was observed was on the 134th visit, and the last time a change was observed was on the 371st visit. Note that this page was only visited 378 times, according to numVisits. As for the second page, only 3 changes were observed at visits 552, 604, and 672, and from numVisits we see that the page was visited 707 times.

This special data structure for visitsWChanges is an example of a non-rectangular data structure. The values for a page are the times of the visits when a change is observed, and since there potentially are a different number of values for each page, we store these data in R as a list of arrays of differing lengths. Of course, we could arrange the data in a different way where, for example, we have one record for every change in a Web page. Then a Web page corresponds to several records in the data set so we would also need to keep track of which records belong to each page. Both data structures are reasonable, but we find that the `ragged array’ format is simpler for our analysis.

There are special functions available in R to work with lists. For example, with the sapply() function we can apply a function to each element of a list. If, say, we want to know how many times a change was observed for each Web page, then we can apply the length() function to each of these arrays. We do this as follows and save the return value as numChanges:

numChanges = sapply(pages$visitsWChanges, length)
head(numChanges)
## [1]  13   3 385  27 185  17

We find that we get the expected result for the first two pages, i.e., the first page had 13 changes and the second had 3 changes. The following numerical summary and histogram of numChanges shows us that some Web pages never changed and others changed on every visit.

summary(numChanges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    99.0   182.8   296.5   719.0
hist(numChanges, breaks = 25, main = "", 
     xlab = "Number of visits where the Web page changed")

We see from the histogram of the number of changes that this distribution is highly skewed. A scatter plot of the number of changes against the number of visits to a page has some interesting features:

plot(numChanges ~ pages$numVisits, xlab ="Number of visits to a Web page",
     ylab = "Number of visits where a change was observed")

Let’s narrow our focus to pages that were visited at least 700 times, and examine the distribution of the number of changes. The histogram has roughly the same shape as the histogram of all 1,000 pages, although the small peak at the end of the distribution is more pronounced.

hist(numChanges[pages$numVisits > 700], breaks = 25, main ="",
     xlab = "Number of visits where the Web page changed")

Estimating How Often a Web Page Changes

To figure out how often we need to revisit a page to keep the Web cache fresh, we need to estimate the typical speed at which the page changes. What is a reasonable approach to estimate this quantity? Let’s examine a few pages, say the 2nd, 5th, and 98th pages in our data set, to get a sense of what we might do. (We chose these particular pages because they offer a range of behaviors). For each page, let’s look at both the quantity of visits and the number of visits where a change was observed. These are:

pages$numVisits[ c(2, 5, 98) ]
## [1] 707 719 719
numChanges[ c(2, 5, 98) ]
## [1]   3 185 719

We see that all three pages were visited all 719 times or nearly so (707 visits for page 2). However, the 2nd page had changed on only 3 occasions, the 5th had changed on 185 visits, and the 98th had changed between every visit.

Let’s consider page 98 first. Since the page changed between every visit, it seems plausible that it changed more than once between visits. We can say that this page changed at least once an hour, but we can’t be more precise because our data are not granular enough. We never observed a time when the page had not changed. Can you imagine what kinds of sites change that frequently? Possibly a news site. In these cases, we want to visit the page more often to keep the cache as fresh as possible. We cannot determine how often to visit, given our data; we need to sample this page more frequently than once an hour in order to make an accurate estimate of the frequency of change for that page.

On the other hand, we have page 2, which changed only 3 times in the 30 days at visits: 552, 604, 672. It seems reasonable to estimate this page changed once every 10 days, i.e., 3 changes in 720 hours is equivalent to 1 change in 240 hours or 1 change per 10 days. Here, we have used the ratio of the number of visits where a change is observed (3) to the length of observation (719, actually we used 720 to make the calculation simpler) to estimate the rate of change to be 1 change per 240 hours, or reciprocally, an average of 240 hours between changes. We can apply this approach to the 5th page to estimate that page’s rate of change to be 185 observed changes / 719 hours, or about 1 change per 4 hours.

The diagram of changes to an example Web page in the first section shows the problem with this estimator. In that simple example, we estimate 4 changes in 8 hours, or 1 change in 2 hours. However, there were 7 changes in that time period and 7/8 changes per hour is clearly a better estimate. Unfortunately, we can detect at most one change in a visit. When a page changes 2, 3, or more times between visits, we only record that it has changed since the previous visit. We are systematically underestimating the change rate. This means that our estimator is biased. If the bias is small, then this won’t matter much. How can we tell the size of the bias? If we have a model for when a change occurs, then we can explore this question. We do this in the next section.

A Model for Web Page Changes

The Poisson arrival process is often used to model a sequence of events that happen randomly and independently with a fixed rate over time. For instance, the occurrences of auto accidents along a particular stretch of road or the arrivals of customers at a service counter are often modeled by a Poisson arrival process. We can try modeling the changes in a Web page with this type of random process. We can generate random page changes using this model and examine the effect on our estimator of observing changes at regular intervals, rather than when they actually happen. That is, we can carry out a simulation study of the bias in our naive estimator.

How do we simulate data from the Poisson arrival process? We need to specify the length of time for which we want to observe the process and the rate at which events occur. For example, in our case, our interval of observation is 719 hours. We don’t know the rate, so we can study several different rates, e.g., 1 event per hour, 1 event per 4 hours, or 4 events per hour. With these two pieces of information, the length of time and the rate, we can generate a set of events, e.g., times when the Web page changes. Then from these events, we can simulate the kind of data that we collect, i.e., the number of times a change is detected when visiting the page at regular one hour intervals. Since we know the true rate of change in our simulation, we can compare it to our estimate of the rate of change from the simulated data, i.e., the ratio of the number of visits where a change is detected to the total number of visits. In a simulation study, we generate thousands of these random Poisson arrival processes and examine the distribution of the corresponding estimates.

We have written a function called simPois() for generating data from the Poisson arrival process. This function has two arguments, T the length of time the process is observed, and rate the rate at which events occur per unit of time (e.g., per hour).

Let’s simulate the process with T = 720 and rate = 4. We can run 2000 simulations with a call to replicate(), as follows:

set.seed(1234)
Time = 720
poisSamps = replicate(2000, simPois(T = Time, rate = 4), simplify = FALSE)

Before we modify the data to resemble our actual data collection process, let’s see how well we can estimate the rate using the Poisson arrival data. For each simulated process, we divide the number of observed events by Time. This gives us 2000 estimates of the rate (which we know to be 4). We can then compare the average of these 2000 estimates to the truth and we can examine the distribution of these estimates.

estimates = sapply(poisSamps, length) / Time
mean(estimates)
## [1] 3.998133
sd(estimates)
## [1] 0.07285877
hist(estimates, breaks = 25, 
     main="Simulation of 2000 Poisson arrivals\n for time = 720 and rate = 4",
     xlab = "Estimates")

When we observe the actual events, the average of our estimates is very close to the true rate of 4. The distribution looks approximately normal and the spread is about 0.07.

What happens when we only observe whether one or more events happened in an hour? We can convert the original samples to the kind of data that we actually collect in our study by rounding each value up to the nearest hour and dropping duplicates. We do this with

obsSamps = lapply(poisSamps, function(samp) unique(ceiling(samp)))

Now, we can compute our estimates for the rate and examine their behavior.

estimatesInterval =  sapply(obsSamps, length) / Time
mean(estimatesInterval)
## [1] 0.9815882
sd(estimatesInterval)
## [1] 0.004934177
hist(estimatesInterval, breaks = 30, 
     main="Simulation of 2000 Poisson arrivals
     Observed hourly (time = 720 and rate = 4)", 
     xlab = "Estimates")

We see that the average of the 2000 estimates is about 1, which is far away from 4. This makes sense because we can only observe changes once an hour, and with a rate of 4, it is highly likely that at least one event happens about every hour. Likewise, the SD of the estimates is only 0.005 because most of the 2000 simulations lead to the same estimate of a rate near 1.

Let’s consider the case when the rate is smaller than the interval of observation, say 1 change in 4 hours. We can rerun our simulation with this new value for the rate

set.seed(1234)
poisSamps = replicate(2000, simPois(T = Time, rate = 0.25), simplify = FALSE)
obsSamps = lapply(poisSamps, function(samp) unique(ceiling(samp)))

estimates =  sapply(poisSamps, length) / Time
mean(estimates)
## [1] 0.2502681
sd(estimates)
## [1] 0.01839696
estimatesInterval =  sapply(obsSamps, length) / Time
mean(estimatesInterval)
## [1] 0.2211549
sd(estimatesInterval)
## [1] 0.01525581
hist(estimatesInterval, breaks = 30, xlab = "Estimates",
     main="Simulation of 2000 Poisson arrivals 
     Observed hourly (time = 720 and rate = 1/4)")

Now we see that the bias is quite a bit smaller. The average rate for the censored data is 0.22. Also the SD is 0.015. In comparison, when we see the full data, the average of the estimates is 0.250 and the SD is 0.018.

Mean Square Error

When we study the behavior of an estimator we typically examine the following three properties,

With a simulation study, we can approximate these three quantities by repeating the process of generating the data and calculating the estimate thousands of times. Then, we use the average of these estimates to approximate the expected value, the SD of these estimates to approximate the SD of the estimator, and the histogram of sample estimates to approximate the sampling distribution of the estimator. If we repeat the simulation study again, we are likely to get very similar results. For example, we compare the following results from another 2000 simulations to our earlier results.

set.seed(567890)
poisSamps = replicate(2000, simPois(T = Time, rate = 0.25), simplify = FALSE)
estimates =  sapply(poisSamps, length) / Time
mean(estimates)
## [1] 0.2504014
sd(estimates)
## [1] 0.01845685
obsSamps = lapply(poisSamps, function(samp) unique(ceiling(samp)))
estimatesInterval =  sapply(obsSamps, length) / Time
mean(estimatesInterval)
## [1] 0.2215535
sd(estimatesInterval)
## [1] 0.01524434

Bias and Variance trade-off

Our simulation studies have shown us that while our estimator can be quite biased, the SD of the estimator tends to be smaller than the SD of the estimator when we observe the actual locations of the events. We sometimes combine the variance of the estimator and the bias into one quantity called Mean Square Error to give us a single value that balances the trade off between bias and variance. The mean square error, is what its name suggests, the expected value of the squared error between the estimator and the true value, i.e., \[ MSE(estimator) = E(estimator - true~value)^2 \] The mean squared error, or MSE for short, can be written as the sum of the bias squared and the variance of the estimator. That is, \[ E(estimator - true~value)^2 = E(estimator - E(estimator))^2 + (E(estimator) - true~value)^2\] so \[ MSE(estimator) = Variance(estimator) + Bias(estimator)^2\]

When the expected value of the estimator matches what we are estimating, i.e., the true value, then the bias is 0 and we say the estimator is unbiased. In this case, the MSE reduces to the variance of the estimator. In our situation, our estimator is biased, and the bias can contribute a significant portion to the MSE. We compute the MSE for our estimator for a rate of \(1/4\).

(mean(estimatesInterval) - 1/4)^2 + var(1/estimatesInterval)
## [1] 0.09898766

And, we compare it to the MSE of estimator if we observed the actual times when the page changed.

(mean(estimates) - 1/4)^2 + var(estimates)
## [1] 0.0003408165

The MSE of our estimator is about 300 times larger than the MSE of the estimator for the uncensored data. We can further study the bias, variance, and MSE of our estimator. For example, we might want to know how the bias behaves as a function of the rate, and how the bias behaves as we increase the time over which we observe the process. From our initial observations about the data collection process, we make the following conjectures:

  • If the rate is small compared to the frequency of observation then the bias probably doesn’t matter much.
  • As the length of observations increases, the bias does not diminish because we are still observing the process at regular intervals and not when changes/events occur.

We can confirm these conjectures with simulation studies. We leave this to the reader.

Is there a way to fix this problem? Working with the Poisson arrival process, we consider an alternative estimate of the rate in the next section.

An Alternative Estimate for the Rate of Change

Our current estimate for the rate is simply the number of visits when a change is observed divided by the total time the page is visited, but can we do better? With a little more knowledge of the Poisson arrival process, we can improve our estimator to take into account the censoring of the data. We can examine how the rate enters into the chance of observing a change (one or more changes to be more precise) on a visit.

Let’s rethink the problem by casting it in a slightly different way. Our data consist of the visits when a change was observed. Consider again the simple hypothetical example of 7 changes in 8 visits shown earlier, which we re-display below. After the initial visit, we revisited the Web page 8 times at time 1, 2, …, 8 and observed changes at times 1, 2, 4,and 5. We can think of these observations as a set of 8 values that are 0 when no change is observed and 1 when a change is observed. In this simple case, the sequence is: 1, 1, 0, 1, 1, 0, 0, 0. According to the Poisson arrival process, these events are independent because they correspond to disjoint intervals in time. They are like independent Bernoulli random trials with the same chance of success on each trial (here a success is at least one change in an hour).

Typically, we estimate the chance of success by the proportion of successes in our observations. In our simple case, that is again 4/8. However, our model gives us a connection between the chance of success and the rate of a page change, which we can use to improve our estimate. We first need a little more information about the Poisson arrival process.

Background on the Poisson Arrival Process

With a Poisson arrival process, the number of events occurring in an interval follows the Poisson distribution. For example, if \(\lambda\) stands for the rate, e.g., the expected number of changes per hour, then in a one hour interval, the number of events follows a Poisson(\(\lambda\)) distribution. That is, the chance there are no events in the interval is \[ P(0) = \frac{\lambda^0}{0!} e^{-\lambda} = e^{-\lambda}. \] The chance there is one event is \[ P(1) = \frac{\lambda^1} {1!} e^{-\lambda} = \lambda e^{-\lambda}, \] and, in general, the chance there are \(k\) events, for \(k \ge 0\), is \[ P(k) = \frac{\lambda^k} {k!} e^{-\lambda}.\] Furthermore, if we count the number of events in an interval of length \(t\) units of time, then the counts follows a Poisson(\(t\lambda\)) distribution and the chance there are \(k\) events, for \(k \ge 0\), is \[ P(k) = \frac{(t\lambda)^k} {k!} e^{-t\lambda}.\]

We can use these probabilities to estimate the chance of observing a change on a visit to a Web page. This is the chance of 1 or more changes in between successive visits, which is the complement of no changes between visits. That is, the chance of observing a change on a visit, is \[P( \ge 1) = 1 - P(0 ~changes) = 1 - e^{-\lambda}.\] We have been estimating this chance by the number of changes divided by the length of time observed, i.e., the observed proportion of visits with a change, \[ Observed~proportion \approx 1 - e^{-\lambda}.\] We can use this relationship to solve for an estimate of \(\lambda\), which we call \(\hat{\lambda}\), \[ \hat{\lambda} = - \log(1 - Observed~proportion).\] There is a slight problem with this estimator: if we observe a change on every visit, then the observed proportion is 1 and our estimate is \(-\log (0)\), which is not possible. We can adjust the estimator slightly to avoid this problem as follows: \[ \hat{\lambda} = - \log(1 - \frac{n}{n + 0.5}Observed~proportion),\] where \(n\) is the total number of visits. This is a standard approach for handling problems with estimators that are at the boundary of possible values.

Let’s try our new estimator. We can return to our simulation study and see how well this new estimator does in comparison to the old estimator and to the estimator when we observe all changes. Recall that poisSamps contains the 2000 samples, and obsSamps contains the censored data like that which we observe. Also, estimates holds the 2000 estimates of \(\lambda\) using the fully observed data. Now, we can use obsSamps a little differently to create an estimate of the rate, \(\lambda\),

estimatesMom = 
  sapply(obsSamps, function(samp) -log((Time + .5 - length(samp))/(Time + .5)))
mean(estimatesMom)
## [1] 0.2504492
sd(estimatesMom)
## [1] 0.01960023
hist(estimatesMom, breaks = 30, 
     main="Simulation of 2000 Poisson arrivals 
     Observed hourly (time = 720 and rate = 1/4)",
     xlab = "Estimates based on Method of Moments")

The average of the estimates is very close to the true value and the standard deviation is very similar as well to the SD when we observe the exact times of changes. Moreover, the MSE for our new estimator nearly matches the MSE of the estimator based on the complete data.

(mean(estimatesMom) - 1/4)^2 + var(estimatesMom)
## [1] 0.000384371
(mean(estimates) - 1/4)^2 + var(estimates)
## [1] 0.0003408165

We were able to find an alternative estimator for the rate of change that appears to have little bias by using a model for how the data are generated, i.e., according to a Poisson arrival process. The technique that we used here is called the method of moments. For this method of estimation: we find the expected value of an observation, i.e., of the random Bernoulli trial (this expected value is a function of the unknown parameter \(\lambda\)); we set this function equal to the sample average; and then solve for the unknown parameter. It is called the method of moments because the expected value is also known as the first moment of the probability distribution and the sample average is also known as the first sample moment. By the way, if you are familiar with the maximum likelihood estimation method, this estimator is also the MLE under the model of a Poisson arrival process.

You might wonder whether it is reasonable to assume that the changes to a Web page follow the Poisson arrival process. It’s always a good idea to check our assumptions. This is the topic of the next section.

Does the Poisson Process Model Fit the Data?

The Poisson arrival process has been a helpful model for improving our estimate for the rate of change of a Web page. However, the model comes with a few assumptions. For example, we have assumed that the rate of change is constant in time. We can check this assumption by, e.g., examining page changes for different hours of the day and days of the week to see if there are systematic patterns. Do Web pages change with the same rate in the afternoon as in the middle of the night? Do pages change at the same rate on the weekend as in the middle of the week?

Our data are for Web pages from around the world. For example, domain names of .jp, .ko, and .cn are from Japan, Korea, and China, respectively, and the domain names that end with .com, .gov, .org, and .net are from the U.S. Let’s restrict our analysis to domains in the U.S. to reduce the number of time zones (the time of the visits are all recorded in Pacific time, not the domain’s time zone). Let’s also examine only pages that have changed at least 20 times. The figure below is a heat map, where high frequencies are bright yellow and low frequencies are red. The plot shows the distribution of the observed times between changes for these U.S. Web pages against the average time between changes (note that the average time between changes is 1/average rate of change). The high-frequency bins are concentrated around the fastest changing pages at the lower left corner of the plot. This makes sense: those that change faster have smaller times between changes and have more changes so the density will be high in this region. However, there is also a small bright spot near the upper right corner. There appear to be a large number of pages with an average change rate of 24 hours and a high number of observed changes exactly 24 hours apart. This indicates a more regular updating of the page than we might expect from a random arrival process. However, it also seems to be a relatively small aberration that may not overly impact our analysis.

Next, we examine the time of day and the day of week when a page changes to see whether there are patterns, such as a lower rate of change in the middle of the night and on the weekends. We again restrict ourselves to the pages that have changed at least 20 times and are on U.S. domains. For the hour-by-hour comparison, we determine which hour of the day that each visit occurred. For example a visit at time 30 occurred at hour 6 in the day, because it is 6 hours past 24. Likewise, a visit at 70 occurred at hour 22 because it is 2 hours before 72 hours (3 * 24). Then we plot the percent of the changes that occur in each hour. If the rate of change does not depend on the time of day then we expect these percentages to all be about the same. We expect about 100%/24 hours, or about 4.17% of the changes to occur each hour. From the plot we see that the changes do fluctuate about 4.17%, but there appears to be more changes in the period from hour 5 to hour 12. Our analysis is somewhat limited because the visits are all based on Pacific time, even though the pages can come from four time zones. We also don’t know if hour 1 is 1 a.m. or some other time such as 6 p.m. because we don’t know when the study began collecting data. Even still, it’s apparent that there are patterns in the times of the changes; it is somewhat more likely for a page to change in this particular 8 hour period (from 5 to 12) than at other times of the day.

changeTimes = unlist(pages$visitsWChanges[usDomains & manyChanges])
changeHr = (changeTimes %% 24)
changeHr[ changeHr == 0] = 24 
totChanges = length(changeHr)
percentChangesHr = as.numeric(table(changeHr) / totChanges * 100)
hour = 1:24
plot(percentChangesHr ~ hour, type = "l", xlab = "Hour of day",
     ylab = "Percent Changes", xlim = c(0, 25), ylim = c(3.75, 4.5))
abline(h = 4.166667, col = "grey", lwd = 2)

We make a similar plot for the day of week. For the day-to-day comparison, we determine which hour of the day and day of the week that each change occurred. There are \(24*7 = 168\) hours in a week, and hours 25 through 48 correspond to day 2, hours 49 through 72 to day 3, and hours 169 to 192 correspond to day 1 of the second week, etc. We examine only the first 28 days of data so that each day can potentially occur 4 times in our calculation. Then, we plot the percentage of changes occurring at each hour in the week. If the rate of change does not depend on the time of day and day of week then we expect these percentages to all be about the same, i.e., about \(100\%/(24*7) = 0.595\%\). These percentages are quite spiky so we smooth them with a 5-hour window because we are only trying to get a general sense of the pattern.

fourWks = 24 * 7 * 4
sub4wkUSDomain = (pages$numVisits >= fourWks) & usDomains
changeTimes4wk = unlist(pages$visitsWChanges[ sub4wkUSDomain ])
changeTimes4wk = changeTimes4wk[changeTimes4wk <= fourWks ]
changeDay = 1 + (changeTimes4wk %% 168)
totChanges = length(changeDay)
percentChangesDay = as.numeric(table(changeDay) / totChanges * 100)

From the plot, we see that on two of the days there are many fewer changes; these are undoubtedly Saturday and Sunday. Also apparent is the daily cycle of changes that we observed in the previous plot.

From these two plots we see that there appears to be a pattern in the rates of change. We could try to account for these time of day and day of week fluctuations in our estimation problem. This would make a more complex model. Data typically do not fit a model perfectly, and decisions must be made as to whether the model is good enough to be useful or whether a more complex model is needed. For example, pages could be visited with different frequencies on the weekend and in the night than during the work week. We do not address these issues here. In the next section, we consider one additional approach to assumption checking for the Poisson arrival process.

Additional model checks

There are several other properties of the Poisson arrival process that we can use to develop model checks for our data:

  • The time between arrivals in the process follow an exponential distribution.
  • The number of events/arrivals in non-overlapping intervals of the same length behave as independent counts from a Poisson distribution.
  • The locations of events in a fixed interval are uniformly distributed.

We consider the first of these properties. It implies that the time between Web page changes follows an exponential distribution. Quantile-quantile plots can be useful in comparing the empirical distribution of our data against a particular probability distribution. The quantile-quantile plot, or qqplot for short, plots the quantiles of the data against the quantiles of the probability distribution that we conjecture has generated the data. That is, we plot pairs like: (lower quantile of the data, lower quantile of the probability distribution), (median of data, median of the probability distribution), (0.99 quantile of data, 0.99 quantile of the probability distribution), etc. If the probability distribution is correct, then these points should roughly fall on a line. Of course, there are likely to be some fluctuations, e.g., the sample quantile is not expected to fall exactly on the probability distribution’s quantile. However, these deviations should look like random fluctuations. When there is systematic departure from a line, such as strong curvature, then this indicates the data do not follow the assumed probability distribution.

To get a sense of how a quantile-quantile plot should look if the data are from an exponential distribution, we can simulate data from this probability distribution and make an exponential quantile plot for it. To do this, we can use one of our simulated Poisson arrival processes and calculate the times between successive events, e.g.,

timeBtwn = diff(c(0, poisSamps[[1]]))

We compute the quantiles of these data for the 0.005, 0.010, 0.015, …, 0.995 quantiles.

probs = seq(0.005, 0.995, by = 0.005)
quanTimeBtwn = quantile(timeBtwn, probs = probs )

We have an estimate for the rate of change in estimates[1], and we use it to find the quantiles of the exponential distribution with this rate:

quanExpTimeBtwn = qexp(p = probs, rate = estimates[1] )

Now we plot these two sets of quantiles (from the simulated data and the exponential distribution) against each other. The exponential-quantile plot appears in the figure below. We see that the points roughly follow a line with more variability in the right tail and a bit of curvature. We can repeat this process with other simulated exponential data to get a sense of the kind of variability we might expect. The exponential distribution has a long right tail and some curvature in the tail is expected in a quantile plot.

Recall that we do not in fact observe the times between events because of the censoring. Instead, we only see the nearest hour at which an event occurred. Moreover, if more than one change occurs between visits, we don’t know this, i.e., we know only that at least one change occurred. How might this censoring affect the quantile plot? We remake the above plot using the censored version of the simulated data. This appears in the figure below. We see that the points fall on steps, which is due to observations being made at regular intervals, but they still roughly follow a line. We also see that the right tail remains a bit skewed.

timeBtwnC = diff(c(0, obsSamps[[1]]))
quanExpAT = quantile(timeBtwnC, probs = probs)
plot(quanExpAT ~ quanExpTimeBtwn, 
     xlab = "Quantiles of the Exponential Distribution",
     ylab = "Sample Quantiles of Censored Data")

Now that we have some experience examining quantile plots for the exponential distribution, we are ready to explore similar plots for the actual Web pages. First, we find a page that has enough observations but not so many that all of the visits had changes.

sampPage = which(numChanges > 180 & numChanges < 250 & 
                  pages$numVisits == 719)[1]

Then we find the times between successive changes, the quantiles of the observed times between changes, and the quantiles of the exponential probability distribution with a rate that matches the estimated rate of change of our sample.

visits = pages$visitsWChanges[[ sampPage ]]
timeBtwnSP = diff(c(0, visits))

totVisits = pages$numVisits[sampPage]
estimateSP = -log((totVisits + .5 - length(visits))/(totVisits + .5))

quanTimeBtwnSP = quantile(timeBtwnSP, probs = probs)
quanExpTimeBtwnSP = qexp(p = probs, rate = estimateSP)

We make a quantile-quantile plot for these two sets of quantiles (see the figure below). The shape is very similar to the quantile plot from the simulated data. It has steps indicating the issue with observing data at intervals. However, it also seems to have somewhat more curvature than the plot with the simulated data. We may want to examine a few more of these quantile plots for other Web pages. If we do, we see that some are closer to what we expect and others are not. Again, the exponential distribution looks reasonable, although somewhat problematic.

plot(quanTimeBtwnSP ~ quanExpTimeBtwnSP,
     xlab = "Quantiles of the Exponential Distribution",
     ylab = "Sample Quantiles of Censored Data")

How Often Should a Page be Visited?

We have seen that the data collected about the changes in a Web page approximately follows a Poisson arrival process and that we can estimate the rate of change reasonably well when the rate of change is slower than the frequency with which we visit the page. How can this help us determine how often we should visit a page in the future?

There are several possible ways to do this. We consider a simple approach related to the notion of limiting the chance of multiple changes between visits. We can estimate the chance that there is more than one change since the last visit, and try to keep this probability below some threshold.

A simulation study can help us here. We simulate a few thousand Web page changes according to a specified rate. Then we examine these changes at different frequencies of observation, e.g., every quarter hour, half hour, hour, 2 hours, etc., and find the proportion of visits where two or more changes occur. We can plot this proportion as a function of the visiting frequency. Further, we can carry out this simulation for several different rates of change and include them on the same plot. This gives us an idea of the relationship between the chance of multiple changes between visits and the frequency of visits for several different rates of change.

We use our simPois() function again, letting the time interval be very large so that we observe many changes. First we specify which rates we want to study,

rates = c(0.2, 0.4, 0.6, 0.8, 1, 1.5, 2, 3, 4)

For each of these rates, we observe the Poisson arrival process for a long time, say 10000 hours.

longTime = 10000
poisSampsRate= lapply(rates, simPois, T = longTime)

We also consider several frequencies of visits, ranging from 0.025 to 1, i.e., from visits every 0.025 hours (1.5 minutes) to every hour.

obsFreq = seq(0.025, 1, by = 0.025)
Fs = length(obsFreq)

For each visit frequency, we determine the times of the visits.

visits = lapply(obsFreq, seq, from = 0, to = longTime)
visits = lapply(visits, function(visit) {
  if (max(visit) < longTime) return(c(visit, longTime)) 
  return(visit)
})

Then for each sequence of arrivals, we find the proportion of visits that have more than one arrival since the previous visit.

multChanges = lapply(poisSampsRate, function(samp) {
  probs = vector(mode = "numeric", length = Fs)
  for (i in 1:Fs) {
    cts = hist(samp, breaks = visits[[i]], plot = FALSE)$counts
    probs[i] = sum(cts > 1) / length(cts)
  }
  return(probs)
})

For each rate, we plot the proportion of visits with multiple events as a function of the visit frequency. We see in this plot that as the time between visits gets longer, there is an increase in the proportion of visits when more than one change occurs. We can use the red horizontal reference line at 5% to help us determine how often to visit a page if we want to bound this chance by 5%. For example, for a rate of 1 change per hour, we should visit the page about every 0.35 hours or every 20 minutes.

What We Learned

References

[1] Cho, J. and Garcia-Molina, H. (2003). “Estimating frequency of change,” ACM Transactions of Internet Technology, Vol. 3, No. 3, 256-290.

[2] Grimes, C. and O’Brien, S. (2008), “Microscale evolution of Web pages,” in WWW ’08: Proceedings of the 17th International World Wide Web Conference, 1149-1150.

[3] Grimes, C. and Ford, D. (2008), “Estimation of Web page change rates,” in Proceedings of the Joint Statistical Meetings, 2008.