In one of his talks at QCon, John Allspaw mentioned using Holt-Winter exponential smoothing on various monitoring instances. Wikipedia has a good entry on the subject, of course, but the basic idea is to take a noisy/spikey time series and smooth it out, so that unexpected changes will stand out even more. That's often initially done by taking a moving average, so say averaging the last 7 days of data and using that as the current day's value. More complicated schemes weight that average, so that the older data contributes less.
Simple exponential smoothing effectively takes this weighted average further, with more recent values being exponentially more important than older ones. However, this has problems in the face of a long term trend, so double exponential includes a factor for the general tendencies in the data (e.g. an increasing trend over time). Triple exponential, which we've using here, also includes a factor to consider seasonal changes, so I thought I'd give that one a go at implementing. Each of those three smoothing aspects have their own weighting factor, alpha, beta and gamma, that control how much of an impact they have, and by setting each to 0 we can have the same code do any one of the three algorithms. Below I've broken out the function into it's component parts, but you can see the whole thing on github
We'll give it a go on some web data that has an unexpected spike, see how visible that is against the timeline. The algorithm is pretty simple, but we need to setup a bunch of variables first. We start off by calculating an initial trend value by looking at the difference in the average values over the first two 'seasons' (the length being a configurable parameter of the function).
<?php
// Calculate an initial trend level
$trend1 = 0;
for($i = 0; $i < $season_length; $i++) {
$trend1 += $data[$i];
}
$trend1 /= $season_length;
$trend2 = 0;
for($i = $season_length; $i < 2*$season_length; $i++) {
$trend2 += $data[$i];
}
$trend2 /= $season_length;
$initial_trend = ($trend2 - $trend1) / $season_length;
?>
Next we create an initial value for the 'level' part, the direct data smoothing parameter, map the data for the season index, and calculate the seasonal changes for the first period.
<?php
// Take the first value as the initial level
$initial_level = $data[0];
// Build index
$index = array();
foreach($data as $key => $val) {
$index[$key] = $val / ($initial_level + ($key + 1) * $initial_trend);
}
// Build season buffer
$season = array_fill(0, count($data), 0);
for($i = 0; $i < $season_length; $i++) {
$season[$i] = ($index[$i] + $index[$i+$season_length]) / 2;
}
// Normalise season
$season_factor = $season_length / array_sum($season);
foreach($season as $key => $val) {
$season[$key] *= $season_factor;
}
?>
Finally, we actually run the smoothing. This loops over the data, updates trend, level and season values for the three elements of the smoothing, and finally combines them to calculate the smoothed value, factoring in the weighting constants. By continuing beyond the end of the data, we can even use this to project into the future and make a forecast!
<?php
$holt_winters = array();
$alpha_level = $initial_level;
$beta_trend = $initial_trend;
foreach($data as $key => $value) {
$temp_level = $alpha_level;
$temp_trend = $beta_trend;
$alpha_level = $alpha * $value / $season[$key] +
(1.0 - $alpha) * ($temp_level + $temp_trend);
$beta_trend = $beta * ($alpha_level - $temp_level) + ( 1.0 - $beta ) * $temp_trend;
$season[$key + $season_length] = $gamma * $value / $alpha_level
+ (1.0 - $gamma) * $season[$key];
$holt_winters[$key] = ($alpha_level + $beta_trend * ($key + 1)) * $season[$key];
}
?>
This whole thing is wrapped in a function that sets the values of the smoothing constants, so we can just call $newdata = holt_winters($data, 30). Running this on the webstats data gives us a smoothed graph, as you can (hopefully) see from the Google chart below, assuming the Javascript is behaving.
John used this kind of smoothing at Etsy in combination with error bars to look for unusual events, and trigger their monitoring systems. One thing I noticed from trying a quick implementation is that the length of time considered for the season can have a big effect on the smoothing, as can the values of the $alpha, $beta and $gamma constants, so some tweaking may be required if using a similar technique on your own data.
If we did want to make some sort of triggering based on data, we'd need to create confidence intervals as well. We can do that with an extra array in the main holt winters loop that is updated like this:
<?php
$deviations[$key] = $dev_gamma * abs($value - $holt_winters[$key]) + (1-$dev_gamma)
* (isset($deviations[$key - $season_length]) ? $deviations[$key - $season_length] : 0);
This is going to track how much our data is deviating from the smoothed value, and factor in seasonality in that. We can use a number of these values added and subtracted to the smoothed value to create confidence bars, and signal if our data goes outside that. We'll add and subtract three multiples of deviation score, which gives us error bars that look something like the below. Note that as the data gets more variable, the confidence bars open up to respect the general increased volatility, but when the data isn't changing much day to day the error bars are pretty tight.
Michael Stillwell
March 10th, 2012 at 11:55
Yow, it doesn't handle that section between 60-75 all that well--either laggy or confused, or both. Manages the lag pretty well elsewhere though, especially 0-60.
Ian Barber
March 10th, 2012 at 12:15
Yeah, the big spike definitely triggers an aberattion, but it over weights that for the following readings. The alpha,beta, gamma parameters and season size do have a major effect on this - you can learn those with a least squared error type approach if you want on old data, but in this case I just bunged in some applicable numbers.
Andrey Esaulov
March 12th, 2012 at 11:11
Just what I needed for my project! Thanks for putting it together.
Two things I miss though are:
1) $data array format you are using
2) the description how to hook it up with google charts
I guess I would have to figure it out myself. The goal of the article is not the step by step code walk through, but rather the illustration of the data smoothing algorith. And this is done perectly!
Volomike
October 31st, 2012 at 20:20
I was suggested to consider Holt-Winter for calculating sales forecasts. How would I go about adjusting what you have here with that, projecting forward?
Also, sales often have a seasonal effect. For instance, if last year we saw a decline trend for more than one month in Q4, then this year we can usually (but not always) forecast that the same thing will happen. Does Holt-Winter take this into consideration?
My thoughts:
http://stats.stackexchange.com/q/41572/16404
P.S. You don't need the ending ?>. "If a file is pure PHP code, it is preferable to omit the PHP closing tag at the end of the file. This prevents accidental whitespace or new lines being added after the PHP closing tag, which may cause unwanted effects because PHP will start output buffering when there is no intention from the programmer to send any output at that point in the script." http://www.php.net/manual/en/language.basic-syntax.phptags.php
Ian Barber
November 16th, 2012 at 22:53
Thanks for the tip volomike. Season is taken into account in the code - there's a little bit at the code on github that shows forecasting. Friends don't let friends extrapolate though, and all that.
Re the closing bracket - in this case it just makes the syntax highlighting a little tidier!
Ian Barber
November 16th, 2012 at 22:54
Thanks for the tip volomike. Season is taken into account in the code - there's a little bit at the code on github that shows forecasting. Friends don't let friends extrapolate though, and all that.
Re the closing bracket - in this case it just makes the syntax highlighting a little tidier!
John
January 8th, 2013 at 12:07
Hey,
I was just wondering how you computed the constants alpha, beta and gamma to be used in this instance.
Thanks for your help!