I hated stats class in graduate school. Nothing made me feel dumber, faster. So, this post is dedicated to that full year of ridiculous, and I hope to now reduce the key points to a 5-minute blog post.
Regression. Sounds hard, right? Watch this.
Let’s say you’ve observed a series of events. Maybe it’s the temperature and the time of day. Maybe it’s the amount of people and energy consumption. Doesn’t matter. Now, you take the numbers from those observations and you put them on a regular X- and Y-graph.
Those circles are all in a rough pattern, right? They’re going up and to the right pretty much together. When one thing is larger, the other thing is larger.
So, Statistics cool fact #1 (and we’re only about 1 minute in here, people): when two things “move together,” they’re correlated. Bang, you just learned correlation.
Ok, what’s up with that red line? That is called the “line of best fit.” Why? Just look at it. If you had to draw one line to go in between all of those circles, that’s about where it would go, right?
And that’s all regression is, really: finding that line. Seriously, that’s it. You don’t even have to know how it does it if you don’t want to.1
So what do you do with that line? Well, if you remember high school geometry, you had to learn about slopes. y = mx +b and all that. Don’t worry. But do you remember that m was the slope, the “rise over the run”? Well, now you do.
That slope is what we call a coefficient. It just says that for every unit of the thing, this other thing changes this much. Look at that red line. For every amount on X, it moves an amount of Y (thus, you know, making it a straight line).
What does that mean? It means for every unit of X, Y changes a certain amount. Let’s say that we’re looking at people and energy consumption in a city. For every person you add, energy consumption in the city goes up some fixed amount. Or, rather than game analytics, say this is eCommerce and it’s every page they view, spending in the store goes up some fixed amount.
Imagine if the points were shaped (we say “distributed”) another way:
Ok, so the line is different now. And the coefficient for it will be negative. Why? For every unit of X, Y is actually going down now (that’s negative correlation if you want to get fancy).
These coefficients are pretty useful. Now realize that you can come up with one for just about anything if you have the data. What do you want to know?
The nifty thing about regression is that it can give you these slope numbers (coefficients) for a bunch of things at once. So, you can know something like “what’s the relationship of time on site to spending, controlling for gender?”
The results will tell you, for every unit of time on site, how much more spending the person will engage in. And they do so while taking the gender of the person out of the picture and putting that off to the side. In fact, gender also gets its own coefficient. So, we get to know for men or women, independent of how much time they spend on the site, who spends more?
The rest is all details, things you aren’t allowed to do (like putting tons and tons of variables in at once, or trusting a model with crummy results), and months of walking through boring proofs. But you’ve just learned 90% of the important part. And you can do this in Excel.
1Of course, lots of people care a lot about this. Here’s the short version: It draws the line where the distance from the line to the circles is smallest overall. It does this by taking those distances and squaring the amounts, so that circles really far off will pull the line harder towards them. Those are (Malcolm Gladwell love this) outliers.