Stop Using Regressions (so much)
OK…You don’t need to completely stop using them, but they are not the go-to solution to all complex problems. In fact, if put into the hands of those with less nuanced understanding of their use, any statistical or data tool can be misused.
To begin, it might be important to understand what a regression is and how it is intended to be used. A regression is a statistical tool used to estimate relationships between two variables. In many ways it is similar to a correlation. But where a correlation can simply demonstrate the likelihood of a relationship, a regression can visualize the relationship and provide and fitted equation to the data.
The picture above shows a textbook example of a linear regression. The data (blue dots) shows each case and where it falls on two variables. Even without a red line, we can see that there appears to be a trend in the data. A regression fits a line to the data that is the best estimate possible given certain level of error. This line fits quite well. But not all data is so accommodating.
All the above data plots have the exact same regression equation and line. This famous example is called the Anscombe's quartet and it shows just how useless regressions can be depending on the situation. Imagine using a wrench to hammer in a nail. Just because the wrench is bad at this task, doesn’t mean the wrench is entirely useless. But it could mean you’re a bad carpenter!
Here are a few cases when not to use regressions.
When the relationships make no logical sense. This is commonly understood in the phrase, “correlation does not imply causation.” ice cream sales positively correlate to murder rates. This doesn’t make logical sense and as a result correlation and regression are not particularly useful in understanding this phenomenon.
There are more variables in play. In the ice cream example, there is an unknown factor that connects ice cream and homicide. No, it is not that frozen desserts lead to violence. It’s the weather; more murders occur in the summer. More ice cream is sold in the summer. Regressions are designed to show relationships between two variables and as a result, there could be other spurious factors not in the calculation. Yes, there are many regressions that use more than two variables. But the point is that there may be yet more variables that play into the problem; especially in a complex system.
Not enough data. I would not use regressions to analyze small datasets. The likelihood that your findings are going to explain all common occurrences of the problem become more remote the smaller your sample. Gather more data if possible.
They don’t explain why. Regressions can tell you how two variables relate. However, they will never tell you why the two relate in such a way. There have been numerous studies in the US and elsewhere that show a significant relationship between education and income. This is a simple linear regression. But if I didn’t understand the basics of our economy and society, this relationship would be a mystery.
This still doesn’t imply causation. If you’re looking for causation, you might be disappointed. The fact remains that while a regression provides far more information on a relationship than a simple correlation, it still can’t guarantee that there is a causal relationship. Higher rates of education likely lead to better job prospects and therefore greater income potential over your lifetime. But what if the opposite is true? What if higher rates of wealth lead to a greater desire for education. Afterall, if I don’t have to worry about feeding myself and paying the gas bill, maybe work on my education. This is a far less convincing theory here, but in other cases, the reverse is not always so implausible.
No feedback loops. In all likelihood, education leads to greater income in the household which results in greater education for the person (through additional professional training) and for their children. The result is a feedback loop. Loops are almost impossible to be expressed properly in a linear regression. As a result, in issues and topics where feedback loops play a role, regressions shouldn’t play a central role.
Again, this is not to say that regressions are useless. In fact, they can play a critical role in the discovery of knowledge and solving complex problems. But too often one research tool can be overused. Using numeric data for example, can result in all problems reduced to a simple number. As much as I use and love data, this is not always the best bet. Tools within data analysis and research can be overused. Keep this in mind.
Researchers and analysts are often called upon by their organization or clients to do a variety of tasks. In this way, they act like the local “handyman” from a small town in a bygone age. That person needed to be a carpenter on Monday, plumber on Tuesday, electrician on Wednesday, and car mechanic on Thursday. This requires them to have a wide variety of tools and to know when to use them.
It’s the same with regressions. Don’t rely on one tool too much, no matter how versatile it is.
Wondering if a regression is the best tool for your problem? Wondering what other analyses might be a good fit instead? Schedule a free 1 hr. consultation with me to discuss your project.