If I may take out some of my frustration on Mark Twain, the beloved American Humorist: fuck it all. Nothing ruins my day more than when someone runs this quote up the flag pole and quickly salutes it in the middle of a debate.
Taking the quote literally implies statistics, in and of itself, is a farce and mirage of numbers. Statistics is the way we make sense of all the observations and results we tally. Without it, we would have big data sets that would be impossible for us to describe.
Using their big, fancy, statistical algorithms, Nate Silver at 538 predicted 50 out of 50 states for the 2012 campaign. Randomized controlled trials, a staple of medical research, was founded in statistics (Statistics and Science, page 6). People who view this as a literal, sagacious quote are just plain wrong about statistics.
The spirit of this quote is actually what’s important. More than any other “science,” statistics has a lot of ways it can go wrong, and it is important to be skeptical whenever someone starts to rattle of a bunch of numbers. When interpreting data, there is some leeway in how to represent the results, and I want to focus on just one of those ways: averages.
In summary, there are three types of averages:
- Mean: add all numbers in the data and divide how many numbers you added.
- Median: the central most number – the one in the middle.
- Mode: the number most represented.
Depending on what data you are looking at or, more importantly, what question you are trying to answer using that data will direct what average should be used.
Let’s take a look at a fake, payroll for personal trainers at a gym to see how these concepts can be used, appropriately or inappropriately:
The mean pay rate for this group is $24.16, the median is $17.50, and the mode is $15.00.
Without even having the original data, these results show you a lot: there is an outlier that is pushing the mean up and the median more appropriately represents what people earn at this gym. How could I tell?
The mean is the average most affected by outliers. The classic example is picture the mean income of you and your friends, then imagine it after Bill Gates joins you. Bill Gates, being a billionaire, would push the mean of the group sky high, and it would no longer truly represent the data anymore; it would make you and your friends look like multi-millionaires. The same thing is happening with these personal trainers. In reality, employee 6 is the one making a real killing at this gym, while everyone else is earning pedestrian pay in comparison.
The median is less affected by outliers. Since it is the middle most number, having data on either extreme will do little to effect it. Returning to the Bill Gates example, a median representation of the group would be intact because Bill Gates arrival would not upset the middle number in your group of friends. Returning to the personal trainers, if you pair the median with the mode, you clearly see people are earning lower and not higher.
The power within the three averages is their synergistic ability; you need all three to make an informed decision. Look at what I can do if I just project one number:
- If I wanted to recruit potential talent to my gym, I could say that the average rate of pay at my gym is $24.16 (the mean). Then, string people along and bait and switch them later on
- If I wanted to fudge my budget a little bit for the upcoming year, I could try and state that my average rate is only $17.50 (the median) and forecast lower than normal. Then, get the hell out before shit hits the fan.
- If I wanted to show management how fiscally responsible I am, I could share with them that the average trainer only earns $15.00 (the mode). Then, hope like hell they don’t find out about employee #6.
As you can see, I can liberally use these definitions to support whatever image I want to project. These numbers are not inherently lying, but it is easy to pick and choose during certain circumstances.
The way to avoid being tricked by statistics is to ask what data interpretation would better answer the question that you have. Wanting to know the most frequent pay rate (the mode) and wanting to know a traditional average (mean) of all employees are two different questions that need two different data points to appropriately be answered.
I worked for a gym that was a part of a larger association. Every year, for budgetary purposes, they would give us the mean pay rate for each department. We would use this rate and multiply it by how many hours we had people working that particular position each month. It was a nightmare for reasons you should already realize; the mean is very affected by outliers and is not a true representation of what people actually earn.
Depending on how your data set (employee pay rates) looked, you could have either a windfall of extra cash or be in a deficit the entire year. Since managing expenses was a part of everyone’s review, you were wedded to data points. False narratives were easily made; if your mean was really high, it looked like you were overpaying with onboarding new employees when it really could mean you had longer, tenured staff.
I tried to ask that we do a weighted average (where you take in account two variables to determine the mean, in this case people’s pay rate AND how many hours they worked), but it fell on deaf ears. This was not because statistics itself is a lying, sentient son-of-a-bitch; it was because people had chosen the wrong metrics to guide their decision making. This is how statistics can be worse than damn lies.