In a recent post, I blogged about the limits of scientific research. I looked at various errors that researchers committed that impacted the accuracy or relevance of studies and experiments. While these errors in fact occur, even in studies reported in prestigious peer-reviewed journals, there are a basic set of protocols followed by researchers, reviewers and editors. These protocols, when followed properly, allow researchers to have reasonable levels of confidence in the predictive or explanatory value of their work. Unfortunately, one can not say the same for the average professional in a typical knowledge worker environment. Many professionals without a specific background in scientific research lack a basic understanding of research concepts such as statistical relevance and appropriate sampling techniques.

Some professionals may wonder why this is even an issue. If they’re not conducting actual research, do they need to be fluent in statistics and sampling? There’s good reason to believe that they do. Many judgements and decisions that we need to make in a professional workplace would be better informed by understanding these principles. One key principle is understanding the role of sample size in many common analytical scenarios.

In 1974, the researchers Kahneman and Tversky conducted a classic study that demonstrated people’s inherent tendency to undervalue the impact of sample size on the variability of outcomes. In the study, they asked subjects to consider an imaginary town that is served by two hospitals. The larger hospital has 45 births a day. The smaller hospital has only 15. Over the course of a year, each hospital records those days in which more than 60% of the new babies were boys. The subject’s challenge was to determine which hospital would record more of these unusual days. Only 21 out of 95 subjects correctly guessed that the smaller hospital would record more of these abnormal days.

In a similar (but more mind bending) problem, consider an urn that you are told contains balls of two different colors. One third of the balls are one color and the remaining 2/3rds are the other color. Now, consider two different scenarios. In the first scenario, you draw 5 balls and 4 of them are white. In the second, you draw 20 balls and 12 of them are white. In which situation should you be more confident that white balls are in fact the predominant color in the urn? Surprisingly, in the first scenario, your chance of being correct is only 87.5%. In the second scenario your odds are higher at 94.75%.

Both the hospital problem and colored ball problem are examples of a simple statistical phenomenon related to sample size. Sample size is simply the number of observed elements or trials in an experiment, study or survey. All problems involving randomness are governed by laws related to sample size. As a rule, the smaller the sample size, the more likely the variability of outcomes.

Let’s look at this rule through that most basic of random events, a coin toss. Let’s start by looking at a very small sample of 10 tosses. We would expect an average of 5 heads. Probability theory would tell us that there was about a 66% chance that the result would be within 20% of the expected norm – that is, either 4,5, or 6 heads. However, if we were to toss the coin 100 times, on average, we would be within 20% of normal (with normal being 50 heads) over 96% of the time. As the sample grew to a very large size we would expect the average percentage of ‘heads’ to be very close to 50%, with little likelihood of significant variance. The takeaway is a simple one: When dealing with small samples, expect greater potential volatility of results.

So outside of some nifty trivia questions for happy hour, how does this apply to the corporate workplace? Let’s examine a few common situations where statistics are being examined and sample size is relevant. Imagine you are the an executive for a technology company, responsible for a number of different sales teams across the country. Because of a particular way that regions or industry sectors are divided, the teams have significant differences in size. The largest team has 100 members and the smallest has 20. One would expect that the smallest team would show the greatest amount of variation in performance. This might lead the group to receive honors as “Team of the Month” or be put on notice as the lowest performing team in the company.

Okay, you respond, but maybe the team really is the best or worst in my division. What does this have to do with sample sizes? In order to understand why sample size is relevant, one needs to understand the randomness component associated with all processes. All teams are going to be potentially impacted by a number of unexpected, external events:

- The CIO and decision maker on a current proposal unexpectedly resigns
- A major client announces an M&A transaction and puts all technology purchases on hold
- A serious regulatory finding causes a client to make a large, unanticipated software purchase
- A hurricane destroys a client’s data center, leading to the replacement of large amounts of hardware

While all groups could potentially be impacted by one or more of these events, the large teams have the mitigating force of a larger group of average performers. In the small team’s case, one single positive or negative aberrational event could skew the average. The impact of small sample size is not just limited to these large, unusual events. Other, more common causes of variation will disproportionately impact the small team.

To better understand the concept of personal variation in performance, picture the following thought experiment. Imagine that one of your hobbies involves going to the batting range and practicing your hitting against pitching machines. Over time, you develop a comfort level with a particular machine. It throws the same exact pitch, at the same speed, every time. You’ve kept track of your performance and noticed that, on average, you make solid contact with pitches 50% of the time. When examining a small subset of your performance, however, you notice significant variation. In several situations, you were able to make contact 9 out of 10 times. In one difficult stretch, you missed 8 times in a row.

Note that the batting range example involves a highly repeatable process, with minimal variation in challenge. Nevertheless, there can be significant short term variation in performance. The same set of statistical forces would apply to the small sales team. Despite earnest effort, like all human beings, they will have some normal variation in their process outcomes. But unlike the larger teams, their performance variation will be magnified. The large teams will get 1000 swings of the bat, not 100.

So, where does this leave us? Should we simply discount performance measurement, such as sales data for teams? The answer is no. If we are to run an organization with an evidence based approach, data, metrics and measurement are essential components. However, understanding how sample size effects variability should cause us to take a cautious view when dealing with small samples. This is true either when a team is small or when the measurement period for an activity is short. When looking to measure performance within an organization, be mindful of the normal volatility inherent in small data sets.

**Additional Thoughts on Visualizing Small Sample Size**

Another way to picture the effect of small sample sizes is to think of a pot of soup. Imagine that you’ve decided to kick the soup up a few notches with some Uncle Jack’s Super Hot Atomic Sauce. You randomly fire in a few squirts across the pot. But just then your phone rings and you forget to mix in the sauce. Now picture two different scenarios. In the first, you ladle out a bowl of soup. In the second, you dip in with a teaspoon to “test” the soup. In the first scenario, like the large sales team, you’d be more likely to get average performance, that is, a soup with a tasty level of hot sauce mixed in. In the second situation, like the smaller team, you’d be more likely to get an extreme result: Either a mouth scorching dose of hot sauce or a bland unseasoned taste of soup.