Data Sampling Techniques Part 1: Statistical Sampling

drjohnchun · ‎09-22-2016

The 2016 U.S. presidential election is less than two months away and the campaigns are intensifying as the Election Day draws near. According to recent reports, it appears that the candidates are locked in a virtual dead heat. A recent news article reads "Candidate A and Candidate B start the race to November 8 on essentially even ground, with Candidate A edging Candidate B by a scant two points among likely voters, and the contest sparking sharp divisions along demographic lines in a new poll. Candidate A tops Candidate B 45% to 43% in the new survey." Let's pause and ask if the 2% "edge" by Candidate A is meaningful, or is it a wash within the sampling error? Is the article correct in saying "Candidate A tops Candidate B?" What does all this mean and how can we apply this knowledge to our work in ServiceNow?

Statistical sampling is a technique used to estimate the big picture (called "population") using a small subset of data (called "sample"). It's commonly used in surveys/polls and audits where evaluating the entire population is either impossible, impractical (for example, it may take too much cost and/or time), or unnecessary. On the other hand, a census evaluates every member of the entire population. An example is the U.S. Census, which is conducted every ten years by the U.S. Government at a tremendous cost and effort, assessing every single resident in the U.S. In working with data and records, both sampling and census techniques provide an insight to the data we have; which technique we choose to use largely depends on the situation and what we're trying to accomplish.

Let's take a look at a few key concepts on statistical sampling using this election poll as an example, without using too much politics, math or statistical jargons. Once we understand how this works (and be able to better interpret the polls from the news), we'll look at some practical examples in ServiceNow that can help us with data quality/process improvements and compliance.

The news article goes on to describe how the poll was conducted: "The poll was conducted by telephone Sept. 1-4 among a random national sample of 1,001 adults. The survey includes results among 886 registered voters and 786 likely voters. For results among registered or likely voters, the margin of sampling error is plus or minus 3.5 percentage points." A simple way to translate this is the "true" results may be, reflecting the margin of error, anywhere between 41.5-48.5% for Candidate A and 39.5-46.5% for Candidate B. Because the 2% difference is well within the 3.5% margin of error, the results are a statistical tie; that is, one can't say "Candidate A tops Candidate B" with certainty.

Let's look at this in a little more detail. According to the U.S. Census Bureau, there were 153 million registered voters, of whom 133 million voted in 2012, when the last presidential election was held. So let's assume that the population size of "likely voters" is 133 million. At 95% confidence level (we'll see what this means in just a bit), the margin of error (MoE) is approximately [1]

Margin of Error (MoE) â‰ˆ ± 1 / âˆšn

where n is the sample size. When we use the sample size of 786 likely voters from the poll in this equation, we get about 3.5% margin of error, consistent with the article. Although the article doesn't mention the confidence level used, 95% is a typical value used. The beauty here is we got the results only with 786 phone calls; it would've been nearly impossible to make 133 million calls. We can see from the above equation that the margin of error becomes smaller as the sample size becomes larger, which makes sense; as the sample size approaches the population size, the error should be almost zero.

Here are the steps for conducting a poll:

Define the objective. In this poll, we're trying to find out whom the likely voters in the U.S. would vote for if the presidential election were held today.
Define the population. Based on the objective, define what the population is. In this poll, the population is the "likely voters in the U.S.", which we assumed to be 133 million voters from the 2012 presidential election.
Define the confidence level. The confidence level of 95% is typically used in many cases; we'll see what this means below.
Define the margin of error and sample size. As we saw above, the two are related. Typically, a margin of error of 5% or smaller is used; from that, the sample size can be calculated.
Select samples. Based on the sample size to be used, select random samples from the population. The keyword here is "random"; there should be no bias in the selection process.
Evaluate the samples. Evaluate the samples based on the objective and pre-defined criteria.
Draw conclusions. Based on the outcome, draw conclusions from statistical analysis.

Poll results are used in conjunction with the margin of error to give confidence interval. Since the poll results only provide an estimate, we can only say that "somewhere between 41.5% and 48.5% (or 45 ± 3.5%) of the likely voters would vote for Candidate A." The meaning of the 95% confidence level is: if we were to repeat the same poll (but with different samples) 100 times, we expect the true value to be within each confidence interval 95 times. While using a single number as in "45% of voters are likely to vote for Candidate A" may seem convenient, as often reported in the media, it's not accurate reporting and may be misleading, especially when 45% is said to "top" 43% as we saw above.

If you'd like to read more on election polls in general terms, "5 key things to know about the margin of error in election polls" provides good explanations. For a more mathematical approach, "The 'Margin of Error' for Differences in Polls" provides the details.

Now that we've seen how statistical sampling works in election polls, we'll see next time how this can apply to our work in ServiceNow.

Please feel free to connect, follow, post feedback / questions / comments, share, like, bookmark, endorse.

John Chun, PhD PMP

1. Charles H. Franklin, "The 'Margin of Error' for Differences in Polls" (2007) http://abcnews.go.com/images/PollingUnit/MOEFranklin.pdf

Data Sampling Techniques Part 1: Statistical Sampling

2026 MVP Applications are open—we invite you to apply today!

Now Create Retirement FAQs and Introduction to the Best Practices site

Data at the Core—Integrations, Federations, and Workflow Data Fabric