Chapter 12: Statistics

Statistics gives us tools for collecting, organizing, describing, and interpreting data. In this chapter, we introduce basic statistical vocabulary, methods for displaying data, and numerical summaries for the center and spread of a data set.

Populations and Samples

Definition: Population

The population of a study is the group the collected data is intended to describe.

Definition: Parameter

A parameter is a value (average, percentage, etc.) calculated using all the data from a population.

Parameters are usually denoted with Greek letters.

Definition: Sample

A sample is a smaller subset of the entire population, ideally one that is fairly representative of the whole population.

Definition: Statistic

A statistic is a value (average, percentage, etc.) calculated using the data from a sample.

Categorizing Data

Definition: Dataset

A data set is a collection of values called data points or data values.

Definition: Variable

A variable is any characteristic that is measured from an object or individual.

Once we have gathered data, we might wish to classify it. Roughly speaking, data can be classified as categorical data or quantitative data.

Definition: Qualitative and Quantitative Variables

A qualitative (categorical) variable represents a characteristic. Qualitative variables are not inherently numbers, so they cannot be added, multiplied, or averaged, but they can be represented graphically with graphs such as bar graphs.

Examples: gender, hair color, race, nationality, religion, course grade, year in college, etc.

A quantitative (numerical) variable represents a measurable quantity. Quantitative variables are inherently numbers, so they can be added, multiplied, averaged, and displayed graphically.

Examples: height, weight, number of cats owned, score of a football game, etc.

Quantitative variables can be further subdivided into continuous and discrete variables.

A continuous variable can take on an uncountable number of values in a range. In other words, the variable can be any number in a range of values. Continuous variables are usually things that are measured.

Examples: height, weight, foot size, time to take a test, length, etc.

A discrete variable can take on only specific values in a range. Discrete variables are usually things that you count.

Examples: IQ, shoe size, family size, number of cats owned, score in a football game, etc.

Sampling Methods

Definition: Random Sample

A random sample is one in which each member of the population has an equal probability of being chosen. A simple random sample is one in which every member of the population and any group of members has an equal probability of being chosen.

Definition: Sampling Bias

A sampling method is biased if every member of the population does not have equal likelihood of being in the sample.

Definition: Stratified Sample

A stratified sample is obtained by dividing the population into meaningful groups, called strata, and then taking a random sample from each group.

Presenting Data Graphically

Once we have collected data, we need to start analyzing it. One way to display and summarize data is to use statistical graphing techniques. The type of graph we use depends on the type of data collected. Qualitative data use graphs like bar graphs and pie graphs. Quantitative data use graphs such as histograms and frequency polygons.

In order to create graphs, we must first organize and summarize the individual data values in the form of a frequency distribution. A frequency distribution is a listing of the data values (or groups of data values) and how often those data values occur.

Definition: Frequency and Frequency Distributions

Frequency is the number of times a data value or group of data values (called a class) occurs in a data set.
A frequency distribution is a listing of each data value or class of data values along with its frequency.
Relative frequency is the frequency divided by $n$ , the size of the sample. This gives the proportion of the entire data set represented by each value or class. Relative frequencies are expressed as fractions, decimals, or percentages.
A relative frequency distribution is a listing of each data value or class of data values along with its relative frequency.

The method of creating a frequency distribution depends on whether we are working with qualitative data or quantitative data. We will now look at how to create each type of frequency distribution according to the type of data and the graphs that go with them.

Definition: Bar Graph

A bar graph displays a bar for each category. The length of each bar indicates the frequency of that category.

To construct a bar graph, we draw a vertical axis and a horizontal axis. The vertical direction has a scale and measures the frequency of each category. The horizontal axis has no numerical scale in this instance but lists the categories.

Definition: Histogram

A histogram is like a bar graph, but the horizontal axis is a number line and the bars represent numerical intervals.

Measures of Central Tendency

In addition to graphical and verbal descriptions of data, we can use numbers to summarize quantitative data distributions. We want to know what a typical, average, or representative value for a set of data is (the center of the data), and how spread out the values are in the data set. In this section we explore measures of central tendency, and in the next section we will explore measures of spread.

Mean

We need to be careful with using the word "average" as it means different things to different people in different contexts. One of the most common uses of the word "average" is what mathematicians and statisticians call the arithmetic mean, or just the mean for short. The mean is what most people think of when they use the word "average," but we should try to use statistical terms to be precise.

Definition: Mean

The mean of a set of data is found as the sum of the data values divided by the number of values. Symbolically, the formula for the sample mean is:

$\overline{x} = \frac{\sum x _{i}}{n} = \frac{x _{1} + x _{2} + x _{3} + x _{4} + \dots + x _{n}}{n},$

where each $x_{i}$ is the $i$ th data value and $n$ is the sample size. The expression $\sum x_{i}$ is a short way to write the data values added together.

We use the symbol $\overline{x}$ to represent the mean, while $x$ is the symbol for a single measurement. We say "x bar."

Example: Mean from a Frequency Table

A sample of 100 families in a particular neighborhood are asked their annual household income, to the nearest 5 thousand dollars. The results are summarized in the frequency table below. What is the mean annual household income for this neighborhood?

Income (thousand dollars)	15	20	25	30	35	40	45	50
Frequency	6	8	11	17	19	20	12	7

Calculating the mean by hand could get tricky if we try to actually add 100 values. We want to add all 100 values and divide by 100:

$\overline{x} = \frac{15 \cdot 6 + 20 \cdot 8 + 25 \cdot 11 + 30 \cdot 17 + 35 \cdot 19 + 40 \cdot 20 + 45 \cdot 12 + 50 \cdot 7}{100} = \frac{3390}{100} = 33.9.$

The mean household income of our sample is $33.9$ thousand dollars, or $$33, 900$ .

Example: Effect of an Outlier on the Mean

Continuing from the previous example, suppose a new family with a household income of 5 million dollars moves into the neighborhood. This is 5,000 thousand dollars. Including this in the sample, the mean is now:

$\overline{x} = \frac{15 \cdot 6 + 20 \cdot 8 + 25 \cdot 11 + 30 \cdot 17 + 35 \cdot 19 + 40 \cdot 20 + 45 \cdot 12 + 50 \cdot 7 + 5000 \cdot 1}{101} = \frac{8390}{101} \approx 83.069.$

While $83.1$ thousand dollars (about $$83, 100$ ) is the correct mean household income, it no longer represents a "typical" value.

Imagine the data values on a see-saw or balance scale. The mean is the value that keeps the data in balance. In the graph of the household income data, the $5$ million data value is so far out to the right that the mean has to adjust upward to keep things in balance.

For this reason, when working with data sets that have outliers, values far outside the primary grouping, it is common to use a different measure of center: the median.

Definition: Outlier

A data value that is much higher or lower than all of the other data values is called an outlier. Sometimes outliers are unusual data values that are interesting and should be studied further, and sometimes they are mistakes.

Median

Definition: Median

The median is the value found in the middle of an ordered data set.

There is no symbol or formula for the median. To find the median, order the data values from smallest to largest and then count from both ends inward toward the center one data value at a time until reaching the middle.

If there are an odd number of data values, then there is one middle data value and that is the median.

If there are an even number of data values, then there are two middle data values. The median is the mean of those two data values.

Example: Median with an Odd Number of Values

Find the median of these quiz scores:

$5, 10, 8, 6, 4, 8, 2, 5, 7, 7, 6.$

It is helpful to mark or cross off the numbers in the original data set as you list them to make sure you do not miss any. Also, be sure to count the number of data values in the ordered list to make sure it matches the number of data values in the original list.

In this example there are $n = 11$ quiz scores. When the distribution contains an odd number of data values, there will be a single value in the middle and that value is the median. For small data sets, we can "walk" one value at a time from the ends of the ordered list toward the center to find the median:

$lower half 2 4 5 5 6 median 6 upper half 7 7 8 8 10 .$

The median test score is 6 points.

Example: Median with an Even Number of Values

Suppose another quiz score needs to be included in the set of quiz scores in the previous example. Someone in the class got a perfect score of 20 points on this very difficult quiz.

The ordered list of data is now:

$2, 4, 5, 5, 6, 6, 7, 7, 8, 8, 10, 20.$

There are now $n = 12$ quiz scores in our sample. When the distribution contains an even number of data values, there will be a pair of values in the middle rather than a single value. We find the mean of those middle two values.

$lower half 2 4 5 5 6 middle pair 6 7 upper half 7 8 8 10 20 .$

The median test score is $\frac{6 + 7}{2} = 6.5$ points. It is important to notice that despite adding an outlier quiz score to the data set, the median is largely unaffected. The median quiz score for the new distribution is 6.5 points when it was 6 points before.

Mode

Definition: Mode

The mode is the data value that occurs most frequently in the data set.

A data set may have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal).

Measures of Spread and Position

Consider these three sets of student quiz scores on a 10-point quiz:

Class A: $5, 5, 5, 5, 5, 5, 5, 5, 5, 5$
Class B: $0, 0, 0, 0, 0, 10, 10, 10, 10, 10$
Class C: $4, 4, 4, 5, 5, 5, 5, 6, 6, 6$

All these data sets have mean $\overline{x} = 5$ and median $5$ , yet the three sets of scores are clearly quite different.

In Class A, everyone had the same score. In Class B, half the class got no points and the other half got a perfect score of 10 points. Scores in Class C were not as consistent as those in Class A but also not as widely varied as those in Class B.

This scenario shows that, in addition to the mean and median, which measure the "typical" value of a data set, we also need a way to measure how "spread out" or varied each data set is. There are several ways to measure the variation and locate positions in a data distribution. In this section we explore range, standard deviation, percentiles, quartiles, and the interquartile range (IQR). We also examine a graphical representation of spread using a box plot.

Range

The first and simplest way to measure spread is the range. Calculation of the range uses only two values from the data set: the largest value and the smallest value. The range is the distance between these two values.

Definition: Range

The range is the difference between the maximum value and the minimum value of the data set.

Example: Range

Refer to the three sets of student quiz scores from the introduction to this section.

For Class A, the range is 0 since both the maximum and minimum are the same: $5 - 5 = 0$ .
For Class B, the range is 10 since $10 - 0 = 10$ .
For Class C, the range is 2 since $6 - 4 = 2$ .

In this example, the range seems to reveal how spread out the data is. However, suppose we add a fourth set of quiz scores:

$Class D: 0, 5, 5, 5, 5, 5, 5, 5, 5, 10.$

Quiz scores from this class also have a mean and median of $5$ . The range is $10$ like Class B, yet this data set is quite different than Class B. To more accurately measure the difference in spreads between these two sets of data, we will have to turn to more sophisticated measures of variation.

Example: Comparing Ranges

Find the range for each data set.

Set A: $10, 20, 30, 40, 50$
Set B: $10, 35, 36, 37, 50$

For both sets of data, the range is $50 - 10 = 40$ . However, most of the data in Set B is closer together, except for the extremes. There seems to be less variability in the data in Set B than in the data in Set A. The range focuses only on the two extreme values and ignores all the data between the extremes. So, we need a better way to quantify the spread.

Standard Deviation

We saw that the range focuses on the difference between the maximum and minimum values. What if we focused on the differences between each of the data values and the center? The center we will use is the mean. The difference between a data value $x$ and the mean of the distribution $\overline{x}$ is called a deviation.

Definition: Deviation

The difference between a data value $x$ and the mean of the data distribution is called the deviation from the mean.

$deviation from the mean = x - \overline{x} .$

To see how deviations work, consider the following temperature data set, whose sample mean is approximately $\overline{x} = 62.7$ :

Notice that some of the deviations are positive and some of them are negative. The sum of the deviations is around zero. If there had been no rounding of the mean, then the sum of the deviations would have been exactly $0$ .

So what does that tell us? Does this imply that on average the data values are a distance of zero units from the mean? No. It just means that some of the data values are above the mean and some are below the mean. The negative deviations are for data values that are below the mean, and the positive deviations are for data values that are above the mean. The positive and negative deviations from the mean cancel each other out.

We need to eliminate the signs of the deviations so we can measure the distance from the mean. Squaring a number is a widely accepted way to make all of the numbers positive. We continue building the table by adding a third column that contains the squares of the deviations from the mean.

$x 7159696863575757576567 sum x - \overline{x} 71 - 62.7 = 8.3 59 - 62.7 = - 3.7 69 - 62.7 = 6.3 68 - 62.7 = 5.3 63 - 62.7 = 0.3 57 - 62.7 = - 5.7 57 - 62.7 = - 5.7 57 - 62.7 = - 5.7 57 - 62.7 = - 5.7 65 - 62.7 = 2.3 67 - 62.7 = 4.3 0.3 (x - \overline{x})^{2} (8.3)^{2} = 68.89 (- 3.7)^{2} = 13.69 (6.3)^{2} = 39.69 (5.3)^{2} = 28.09 (0.3)^{2} = 0.09 (- 5.7)^{2} = 32.49 (- 5.7)^{2} = 32.49 (- 5.7)^{2} = 32.49 (- 5.7)^{2} = 32.49 (2.3)^{2} = 5.29 (4.3)^{2} = 18.49 304.19$

Now that we have the sum of the squared deviations, we should find the mean of these values. However, since this is a sample, the normal way to find the mean (summing and dividing by $n$ ) does not estimate the true population spread correctly. It would underestimate the true value. So, to calculate a better estimate, we divide by a slightly smaller number, $n - 1$ . This adjusted average is known as the sample variance. The sample variance is the sum of the squared deviations from the mean divided by $n - 1$ . The symbol for sample variance is $s^{2}$ , and the formula for the sample variance is

$s^{2} = \frac{\sum ( x - x ) ^{2}}{n - 1} .$

For this data set, the sample variance is

$s^{2} = \frac{304.19}{11 - 1} = \frac{304.19}{10} = 30.419.$

The variance measures the average squared distance from the mean. Since we want to know the average distance from the mean, we need to take the square root at this point. The result is the sample standard deviation. The sample standard deviation is the square root of the variance and measures the average distance the data values are from the mean. The symbol for sample standard deviation is $s$ , and the formula for the sample standard deviation is

$s = s^{2} = \frac{\sum ( x - x ) ^{2}}{n - 1} .$

Thus, for this data set, the sample standard deviation is

$s = 30.419 \approx 5.5 2^{\circ} F .$

The units are the same as the original data.

Definition: Sample Standard Deviation

The standard deviation is a measure of spread based on how far each data value deviates from the mean.

$s = \frac{\sum ( x - x ) ^{2}}{n - 1} .$

To compute the sample standard deviation by hand:

Find the deviation of each data value from the mean. In other words, subtract the mean from the data value.
Square each deviation.
Add the squared deviations.
Divide by one fewer than the number of data values, $n - 1$ . This value is the variance.
Take the square root of the result.

Percentiles

Definition: Percentiles

The $k$ th percentile is a value of the data set where $k %$ of the data set is less than or equal to that data value.

For example, if a data value is at the $80$ th percentile, then $80%$ of the data values fall at or below this value (and $20%$ of the data values fall above this value).

We see percentiles in many places in our lives. If you take any standardized test, your score is usually given as a percentile. If you take your child to the doctor, their height and weight are given as percentiles so they can be compared to other children their age. If your child is tested for gifted or behavior problems, the score is given as a percentile. If your child has a score on a gifted test that is at the 92nd percentile, then that means 92% of all of the children who took the same gifted test scored the same or lower than your child. Of course, that also means that 8% scored higher than your child.

A percentile is a measure that helps you determine where a data value is located relative to the other data values. For example, a test grade reported as a percentile does not tell you whether you did well or poorly. It does not tell you whether you passed or failed. It only tells you how well you did relative to the rest of the students who took the same test. For this reason, we often refer to a percentile as a measure of position.

Five-Number Summary

Three very common percentiles are the first, second, and third quartiles. Quartiles are locations in the data set that split the data distribution into quarters, or sections that each contain $25%$ of the data values.

Definition: Quartiles

Quartiles are values that divide the data in quarters:

The first quartile ( $Q_{1}$ ) is the value so that $25%$ of the data values are at or below this value. This is also known as the 25th percentile.
The second quartile ( $Q_{2}$ ) is the value so that $50%$ of the data values are at or below this value. This is also known as the 50th percentile, but more commonly called the median.
The third quartile ( $Q_{3}$ ) is the value so that $75%$ of the data values are at or below this value. This is also known as the 75th percentile.

To find the quartiles:

Order the data from smallest to largest.
Find the median. This is the second quartile, $Q_{2}$ .
Find the median of the lower half of the data values (all values to the left of the median's location). This is the first quartile, $Q_{1}$ .
Find the median of the upper half of the data values (all values to the right of the median's location). This is the third quartile, $Q_{3}$ .

Like the standard deviation, the quartiles are used to measure how spread out the data are, but unlike the standard deviation the quartiles are not a single-number summary of spread. The three quartiles, together with the maximum and minimum values, create a measure of spread called the five-number summary.

Definition: Five-Number Summary & IQR

The five-number summary takes the form: Minimum, $Q_{1}$ , Median, $Q_{3}$ , Maximum.

These five values divide the data into quarters: $25%$ of the data is between the minimum and $Q_{1}$ , $25%$ is between $Q_{1}$ and the median, $25%$ is between the median and $Q_{3}$ , and $25%$ is between $Q_{3}$ and the maximum value.

Moreover, $50%$ of the data lies between $Q_{1}$ and $Q_{3}$ . The distance between $Q_{1}$ and $Q_{3}$ is called the interquartile range.

The interquartile range (IQR) measures the spread in the middle $50%$ of the data. Subtract $Q_{1}$ from $Q_{3}$ to find its value:

$I QR = Q_{3} - Q_{1} .$

Example: Five-Number Summary and IQR

The scores for a women's golf team in tournament play are listed below. Find the five-number summary and the IQR.

Data: $89, 90, 87, 95, 86, 81, 111, 108, 83, 88, 91, 79$ .

First, order the $n = 12$ data values from smallest to largest. The median will be the mean of the two middle values since there are an even number of data values.

$numbers below median 79 81 83 86 87 88 median numbers above median 89 90 91 95 108 111 .$

The median is $\frac{88 + 89}{2} = 88.5$ .
There are 6 numbers below the median: $79, 81, 83, 86, 87, 88$ . The median of these six numbers is $\frac{83 + 86}{2} = 84.5$ .
There are 6 numbers above the median: $89, 90, 91, 95, 108, 111$ . The median of these six numbers is $\frac{91 + 95}{2} = 93$ .
The minimum is 79 and the maximum is 111.

Thus, the five-number summary is $M in = 79$ , $Q_{1} = 84.5$ , $M e d = 88.5$ , $Q_{3} = 93$ , $M a x = 111$ . The $I QR = Q_{3} - Q_{1} = 93 - 84.5 = 8.5$ .

Box-and-Whiskers Plots

Definition: Box Plot

A box plot is a graphical representation of the five-number summary.

A box plot is created by first setting a scale (number line) as a guideline for the box plot. Then, draw a rectangle that spans from $Q_{1}$ to $Q_{3}$ above the number line. Mark the median with a vertical line through the rectangle. Next, draw symbols (dots, small vertical lines, etc.) for the minimum and maximum points to the sides of the rectangle. Finally, draw horizontal lines from the sides of the rectangle out to the symbols. These horizontal lines are known as "whiskers."

Using the results of the golf scores tournament from the previous example, a box plot would show the minimum at $79$ , $Q_{1}$ at $84.5$ , the median at $88.5$ , $Q_{3}$ at $93$ , and the maximum at $111$ .

Keyboard shortcuts

Mathematics Brush-up for Data Science