Statistical testing is not that difficult! Let Dr. Stephen P. Holden guide you through the murky territory of statistics, only to discover that it isn’t murky at all!
So You start by calculating a statistic which is simply a measure of how far the observed is from what we might have expected. I think we tend to belabour the problem of statistical significance testing sometimes. The whole process is very intuitive.
Let me take you through a thought-experiment to show you this.
“I’m going to toss a coin 10 times. And to make it interesting, let me put thispineapple (Australian $50 note) on the table and make you this offer:
“If you guess the exact number of heads in the next ten coin tosses that I make, I will give you the $50. If you don’t guess it exactly, I get to keep my $50. Are you willing?”
So assuming that you see that participating in this gamble is a “no-brainer”, you say:
So what number of coin tosses will be heads? Make your guess now. H is for Heads
Most people guess somewhere around five (5/10). This is of course, the “null hypothesis”, but more on that in a moment.
I now proceed to toss the coin, I catch it, I look, I call out the result. Here’s a series of 10 coin tosses that I prepared beforehand:
In a real version of this ‘thought-experiment’ (that I conduct in presentations), people start to laugh at about seven or eight heads. This is very important because this reflects exactly the logic underlying statistical significance testing.
Okay, so let’s break it down. H is for Human Intuition
1. There’s a statistic which measures the observed result. It is very simple in this case, it is the number of “Heads” that come up, so let’s just call it the H-statistic. (H is for heads).
2. The distribution of this H-statistic is known in advance: the expected result is 50% or thereabouts. So our expectation, the distribution of the H-Statistic under the “null hypothesis” is as pictured to the right.
3. The observed result was 9 out of 10 heads. Based on our understanding of the distribution at #2 above, we can all agree this result (9/10) is possible, but pretty improbable assuming the coin, the tosses and the calls are fair (which is our expectation under the null hypothesis). In fact, according, to the distribution (see chart), the probability of exactly 9 heads is 0.01. The probability of 9 or higher (10 heads) is 0.01 + 0.001 = 0.011. The probability of an extreme value of the H-statistic, say nine or more or one or less (for a two-tailed test) is 0.011 + 0.011 = 0.022.
4. The key question here is whether we take an “improbable” result (p=0.022 or less for instance) and interpret it as a “surprising” result given the expectation, or interpret it as “unlikely in this case”. The laughter at seven or eight heads attests that many think that getting 9 heads out 10 exceeds the “significance level”. In other words, they are saying, “Sure, the result is possible, but I’m going to call ‘Bullsh!'” or in a statistician’s language “statistical significance.” We reject that the result was by chance, and conclude that something else was going on.
So every one of us has a native statistician inside of us as reflected in steps 1-4 above. If you want a (slightly) more technical version, you can keep going, but in some senses, my work here is done! H is the Hard(er) Version! Step 1: Every statistic is simply a measure of an observed result. So χ2 (chi-sq), t, F (which happens to be t2), r, etc) are simply a measure of how far what we observed was from expected – where ‘expected’ for statisticians means no relationship between variables. In the example above, the observed result was 9, the expected result was about 5. Step 2: The distribution of statistics is known. In the example, the distribution of the H-statistic is shown in the chart. Essentially, the most likely result is 50:50 with declining probabilities for higher (or lower) values of H than 5.
With the distribution in place, we can now look up the probability of any observed result, e.g., 9 as we observed. You can dothat herefor the H-statistic: the probability of success (i.e., “Heads”) on a single trial is 0.5, the number of trials was 10, the observed result was 9. The probability of the result being 9 heads or more or 1 head or lower is 0.011 + 0.011 = 0.022.
Step 3: We set our significance (or alpha) level. This is our credulity level. If what we observe (the statistic) becomes sufficiently improbable (typically set as p<0.05), we call it: “statistical significance.”
As an aside, statistical packages like SPSS incorrectly label p-values as “sig-value” which is guaranteed to contribute to confusion: p-values relate to Step 2, sig-values relate to Step 3. Put another way, p-values are calculated from the distribution of the statistic (at Step 2), a sig-value is a researcher-decision about where to place their ‘cutoff’ on accumulated p-values (Step 3).
In the H-statistic example, the laughter was a reflection that the p-value for the observed result was lower than the (intuitively expressed) significance level. It’s a human way of saying you just crossed my credulity threshold: “I expected something around 50:50, but this is sounding suspicious!”
Giving up the possibility that 9/10 is just a surprising result and concluding that rather, it is a ‘suspicious’ result is called “rejecting the null hypothesis”.
Now, what do you think explains the result? Most proffer explanations like “The coin has two heads,” “You tossed the coin a particular way,”, some even claim “You’re lying!” These are are all alternative hypotheses.
As it turns out, the statistician’s significance level is probably a little stricter than the human’s significance level (as based on the laughter measure). The probability of 9 or 10 heads or 0 or 1 heads (for the two-tailed test) is 0.022. The probability of 8 or more heads or 2 or less heads is 0.11 (see here using 8 as the observed result instead of 9). So the naive human statistician will call “Bullsh!” at 8/10, but a statistician wouldn’t do so until 9/10.
H is for Happy? So you see, statistical testing is not that difficult. (1) We start by calculating a statistic which is simply a measure of how far the observed is from what we might have expected (assuming no relationship). (2) Somewhere out there, some mathematical statisticians will have done the arcane calculations to produce a distribution of that statistic showing probabilities for each value from most probable (expected value) to the more distant and improbable values. To get the probability of our observed result, we simply look at the area under the curve for the statistic value, ‘H’ or more extreme. (3) We set our cut-point, our significance level, generally set at p = 0.05 and (4) if p < the sig-level, then we dismiss our expectation as the explanation and start searching for some alternative explanation for the observed result.
Comments? Errors? (Yep, statistics is still sufficiently tangled that this is very possible!) Let me know your thoughts.