Explore the Scientific R&D Platform

Top Bar background texture

Scientific intelligence platform for AI-powered data management and workflow automation

Prism logo

Statistical analysis and graphing software for scientists

Geneious logo

Bioinformatics, cloning, and antibody discovery software

Plan, visualize, & document core molecular biology procedures

Proteomics software for analysis of mass spec data

LabArchives logo

Electronic Lab Notebook to organize, search and share data

Modern cytometry analysis platform

Analysis, statistics, graphing and reporting of flow cytometry data

Easy Panel logo

Intelligent panel design & inventory management for flow cytometry

Software to optimize designs of clinical trials

M-Star logo

Computational fluid dynamics (CFD) software for engineers and researchers

SoftGenetics logo

Genetic analysis software for research, forensics & healthcare applications

Why is Sample Size important?

Why calculate sample size.

A good statistical study is one that is well designed and leads to valid conclusions. This however, is not always the case, even in published studies. In Cohen’s (1962) seminal power analysis of the journal of Abnormal and Social Psychology he concluded that over half of the published studies were insufficiently powered to result in statistical significance for the main hypothesis.

What is Sample Size? 

The power of a statistical test is the probability that a test will reject the null hypothesis when the null hypothesis is false. That is, power reflects the probability of not committing a type II error. The two major factors affecting the power of a study are the sample size and the effect size.

Video: How To Calculate Sample Size For Clinical Trials in 5 Steps

We have a full guide on how to use a sample size calculator . However, you will have better results should you understand some key concepts. T he larger the sample size is the smaller the effect size that can be detected. The reverse is also true; small sample sizes can detect large effect sizes. While researchers generally have a strong idea of the effect size in their planned study it is in determining an appropriate sample size that often leads to an underpowered study. This poses both scientific and ethical issues for researchers.

A study that has a sample size which is too small may produce inconclusive results and could also be considered unethical , because exposing human subjects or lab animals to the possible risks associated with research is only justifiable if there is a realistic chance that the study will yield useful information .

Similarly, a study that has a sample size which is too large will waste scarce resources and could expose more participants than necessary to any related risk. Thus an appropriate determination of the sample size used in a study is a crucial step in the design of a study.

More recent studies analysing the power of published papers has shown that, even still, there are large numbers of papers being published that have insufficient power. With the availability of sample size software such as nQuery Sample Size and Power Calculator   for Successful Clinical Trials which can calculate appropriate sample sizes for any given power such issues should not be arising so often today.

To summarize why sample size is important:

  • The two major factors affecting the power of a study are the sample size and the effect size
  • A study should only be undertaken once there is a realistic chance that the study will yield useful information
  • A study that has a sample size which is too small may produce inconclusive results and could also be considered unethical by exposing human subjects or lab animals to needless risk
  • A study that is too large will waste scarce resources and could expose more participants than necessary to any related risk
  • Thus an appropriate determination of the sample size used in a study is a crucial step in the design of a study

Recommended Reading: Answers To The Top Sample Size Questions

Now you know why sample size is important, l earn the   5 essential steps to determine sample size & power.

5 Essential Steps to Determine Sample Size & Power

Click the image above to view our guide to calculating sample size. With this knowledge you can then excel at using a sample size calculator like nQuery.

See what the industry leading software can do for you

Browse our webinars.

Everything to Know About Sample Size Determination thumbnail image

Guide to Sample Size

Everything to Know About Sample Size Determination

Designing Robust Group Sequential Trials | Free nQuery Training thumbnail image

Designing Robust Group Sequential Trials | Free nQuery Training

Group Sequential Design Theory and Practice thumbnail image

Group Sequential Design Theory and Practice

Get started with nQuery today

Try for free and upgrade as your team grows

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Statistical Hypothesis Testing Overview

By Jim Frost 59 Comments

In this blog post, I explain why you need to use statistical hypothesis testing and help you navigate the essential terminology. Hypothesis testing is a crucial procedure to perform when you want to make inferences about a population using a random sample. These inferences include estimating population properties such as the mean, differences between means, proportions, and the relationships between variables.

This post provides an overview of statistical hypothesis testing. If you need to perform hypothesis tests, consider getting my book, Hypothesis Testing: An Intuitive Guide .

Why You Should Perform Statistical Hypothesis Testing

Graph that displays mean drug scores by group. Use hypothesis testing to determine whether the difference between the means are statistically significant.

Hypothesis testing is a form of inferential statistics that allows us to draw conclusions about an entire population based on a representative sample. You gain tremendous benefits by working with a sample. In most cases, it is simply impossible to observe the entire population to understand its properties. The only alternative is to collect a random sample and then use statistics to analyze it.

While samples are much more practical and less expensive to work with, there are trade-offs. When you estimate the properties of a population from a sample, the sample statistics are unlikely to equal the actual population value exactly.  For instance, your sample mean is unlikely to equal the population mean. The difference between the sample statistic and the population value is the sample error.

Differences that researchers observe in samples might be due to sampling error rather than representing a true effect at the population level. If sampling error causes the observed difference, the next time someone performs the same experiment the results might be different. Hypothesis testing incorporates estimates of the sampling error to help you make the correct decision. Learn more about Sampling Error .

For example, if you are studying the proportion of defects produced by two manufacturing methods, any difference you observe between the two sample proportions might be sample error rather than a true difference. If the difference does not exist at the population level, you won’t obtain the benefits that you expect based on the sample statistics. That can be a costly mistake!

Let’s cover some basic hypothesis testing terms that you need to know.

Background information : Difference between Descriptive and Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics

Hypothesis Testing

Hypothesis testing is a statistical analysis that uses sample data to assess two mutually exclusive theories about the properties of a population. Statisticians call these theories the null hypothesis and the alternative hypothesis. A hypothesis test assesses your sample statistic and factors in an estimate of the sample error to determine which hypothesis the data support.

When you can reject the null hypothesis, the results are statistically significant, and your data support the theory that an effect exists at the population level.

The effect is the difference between the population value and the null hypothesis value. The effect is also known as population effect or the difference. For example, the mean difference between the health outcome for a treatment group and a control group is the effect.

Typically, you do not know the size of the actual effect. However, you can use a hypothesis test to help you determine whether an effect exists and to estimate its size. Hypothesis tests convert your sample effect into a test statistic, which it evaluates for statistical significance. Learn more about Test Statistics .

An effect can be statistically significant, but that doesn’t necessarily indicate that it is important in a real-world, practical sense. For more information, read my post about Statistical vs. Practical Significance .

Null Hypothesis

The null hypothesis is one of two mutually exclusive theories about the properties of the population in hypothesis testing. Typically, the null hypothesis states that there is no effect (i.e., the effect size equals zero). The null is often signified by H 0 .

In all hypothesis testing, the researchers are testing an effect of some sort. The effect can be the effectiveness of a new vaccination, the durability of a new product, the proportion of defect in a manufacturing process, and so on. There is some benefit or difference that the researchers hope to identify.

However, it’s possible that there is no effect or no difference between the experimental groups. In statistics, we call this lack of an effect the null hypothesis. Therefore, if you can reject the null, you can favor the alternative hypothesis, which states that the effect exists (doesn’t equal zero) at the population level.

You can think of the null as the default theory that requires sufficiently strong evidence against in order to reject it.

For example, in a 2-sample t-test, the null often states that the difference between the two means equals zero.

When you can reject the null hypothesis, your results are statistically significant. Learn more about Statistical Significance: Definition & Meaning .

Related post : Understanding the Null Hypothesis in More Detail

Alternative Hypothesis

The alternative hypothesis is the other theory about the properties of the population in hypothesis testing. Typically, the alternative hypothesis states that a population parameter does not equal the null hypothesis value. In other words, there is a non-zero effect. If your sample contains sufficient evidence, you can reject the null and favor the alternative hypothesis. The alternative is often identified with H 1 or H A .

For example, in a 2-sample t-test, the alternative often states that the difference between the two means does not equal zero.

You can specify either a one- or two-tailed alternative hypothesis:

If you perform a two-tailed hypothesis test, the alternative states that the population parameter does not equal the null value. For example, when the alternative hypothesis is H A : μ ≠ 0, the test can detect differences both greater than and less than the null value.

A one-tailed alternative has more power to detect an effect but it can test for a difference in only one direction. For example, H A : μ > 0 can only test for differences that are greater than zero.

Related posts : Understanding T-tests and One-Tailed and Two-Tailed Hypothesis Tests Explained

Image of a P for the p-value in hypothesis testing.

P-values are the probability that you would obtain the effect observed in your sample, or larger, if the null hypothesis is correct. In simpler terms, p-values tell you how strongly your sample data contradict the null. Lower p-values represent stronger evidence against the null. You use P-values in conjunction with the significance level to determine whether your data favor the null or alternative hypothesis.

Related post : Interpreting P-values Correctly

Significance Level (Alpha)

image of the alpha symbol for hypothesis testing.

For instance, a significance level of 0.05 signifies a 5% risk of deciding that an effect exists when it does not exist.

Use p-values and significance levels together to help you determine which hypothesis the data support. If the p-value is less than your significance level, you can reject the null and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.

Related posts : Graphical Approach to Significance Levels and P-values and Conceptual Approach to Understanding Significance Levels

Types of Errors in Hypothesis Testing

Statistical hypothesis tests are not 100% accurate because they use a random sample to draw conclusions about entire populations. There are two types of errors related to drawing an incorrect conclusion.

  • False positives: You reject a null that is true. Statisticians call this a Type I error . The Type I error rate equals your significance level or alpha (α).
  • False negatives: You fail to reject a null that is false. Statisticians call this a Type II error. Generally, you do not know the Type II error rate. However, it is a larger risk when you have a small sample size , noisy data, or a small effect size. The type II error rate is also known as beta (β).

Statistical power is the probability that a hypothesis test correctly infers that a sample effect exists in the population. In other words, the test correctly rejects a false null hypothesis. Consequently, power is inversely related to a Type II error. Power = 1 – β. Learn more about Power in Statistics .

Related posts : Types of Errors in Hypothesis Testing and Estimating a Good Sample Size for Your Study Using Power Analysis

Which Type of Hypothesis Test is Right for You?

There are many different types of procedures you can use. The correct choice depends on your research goals and the data you collect. Do you need to understand the mean or the differences between means? Or, perhaps you need to assess proportions. You can even use hypothesis testing to determine whether the relationships between variables are statistically significant.

To choose the proper statistical procedure, you’ll need to assess your study objectives and collect the correct type of data . This background research is necessary before you begin a study.

Related Post : Hypothesis Tests for Continuous, Binary, and Count Data

Statistical tests are crucial when you want to use sample data to make conclusions about a population because these tests account for sample error. Using significance levels and p-values to determine when to reject the null hypothesis improves the probability that you will draw the correct conclusion.

To see an alternative approach to these traditional hypothesis testing methods, learn about bootstrapping in statistics !

If you want to see examples of hypothesis testing in action, I recommend the following posts that I have written:

  • How Effective Are Flu Shots? This example shows how you can use statistics to test proportions.
  • Fatality Rates in Star Trek . This example shows how to use hypothesis testing with categorical data.
  • Busting Myths About the Battle of the Sexes . A fun example based on a Mythbusters episode that assess continuous data using several different tests.
  • Are Yawns Contagious? Another fun example inspired by a Mythbusters episode.

Share this:

why is sample size important in hypothesis testing

Reader Interactions

' src=

January 14, 2024 at 8:43 am

Hello professor Jim, how are you doing! Pls. What are the properties of a population and their examples? Thanks for your time and understanding.

' src=

January 14, 2024 at 12:57 pm

Please read my post about Populations vs. Samples for more information and examples.

Also, please note there is a search bar in the upper-right margin of my website. Use that to search for topics.

' src=

July 5, 2023 at 7:05 am

Hello, I have a question as I read your post. You say in p-values section

“P-values are the probability that you would obtain the effect observed in your sample, or larger, if the null hypothesis is correct. In simpler terms, p-values tell you how strongly your sample data contradict the null. Lower p-values represent stronger evidence against the null.”

But according to your definition of effect, the null states that an effect does not exist, correct? So what I assume you want to say is that “P-values are the probability that you would obtain the effect observed in your sample, or larger, if the null hypothesis is **incorrect**.”

July 6, 2023 at 5:18 am

Hi Shrinivas,

The correct definition of p-value is that it is a probability that exists in the context of a true null hypothesis. So, the quotation is correct in stating “if the null hypothesis is correct.”

Essentially, the p-value tells you the likelihood of your observed results (or more extreme) if the null hypothesis is true. It gives you an idea of whether your results are surprising or unusual if there is no effect.

Hence, with sufficiently low p-values, you reject the null hypothesis because it’s telling you that your sample results were unlikely to have occurred if there was no effect in the population.

I hope that helps make it more clear. If not, let me know I’ll attempt to clarify!

' src=

May 8, 2023 at 12:47 am

Thanks a lot Ny best regards

May 7, 2023 at 11:15 pm

Hi Jim Can you tell me something about size effect? Thanks

May 8, 2023 at 12:29 am

Here’s a post that I’ve written about Effect Sizes that will hopefully tell you what you need to know. Please read that. Then, if you have any more specific questions about effect sizes, please post them there. Thanks!

' src=

January 7, 2023 at 4:19 pm

Hi Jim, I have only read two pages so far but I am really amazed because in few paragraphs you made me clearly understand the concepts of months of courses I received in biostatistics! Thanks so much for this work you have done it helps a lot!

January 10, 2023 at 3:25 pm

Thanks so much!

' src=

June 17, 2021 at 1:45 pm

Can you help in the following question: Rocinante36 is priced at ₹7 lakh and has been designed to deliver a mileage of 22 km/litre and a top speed of 140 km/hr. Formulate the null and alternative hypotheses for mileage and top speed to check whether the new models are performing as per the desired design specifications.

' src=

April 19, 2021 at 1:51 pm

Its indeed great to read your work statistics.

I have a doubt regarding the one sample t-test. So as per your book on hypothesis testing with reference to page no 45, you have mentioned the difference between “the sample mean and the hypothesised mean is statistically significant”. So as per my understanding it should be quoted like “the difference between the population mean and the hypothesised mean is statistically significant”. The catch here is the hypothesised mean represents the sample mean.

Please help me understand this.

Regards Rajat

April 19, 2021 at 3:46 pm

Thanks for buying my book. I’m so glad it’s been helpful!

The test is performed on the sample but the results apply to the population. Hence, if the difference between the sample mean (observed in your study) and the hypothesized mean is statistically significant, that suggests that population does not equal the hypothesized mean.

For one sample tests, the hypothesized mean is not the sample mean. It is a mean that you want to use for the test value. It usually represents a value that is important to your research. In other words, it’s a value that you pick for some theoretical/practical reasons. You pick it because you want to determine whether the population mean is different from that particular value.

I hope that helps!

' src=

November 5, 2020 at 6:24 am

Jim, you are such a magnificent statistician/economist/econometrician/data scientist etc whatever profession. Your work inspires and simplifies the lives of so many researchers around the world. I truly admire you and your work. I will buy a copy of each book you have on statistics or econometrics. Keep doing the good work. Remain ever blessed

November 6, 2020 at 9:47 pm

Hi Renatus,

Thanks so much for you very kind comments. You made my day!! I’m so glad that my website has been helpful. And, thanks so much for supporting my books! 🙂

' src=

November 2, 2020 at 9:32 pm

Hi Jim, I hope you are aware of 2019 American Statistical Association’s official statement on Statistical Significance: https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913 In case you do not bother reading the full article, may I quote you the core message here: “We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way."

With best wishes,

November 3, 2020 at 2:09 am

I’m definitely aware of the debate surrounding how to use p-values most effectively. However, I need to correct you on one point. The link you provide is NOT a statement by the American Statistical Association. It is an editorial by several authors.

There is considerable debate over this issue. There are problems with p-values. However, as the authors state themselves, much of the problem is over people’s mindsets about how to use p-values and their incorrect interpretations about what statistical significance does and does not mean.

If you were to read my website more thoroughly, you’d be aware that I share many of their concerns and I address them in multiple posts. One of the authors’ key points is the need to be thoughtful and conduct thoughtful research and analysis. I emphasize this aspect in multiple posts on this topic. I’ll ask you to read the following three because they all address some of the authors’ concerns and suggestions. But you might run across others to read as well.

Five Tips for Using P-values to Avoid Being Misled How to Interpret P-values Correctly P-values and the Reproducibility of Experimental Results

' src=

September 24, 2020 at 11:52 pm

HI Jim, i just want you to know that you made explanation for Statistics so simple! I should say lesser and fewer words that reduce the complexity. All the best! 🙂

September 25, 2020 at 1:03 am

Thanks, Rene! Your kind words mean a lot to me! I’m so glad it has been helpful!

' src=

September 23, 2020 at 2:21 am

Honestly, I never understood stats during my entire M.Ed course and was another nightmare for me. But how easily you have explained each concept, I have understood stats way beyond my imagination. Thank you so much for helping ignorant research scholars like us. Looking forward to get hardcopy of your book. Kindly tell is it available through flipkart?

September 24, 2020 at 11:14 pm

I’m so happy to hear that my website has been helpful!

I checked on flipkart and it appears like my books are not available there. I’m never exactly sure where they’re available due to the vagaries of different distribution channels. They are available on Amazon in India.

Introduction to Statistics: An Intuitive Guide (Amazon IN) Hypothesis Testing: An Intuitive Guide (Amazon IN)

' src=

July 26, 2020 at 11:57 am

Dear Jim I am a teacher from India . I don’t have any background in statistics, and still I should tell that in a single read I can follow your explanations . I take my entire biostatistics class for botany graduates with your explanations. Thanks a lot. May I know how I can avail your books in India

July 28, 2020 at 12:31 am

Right now my books are only available as ebooks from my website. However, soon I’ll have some exciting news about other ways to obtain it. Stay tuned! I’ll announce it on my email list. If you’re not already on it, you can sign up using the form that is in the right margin of my website.

' src=

June 22, 2020 at 2:02 pm

Also can you please let me if this book covers topics like EDA and principal component analysis?

June 22, 2020 at 2:07 pm

This book doesn’t cover principal components analysis. Although, I wouldn’t really classify that as a hypothesis test. In the future, I might write a multivariate analysis book that would cover this and others. But, that’s well down the road.

My Introduction to Statistics covers EDA. That’s the largely graphical look at your data that you often do prior to hypothesis testing. The Introduction book perfectly leads right into the Hypothesis Testing book.

June 22, 2020 at 1:45 pm

Thanks for the detailed explanation. It does clear my doubts. I saw that your book related to hypothesis testing has the topics that I am studying currently. I am looking forward to purchasing it.

Regards, Take Care

June 19, 2020 at 1:03 pm

For this particular article I did not understand a couple of statements and it would great if you could help: 1)”If sample error causes the observed difference, the next time someone performs the same experiment the results might be different.” 2)”If the difference does not exist at the population level, you won’t obtain the benefits that you expect based on the sample statistics.”

I discovered your articles by chance and now I keep coming back to read & understand statistical concepts. These articles are very informative & easy to digest. Thanks for the simplifying things.

June 20, 2020 at 9:53 pm

I’m so happy to hear that you’ve found my website to be helpful!

To answer your questions, keep in mind that a central tenant of inferential statistics is that the random sample that a study drew was only one of an infinite number of possible it could’ve drawn. Each random sample produces different results. Most results will cluster around the population value assuming they used good methodology. However, random sampling error always exists and makes it so that population estimates from a sample almost never exactly equal the correct population value.

So, imagine that we’re studying a medication and comparing the treatment and control groups. Suppose that the medicine is truly not effect and that the population difference between the treatment and control group is zero (i.e., no difference.) Despite the true difference being zero, most sample estimates will show some degree of either a positive or negative effect thanks to random sampling error. So, just because a study has an observed difference does not mean that a difference exists at the population level. So, on to your questions:

1. If the observed difference is just random error, then it makes sense that if you collected another random sample, the difference could change. It could change from negative to positive, positive to negative, more extreme, less extreme, etc. However, if the difference exists at the population level, most random samples drawn from the population will reflect that difference. If the medicine has an effect, most random samples will reflect that fact and not bounce around on both sides of zero as much.

2. This is closely related to the previous answer. If there is no difference at the population level, but say you approve the medicine because of the observed effects in a sample. Even though your random sample showed an effect (which was really random error), that effect doesn’t exist. So, when you start using it on a larger scale, people won’t benefit from the medicine. That’s why it’s important to separate out what is easily explained by random error versus what is not easily explained by it.

I think reading my post about how hypothesis tests work will help clarify this process. Also, in about 24 hours (as I write this), I’ll be releasing my new ebook about Hypothesis Testing!

' src=

May 29, 2020 at 5:23 am

Hi Jim, I really enjoy your blog. Can you please link me on your blog where you discuss about Subgroup analysis and how it is done? I need to use non parametric and parametric statistical methods for my work and also do subgroup analysis in order to identify potential groups of patients that may benefit more from using a treatment than other groups.

May 29, 2020 at 2:12 pm

Hi, I don’t have a specific article about subgroup analysis. However, subgroup analysis is just the dividing up of a larger sample into subgroups and then analyzing those subgroups separately. You can use the various analyses I write about on the subgroups.

Alternatively, you can include the subgroups in regression analysis as an indicator variable and include that variable as a main effect and an interaction effect to see how the relationships vary by subgroup without needing to subdivide your data. I write about that approach in my article about comparing regression lines . This approach is my preferred approach when possible.

' src=

April 19, 2020 at 7:58 am

sir is confidence interval is a part of estimation?

' src=

April 17, 2020 at 3:36 pm

Sir can u plz briefly explain alternatives of hypothesis testing? I m unable to find the answer

April 18, 2020 at 1:22 am

Assuming you want to draw conclusions about populations by using samples (i.e., inferential statistics ), you can use confidence intervals and bootstrap methods as alternatives to the traditional hypothesis testing methods.

' src=

March 9, 2020 at 10:01 pm

Hi JIm, could you please help with activities that can best teach concepts of hypothesis testing through simulation, Also, do you have any question set that would enhance students intuition why learning hypothesis testing as a topic in introductory statistics. Thanks.

' src=

March 5, 2020 at 3:48 pm

Hi Jim, I’m studying multiple hypothesis testing & was wondering if you had any material that would be relevant. I’m more trying to understand how testing multiple samples simultaneously affects your results & more on the Bonferroni Correction

March 5, 2020 at 4:05 pm

I write about multiple comparisons (aka post hoc tests) in the ANOVA context . I don’t talk about Bonferroni Corrections specifically but I cover related types of corrections. I’m not sure if that exactly addresses what you want to know but is probably the closest I have already written. I hope it helps!

' src=

January 14, 2020 at 9:03 pm

Thank you! Have a great day/evening.

January 13, 2020 at 7:10 pm

Any help would be greatly appreciated. What is the difference between The Hypothesis Test and The Statistical Test of Hypothesis?

January 14, 2020 at 11:02 am

They sound like the same thing to me. Unless this is specialized terminology for a particular field or the author was intending something specific, I’d guess they’re one and the same.

' src=

April 1, 2019 at 10:00 am

so these are the only two forms of Hypothesis used in statistical testing?

April 1, 2019 at 10:02 am

Are you referring to the null and alternative hypothesis? If so, yes, that’s those are the standard hypotheses in a statistical hypothesis test.

April 1, 2019 at 9:57 am

year very insightful post, thanks for the write up

' src=

October 27, 2018 at 11:09 pm

hi there, am upcoming statistician, out of all blogs that i have read, i have found this one more useful as long as my problem is concerned. thanks so much

October 27, 2018 at 11:14 pm

Hi Stano, you’re very welcome! Thanks for your kind words. They mean a lot! I’m happy to hear that my posts were able to help you. I’m sure you will be a fantastic statistician. Best of luck with your studies!

' src=

October 26, 2018 at 11:39 am

Dear Jim, thank you very much for your explanations! I have a question. Can I use t-test to compare two samples in case each of them have right bias?

October 26, 2018 at 12:00 pm

Hi Tetyana,

You’re very welcome!

The term “right bias” is not a standard term. Do you by chance mean right skewed distributions? In other words, if you plot the distribution for each group on a histogram they have longer right tails? These are not the symmetrical bell-shape curves of the normal distribution.

If that’s the case, yes you can as long as you exceed a specific sample size within each group. I include a table that contains these sample size requirements in my post about nonparametric vs parametric analyses .

Bias in statistics refers to cases where an estimate of a value is systematically higher or lower than the true value. If this is the case, you might be able to use t-tests, but you’d need to be sure to understand the nature of the bias so you would understand what the results are really indicating.

I hope this helps!

' src=

April 2, 2018 at 7:28 am

Simple and upto the point 👍 Thank you so much.

April 2, 2018 at 11:11 am

Hi Kalpana, thanks! And I’m glad it was helpful!

' src=

March 26, 2018 at 8:41 am

Am I correct if I say: Alpha – Probability of wrongly rejection of null hypothesis P-value – Probability of wrongly acceptance of null hypothesis

March 28, 2018 at 3:14 pm

You’re correct about alpha. Alpha is the probability of rejecting the null hypothesis when the null is true.

Unfortunately, your definition of the p-value is a bit off. The p-value has a fairly convoluted definition. It is the probability of obtaining the effect observed in a sample, or more extreme, if the null hypothesis is true. The p-value does NOT indicate the probability that either the null or alternative is true or false. Although, those are very common misinterpretations. To learn more, read my post about how to interpret p-values correctly .

' src=

March 2, 2018 at 6:10 pm

I recently started reading your blog and it is very helpful to understand each concept of statistical tests in easy way with some good examples. Also, I recommend to other people go through all these blogs which you posted. Specially for those people who have not statistical background and they are facing to many problems while studying statistical analysis.

Thank you for your such good blogs.

March 3, 2018 at 10:12 pm

Hi Amit, I’m so glad that my blog posts have been helpful for you! It means a lot to me that you took the time to write such a nice comment! Also, thanks for recommending by blog to others! I try really hard to write posts about statistics that are easy to understand.

' src=

January 17, 2018 at 7:03 am

I recently started reading your blog and I find it very interesting. I am learning statistics by my own, and I generally do many google search to understand the concepts. So this blog is quite helpful for me, as it have most of the content which I am looking for.

January 17, 2018 at 3:56 pm

Hi Shashank, thank you! And, I’m very glad to hear that my blog is helpful!

' src=

January 2, 2018 at 2:28 pm

thank u very much sir.

January 2, 2018 at 2:36 pm

You’re very welcome, Hiral!

' src=

November 21, 2017 at 12:43 pm

Thank u so much sir….your posts always helps me to be a #statistician

November 21, 2017 at 2:40 pm

Hi Sachin, you’re very welcome! I’m happy that you find my posts to be helpful!

' src=

November 19, 2017 at 8:22 pm

great post as usual, but it would be nice to see an example.

November 19, 2017 at 8:27 pm

Thank you! At the end of this post, I have links to four other posts that show examples of hypothesis tests in action. You’ll find what you’re looking for in those posts!

Comments and Questions Cancel reply

Why is Sample Size Important? (Explanation & Examples)

Sample size refers to the total number of individuals involved in an experiment or study.

Sample size is important because it directly affects how precisely we can estimate population parameters.

To understand why this is the case, it helps to have a basic understanding of confidence intervals.

A Brief Explanation of Confidence Intervals

In statistics, we’re often interested in measuring population parameters – numbers that describe some characteristic of an entire population.

For example, we might be interested in measuring the mean height of all individuals in a certain city.

However, it’s often too costly and time-consuming to go around and collect data on every individual in a population so we typically take a random sample from the population instead and use data from the sample to estimate the population parameter.

For example, we might collect data on the height of 100 random individuals in the city. We can then calculate the mean height of the individuals in the sample. However, we can’t be certain that the sample mean exactly matches the population mean.

To account for this uncertainty, we can create a confidence interval . A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence.

The formula to calculate a confidence interval for a population mean is:

Confidence Interval =  x   +/-  z*(s/√ n )

  • x : sample mean
  • z:  the chosen z-value
  • s:  sample standard deviation
  • n:  sample size

The z-value that you will use is dependent on the confidence level that you choose. The following table shows the z-value that corresponds to popular confidence level choices:

0.90 1.645
0.95 1.96
0.99 2.58

The Relationship Between Sample Size & Confidence Intervals

S uppose we want to estimate the mean weight of a population of turtles. We collect a random sample of turtles with the following information:

  • Sample size  n = 25
  • Sample mean weight  x = 300
  • Sample standard deviation  s = 18.5

Here is how to find calculate the 90% confidence interval for the true population mean weight:

90% Confidence Interval:  300 +/-  1.645*(18.5/√ 25 ) =  [293.91, 306.09]

We are 90% confident that the true mean weight of the turtles in the population is between 293.91 and 306.09 pounds.

Now suppose instead of 25 turtles, we actually collect data for 50 turtles. 

90% Confidence Interval:  300 +/-  1.645*(18.5/√ 50 ) =  [295.79, 304.30]

Notice that this confidence interval is narrower than the previous confidence interval. This means our estimate of the true population mean weight of turtles is more precise.

Now suppose we instead collected data for 100 turtles. 

90% Confidence Interval:  300 +/-  1.645*(18.5/√ 100 ) =  [296.96, 303.04]

Notice that this confidence interval is even narrower than the previous confidence interval.

The following table summarizes each of the confidence interval widths:

why is sample size important in hypothesis testing

Here’s the takeaway: The larger the sample size, the more precisely we can estimate a population parameter .

Additional Resources

The following tutorials provide other helpful explanations of confidence intervals and sample size.

An Introduction to Confidence Intervals 4 Examples of Confidence Intervals in Real Life Population vs. Sample: What’s the Difference?

How to Fix in Pandas: SettingWithCopyWarning

How to overlay plots in r (with examples), related posts, how to normalize data between -1 and 1, vba: how to check if string contains another..., how to interpret f-values in a two-way anova, how to create a vector of ones in..., how to find the mode of a histogram..., how to find quartiles in even and odd..., how to determine if a probability distribution is..., what is a symmetric histogram (definition & examples), how to calculate sxy in statistics (with example), how to calculate sxx in statistics (with example).

  • Find Us On Facebook
  • Follow on Twitter
  • Subscribe using RSS

The Importance and Effect of Sample Size

When conducting research about your customers, patients or products it’s usually impossible, or at least impractical, to collect data from all of the people or items that you are interested in. Instead, we take a sample (or subset) of the population of interest and learn what we can from that sample about the population.

There are lots of things that can affect how well our sample reflects the population and therefore how valid and reliable our conclusions will be. In this blog, we introduce some of the key concepts that should be considered when conducting a survey, including confidence levels and margins of error , power and effect sizes . (See the glossary below for some handy definitions of these terms.) Crucially, we’ll see that all of these are affected by how large a sample you take, i.e., the sample size .

Confidence and Margin of Error

Let’s start by considering an example where we simply want to estimate a characteristic of our population, and see the effect that our sample size has on how precise our estimate is.

The size of our sample dictates the amount of information we have and therefore, in part, determines our precision or level of confidence that we have in our sample estimates. An estimate always has an associated level of uncertainty, which depends upon the underlying variability of the data as well as the sample size. The more variable the population, the greater the uncertainty in our estimate. Similarly, the larger the sample size the more information we have and so our uncertainty reduces.

Suppose that we want to estimate the proportion of adults who own a smartphone in the UK. We could take a sample of 100 people and ask them. Note: it’s important to consider how the sample is selected to make sure that it is unbiased and representative of the population – we’ll blog on this topic another time.

The larger the sample size the more information we have and so our uncertainty reduces.

If 59 out of the 100 people own a smartphone, we estimate that the proportion in the UK is 59/100=59%. We can also construct an interval around this point estimate to express our uncertainty in it, i.e., our margin of error . For example, a 95% confidence interval for our estimate based on our sample of size 100 ranges from 49.36% to 68.64% (which can be calculated using our free online calculator ). Alternatively, we can express this interval by saying that our estimate is 59% with a margin of error of ±9.64%. This is a 95% confidence interval, which means that there is 95% probability that this interval contains the true proportion. In other words, if we were to collect 100 different samples from the population the true proportion would fall within this interval approximately 95 out of 100 times.

What would happen if we were to increase our sample size by going out and asking more people?

Suppose we ask another 900 people and find that, overall, 590 out of the 1000 people own a smartphone. Our estimate of the prevalence in the whole population is again 590/1000=59%. However, our confidence interval for the estimate has now narrowed considerably to 55.95% to 62.05%, a margin of error of ±3.05% – see Figure 1 below. Because we have more data and therefore more information, our estimate is more precise.

Precision versus sample size

As our sample size increases, the confidence in our estimate increases, our uncertainty decreases and we have greater precision. This is clearly demonstrated by the narrowing of the confidence intervals in the figure above. If we took this to the limit and sampled our whole population of interest then we would obtain the true value that we are trying to estimate – the actual proportion of adults who own a smartphone in the UK and we would have no uncertainty in our estimate.

Power and Effect Size

Increasing our sample size can also give us greater power to detect differences. Suppose in the example above that we were also interested in whether there is a difference in the proportion of men and women who own a smartphone.

We can estimate the sample proportions for men and women separately and then calculate the difference. When we sampled 100 people originally, suppose that these were made up of 50 men and 50 women, 25 and 34 of whom own a smartphone, respectively. So, the proportion of men and women owning smartphones in our sample is 25/50=50% and 34/50=68%, with less men than women owning a smartphone. The difference between these two proportions is known as the observed effect size. In this case, we observe that the gender effect is to reduce the proportion by 18% for men relative to women.

Is this observed effect significant, given such a small sample from the population, or might the proportions for men and women be the same and the observed effect due merely to chance?

We can use a statistical test to investigate this and, in this case, we use what’s known as the ‘Binomial test of equal proportions’ or ‘ two proportion z-test ‘. We find that there is insufficient evidence to establish a difference between men and women and the result is not considered statistically significant. The probability of observing a gender effect of 18% or more if there were truly no difference between men and women is greater than 5%, i.e., relatively likely and so the data provides no real evidence to suggest that the true proportions of men and women with smartphones are different. This cut-off of 5% is commonly used and is called the “ significance level ” of the test. It is chosen in advance of performing a test and is the probability of a type I error, i.e., of finding a statistically significant result, given that there is in fact no difference in the population.

What happens if we increase our sample size and include the additional 900 people in our sample?

Suppose that overall these were made up of 500 women and 500 men, 250 and 340 of whom own a smartphone, respectively. We now have estimates of 250/500=50% and 340/500=68% of men and women owning a smartphone. The effect size, i.e., the difference between the proportions, is the same as before (50% – 68% = ‑18%), but crucially we have more data to support this estimate of the difference. Using the statistical test of equal proportions again, we find that the result is statistically significant at the 5% significance level. Increasing our sample size has increased the power that we have to detect the difference in the proportion of men and women that own a smartphone in the UK.

Figure 2 provides a plot indicating the observed proportions of men and women, together with the associated 95% confidence intervals. We can clearly see that as our sample size increases the confidence intervals for our estimates for men and women narrow considerably. With a sample size of only 100, the confidence intervals overlap, offering little evidence to suggest that the proportions for men and women are truly any different. On the other hand, with the larger sample size of 1000 there is a clear gap between the two intervals and strong evidence to suggest that the proportions of men and women really are different.

The Binomial test above is essentially looking at how much these pairs of intervals overlap and if the overlap is small enough then we conclude that there really is a difference. (Note: The data in this blog are only for illustration; see this article for the results of a real survey on smartphone usage from earlier this year.)

Difference versus sample size

If your effect size is small then you will need a large sample size in order to detect the difference otherwise the effect will be masked by the randomness in your samples. Essentially, any difference will be well within the associated confidence intervals and you won’t be able to detect it. The ability to detect a particular effect size is known as statistical power . More formally, statistical power is the probability of finding a statistically significant result, given that there really is a difference (or effect) in the population. See our recent blog post “ Depression in Men ‘Regularly Ignored ‘” for another example of the effect of sample size on the likelihood of finding a statistically significant result.

So, larger sample sizes give more reliable results with greater precision and power, but they also cost more time and money. That’s why you should always perform a sample size calculation before conducting a survey to ensure that you have a sufficiently large sample size to be able to draw meaningful conclusions, without wasting resources on sampling more than you really need. We’ve put together some free, online statistical calculators to help you carry out some statistical calculations of your own, including sample size calculations for estimating a proportion and comparing two proportions .

Margin of error – This is the level of precision you require. It is the range in which the value that you are trying to measure is estimated to be and is often expressed in percentage points (e.g., ±2%). A narrower margin of error requires a larger sample size.

Confidence level – This conveys the amount of uncertainty associated with an estimate. It is the chance that the confidence interval (margin of error around the estimate) will contain the true value that you are trying to estimate. A higher confidence level requires a larger sample size.

Power – This is the probability that we find statistically significant evidence of a difference between the groups, given that there is a difference in the population. A greater power requires a larger sample size.

Effect size – This is the estimated difference between the groups that we observe in our sample. To detect a difference with a specified power, a smaller effect size will require a larger sample size.

Related Articles

  • “Modest” but “statistically significant”…what does that mean? (statsoft.com)
  • Legal vs clinical trials: An explanation of sampling errors and sample size (statslife.org.uk)

Tell us what you want to achieve

  • Data Collection & Management
  • Data Mining
  • Innovation & Research
  • Qualitative Analysis
  • Surveys & Sampling
  • Visualisation
  • Agriculture
  • Environment
  • Market Research
  • Public Sector

Select Statistical Services Ltd

Oxygen House, Grenadier Road, Exeter Business Park,

Exeter EX1 3LH

t: 01392 440426

e: [email protected]

Sign up to our Newsletter

  • Please tick this box to confirm that you are happy for us to store and process the information supplied above for the purpose of managing your subscription to our newsletter.
  • Comments This field is for validation purposes and should be left unchanged.
  • Telephone Number

' width=

  • By using this form you agree with the storage and handling of your data by this website.
  • Phone This field is for validation purposes and should be left unchanged.

Enquiry - Jobs

' width=

  • Email This field is for validation purposes and should be left unchanged.

why is sample size important in hypothesis testing

Power and Sample Size Determination

  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • |   6  
  • |   7  
  • |   8  
  • |   9  
  • |   10  
  • |   11  

On This Page sidebar

Issues in Estimating Sample Size for Hypothesis Testing

Ensuring that a test has high power.

Learn More sidebar

All Modules

In the module on hypothesis testing for means and proportions, we introduced techniques for means, proportions, differences in means, and differences in proportions. While each test involved details that were specific to the outcome of interest (e.g., continuous or dichotomous) and to the number of comparison groups (one, two, more than two), there were common elements to each test. For example, in each test of hypothesis, there are two errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true.   In the first step of any test of hypothesis, we select a level of significance, α , and α = P(Type I error) = P(Reject H 0 | H 0 is true). Because we purposely select a small value for α , we control the probability of committing a Type I error. The second type of error is called a Type II error and it is defined as the probability we do not reject H 0 when it is false. The probability of a Type II error is denoted β , and β =P(Type II error) = P(Do not Reject H 0 | H 0 is false). In hypothesis testing, we usually focus on power, which is defined as the probability that we reject H 0 when it is false, i.e., power = 1- β = P(Reject H 0 | H 0 is false). Power is the probability that a test correctly rejects a false null hypothesis. A good test is one with low probability of committing a Type I error (i.e., small α ) and high power (i.e., small β, high power).  

Here we present formulas to determine the sample size required to ensure that a test has high power. The sample size computations depend on the level of significance, aα, the desired power of the test (equivalent to 1-β), the variability of the outcome, and the effect size. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference. Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria.  

The concept of statistical power can be difficult to grasp. Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view.  

Suppose we want to test the following hypotheses at aα=0.05:  H 0 : μ = 90 versus H 1 : μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the alternative hypothesis or not. This is done by computing a test statistic and comparing the test statistic to an appropriate critical value. If the null hypothesis is true (μ=90), then we are likely to select a sample whose mean is close in value to 90. However, it is also possible to select a sample whose mean is much larger or much smaller than 90. Recall from the Central Limit Theorem (see page 11 in the module on Probability ), that for large n (here n=100 is sufficiently large), the distribution of the sample means is approximately normal with a mean of

If the null hypothesis is true, it is possible to observe any sample mean shown in the figure below; all are possible under H 0 : μ = 90.  

Normal distribution of X when the mean of X is 90. A bell-shaped curve with a value of X-90 at the center.

Rejection Region for Test H 0 : μ = 90 versus H 1 : μ ≠ 90 at α =0.05

Standard normal distribution showing a mean of 90. The rejection areas are in the two tails at the extremes above and below the mean. If the alpha level is 0.05, then each tail accounts for an arean of 0.025.

The areas in the two tails of the curve represent the probability of a Type I Error, α= 0.05. This concept was discussed in the module on Hypothesis Testing .  

Now, suppose that the alternative hypothesis, H 1 , is true (i.e., μ ≠ 90) and that the true mean is actually 94. The figure below shows the distributions of the sample mean under the null and alternative hypotheses.The values of the sample mean are shown along the horizontal axis.  

Two overlapping normal distributions, one depicting the null hypothesis with a mean of 90 and the other showing the alternative hypothesis with a mean of 94. A more complete explanation of the figure is provided in the text below the figure.

If the true mean is 94, then the alternative hypothesis is true. In our test, we selected α = 0.05 and reject H 0 if the observed sample mean exceeds 93.92 (focusing on the upper tail of the rejection region for now). The critical value (93.92) is indicated by the vertical line. The probability of a Type II error is denoted β, and β = P(Do not Reject H 0 | H 0 is false), i.e., the probability of not rejecting the null hypothesis if the null hypothesis were true. β is shown in the figure above as the area under the rightmost curve (H 1 ) to the left of the vertical line (where we do not reject H 0 ). Power is defined as 1- β = P(Reject H 0 | H 0 is false) and is shown in the figure as the area under the rightmost curve (H 1 ) to the right of the vertical line (where we reject H 0 ).  

Note that β and power are related to α, the variability of the outcome and the effect size. From the figure above we can see what happens to β and power if we increase α. Suppose, for example, we increase α to α=0.10.The upper critical value would be 92.56 instead of 93.92. The vertical line would shift to the left, increasing α, decreasing β and increasing power. While a better test is one with higher power, it is not advisable to increase α as a means to increase power. Nonetheless, there is a direct relationship between α and power (as α increases, so does power).

β and power are also related to the variability of the outcome and to the effect size. The effect size is the difference in the parameter of interest (e.g., μ) that represents a clinically meaningful difference. The figure above graphically displays α, β, and power when the difference in the mean under the null as compared to the alternative hypothesis is 4 units (i.e., 90 versus 94). The figure below shows the same components for the situation where the mean under the alternative hypothesis is 98.

Overlapping bell-shaped distributions - one with a mean of 90 and the other with a mean of 98

Notice that there is much higher power when there is a larger difference between the mean under H 0 as compared to H 1 (i.e., 90 versus 98). A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is 94. Notice also in this case that there is little overlap in the distributions under the null and alternative hypotheses. If a sample mean of 97 or higher is observed it is very unlikely that it came from a distribution whose mean is 90. In the previous figure for H 0 : μ = 90 and H 1 : μ = 94, if we observed a sample mean of 93, for example, it would not be as clear as to whether it came from a distribution whose mean is 90 or one whose mean is 94.

In designing studies most people consider power of 80% or 90% (just as we generally use 95% as the confidence level for confidence interval estimates). The inputs for the sample size formulas include the desired power, the level of significance and the effect size. The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate.  

The formulas we present below produce the minimum sample size to ensure that the test of hypothesis will have a specified probability of rejecting the null hypothesis when it is false (i.e., a specified power). In planning studies, investigators again must account for attrition or loss to follow-up. The formulas shown below produce the number of participants needed with complete data, and we will illustrate how attrition is addressed in planning studies.

return to top | previous page | next page

Content ©2020. All Rights Reserved. Date last modified: March 13, 2020. Wayne W. LaMorte, MD, PhD, MPH

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Dental Press J Orthod
  • v.19(4); Jul-Aug 2014

Language: English | Portuguese

How sample size influences research outcomes

Jorge faber.

1 Adjunct professor, Department of Orthodontics, University of Brasília.

Lilian Martins Fonseca

2 Invited Professor, Department of Orthodontics, University of Brasília.

Sample size calculation is part of the early stages of conducting an epidemiological, clinical or lab study. In preparing a scientific paper, there are ethical and methodological indications for its use. Two investigations conducted with the same methodology and achieving equivalent results, but different only in terms of sample size, may point the researcher in different directions when it comes to making clinical decisions. Therefore, ideally, samples should not be small and, contrary to what one might think, should not be excessive. The aim of this paper is to discuss in clinical language the main implications of the sample size when interpreting a study.

O cálculo amostral faz parte dos estágios iniciais de realização de um estudo epidemiológico, clínico ou laboratorial. Há indicações éticas e metodológicas para o seu emprego na elaboração de um trabalho científico. Duas pesquisas, realizadas com a mesma metodologia obtendo resultados equivalentes, e que diferem apenas no tamanho da amostra, podem apontar para diferentes direções no processo de tomada de decisão clínica. Portanto, as amostras estudadas idealmente não devem ser pequenas e, ao contrário do que pode-se pensar, não devem ser excessivas. O objetivo desse artigo é discutir, numa linguagem clínica, as principais implicações do tamanho das amostras na interpretação de um estudo.

In recent years a growing concern has overwhelmed the scientific community in the healthcare area: Sample size calculation. Although at first blush it may seem like an overriding concern over methodological issues, notably to clinicians, such concern is utterly justifiable. This issue is of paramount importance.

Samples should not be either too big or too small since both have limitations that can compromise the conclusions drawn from the studies. Too small a sample may prevent the findings from being extrapolated, whereas too large a sample may amplify the detection of differences, emphasizing statistical differences that are not clinically relevant. 1 We will discuss in this article the major impacts of sample size on orthodontic studies.

FACTORS THAT AFFECT SAMPLE SIZE

The purpose of estimating the appropriate sample size is to produce studies capable of detecting clinically relevant differences. Bearing this point in mind, there are different formulas to calculate sample size. 2 , 3 These formulas comprise several aspects which are listed below. Most sample size calculators available on the web have limited validity because they use a single formula - which is usually not divulged - to generate sample sizes for the studies.

The first aspect is the type of variable being studied. For example, it should be determined if the variable is categorical like the Angle classification (Class I, II or III), or continuous like the length of the dental arch (usually measured in millimeters).

It is then necessary to determine the relationship between the groups that will be evaluated and the statistical analysis that will be employed. Are we going to evaluate groups that are independent, i.e., the measurements of one group do not influence the other? Are they dependent groups like the measurements taken before and after treatment? Are we going to use a split-mouth design, whereby treatment is performed on one quadrant and a different therapy on another quadrant? Will we be using t-test or chi-square test? All these questions lead to different sample size calculation formulas.

Subsequently, we have to answer the question concerning which results we envisage if a standard treatment is performed. What is the mean value or the expected ratio? The answer to this question is usually obtained from the literature or by means of pilot studies.

It is also important to determine what is the smallest magnitude of the effect and the extent to which it is clinically relevant. For example, how many degrees of difference in the ANB angle can be considered relevant? It is vital that we address this issue. The smaller the difference that we wish to identify, the greater the number of cases in a study. If researchers wish to detect a difference as small as 0.1° in an ANB angle, they will probably need thousands of patients in their study. If this value rises to 1°, the number of cases required falls drastically.

Finally, it is essential that the researcher determine the level of significance and the type II error, which is the probability of not rejecting the null hypothesis, although the hypothesis is actually false, which the study will accept as reasonable.

With this information in hand, we will apply the appropriate formula according to the study design in question, and determine the sample size. Today, this calculation is typically carried out with the aid of a computer program. For example, Pocock's formula 2 for continuous variables is frequently used in our specialty. It is used in studies where one wishes to examine the difference between data means with normal distribution and equal-size, independent groups.

PROBLEMS WITH VERY SMALL SAMPLES

Try to envision the following scenario. A researcher conducts a study on patients who are being treated with a new device which although very uncomfortable has the potential to improve treatment of Class II malocclusions. The researcher wishes to compare the new functional device with the Herbst appliance. Patients will be randomly assigned to each group. The researcher is not aware, but we are, that s/he needs 60 subjects (30 patients in each group) to ensure sufficient power to be able to extrapolate the statistical analysis results to the overall population. In other words, so that we can feel confident that these results will serve as a parameter on which to base the proposed treatment. Furthermore, we also know, although the researcher does not, that this new therapy is less effective than the traditional method.

However, the researcher used only 15 patients in each group. The results of the study showed that the new device is inferior to conventional treatment. What are the implications?

The first is that using a sample smaller than the ideal increases the chance of assuming as true a false premise. Thus, chances are that the proposed device has no disadvantage compared to traditional therapy. Furthermore, it is assumed that people were subjected to a study, and had to undergo in vain all additional suffering associated with the therapy, given that the goals of the study were not achieved. In addition, financial and time resources were squandered since ultimately it will contribute absolutely nothing to improve clinical practice or quality of life. The situation becomes even worse if the research involves public funding: A total waste of taxpayer money.

PROBLEMS WITH VERY LARGE SAMPLES

There is a widespread belief that large samples are ideal for research or statistical analysis. However, this is not always true. Using the above example as a case study, very large samples that exceed the value estimated by sample size calculation present different hurdles.

The first is ethical. Should a study be performed with more patients than necessary? This means that more people than needed are exposed to the new therapy. Potentially, this implies increased hassle and risk. Obviously the problem is compounded if the new protocol is inferior to the traditional method: More patients are involved in a new, uncomfortable therapy that yields inferior results.

The second obstacle is that the use of a larger number of cases can also involve more financial and human resources than necessary to obtain the desired response.

In addition to these factors, there is another noteworthy issue that has to do with statistics. Statistical tests were developed to handle samples, not populations. When numerous cases are included in the statistics, analysis power is substantially increased. This implies an exaggerated tendency to reject null hypotheses with clinically negligible differences. What is insignificant becomes significant. Thus, a potential statistically significant difference in the ANB angle of 0.1° between the groups cited in the previous example would obviously produce no clinical difference in the effects of wearing an appliance.

When very large samples are available in a retrospective study, the researcher needs first to collect subsamples randomly, and only then perform the statistical test. If it is a prospective study, the researcher should collect only what is necessary, and include a few more individuals to compensate for subjects that leave the study.

CONCLUSIONS

In designing a study, sample size calculation is important for methodological and ethical reasons, as well as for reasons of human and financial resources. When reading an article, the reader should be on the alert to ascertain that the study they are reading was subjected to sample size calculation. In the absence of this calculation, the findings of the study should be interpreted with caution.

An appropriate sample renders the research more efficient: Data generated are reliable, resource investment is as limited as possible, while conforming to ethical principles. The use of sample size calculation directly influences research findings. Very small samples undermine the internal and external validity of a study. Very large samples tend to transform small differences into statistically significant differences - even when they are clinically insignificant. As a result, both researchers and clinicians are misguided, which may lead to failure in treatment decisions.

How to cite this article: Faber J, Fonseca LM. How sample size influences research outcomes. Dental Press J Orthod. 2014 July-Aug;19(4):27-9. DOI: http://dx.doi.org/10.1590/2176-9451.19.4.027-029.ebo

  • Remote Access
  • Save figures into PowerPoint
  • Download tables as PDFs

JAMA Guide to Statistics and Methods

Sample Size Calculation for a Hypothesis Test

Lynne Stokes

  • Download Chapter PDF

Disclaimer: These citations have been automatically generated based on the information we have and it may not be 100% accurate. Please consult the latest official manual style if you have any questions regarding the format accuracy.

Download citation file:

  • Search Book

Jump to a Section

Introduction, use of the method.

  • CAVEATS TO CONSIDER WHEN LOOKING AT RESULTS BASED ON POWER ANALYSIS
  • ACKNOWLEDGMENT
  • Full Chapter
  • Related Content

This JAMA Guide to Statistics and Methods explains the importance of considering sample size when interpreting study results, how the power analysis can help calculate the appropriate sample size, and the potential pitfalls of this approach.

Koegelenberg et al 1 reported the results of a randomized clinical trial (RCT) that investigated whether treatment with a nicotine patch in addition to varenicline produced higher rates of smoking abstinence than varenicline alone. The primary results were positive; that is, patients receiving the combination therapy were more likely to achieve continuous abstinence at 12 weeks than patients receiving varenicline alone. The absolute difference in the abstinence rate was estimated to be approximately 14%, which was statistically significant at level α = .05.

These findings differed from the results reported in 2 previous studies 2,3 of the same question, which detected no difference in treatments. What explains this difference? One explanation offered by the authors is that the previous studies “…may have been inadequately powered,” which means the sample size in those studies may have been too small to identify a difference between the treatments tested.

Why Is Power Analysis Used?

The sample size in a research investigation should be large enough that differences occurring by chance are rare but should not be larger than necessary, to avoid waste of resources and to prevent exposure of research participants to risk associated with the interventions. With any study, but especially if the study sample size is very small, any difference in observed rates can happen by chance and thus cannot be considered statistically significant.

In developing the methods for a study, investigators conduct a power analysis to calculate sample size. The power of a hypothesis test is the probability of obtaining a statistically significant result when there is a true difference in treatments. For example, suppose, as Koegelenberg et al 1 did, that the smoking abstinence rate were 45% for varenicline alone and 14% larger, or 59%, for the combination regimen. Power is the probability that, under these conditions, the trial would detect a difference in rates large enough to be statistically significant at a certain level α (ie, α is the probability of a type I error, which occurs by rejecting a null hypothesis that is actually true).

Get Free Access Through Your Institution

Education guide.

why is sample size important in hypothesis testing

Pop-up div Successfully Displayed

This div only appears when the trigger link is hovered over. Otherwise it is hidden from view.

Please Wait

Logo for Mavs Open Press

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 9: Data Analysis – Hypothesis Testing, Estimating Sample Size, and Modeling

This chapter provides the foundational concepts and tools for analyzing data commonly seen in the transportation profession. The concepts include hypothesis testing, assessing the adequacy of the sample sizes, and estimating the least square model fit for the data. These applications are useful in collecting and analyzing travel speed data, conducting before-after comparisons, and studying the association between variables, e.g., travel speed and congestion as measured by traffic density on the road.

Learning Objectives

At the end of the chapter, the reader should be able to do the following:

  • Estimate the required sample size for testing.
  • Use specific significance tests including, z-test, t-test (one and two samples), chi-squared test.
  • Compute corresponding p-value for the tests.
  • Compute and interpret simple linear regression between two variables.
  • Estimate a least-squares fit of data.
  • Find confidence intervals for parameter estimates.
  • Use of spreadsheet tools (e.g., MS Excel) and basic programming (e.g., R or SPSS) to calculate complex and repetitive mathematical problems similar to earthwork estimates (cut, fill, area, etc.), trip generation and distribution, and linear optimization.
  • Use of spreadsheet tools (e.g., MS Excel) and basic programming (e.g., R or SPSS) to create relevant graphs and charts from data points.
  • Identify topics in the introductory transportation engineering courses that build on the concepts discussed in this chapter.

Central Limit Theorem

In this section, you will learn about the the central limit theorem by reading each description along with watching the videos. Also, short problems to check your understanding are included.

The Central Limit theorem for Sample Means

The sampling distribution is a theoretical distribution. It is created by taking many samples of size n from a population. Each sample mean is then treated like a single observation of this new distribution, the sampling distribution. The genius of thinking this way is that it recognizes that when we sample, we are creating an observation and that observation must come from some particular distribution. The Central Limit Theorem answers the question: from what distribution dis a sample mean come? If this is discovered, then we can treat a sample mean just like any other observation and calculate probabilities about what values it might take on. We have effectively moved from the world of statistics where we know only what we have from the sample to the world of probability where we know the distribution from which the same mean came and the parameters of that distribution.

The reasons that one samples a population are obvious. The time and expense of checking every invoice to determine its validity or every shipment to see if it contains all the items may well exceed the cost of errors in billing or shipping. For some products, sampling would require destroying them, called destructive sampling. One such example is measuring the ability of a metal to withstand saltwater corrosion for parts on ocean going vessels.

Sampling thus raises an important question; just which sample was drawn. Even if the sample were randomly drawn, there are theoretically an almost infinite number of samples. With just 100 items, there are more than 75 million unique samples of size five that can be drawn. If six are in the sample, the number of possible samples increases to just more than one billion. Of the 75 million possible samples, then, which one did you get? If there is variation in the items to be sampled, there will be variation in the samples. One could draw an “unlucky” sample and make very wrong conclusions concerning the population. This recognition that any sample we draw is really only one from a distribution of samples provides us with what is probably the single most important theorem is statistics: the Central Limit Theorem. Without the Central Limit Theorem, it would be impossible to proceed to inferential statistics from simple probability theory. In its most basic form, the Central Limit Theorem states that regardless of the underlying probability density function of the population data, the theoretical distribution of the means of samples from the population will be normally distributed. In essence, this says that the mean of a sample should be treated like an observation drawn from a normal distribution. The Central Limit Theorem only holds if the sample size is “large enough” which has been shown to be only 30 observations or more.

Figure 1 graphically displays this very important proposition.

Graph of the population and normal sampling distribution

Notice that the horizontal axis in the top panel is labeled X. These are the individual observations of the population. This is the unknown distribution of the population values. The graph is purposefully drawn all squiggly to show that it does not matter just how odd ball it really is. Remember, we will never know what this distribution looks like, or its mean or standard deviation for that matter.

\overline{X^{\prime} s}

The Central Limit Theorem goes even further and tells us the mean and standard deviation of this theoretical distribution.

Table 1
Mean
Standard deviation s

\overline{X^{\prime}}

Sampling Distribution of the Sample Mean

Sampling Distribution of the Sample Mean (Part 2)

Sampling Distributions: Sampling Distribution of the Mean

Using the Central Limit Theorem

Law of Large Numbers

\bar{X} \text { is } \frac{\sigma}{\sqrt{n}}

Indeed, there are two critical issues that flow from the Central Limit Theorem and the application of the Law of Large numbers to it. These are listed below.

  • The probability density function of the sampling distribution of means is normally distributed regardless of the underlying distribution of the population observations and
  • Standard deviation of the sampling distribution decreases as the size of the samples that were used to calculate the means for the sampling distribution increases.

Taking these in order. It would seem counterintuitive that the population may have any distribution and the distribution of means coming from it would be normally distributed. With the use of computers, experiments can be simulated that show the process by which the sampling distribution changes as the sample size is increased. These simulations show visually the results of the mathematical proof of the Central Limit Theorem.

\bar{x}^{\prime} s

At non-extreme values of n, this relationship between the standard deviation of the sampling distribution and the sample size plays a very important part in our ability to estimate the parameters in which we are interested.

Figure 3 shows three sampling distributions. The only change that was made is the sample size that was used to get the sample means for each distribution. As the sample size increases, n goes from 10 to 30 to 50, the standard deviations of the respective sampling distributions decrease because the sample size is in the denominator of the standard deviations of the sampling distributions.

Normal distribution with variety of sample sizes.

The Central Limit Theorem for Proportions

\overline{x^{\prime}} s

In order to find the distribution from which sample proportions come we need to develop the sampling distribution of sample proportions just as we did for sample means. So again, imagine that we randomly sample say 50 people and ask them if they support the new school bond issue. From this we find a sample proportion, p’, and graph it on the axis of p’s. We do this again and again etc., etc. until we have the theoretical distribution of p’s. Some sample proportions will show high favorability toward the bond issue and others will show low favorability because random sampling will reflect the variation of views within the population. What we have done can be seen in Figure 5. The top panel is the population distributions of probabilities for each possible value of the random variable X. While we do not know what the specific distribution looks like because we do not know p, the population parameter, we do know that it must look something like this. In reality, we do not know either the mean or the standard deviation of this population distribution, the same difficulty we faced when analyzing the X’s previously.

Bar group of population and the corresponding normal sampling distribution

Importantly, in the case of the analysis of the distribution of sample means, the Central Limit Theorem told us the expected value of the mean of the sample means in the sampling distribution, and the standard deviation of the sampling distribution. Again, the Central Limit Theorem provides this information for the sampling distribution for proportions. The answers are:

\mu_{p^{\prime}}

Both these conclusions are the same as we found for the sampling distribution for sample means. However, in this case, because mean and standard deviation of the binomial distribution both rely upon p , the formula for the standard deviation of the sampling distribution requires algebraic manipulation to be useful. The standard deviation of the sampling distribution for proportions is thus:

\sigma_{p^{\prime}}=\sqrt{\frac{p(1-P)}{n}}

Table 2
‘s
Mean
Standard deviation

Table 2 summarizes these results and shows the relationship between the population, sample, and sampling distribution.

\mu \text { or } p

Find Confidence Intervals for Parameter Estimates

In this section, you will learn how to find and estimate confidence intervals by reading each description along with watching the videos included. Also, short problems to check your understanding are included.

Confidence Intervals & Estimation: Point Estimates Explained

Introduction to Confidence Intervals

Suppose you were trying to determine the mean rent of a two-bedroom apartment in your town. You might look in the classified section of the newspaper, write down several rents listed, and average them together. You would have obtained a point estimate of the true mean. If you are trying to determine the percentage of times you make a basket when shooting a basketball, you might count the number of shots you make and divide that by the number of shots you attempted. In this case, you would have obtained a point estimate for the true proportion the parameter p in the binomial probability density function.

We use sample data to make generalizations about an unknown population. This part of statistics is called inferential statistics . The sample data help us to make an estimate of a population parameter. We realize that the point estimate is most likely not the exact value of the population parameter, but close to it. After calculating point estimates, we construct interval estimates, called confidence intervals. What statistics provides us beyond a simple average , or point estimate, is an estimate to which we can attach a probability of accuracy, what we will call a confidence level. We make inferences with a known level of probability.

\sigma

A confidence interval is another type of estimate but, instead of being just one number, it is an interval of numbers. The interval of numbers is a range of values calculated from a given set of sample data. The confidence interval is likely to include the unknown population parameter.

\sigma=1

We say that we are 95% confident that the unknown population mean number of songs downloaded from iTunes per month is between 1.8 and 2.2. The 95% confidence interval is (1.8, 2.2). Please note that we talked in terms of 95% confidence using the empirical rule. The empirical rule for two standard deviations is only approximately 95% of the probability under the normal distribution. To be precise, two standard deviations under a normal distribution is actually 95.44% of the probability. To calculate the exact 95% confidence level, we would use 1.96 standard deviations.

Remember that a confidence interval is created for an unknown population parameter like the population mean, 𝜇 .

For the confidence interval for a mean the formula would be:

\mu=\bar{X} \pm Z_\alpha \sigma / \sqrt{n}

Or written another way as:

\bar{X}-Z_\alpha \sigma / \sqrt{n} \leq \mu \leq \bar{X}+Z_\alpha \sigma / \sqrt{n}

A Confidence Interval for a Population Standard Deviation, Known or Large Sample Size

A confidence interval for a population mean, when the population standard deviation is known based on the conclusion of the Central Limit Theorem that the sampling distribution of the sample means follow an approximately normal distribution.

Calculating the Confidence Interval

Consider the standardizing formula for the sampling distribution developed in the discussion of the Central Limit Theorem:

Z_1=\frac{\bar{X}-\mu_{-}}{\sigma_{-}}=\frac{\bar{X}-\mu}{\sigma / \sqrt{n}}

This is the formula for a confidence interval for the mean of a population.

Z_\alpha

Table 3
Confidence Level
0.80 1.28
0.90 1.645
0.95 1.96
0.99 2.58

\text { b }(1-\alpha)

Let us say we know that the actual population mean number of iTunes downloads is 2.1. The true population mean falls within the range of the 95% confidence interval. There is absolutely nothing to guarantee that this will happen . Further, if the true mean falls outside of the interval, we will never know it. We must always remember that we will never ever know the true mean. Statistics simply allows us, with a given level of probability (confidence), to say that the true mean is within the range calculated.

Changing the Confidence Level or Sample Size

Here again is the formula for a confidence interval for an unknown population mean assuming we know the population standard deviation:

\bar{X}-Z_\alpha(\alpha / \sqrt{n}) \leq \mu \leq \bar{X}+Z_\alpha(\alpha / \sqrt{n})

For a moment we should ask just what we desire in a confidence interval. Our goal was to estimate the population mean from a sample. We have forsaken the hope that we will ever find the true population mean, and population standard deviation for that matter, for any case except where we have an extremely small population and the cost of gathering the data of interest is very small. In all other cases we must rely on samples. With the Central Limit Theorem, we have the tools to provide a meaningful confidence interval with a given level of confidence, meaning a known probability of being wrong. By meaningful confidence interval we mean one that is useful. Imagine that you are asked for a confidence interval for the ages of your classmates. You have taken a sample and find a mean of 19.8 years. You wish to be very confident, so you report an interval between 9.8 years and 29.8 years. This interval would certainly contain the true population mean and have a very high confidence level. However, it hardly qualifies as meaningful. The very best confidence interval is narrow while having high confidence. There is a natural tension between these two goals. The higher the level of confidence the wider the confidence interval as the case of the students’ ages above. We can see this tension in the equation for the confidence interval.

\mu=\bar{x} \pm Z_\alpha\left(\frac{\sigma}{\sqrt{n}}\right)

Calculating the Confidence Interval: An Alternative Approach

The confidence interval estimate will have the form:

(Point estimate – error bound, point estimate + error bound) or, in symbols,

(\bar{x}-E B M, \bar{x}+E B M) .

The mathematical formula for this confidence interval is:

The margin of error (EBM) depends on the confidence level (abbreviated CL ). The confidence level is often considered the probability that the calculated confidence interval estimate will contain the true population parameter. However, it is more accurate to state that the confidence level is the percent of confidence intervals that contain the true population parameter when repeated samples are taken. Most often, it is the choice of the person constructing the confidence interval to choose a confidence level of 90% or higher because that person wants to be reasonably certain of his or her conclusions.

(\alpha) . \alpha

To capture the central 90%, we must go out 1.645 standard deviations on either side of the calculated sample mean. The value 1.645 is the z-score from a standard normal probability distribution that puts an area of 0.90 in the center, an area of 0.05 in the far-left tail, and an area of 0.05 in the far-right tail.

\frac{\sigma}{\sqrt{n}}

Calculating the Confidence Interval Using EBM

To construct a confidence interval, estimate for an unknown population mean, we need data from a random sample. The steps to construct and interpret the confidence interval are listed below.

  • Find the z-score from the standard normal table that corresponds to the confidence level desired.
  • Calculate the error bound EBM.
  • Construct the confidence interval.
  • Write a sentence that interprets the estimate in the context of the situation in the problem.

Finding the z-score for the Stated Confidence Level

Z \sim N(0,1)

Calculating the Error Bound (EBM)

E B M=\left(Z_{\frac{\alpha}{2}}\right)\left(\frac{\sigma}{\sqrt{n}}\right)

Constructing the Confidence Interval

(\bar{x}-E M B, \bar{x}+E B M)

The graph gives a picture of the entire situation.

C L+\frac{\alpha}{2}+\frac{\alpha}{2}=C L+\alpha=1

Confidence Interval for Mean: 1 Sample Z Test (Using Formula)

Check Your Understanding: Confidence Interval for Mean

A Confidence Interval for a Population Standard Deviation Unknown, Small Sample Case

Up until the mid-1970s, some statisticians used the normal distribution approximation for large sample sizes and used the Student’s t-distribution only for sample sizes of at most 30 observations.

t=\frac{\bar{x}-\mu}{\left(\frac{s}{\sqrt{n}}\right)}

Properties of the Student’s t-distribution

  • The graph for the Student’s t-distribution is similar to the standard normal curve and at infinite degrees of freedom it is the normal distribution. You can confirm this by reading the bottom line at infinite degrees of freedom for a familiar level of confidence, e.g., at column 0.05, 95% level of confidence, we find the t-value of 1.96 at infinite degrees of freedom.
  • The mean for the Student’s t-distribution is zero and the distribution is symmetric about zero, again like the standard normal distribution.
  • The Student’s t-distribution has more probability in its tails than the standard normal distribution because the spread of the t-distribution is greater than the spread of the standard normal. So, the graph of the Student’s t-distribution will be thicker in the tails and shorter in the center than the graph of the standard normal distribution.
  • The exact shape of the Student’s t-distribution depends on the degrees of freedom. As the degrees of freedom increases, the graph of Student’s t-distribution becomes more like the graph of the standard normal distribution.

A probability table for the Student’s t-distribution is used to calculate t-values at various commonly used levels of confidence. The table gives t-scores that correspond to the confidence level (column) and degrees of freedom (row). When using a t-table, note that some tables are formatted to show the confidence level in the column headings, while the column headings in some tables may show only corresponding area in one or both tails. Notice that at the bottom the table will show the t-value for infinite degrees of freedom. Mathematically, as the degrees of freedom increase, the t-distribution approaches the standard normal distribution. You can find familiar Z-values by looking in the relevant alpha column and reading value in the last row.

A Student’s t-table gives t-scores given the degrees of freedom and the right-tailed probability.

The Student’s t-distribution has one of the most desirable properties of the normal: it is symmetrical. What the Student’s t-distribution does is spread out the horizontal axis, so it takes a larger number of standard deviations to capture the same amount of probability. In reality there are an infinite number of Student’s t-distributions, one for each adjustment to the sample size. As the sample size increases, the Student’s t-distribution become more and more like the normal distribution. When the sample size reaches 30 the normal distribution is usually substituted for the Student’s t because they are so much alike. This relationship between the Student’s t-distribution and the normal distribution is shown in Figure 8.

Graph of the relationship between the normal and t distribution

Confidence Intervals: Using the t Distribution

Check Your Understanding: Confidence Intervals

A Confidence Interval for a Population Proportion

During an election year, we see articles in the newspaper that state confidence intervals in terms of proportions or percentages. For example, a poll for a particular candidate running for president might show that the candidate has 40% of the vote within three percentage points (if the sample is large enough). Often, election polls are calculated with 95% confidence, so, the pollsters would be 95% confident that the true proportion of voters who favored the candidate would be between 0.37 and 0.43.

The procedure to find the confidence interval for a population proportion is similar to that for the population mean, but the formulas are a bit different although conceptually identical. While the formulas are different, they are based upon the same mathematical foundation given to us by the Central Limit Theorem. Because of this we will see the same basic format using the same three pieces of information: the sample value of the parameter in question, the standard deviation of the relevant sampling distribution, and the number of standard deviations we need to have the confidence in our estimate that we desire.

X \sim B(n, p)

x = the number of successes in the sample

n = the size of the sample

The formula for the confidence interval for a population proportion follows the same format as that for an estimate of a population mean. Remembering the sampling distribution for the proportion, the standard deviation was found to be:

\sigma_{p^{\prime}}=\sqrt{\frac{p(1-p)}{n}}

The confidence interval for a population proportion, therefore, becomes:

p=p^{\prime} \pm\left[Z_{\left(\frac{\alpha}{2}\right)} \sqrt{\frac{p^{\prime}\left(1-p^{\prime}\right)}{n}}\right]

The sample proportions p’  and q’  are estimates of the unknown population proportions p and q . The estimated proportions p’  and q’  are used because p  and q  are not known.

Remember that as p moves further from 0.5 the binomial distribution becomes less symmetrical. Because we are estimating the binomial with the symmetrical normal distribution the further away from symmetrical the binomial becomes the less confidence we have in the estimate.

This conclusion can be demonstrated through the following analysis. Proportions are based upon the binomial probability distribution. The possible outcomes are binary, either “success” or “failure.” This gives rise to a proportion, meaning the percentage of the outcomes that are “successes.” It was shown that the binomial distribution could be fully understood if we knew only the probability of a success in any one trial, called p. The mean and the standard deviation of the binomial were found to be:

\sigma=\sqrt{n p q}

It was also shown that the binomial could be estimated by the normal distribution if BOTH np AND nq were greater than 5. From the discussion above, it was found that the standardizing formula for the binomial distribution is:

Z=\frac{p^{\prime}-p}{\sqrt{\left(\frac{p q}{n}\right)}}

We can now manipulate this formula in just the same way we did for finding the confidence intervals for a mean, but to find the confidence interval for the binomial population parameter, p .

p^{\prime}-Z_\alpha \sqrt{\frac{p^{\prime} q^{\prime}}{n}} \leq p \leq p^{\prime}+Z_\alpha \sqrt{\frac{p^{\prime} q^{\prime}}{n}}

x = number of successes.

n = the number in the sample.

q^{\prime}=\left(1-p^{\prime}\right)

Unfortunately, there is no correction factor for cases where the sample size is small so np’  and nq’  must always be greater than 5 to develop an interval estimate for p .

Also written as:

p^{\prime}-Z_\alpha \sqrt{\frac{p^{\prime}\left(1-p^{\prime}\right)}{n}} \leq p \leq p^{\prime}+Z_\alpha \sqrt{\frac{p^{\prime}\left(1-p^{\prime}\right)}{n}}

How to Construct a Confidence Interval for Population Proportion

Check Your Understanding: How to Construct a Confidence Interval for Population Proportion

Estimate the Required Sample Size for Testing

In this section, you will learn how to calculate sample size with continuous and binary random samples. by reading each description along with watching the videos included. Also, short problems to check your understanding are included.

Calculating the Sample Size n: Continuous and Binary Random Variables

Continuous Random Variables

Usually, we have no control over the sample size of a data set. However, if we are able to set the sample size, as in cases where we are taking a survey, it is very helpful to know just how large it should be to provide the most information. Sampling can be very costly in both time and product. Simple telephone surveys will cost approximately $30.00 each, for example, and some sampling requires the destruction of the product.

(\bar{X}-\mu)

Binary Random Variables

What was done in cases when looking for the mean of a distribution can also be done when sampling to determine the population parameter p  for proportions. Manipulation of the standardizing formula for proportions gives:

n=\frac{Z_\alpha^2 p q}{e^2}

There is an interesting trade-off between the level of confidence and the sample size that shows up here when considering the cost of sampling. Table 4 shows the appropriate sample size at different levels of confidence and different level of the acceptable error, or tolerance.

Table 4
Required sample size (90%) Required sample size (95%) Tolerance level
1691 2401 2%
752 1067 3%
271 384 5%
68 96 10%

p=0.5 \text { and } q=0.5

The acceptable error, called tolerance in the table, is measured in plus or minus values from the actual proportion. For example, an acceptable error of 5% means that if the sample proportion was found to be 26 percent, the conclusion would be that the actual population proportion is between 21 and 31 percent with a 90 percent level of confidence if a sample of 271 had been taken. Likewise, if the acceptable error was set at 2%, then the population proportion would be between 24 and 28 percent with a 90 percent level of confidence but would require that the sample size be increased from 271 to 1,691. If we wished a higher level of confidence, we would require a larger sample size. Moving from a 90 percent level of confidence to a 95 percent level at a plus or minus 5% tolerance requires changing the sample size from 271 to 384. A very common sample size often seen reported in political surveys is 384. With the survey results it is frequently stated that the results are good to a plus or minus 5% level of “accuracy”.

Example: Suppose a mobile phone company wants to determine the current percentage of customers aged 50+ who use text messaging on their cell phones. How many customers aged 50+ should the company survey in order to be 90% confident that the estimated (sample) proportion is within three percentage points of the true population proportion of customers aged 50+ who use text messaging on their cell phones.

z_{\frac{\alpha}{2}}, z_{0.05}=1.645

Round the answer to the next higher value. The sample size should be 752 cell phone customers aged 50+ in order to be 90% confident that the estimated (sample) proportion is within three percentage points of the true population proportion of all customers aged 50+ who use text messaging on their cell phones.

Estimation and Confidence Intervals: Calculate Sample Size

Calculating Sample size to Predict a Population Proportion

Use Specific Significance Tests Including, Z-Test, T-Test (one and two samples), Chi-Squared Test

In this section, you will learn the fundamentals of hypothesis testing along with hypothesis testing with errors by reading each description along with watching the videos. Also, short problems to check your understanding are included.

Hypothesis Testing with One Sample

Statistical testing is part of a much larger process known as the scientific method. The scientific method, briefly, states that only by following a careful and specific process can some assertion be included in the accepted body of knowledge. This process begins with a set of assumptions upon which a theory, sometimes called a model, is built. This theory, if it has any validity, will lead to predictions; what we call hypotheses.

Statistics and statisticians are not necessarily in the business of developing theories, but in the business of testing others’ theories. Hypotheses come from these theories based upon an explicit set of assumptions and sound logic. The hypothesis comes first, before any data are gathered. Data do not create hypotheses; they are used to test them. If we bear this in mind as we study this section, the process of forming and testing hypotheses will make more sense.

One job of a statistician is to make statistical inferences about populations based on samples taken from the population. Confidence intervals are one way to estimate a population parameter. Another way to make a statistical inference is to make a decision about the value of a specific parameter. For instance, a car dealer advertises that its new small truck gets 35 miles per gallon, on average. A tutoring service claims that its method of tutoring helps 90% of its students get an A or a B. A company says that women managers in their company earn an average of $60,000 per year.

A statistician will make a decision about these claims. This process is called "hypothesis testing." A hypothesis test involves collecting data from a sample and evaluating the data. Then, the statistician makes a decision as to whether or not there is sufficient evidence, based upon analyses of the data, to reject the null hypothesis.

Hypothesis Testing: The Fundamentals

Null and Alternative Hypotheses

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H_0:

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

Table 5 presents the various hypotheses in the relevant pairs. For example, if the null hypothesis is equal to some value, the alternative has to be not equal to that value.

Table 5
Equal (=) Not equal
Greater than or equal to Less than (<)
Less than or equal to More than (>)

\boldsymbol{H}_0

Example 2: We wants to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are:

H_0: \mu=2.0

Example 3: We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

H_0: \mu \geq 5

Hypothesis Testing: Setting up the Null and Alternative Hypothesis Statements

Outcomes and the Type I and Type II Errors

H_0

Table 6
True False
Cannot reject Correct outcome Type II error
Cannot accept Type I error Correct outcome

The four possible outcomes in the table are:

\beta

The easiest way to see the relationship between the alpha error and the level of confidence is in Figure 9.

Overlapping normal distributions

By way of example, the American judicial system begins with the concept that a defendant is “presumed innocent”. This is the status quo and is the null hypothesis. The judge will tell the jury that they cannot find the defendant guilty unless the evidence indicates guilt beyond a “reasonable doubt” which is usually defined in criminal cases as 95% certainty of guilt. If the jury cannot accept the null, innocent, then action will be taken, jail time. The burden of proof always lies with the alternative hypothesis. (In civil cases, the jury needs only to be more than 50% certain of wrongdoing to find culpability, called “a preponderance of the evidence”).

The example above was for a test of a mean, but the same logic applies to tests of hypotheses for all statistical parameters one may wish to test.

Type I error: Frank thinks that his rock-climbing equipment may not be safe when, in fact, it really is safe.

Type II error: Frank thinks that his rock-climbing equipment may be safe when, in fact, it is not safe.

\boldsymbol{\beta}=

Notice that, in this case, the error with the greater consequence is the Type II error. (If Frank thinks his rock-climbing equipment is safe, he will go ahead and use it.)

This is a situation described as “accepting a false null”.

Hypothesis Testing: Type I and Type II Errors

Check Your Understanding: H ypothesis Testing: Type I and Type II Errors

Distribution Needed for Hypothesis Testing

1-p^{\prime}

Hypothesis Test for the Mean

Going back to the standardizing formula we can derive the test statistic for testing hypotheses concerning means.

Z_c=\frac{\bar{x}-\mu_0}{\sigma / \sqrt{n}}

This gives us the decision rule for testing a hypothesis for a two-tailed test:

Table 7
If
If

Hypothesis testing: Finding Critical Values

Normal Distribution: Finding Critical Values of Z

P-Value Approach

\alpha,

Both decision rules will result in the same decision, and it is a matter of preference which one is used.

What is a “P-Value?”

One and Two-Tailed Tests

\mu \neq 100

The claim would be in the alternative hypothesis. The burden of proof in hypothesis testing is carried in the alternative. This is because failing to reject the null, the status quo, must be accomplished with 90 or 95 percent confidence that it cannot be maintained. Said another way, we want to have only a 5 or 10 percent probability of making a Type I error, rejecting a good null; overthrowing the status quo.

Figure 13 shows two possible cases and the form of the null and alternative hypothesis that give rise to them.

Two normal distributions one with the higher tail shaded and the other the lower tail.

Table 8: Test Statistics for Test of Means, Varying Sample Size, Population Standard Deviation Known or Unknown

 

Effects of Sample Size on Test Statistic

d f=(n-1)

Table 8 summarizes these rules.

A Systematic Approach for Testing a Hypothesis

A systematic approach to hypothesis testing follows the following steps and in this order. This template will work for all hypotheses that you will ever test.

  • Set up the null and alternative hypothesis. This is typically the hardest part of the process. Here the question being asked is reviewed. What parameter is being tested, a mean, a proportion, differences in means, etc. Is this a one-tailed test or two-tailed test?

Z_\alpha, t_\alpha

  • Take a sample(s) and calculate the relevant parameters: sample mean, standard deviation, or proportion. Using the formula for the test statistic from above in step 2, now calculate the test statistic for this particular case using the parameters you have just calculated.
  • The test statistic is in the tail: Cannot Accept the null, the probability that this sample mean (proportion) came from the hypothesized distribution is too small to believe that it is the real home of these sample data.
  • The test statistic is not in the tail: Cannot Reject the null, the sample data are compatible with the hypothesized population parameter.
  • Reach a conclusion. It is best to articulate the conclusion two different ways. First a formal statistical conclusion such as “With a 5 % level of significance we cannot accept the null hypotheses that the population mean is equal to XX (units of measurement)”. The second statement of the conclusion is less formal and states the action, or lack of action, required. If the formal conclusion was that above, then the informal one might be, “The machine is broken, and we need to shut it down and call for repairs.”

All hypotheses tested will go through this same process. The only changes are the relevant formulas and those are determined by the hypothesis required to answer the original question.

Hypothesis Testing: One Sample Z Test of the Mean (Critical Value Approach)

Hypothesis Testing: t Test for the Mean (Critical Value Approach)

Hypothesis Testing: 1 Sample Z Test of the Mean (Confidence Interval Approach)

Hypothesis Testing: 1 Sample Z Test for Mean (P-Value Approach)

Hypothesis Test for Proportions

Just as there were confidence intervals for proportions, or more formally, the population parameter p  of the binomial distribution, there is the ability to test hypotheses concerning p .

p^{\prime}=x / n, x

Again, we begin with the standardizing formula modified because this is the distribution of a binomial.

Z=\frac{p^{\prime}-p}{\sqrt{\frac{p q}{n}}}

This is the test statistic for testing hypothesized values of p , where the null and alternative hypotheses take one of the following forms:

Table 9

Hypothesis Testing: 1 Proportion using the Critical Value Approach

Hypothesis Testing with Two Samples

Studies often compare two groups. For example, researchers are interested in the effect aspirin has in preventing heart attacks. Over the last few years, newspapers and magazines have reported various aspirin studies involving two groups. Typically, one group is given aspirin and the other group is given a placebo. Then, the heart attack rate is studied over several years.

There are other situations that deal with the comparison of two groups. For example, studies compare various diet and exercise programs. Politicians compare the proportion of individuals from different income brackets who might vote for them. Students are interested in whether SAT or GRE preparatory courses really help raise their scores. Many business applications require comparing two groups. It may be the investment returns of two different investment strategies, or the differences in production efficiency of different management styles.

To compare two means or two proportions, you work with two groups. The groups are classified either as independent or matched pairs . Independent groups consist of two samples that are independent, that is, sample values selected from one population are not related in any way to sample values selected from the other population. Matched pairs consist of two samples that are dependent. The parameter tested using matched pairs is the population mean. The parameters tested using independent groups are either population means or population proportions of each group.

Comparing Two Independent Population Means

The comparison of two independent population means is very common and provides a way to test the hypothesis that the two groups differ from each other. Is the night shift less productive than the day shift, are the rates of return from fixed asset investments different from those from common stock investments, and so on? An observed difference between two sample means depends on both the means and the sample standard deviations. Very different means can occur by chance if there is great variation among the individual samples. The test statistic will have to account for this fact. The test comparing two independent population means with unknown and possibly unequal population standard deviations is called the Aspin-Welch t-test. The degrees of freedom formula we will see later was developed by Aspin-Welch.

When we developed the hypothesis test for the mean and proportions, we began with the Central Limit Theorem. We recognized that a sample mean came from a distribution of sample means, and sample proportions came from the sampling distribution of sample proportions. This made our sample parameters, the sample means and sample proportions, into random variables. It was important for us to know the distribution that these random variables came from. The Central Limit Theorem gave us the answer: the normal distribution. Our Z and t statistics came from this theorem. This provided us with the solution to our question of how to measure the probability that a sample mean came from a distribution with a particular hypothesized value of the mean or proportion. In both cases that was the question: what is the probability that the mean (or proportion) from our sample data came from a population distribution with the hypothesized value we are interested in?

Now we are interested in whether or not two samples have the same mean. Our question has not changed: Do these two samples come from the same population distribution? We recognize that we have two sample means, one from each set of data, and thus we have two random variables coming from two unknown distributions. To solve the problem, we create a new random variable, the difference between the sample means. This new random variable also has a distribution and, again, the Central Limit Theorem tells us that this new distribution is normally distributed, regardless of the underlying distributions of the original data. A graph may help to understand this concept.

Two population graphs forming into one sampling distribution.

The Central Limit Theorem, as before, provides us with the standard deviation of the sampling distribution, and further, that the expected value of the mean of the distribution of differences in sample means is equal to the differences in the population means. Mathematically this can be stated:

E\left(\mu_{x_1}-\mu_{x_2}\right)=\mu_1-\mu_2

The standard error is:

\sqrt{\frac{\left(s_1\right)^2}{n_1}+\frac{\left(s_2\right)^2}{n_2}}

We remember that substituting the sample variance for the population variance when we did not have the population variance was the technique we used when building the confidence interval and the test statistic for the test of hypothesis for a single mean back in Confidence Intervals and Hypothesis Testing with One Sample. The test statistic (t-score) is calculated as follows:

t_c=\frac{\left(\bar{x}-\bar{x}_2\right)-\delta_0}{\sqrt{\frac{\left(s_1\right)^2}{n_1}+\frac{\left(s_2\right)^2}{n_2}}}

The number of degrees of freedom (df) requires a somewhat complicated calculation. The df are not always a whole number. The test statistic above is approximated by the Student’s t-distribution with df as follows:

d f=\frac{\left(\frac{\left(s_1\right)^2}{n_1}+\frac{\left(s_2\right)^2}{n_2}\right)^2}{\left(\frac{1}{n_1-1}\right)\left(\frac{\left(s_1\right)^2}{n_1}\right)^2+\left(\frac{1}{n_2-1}\right)\left(\frac{\left(s_2\right)^2}{n_2}\right)^2}

The format of the sampling distribution, differences in sample means, specifies that the format of the null and alternative hypothesis is:

H_0: \mu_1-\mu_2=\delta_0

Hypothesis Testing – Two Population Means

Two Population Means, One Tail Test

Two Population Means, Two Tail Test

Check Your Understanding: Hypothesis Testing (Two Population Means)

Cohen’s Standards for Small, Medium, and Large Effect Sizes

Cohen's d is a measure of “effect size” based on the differences between two means. Cohen’s d, named for United States statistician Jacob Cohen, measures the relative strength of the differences between the means of two populations based on sample data. The calculated value of effect size is then compared to Cohen’s standards of small, medium, and large effect sizes.

Table 10
Small 0.2
Medium 0.5
Large 0.8

Cohen’s d is the measure of the difference between two means divided by the pooled standard deviation:

d=\frac{\bar{x}_1-\bar{x}_2}{s_{\text {pooled }}} \text { where } s_{\text {pooled }}=\sqrt{\frac{\left(n_1-1\right) s_1^2+\left(n_2-1\right) s_2^2}{n_1+n_2-2}}

It is important to note that Cohen’s d does not provide a level of confidence as to the magnitude of the size of the effect comparable to the other tests of hypothesis we have studied. The sizes of the effects are simply indicative.

Effect Size for a Significant Difference of Two Sample Means

Test for Differences in Means: Assuming Equal Population Variances

Typically, we can never expect to know any of the population parameters, mean, proportion, or standard deviation. When testing hypotheses concerning differences in means we are faced with the difficulty of two unknown variances that play a critical role in the test statistic. We have been substituting the sample variances just as we did when testing hypotheses for a single mean. And as we did before, we used a Student’s t to compensate for this lack of information on the population variance. There may be situations, however, when we do not know the population variances, but we can assume that the two populations have the same variance. If this is true, then the pooled sample variance will be smaller than the individual sample variances. This will give more precise estimates and reduce the probability of discarding a good null. The null and alternative hypotheses remain the same, but the test statistic changes to:

t_c=\frac{\left(\overline{x_1}-\bar{x}_2\right)-\delta_0}{\sqrt{S_p^2\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}

Example: A drug trial is attempted using a real drug and a pill made of just sugar. 18 people are given the real drug in hopes of increasing the production of endorphins. The increase in endorphins is found to be on average 8 micrograms per person, and the sample standard deviation is 5.4 micrograms. 11 people are given the sugar pill, and their average endorphin increase is 4 micrograms with a standard deviation of 2.4. From previous research on endorphins, it is determined that it can be assumed that the variances within the two samples can be assumed to be the same. Test at 5% to see if the population mean for the real drug had a significantly greater impact on the endorphins than the population mean with the sugar pill.

First, we begin by designating one of the two groups Group 1 and the other Group 2. This will be needed to keep track of the null and alternative hypotheses. Let us set Group 1 as those who received the actual new medicine being tested and therefore Group 2 is those who received the sugar pill. We can now set up the null and alternative hypothesis as:

H_0: \mu_1 \leq \mu_2

The test statistic is clearly in the tail, 2.31 is larger than the critical value of 1.703, and therefore we cannot maintain the null hypothesis. Thus, we conclude that there is significant evidence at the 95% level of confidence that the new medicine produces the effect desired.

Two Population Means with Known Standard Deviations

The standard deviation is:

\sqrt{\frac{\left(\sigma_1\right)^2}{n_1}+\frac{\left(\sigma_2\right)^2}{n_2}}

The test statistic (z-score) is:

Z_c=\frac{\left(\bar{x}_1-\bar{x}_2\right)-\delta_0}{\sqrt{\frac{\left(\sigma_1\right)^2}{n_1}+\frac{\left(\sigma_2\right)^2}{n_2}}}

Check Your Understanding: Two Population Means

Matched or Paired Samples

\bar{X}_d

When using a hypothesis test for matched or paired samples, the following characteristics may be present:

  • Simple random sampling is used.
  • Sample sizes are often small.
  • Two measurements (samples) are drawn from the same pair of individuals or objects.
  • Differences are calculated from the matched or paired samples.
  • The differences form the sample that is used for the hypothesis test.
  • Either the matched pairs have differences that come from a population that is normal or the number of differences is sufficiently large so that distribution of the sample mean of differences is approximately normal.

\mu_d

The null and alternative hypotheses for this test are:

H_0: \mu_d=0

The test statistic is:

t_c=\frac{\bar{x}_d-\mu_d}{\left(\frac{s_d}{\sqrt{n}}\right)}

Two Population Means, One Tail Test, Matched Sample

Two Population Means, One Tail test, Matched Sample (Hypothesized Test Different from Zero)

Matched Sample, Two Tail Test

Check Your Understanding: Matched Sample, Two Tail Test

Comparing Two Independent Population Proportions

When conducting a hypothesis test that compares two independent population proportions, the following characteristics should be present:

  • The two independent samples are random samples are independent.
  • The number of successes is at least five, and the number of failures is at least five, for each of the samples.
  • Growing literature states that the population must be at least ten or even perhaps 20 times the size of the sample. This keeps each population from being over-sampled and causing biased results.

Comparing two proportions, like comparing two means, is common. If two estimated proportions are different, it may be due to a difference in the populations or it may be due to chance in the sampling. A hypothesis test can help determine if a difference in the estimated proportions reflects a difference in the two population proportions.

\left(p_A^{\prime}-p_B^{\prime}\right)

Most common, however, is the test that the two proportions are the same. That is,

H_0: p_A=p_B

The pooled proportion is calculated as follows:

p_c=\frac{x_A+x_B}{n_A+n_B}

Two Population Proportions, One Tail Test

Two Population Proportion, Two Tail Test

Check Your Understanding: Comparing Two Independent Population Proportions

The Chi-Square Distribution

Have you ever wondered if lottery winning numbers were evenly distributed or if some numbers occurred with a greater frequency? How about if the types of movies people preferred were different across different age groups? What about if a coffee machine was dispensing approximately the same amount of coffee each time? You could answer these questions by conducting a hypothesis test.

You will now study a new distribution, one that is used to determine the answers to such questions. This distribution is called the chi-square distribution.

In this section, you will learn the three major applications of the chi-square distribution:

  • The test of a single variance, which tests variability, such as in the coffee example
  • The goodness-of-fit test, which determines if data fit a particular distribution, such as in the lottery example
  • the test of independence, which determines if events are independent, such as in the movie example

Facts About the Chi-Square Distribution

The notation for the chi-square distribution is:

\chi \sim X_{d f}^2

The random variable for a chi-square distribution with k  degrees of freedom is the sum of k  independent, squared standard normal variables.

\chi^2=\left(Z_1\right)^2+\left(Z_2\right)^2+\ldots+\left(Z_k\right)^2

  • The curve is nonsymmetrical and skewed to the right.
  • There is a different chi-square curve for each df.

The difference of distributions according to sample size

3. The test statistic for any test is always greater than or equal to zero.

d f>90

Test of a Single Variance

A test of a single variance assumes that the underlying distribution is normal. The null and alternative hypotheses are stated in terms of the population variance . The test statistic is:

\chi_c^2=\frac{(n-1) s^2}{\sigma_0^2}

Example: Math instructors are not only interested in how their students do on exams, on average, but how the exam scores vary. To many instructors, the variance (or standard deviation) may be more important than the average.

Suppose a math instructor believes that the standard deviation for his final exam is five points. One of his best students thinks otherwise. The student claims that the standard deviation is more than five points. If the student were to conduct a hypothesis test, what would the null and alternative hypotheses be?

Solution: Even though we are given the population standard deviation, we can set up the test using the population variance as follows:

H_0: \sigma^2 \leq 5^2

Single Population Variances, One-Tail Test

Check Your Understanding: Test of a Single Variance

Goodness-Of-Fit Test

In this type of hypothesis test, you determine whether the data “fit” a particular distribution or not. For example, you may suspect your unknown data fit a binomial distribution. You use a chi-square test (meaning the distribution for the hypothesis test is chi-square) to determine if there is a fit or not. The null and the alternative hypotheses for this test may be written in sentences or may be stated as equations or inequalities.

The test statistic for a goodness-of -fit test is:

\sum_k \frac{(O-E)^2}{E}

  • O = observed values (data)
  • E = expected values (from theory)
  • k = the number of different data cells or categories

\frac{(O-E)^2}{E}

The goodness-of-fit test is almost always right-tailed. If the observed values and the corresponding expected values are not close to each other, then the test statistic can get very large and will be way out in the right tail of the chi-square curve.

Note: The number of expected values inside each cell needs to be at least five in order to use this test.

Chi-Square Statistic for Hypothesis Testing

Chi-Square Goodness-of-Fit Example

Check Your Understanding: Goodness-of-Fit Test

Test of Independence  

Tests of independence involve using a contingency table of observed (data) values.

The test statistic of a test of independence is similar to that of a goodness-of-fit test:

\sum_{i \bullet j} \frac{(O-E)^2}{E}

  • O = observed values
  • E = expected values
  • i = the number of rows in the table
  • j = the number of columns in the table

i \cdot j \text { terms of the form } \frac{(O-E)^2}{E}

A test of independence determines whether two factors are independent or not.

Note: The expected value inside each cell needs to be at least five in order for you to use this test.

The test of independence is always right tailed because of the calculation of the test statistic. If the expected and observed values are not close together, then the test statistic is very large and way out in the right tail of the chi-square curve, as it is in a goodness-of-fit.

The number of degrees of freedom for the test of independence is:

d f=(\text { Number of columns }-1 \text { ) (number of rows }-1 \text { ) }

The following formula calculates the expected number (E):

E=\frac{(\text { row total })(\text { column total })}{\text { total number surveyed }}

Simple Explanation of Chi-Squared

Chi-Square Test for Association (independence)

Check Your Understanding: Test of Independence

Test for Homogeneity

The goodness-of-fit test can be used to decide whether a population fits a given distribution, but it will not suffice to decide whether two populations follow the same unknown distribution. A different test, called the test of homogeneity , can be used to draw a conclusion about whether two populations have the same distribution. To calculate the test statistic for a test for homogeneity, follow the same procedure as with the test of independence.

H_a:

Test Statistic

Degrees of Freedom (df)

d f=\text { number of columns }-1

Requirements

All values in the table must be greater than or equal to five

Common uses

Comparing two populations. For example: men vs. women, before vs. after, east vs. west. The variable is categorical with more than two possible response values.

Introduction to the Chi-Square Test for Homogeneity

Check Your Understanding: Test for Homogeneity

Comparison of the Chi-Square Tests

Goodness-of-fit: Use the goodness-of-fit test to decide whether a population with an unknown distribution “fits” a known distribution. In this case there will be a single qualitative survey question or a single outcome of an experiment from a single population. Goodness-of-Fit is typically used to see if the population is uniform (all outcomes occur with equal frequency), the population is normal, or the population is the same as another population with a known distribution. The null and alternative hypotheses are:

Independence: Use the test for independence to decide whether two variables (factors) are independent or dependent. In this case there will be two qualitative survey questions or experiments and a contingency table will be constructed. The goal is to see if the two variables are unrelated (independent) or related (dependent). The null and alternative hypotheses are:

Homogeneity: Use the test for homogeneity to decide if two populations with unknown distributions have the same distribution as each other. In this case there will be a single qualitative survey question or experiment given to two different populations. The null and alternative hypotheses are:

F Distribution and One-Way ANOVA

Many statistical applications in psychology, social science, business administration, and the natural sciences involve several groups. For example, an environmentalist is interested in knowing if the average amount of pollution varies in several bodies of water. A sociologist is interested in knowing if the amount of income a person earns varies according to his or her upbringing. A consumer looking for a new car might compared the average gas mileage of several models.

For hypothesis tests comparing averages among more than two groups, statisticians have developed a method called “Analysis of Variance” (abbreviated ANOVA). In this chapter, you will study the simplest form of ANOVA called single factor or one-way ANOVA . You will also study the F  distribution, used for one-way ANOVA, and the test for differences between two variances. This is just a very brief overview of one-way ANOVA. One-Way ANOVA, as it is presented here, relies heavily on a calculator or computer.

Test of Two Variances

This chapter introduces a new probability density function, the F distribution. This distribution is used for many applications including ANOVA and for testing equality across multiple means. We begin with the F distribution and the test of hypothesis of differences in variances. It is often desirable to compare two variances rather than two averages. For instance, college administrators would like two college professors grading exams to have the same variation in their grading. In order for a lid to fit a container, the variation in the lid and the container should be approximately the same. A supermarket might be interested in the variability of check-out times for two checkers. In finance, the variance is a measure of risk and thus an interesting question would be to test the hypothesis that two different investment portfolios have the same variance, the volatility.

In order to perform a F test of two variances, it is important that the following are true:

  • The populations from which the two samples are drawn are approximately normally distributed.
  • The two populations are independent of each other.

Unlike most other hypothesis tests in this Module, the F test for equality of two variances is very sensitive to deviations from normality. If the two distributions are not normal, or close, the test can give a biased result for the test statistic.

\sigma_1^2 \text { and } \sigma_2^2

The various forms of the hypotheses tested are:

Table 11

A more general form of the null and alternative hypothesis for a two tailed test would be:

H_0: \frac{\sigma_1^2}{\sigma_2^2}=\delta_0

Therefore, if F is close to one, the evidence favors the null hypothesis (the two population variances are equal). But if F is much larger than one, then the evidence is against the null hypothesis. In essence, we are asking if the calculated F statistic, test statistic, is significantly different from one.

F_{\alpha, d f 1, d f 2}

To find the critical value for the lower end of the distribution, reverse the degrees of freedom and divide the F-value from the table into one.

1 / F_{\alpha, d f 2, d f 1}

When the calculated value of F is between the critical values, not in the tail, we cannot reject the null hypothesis that the two variances came from a population with the same variance. If the calculated F-value is in either tail we cannot accept the null hypothesis just as we have been doing for all of the previous tests of hypothesis.

An alternative way of finding the critical values of the F distribution makes the use of the F-table easier. We note in the F-table that all the values of F are greater than one therefore the critical F value for the left-hand tail will always be less than one because to find the critical value on the left tail we divide an F value into the number one as shown above. We also note that if the sample variance in the numerator of the test statistic is larger than the sample variance in the denominator, the resulting F value will be greater than one. The shorthand method for this test is thus to be sure that the larger of the two sample variances is placed in the numerator to calculate the test statistic. This will mean that only the right-hand tail critical value will have to be found in the F-table.

Hypothesis Test Two Population Variances

Check Your Understanding: F Distribution

One-Way ANOVA

The purpose of a one-way ANOVA test is to determine the existence of a statistically significant difference among several group means. The test actually uses variances to help determine if the means are equal or not. In order to perform a one-way ANOVA test, there are five basic assumptions to be fulfilled:

  • Each population from which a sample is taken is assumed to be normal.
  • All samples are randomly selected and independent.
  • The populations are assumed to have equal standard deviations (or variances).
  • The factor is a categorical variable.
  • The response is a numerical variable.

The Null and Alternative Hypotheses

The null hypothesis is simply that all the group population means are the same. The alternative hypothesis is that at least one pair of means is different. For example, if there are k  groups:

H_0: \mu_1=\mu_2=\mu_3=\ldots \mu_k

If the null hypothesis is false, then the variance of the combined data is larger which is caused by the different means as shown in the second graph (green box plots).

Box and whisker plot

The F Distribution and the F-Ratio

The distribution used for the hypothesis test is a new one. It is called the F distribution, invented by George Snedecor but named in honor of Sir Ronald Fisher, an English statistician. The F statistic is a ratio (a fraction). There are two sets of degrees of freedom: one for the numerator and one for the denominator.

F \sim F_{4,10}

To calculate the F ratio , two estimates of the variance are made.

\sigma^2

To find a “sum of squares” means to add together squared quantities that, in some cases, may be weighted.

M S_{\text {between }}

Calculation of Sum of Squares and Mean Square

  • k = the number of different groups

n_j=\text { the size of the } j_{t h}

Note: The null hypothesis says that all the group population means are equal. The hypothesis of equal means implies that the populations have the same normal distribution, because it is assumed that the populations are normal and that they have equal variances.

F-Ration or F Statistic

F=\frac{M S_{\text {between }}}{M S_{\text {within }}}

The foregoing calculations were done with groups of different sizes. If the groups are the same size, the calculations simplify somewhat, and the F-ratio can be written as:

F-Ratio Formula when the groups are the same size

F=\frac{n \cdot s_{-}^2}{s_{\text {pooled }}^2}

  • n = the sample size

d f_{\text {numerator }}=k-1

Data are typically put into a table for easy viewing. One-Way ANOVA results are often displayed is this manner by computer software.

Table 12
) ) )
Factor (Between) (Factor) – 1 (Factor) = =
Error (Within) (Error) –  (Error) =
Total (Total) – 1

Example: Three different diet plans are to be tested for mean weight loss. The entries in the table are the weight losses for the different plans. The one-way ANOVA results are shown in Table 13.

Table 13
5 3.5 8
4.5 7 4
4 3.5
3 4.5

s_1=16.5, s_2=15, s_3=15.5

Following are the calculations needed to fill in the one-way ANOVA table. The table is used to conduct a hypothesis test.

S S_{\text {between }}=\sum\left[\frac{\left(s_j\right)^2}{n_j}\right]-\frac{\left(\sum s_j\right)^2}{n}

Table 14
Factor
(Between)
(Factor) = Between)
= 2.2458
– 1
= 3 groups – 1
= 2
Factor) =  (Factor)/ – 1)
= 2.2458/2
= 1.1229
=
(Factor)/ (Error) = 1.1229/2.9792
= 0.3769
Error
(Within)
(Error)
=  (Within)
= 20.8542
= 10 total data
– 3 groups
= 7
(Error)
Error)/( = 20.8542/7
= 2.9792
Total (Total)
= 2.2458 + 20.8542
= 23.1
= 10 total data
– 1
= 9

F \sim F_{d f(\text { num }), d f(\text { denom })}

Calculating SST (total sum of squares)

Calculating

Hypothesis Test with F-Statistic

Check Your Understanding: The F Distribution and the F-Ratio

Facts About the F Distribution

  • The curve is not symmetrical but skewed to the right.
  • There is a different curve for each set of degrees of freedom.
  • The F statistic is greater than or equal to zero.
  • As the degrees of freedom for the numerator and for the denominator get larger, the curve approximates the normal as can be seen in Figure 18. Remember that the F cannot ever be less than zero, so the distribution does not have a tail that goes to infinity on the left as the normal distribution does.
  • Other uses for the F distribution include comparing two variances and two-way Analysis of Variances. Two-Way Analysis is beyond the scope of this section.

F Distribution graph with various sample sizes

Compute and Interpret Simple Linear Regression Between Two Variables

Linear regression and correlation.

Professionals often want to know how two or more numeric variables are related. For example, is there a relationship between the grade on the second math exam a student takes and the grade on the final exam? If there is a relationship, what is the relationship and how strong is it?

This example may or may not be tied to a model, meaning that some theory suggested that a relationship exists. This link between a cause and an effect is the foundation of the scientific method and is the core of how we determine what we believe about how the world works. Beginning with a theory and developing a model of the theoretical relationship should result in a prediction, what we have called a hypothesis earlier. Now the hypothesis concerns a full set of relationships.

y=f(x)

In this section we will begin with correlation, the investigation of relationships among variables that may or may not be founded on a cause-and-effect model. The variables simply move in the same, or opposite, direction. That is to say, they do not move randomly. Correlation provides a measure of the degree to which this is true. From there we develop a tool to measure cause and effect relationships, regression analysis. We will be able to formulate models and tests to determine if they are statistically sound. If they are found to be so, then we can use them to make predictions: if as a matter of policy, we changed the value of this variable what would happen to this other variable? If we imposed a gasoline tax of 50 cents per gallon how would that effect the carbon emissions, sales of Hummers/Hybrids, use of mass transit, etc.? The ability to provide answers to these types of questions is the value of regression as both a tool to help us understand our world and to make thoughtful policy decisions.

The Correlation Coefficient r

As we begin this section, we note that the type of data we will be working with has changed. Perhaps unnoticed, all the data we have been using is for a single variable. It may be from two samples, but it is still a univariate variable. The type of data described for any model of cause and effect is bivariate data — “bi” for two variables. In reality, statisticians use multivariate data, meaning many variables.

Data can be classified into three broad categories: time series data, cross-section data, and panel data. Time series data measures a single unit of observation; say a person, or a company or a country, as time passes. What are measures will be at least two characteristics, say the person’s income, the quantity of a particular good they buy and the price they paid. This would be three pieces of information in one time period, say 1985. If we followed that person across time we would have those same pieces of information for 1985, 1986, 1987, etc. This would constitute a time series data set.

A second type of data set is for cross-section data. Here the variation is not across time for a single unit of observation, but across units of observation during one point in time. For a particular period of time, we would gather the price paid, amount purchased, and income of many individual people.

A third type of data set is panel data. Here a panel of units of observation is followed across time. If we take our example from above, we might follow 500 people, the unit of observation, through time, ten years, and observe their income, price paid and quantity of the good purchased. If we had 500 people and data for ten years for price, income and quantity purchased we would have 15,000 pieces of information. These types of data sets are very expensive to construct and maintain. They do, however, provide a tremendous amount of information that can be used to answer very important questions. As an example, what is the effect on the labor force participation rate of women as their family of origin, mother and father, age? Or are there differential effects on health outcomes depending upon the age at which a person started smoking? Only panel data can give answers to these and related questions because we must follow multiple people across time.

Beginning with a set of data with two independent variables we ask the question: are these related? One way to visually answer this question is to create a scatter plot of the data. We could not do that before when we were doing descriptive statistics because those data were univariate. Now we have bivariate data so we can plot in two dimensions. Three dimensions are possible on a flat piece of paper but become very hard to fully conceptualize. Of course, more than three dimensions cannot be graphed although the relationships can be measured mathematically.

To provide mathematical precision to the measurement of what we see we use the correlation coefficient. The correlation tells us something about the co-movement of two variables, but nothing about why this movement occurred. Formally, correlation analysis assumes that both variables being analyzed are independent variables. This means that neither one causes the movement in the other. Further, it means that neither variable is dependent on the other, or for that matter, on any other variable. Even with these limitations, correlation analysis can yield some interesting results.

\rho

In practice all correlation and regression analysis will be provided through computer software designed for these purposes. Anything more than perhaps one-half a dozen observations creates immense computational problems. It was because of this fact that correlation, and even more so, regression, were not widely used research tools until after the advent of “computing machines.” Now the computing power required to analyze data using regression packages is deemed almost trivial by comparison to just a decade ago.

1 \text { or }-1

Remember, all the correlation coefficient tells us is whether or not the data are linearly related. In panel (d) the variables obviously have some type of very specific relationship to each other, but the correlation coefficient is zero, indicating no linear relationship exists.

X_1 \text { and } X_2 \text { then } r

What the VALUE of r tells us:

-1 \text { and }+1:-1 \leq r \leq 1

What the SIGN of r tells us

X_1 \text { increases, } X_2

Bivariate Relationship Linearity, Strength, and Direction

Check Your Understanding: The Correlation Coefficient r

Calculating Correlation Coefficient r

Example: Correlation Coefficient Intuition

Linear Equations

Linear regression to two variables is based on a linear equation with one independent variable. The equation has the form:

y=a+b x

Where a  and b  are constant numbers.

The variable x is the independent variable, and y is the dependent variable . Another way to think about this equation is a statement of cause and effect. The X variable is the cause, and the Y variable is the hypothesized effect. Typically, you choose a value to substitute for the independent variable and then solve for the dependent variable.

Slope and Y-Intercept of a Linear Equation

(0, a)

The Regression Equation

Regression analysis is a statistical technique that can test the hypothesis that a variable is dependent upon one or more other variables. Further, regression analysis can provide an estimate of the magnitude of the impact of a change in one variable on another. This last feature, of course, is all important in predicting future values.

Regression analysis is based upon a functional relationship among variables and further, assumes that the relationship is linear. This linearity assumption is required because, for the most part, the theoretical statistical properties of non-linear estimation are not well worked out yet by the mathematicians and econometricians. This presents us with some difficulties in economic analysis because many of our theoretical models are nonlinear. The marginal cost curve, for example, is decidedly nonlinear as is the total cost function, if we are to believe in the effect of specialization of labor and the Law of Diminishing Marginal Product. There are techniques for overcoming some of these difficulties, exponential and logarithmic transformation of the data for example, but at the outset we must recognize that standard ordinary least squares (OLS) regression analysis will always use a linear function to estimate what might be a nonlinear relationship.

The general linear regression model can be stated by the equation:

y_i=\beta_0+\beta_1 X_{1 i}+\beta_2 X_{2 i}+\ldots+\beta_k X_{k i}+\varepsilon_i

As with our earlier work with probability distributions, this model works only if certain assumptions hold. These are that the Y is normally distributed, the errors are also normally distributed with a mean of zero and a constant standard deviation, and that the error terms are independent of the size of X and independent of each other.

Assumptions of the Ordinary Least Squares Regression Model

Each of these assumptions needs a bit more explanation. If one of these assumptions fails to be true, then it will have an effect on the quality of the estimates. Some of the failures of these assumptions can be fixed while others result in estimates that quite simply provide no insight into the questions the model is trying to answer or worse, give biased estimates.

x_i

  • The error term is a random variable with a mean of zero and a constant variance. The meaning of this is that the variances of the independent variables are independent of the value of the variable. Consider the relationship between personal income and the quantity of a good purchased as an example of a case where the variance is dependent upon the value of the independent variable, income. It is plausible that as income increases the variation around the amount purchased will also increase simply because of the flexibility provided with higher levels of income. The assumption is for constant variance with respect to the magnitude of the independent variable called homoscedasticity. If the assumption fails, then it is called heteroscedasticity. Figure 21 shows the case of homoscedasticity where all three distributions have the same variance around the predicted value of Y regardless of the magnitude of X.
  • Error terms should be normally distributed. This can be seen in Figure 21 by the shape of the distributions placed on the predicted line at the expected value of the relevant value of Y.
  • The independent variables are independent of Y but are also assumed to be independent of the other X variables. The model is designed to estimate the effects of independent variables on some dependent variable in accordance with a proposed theory. The case where some or more of the independent variables are correlated is not unusual. There may be no cause and effect relationship among the independent variables, but nevertheless they move together. Take the case of a simple supply curve where quantity supplied is theoretically related to the price of the product and the prices of inputs. There may be multiple inputs that may over time move together from general inflationary pressure. The input prices will therefore violate this assumption of regression analysis. This condition is called multicollinearity.
  • The error terms are uncorrelated with each other. This situation arises from an effect on one error term from another error term. While not exclusively a time series problem, it is here that we most often see this case. An X variable in time period one has an effect on the Y variable, but this effect then has an effect in the next time period. This effect gives rise to a relationship among the error terms. This case is called autocorrelation, “self-correlated.” The error terms are now not independent of each other, but rather have their own effect on subsequent error terms.

\widehat{y}=a+b x

This is the general form that is most often called the multiple regression model. So-called “simple” regression analysis has only one independent (right-hand) variable rather than many independent variables. Simple regression is just a special case of multiple regression. There is some value in beginning with simple regression: it is easy to graph in two dimensions, difficult to graph in three dimensions, and impossible to graph in more than three dimensions. Consequently, our graphs will be for the simple regression case. Figure 22 presents the regression problem in the form of a scatter plot graph of the data set where it is hypothesized that Y is dependent upon the single independent variable X.

The regression problem comes down to determining which straight line would best represent the data in Figure 23. Regression analysis is sometimes called “least squares” analysis because the method of determining which line best “fits” the data is to minimize the sum of the squared residuals of a line put through the data.

Least squares regression line

Consider the graph in Figure 24. The notation has returned to that for the more general model rather than the specific case of the Macroeconomic consumption function in our example.

Least squares regression line

If the observed data point lies above the line, the residual is positive, and the line underestimates the actual data value for y.

If the observed data point lies below the line, the residual is negative, and the line overestimates that actual data value for y.

y_0-\hat{y}_0=e_0

The sum of the errors squared is the term obviously called Sum of Squared Errors (SSE).

b_0 \text { and } b_1

The slope b can also be written as:

b_1=r_{y, x}\left(\frac{s_y}{s_x}\right)

The variance of the errors is fundamental in testing hypotheses for a regression. It tells us just how “tight” the dispersion is about the line. As we will see shortly, the greater the dispersion about the line, meaning the larger the variance of the errors, the less probable that the hypothesized independent variable will be found to have a significant effect on the dependent variable. In short, the theory being tested will more likely fail if the variance of the error term is high. Upon reflection this should not be a surprise. As we tested hypotheses about a mean we observed that large variances reduced the calculated test statistic and thus it failed to reach the tail of the distribution. In those cases, the null hypotheses could not be rejected. If we cannot reject the null hypothesis in a regression problem, we must conclude that the hypothesized independent variable has no effect on the dependent variable.

A way to visualize this concept is to draw two scatter plots of x and y data along a predetermined line. The first will have little variance of the errors, meaning that all the data points will move close to the line. Now do the same except the data points will have a large estimate of the error variance, meaning that the data points are scattered widely along the line. Clearly the confidence about a relationship between x and y is affected by this difference between the estimate of the error variance.

Introduction to Residuals and Least-Squares Regression

Calculating Residual Example

Check Your Understanding: Linear Equations

Residual Plots

Check Your Understanding: Residual Plots

Calculating the Equation of a Regression Line

Check Your Understanding: Calculating the Equation of a Regression Line

Interpreting Slope of Regression Line

Interpreting y-intercept in Regression Model

Check Your Understanding: Interpreting Slope of Regression Line and Interpreting y-intercept in Regression Model

Using Least Squares Regression Output

Check Your Understanding: Using Least Squares Regression Output

How Good is the Equation?

The multiple correlation coefficient, also called the coefficient of multiple determination or the coefficient of determination , is given by the formula:

R^2=\frac{S S R}{S S T}

R-Squared or Coefficient of Determination

Data Analysis Tools (Spreadsheets and Basic Programming)

Descriptive statistics, using microsoft excel’s “descriptive statistics” tool.

How to Run Descriptive Statistics in R

Descriptive Statistics in R Part 2

Regression Analysis

How to use microsoft excel for regression analysis.

Please read this text on how to use Microsoft Excel for regression analysis. 

Simple Linear Regression in Excel

Simple Linear Regression, fit and interpretations in R

Relevance to Transportation Engineering Coursework

This section explains the relevance of the regression models for trip generation, mode choice, traffic flow-speed-density relationship, traffic safety, and appropriate sample size for spot speed study to transportation engineering coursework.

Regression Models for Trip Generation

The trip generation step is the first of the four-step process for estimating travel demand for infrastructure planning. It involves estimating the number of trips made to and from each traffic analysis zone (TAZ). Trip generation models are estimated based on land use and trip-making data. They use either linear regression or cross-tabulation of household characteristics. Simple linear regression is described in the section above titled “ Compute and Interpret Simple Linear Regression Between Two Variables” , and the tools to conduct the linear regression are discussed in “ Data Analysis Tools (Spreadsheets and Basic Programming)”.

Mode Choice

Estimation of Mode Choice is also part of the four-step process for estimating travel demand. It entails estimating the trip makers’ transportation mode (drive alone, walk, take public transit, etc.) choices. The results of this step are the counts of trips categorized by mode. The most popular mode choice model is the discrete choice, multinomial logit model. Hypothesis tests are conducted for the estimated model parameters to assess whether they are “statistically significant.” The section titled “ Use Specific Significance Tests Including, Z-Test, T-Test (one and two samples), Chi-Squared Test”  of this chapter provides extensive information on hypothesis testing.

Traffic Flow-Speed-Density Relationship

Greenshield’s model is used to represent the traffic flow-speed-density relationship. Traffic speed and traffic density (number of vehicles per unit mile) are collected to estimate a linear regression model for speed as a function of density. “ Compute and Interpret Simple Linear Regression Between Two Variables” above provides information on simple linear regression. “ Data Analysis Tools (Spreadsheets and Basic Programming)” provides guidance for implementing the linear regression technique using tools available in Microsoft Excel and the programing language R.

Traffic Safety

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Given big enough sample size, a test will always show significant result unless the true effect size is exactly zero. Why?

I am curious about a claim made in Wikipedia's article on effect size . Specifically:

[...] a non-null statistical comparison will always show a statistically significant results unless the population effect size is exactly zero

I am not sure what this means/implies, let alone an argument to back it up. I guess, after all, an effect is a statistic, i.e., a value calculated from a sample , with its own distribution. Does this mean that effects are never due to just random variation (which is what I understand it means to not be significant)? Do we then just consider whether the effect is strong enough -- having high absolute value?

I am considering the effect I am most familiar with: the Pearson correlation coefficient r seems to contradict this. Why would any $r$ be statistically-significant? If $r$ is small our regression line $$ y=ax+b = r\left(\frac {s_y}{s_x}\right) = \epsilon x+b $$

For $\epsilon$ small,is close to 0, an F-test will likely contain a confidence interval containing 0 for the slope. Isn't this a counterexample?

  • hypothesis-testing

Firebug's user avatar

  • 12 $\begingroup$ Hint: the clause before the portion you quoted is essential. " Given a sufficiently large sample size , a non-null statistical comparison will always show a statistically significant results unless the population effect size is exactly zero…" $\endgroup$ –  Kodiologist Commented Jan 19, 2018 at 2:29
  • $\begingroup$ @Kodiologist: But, re my example, would this imply that if sample size were bigger, then r itself would also be bigger, or, at least the expression $r(s_y/s_x)$ would be larger if sample size were larger? I don't see it. $\endgroup$ –  gary Commented Jan 19, 2018 at 2:38
  • 7 $\begingroup$ If this wasn't true, it would be a flaw in the statistical method. If $\mu > \mu_0$, surely some sample size is large enough to detect the difference. $\endgroup$ –  John Coleman Commented Jan 19, 2018 at 12:01
  • $\begingroup$ curious, how does this relate to this question: stats.stackexchange.com/questions/35470/… $\endgroup$ –  Charlie Parker Commented Mar 3, 2023 at 22:08

4 Answers 4

As @Kodiologist points out, this is really about what happens for large sample sizes. For small sample sizes there's no reason why you can't have false positives or false negatives.

I think the $z$-test makes the asymptotic case clearest. Suppose we have $X_1, \dots, X_n \stackrel{\text{iid}}\sim \mathcal N(\mu, 1)$ and we want to test $H_0: \mu = 0$ vs $H_A: \mu \neq 0$. Our test statistic is $$ Z_n = \frac{\bar X_n - 0}{1 / \sqrt n} = \sqrt n\bar X_n. $$

$\bar X_n \sim \mathcal N(\mu, \frac 1n)$ so $Z_n = \sqrt n \bar X_n \sim \mathcal N(\mu \sqrt n, 1)$. We are interested in $P(|Z_n| \geq \alpha)$. $$ P(|Z_n| \geq \alpha) = P(Z_n \leq -\alpha)+ P(Z_n \geq \alpha) $$ $$ = 1 + \Phi(-\alpha - \mu\sqrt n) - \Phi(\alpha - \mu \sqrt n). $$ Let $Y \sim \mathcal N(0,1)$ be our reference variable. Under $H_0$ $\mu = 0$ so we have $P(|Z_n| \geq \alpha) = 1 - P(-\alpha \leq Y \leq \alpha)$ so we can choose $\alpha$ to control our type I error rate as desired. But under $H_A$ $\mu \sqrt n \neq 0$ so $$ P(|Z_n| \geq \alpha) \to 1 + \Phi(\pm\infty) - \Phi(\pm\infty) = 1 $$ so with probability 1 we will reject $H_0$ if $\mu \neq 0$ (the $\pm$ is in case of $\mu < 0$, but either way the infinities have the same sign).

The point of this is that if $\mu$ exactly equals $0$ then our test statistic has the reference distribution and we'll reject 5% (or whatever we choose) of the time. But if $\mu$ is not exactly $0$, then the probability that we'll reject heads to $1$ as $n$ increases. The idea here is the consistency of a test, which is that under $H_A$ the power (probability of rejecting) heads to $1$ as $n \to \infty$.

It's the exact same story with the test statistic for testing $H_0 : \rho = \rho_0$ versus $H_A: \rho \neq \rho_0$ with the Pearson correlation coefficient. If the null hypothesis is false, then our test statistic gets larger and larger in probability, so the probability that we'll reject approaches $1$.

jld's user avatar

  • 1 $\begingroup$ Nitpick: if $μ < 0$, then $Z_n$ will diverge to $-\infty$ instead of $\infty$, right? $\endgroup$ –  Kodiologist Commented Jan 19, 2018 at 5:18
  • 1 $\begingroup$ Nice, but what happens in the $\mu=0$ case should depend on whether $\bar{X}\to_p 0$ “faster” than $\sqrt{n}\to \infty$, right? I’m not even sure how you would “compare” the rate of convergence for a sequence of random variables and a sequence of integers - probably Slutsky’s theorem or something like that should be applied. $\endgroup$ –  DeltaIV Commented Jan 19, 2018 at 7:37
  • 1 $\begingroup$ @DeltaIV, right, if the convergence rate were different, one would need a different scaling to get a nondegenerate null distribution. But for the present example, root-n is the right rate. $\endgroup$ –  Christoph Hanck Commented Jan 19, 2018 at 9:13
  • 1 $\begingroup$ $\sqrt n \bar X$ converges to a standard normal by the CLT, not to $0$. $\endgroup$ –  guy Commented Jan 19, 2018 at 14:27

As a simple example, suppose that I am estimating your height using some statistical mumbo jumbo.

You've always stated to others that you are 177 cm (about 5 ft 10 in).

If I were to test this hypothesis (that your height is equal to 177 cm, $h = 177$), and I could reduce the error in my measurement enough, then I could prove that you are not in fact 177 cm. Eventually, if I estimate your height to enough decimal places, you would almost surely deviate from the stated height of 177.00000000 cm. Perhaps you are 177.02 cm; I only have to reduce my error to less than .02 to find out that you are not 177 cm.

How do I reduce the error in statistics? Get a bigger sample. If you get a large enough sample, the error gets so small that you can detect the most minuscule deviations from the null hypothesis.

Underminer's user avatar

  • 3 $\begingroup$ This is a very clear and concise explanation. It probably is more helpful for understanding why this happens than the more mathematical answers. Well done. $\endgroup$ –  Nobody Commented Jan 19, 2018 at 18:15
  • 2 $\begingroup$ Nicely explained, but I think it's also important to consider that there are cases in which the stated value is truly exact. For example, setting aside weird things that happen in string theory etc., a measurement of the number of spatial dimensions of our universe (which can be done) is going to give 3, and no matter how precise you make that measurement, you will never consistently find statistically significant deviations from 3. Of course if you keep testing enough times you'll get some deviations simply due to variance, but that's a different issue. $\endgroup$ –  David Z Commented Jan 21, 2018 at 6:34
  • $\begingroup$ Probably a naive question but if I claim I am 177cm, doesn't the concept of significant digits mean I am only saying I am between 176.5 and 177.5? The answer seems to give a good theoretical concept, true, but is it not based on a false premise? What am I missing? $\endgroup$ –  JimLohse Commented Jan 31, 2018 at 15:10
  • $\begingroup$ In this case the stated height of 177 is analogous to the null hypothesis in statistics. In traditional hypothesis testing for equality, you make a statement of equality (e.g. $\mu = 177$). The point is that no matter what you state your height to be, I can disprove it by reducing the error unless the null hypothesis is EXACTLY true. I used height as an easy to understand example, but this concept is the same in other areas (substance x does not cause cancer, this coin is fair, etc.) $\endgroup$ –  Underminer Commented Jan 31, 2018 at 16:47
  • $\begingroup$ Mathematically, how is your claim "if you get a large enough sample, the error gets smaller..." supported? We compute the p-value from the $Z$ score, but I don't see how the Z score changes with sample size. $\endgroup$ –  roulette01 Commented Sep 10, 2020 at 15:07

Arguably what they said is wrong, if for no other reason than their use of "this always happens".

I don't know if this is the crux of the confusion you're having, but I'll post it because I think many do and will get confused by this:

"$X$ happens if $n$ is large enough" does NOT mean "If $n > n_0$, then $X$."

Rather, it means $\lim\limits_{n\to\infty} \pr (x) = 1$..

What they are literally saying translates to the following:

For any sample size $n$ above some minimum size $n_0$, the result of any non-null test is guaranteed to be significant if the true effect size is not exactly zero.

What they were trying to say, though, is the following:

For any significance level, as the sample size is increased, the probability that a non-null test yields a significant result approaches 1 if the true effect size is not exactly zero.

There are crucial differences here:

There is no guarantee. You are only more likely to get a significant result with a bigger sample. Now, they could dodge part of the blame here, because so far it's just a terminology issue. In a probabilistic context, it is understood that the statement "if n is large enough then X" can also be interpreted to mean "X becomes more and more likely to be true as n grows large" . However, this interpretation goes out my window as soon as they say this "always" happens. The proper terminology here would have been to say this happens " with high probability " 1 .

This is secondary, but their wording is confusing—it seems to imply that you fix the sample size to be "large enough", and then the statement holds true for any significance level. However, regardless of what the precise mathematical statement is, that doesn't really make sense: you always first fix the significance level, and then you choose the sample size to be large enough. But the suggestion that it can somehow be the other way around unfortunately emphasizes the $n > n_0$ interpretation of "large enough", so that makes the above problem even worse.

But once you understand the literature, you get what they're trying to say.

(Side note: incidentally, this is exactly one of the constant problems many people have with Wikipedia. Frequently, it's only possible to understand what they're saying if you already know the material, so it's only good for a reference or as a reminder, not as self-teaching material.)

1 For the fellow pedants (hi!), yes, the term has a more specific meaning than the one I linked to. The loosest technical term we probably want here is "asymptotically almost surely" . See here .

user541686's user avatar

  • $\begingroup$ "the probability that a non-null test yields a significant result approaches 0 if the true effect size is exactly zero" may not be quite right: if the test has significance level $\alpha$ then the probability of yielding a significant result may be $\alpha$ or thereabouts at all sample sizes $\endgroup$ –  Henry Commented Jan 19, 2018 at 10:41
  • $\begingroup$ @Henry: Oh shoot, you're right! I wrote it so fast I didn't stop to think. Thanks a ton! I've fixed it. :) $\endgroup$ –  user541686 Commented Jan 19, 2018 at 11:16

My favorite example is number of fingers by gender. The vast majority of people have 10 fingers. Some have lost fingers due to accidents. Some have extra fingers.

I don't know if men have more fingers than women (on average). All the easily available evidence suggests that men and women both have 10 fingers.

However, I am highly confident that if I did a census of all men and all women then I would learn that one gender has more fingers (on average) than the other.

emory's user avatar

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing or ask your own question .

  • Featured on Meta
  • User activation: Learnings and opportunities
  • Join Stack Overflow’s CEO and me for the first Stack IRL Community Event in...

Hot Network Questions

  • Which cartoon episode has Green Lantern hitting Superman with a tennis racket and sending him flying?
  • HTA w/ VBScript to open and copy links
  • Would Dicyanoacetylene Make a Good Flamethrower Fuel?
  • Use the lower of two voltages when one is active
  • How to prove this problem about ternary quadratic form?
  • A coworker says I’m being rude—only to him. How should I handle this?
  • How can I use an op-amp in differential configuration to measure voltage across a MOSFET?
  • Are there any texts which talk about "moral hallucinations" in a good amount of detail?
  • Count squares in my pi approximation
  • Cutting a curve through a thick timber without waste
  • Missed the application deadline for a TA job. Should I contact them?
  • On Concordant Readings in Titrations
  • meaning of a sentence from Agatha Christie (Murder of Roger Ackroyd)
  • Craig interpolants for Linear Temporal Logic: finding one when they exist
  • Help with understanding a rigid geometry proof
  • How to sum all the elements in a group in specific row with the package `listofitems`?
  • How to win a teaching award?
  • Grid-based pathfinding for a lot of agents: how to implement "Tight-Following"?
  • Movie from the fifties where aliens look human but wear sunglasses to hide that they have no irises (color) in their eyes - only whites!
  • "00000000000000"
  • Would a scientific theory of everything be falsifiable?
  • Number theory: Can all rational numbers >1 be expressed as a product of rational numbers >1?
  • How to translate the letter Q to Japanese?
  • What is the best way to protect from polymorphic viruses?

why is sample size important in hypothesis testing

IMAGES

  1. How to determine correct sample size for hypothesis testing?

    why is sample size important in hypothesis testing

  2. Why is Sample Size important?

    why is sample size important in hypothesis testing

  3. PPT

    why is sample size important in hypothesis testing

  4. PPT

    why is sample size important in hypothesis testing

  5. PPT

    why is sample size important in hypothesis testing

  6. Hypothesis Testing

    why is sample size important in hypothesis testing

VIDEO

  1. Hypothesis Testing for Mean: p-value is more than the level of significance (Hat Size Example)

  2. SESSION 21 Hypothesis testing for one population part 1 UGBS301

  3. Stats Micro Lessons 29: Hypothesis Testing in Statistics

  4. Hypothesis testing in Large Samples-V: Sample and the Population Standard Deviations

  5. Statistics 101: Single Sample Hypothesis t-test Examples

  6. Part 2: Sample Size Determination

COMMENTS

  1. Sample Size and its Importance in Research

    In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05.

  2. Sample size, power and effect size revisited: simplified and practical

    The first aim is to explain the importance of sample size and its relationship to effect size (ES) and statistical significance. The second aim is to assist researchers planning to perform sample size estimations by suggesting and elucidating available alternative software, guidelines and references that will serve different scientific purposes.

  3. Why is Sample Size Important? (Explanation & Examples)

    Sample size refers to the total number of individuals involved in an experiment or study. Sample size is important because it directly affects how precisely we can estimate population parameters. To understand why this is the case, it helps to have a basic understanding of confidence intervals.

  4. Why is Sample Size important?

    To summarize why sample size is important: The two major factors affecting the power of a study are the sample size and the effect size. A study should only be undertaken once there is a realistic chance that the study will yield useful information. A study that has a sample size which is too small may produce inconclusive results and could ...

  5. How to Calculate Sample Size Needed for Power

    Hypothesis testing takes all of this information and uses it to calculate the p-value —which you use to determine statistical significance. The key takeaway is that the statistical significance of any effect depends collectively on the size of the effect, the sample size, and the variability present in the sample data.

  6. Statistical Hypothesis Testing Overview

    Hypothesis testing is a crucial procedure to perform when you want to make inferences about a population using a random sample. These inferences include estimating population properties such as the mean, differences between means, proportions, and the relationships between variables. This post provides an overview of statistical hypothesis testing.

  7. Why is Sample Size Important? (Explanation & Examples)

    This tutorial explains why sample size is important in statistics, including an explanation and several examples.

  8. Statistics in Brief: The Importance of Sample Size in the Planning and

    Equally important, readers of medical journals should understand sample size because such understanding is essential to interpret the relevance of a finding with regard to their own patients.

  9. 25.3

    25.3 - Calculating Sample Size Before we learn how to calculate the sample size that is necessary to achieve a hypothesis test with a certain power, it might behoove us to understand the effect that sample size has on power. Let's investigate by returning to our IQ example.

  10. Sample Size and Hypothesis Testing

    Summary Hypothesis testing determines if there is sufficient evidence to support a claim (the statistical hypothesis) about a population parameter based on a sample of data. Right-sizing experiments involve trade-offs involving the probabilities of different kinds of false claims, precision of estimates, and operational and ethical constraints on sample size. Power is the probability of ...

  11. The Importance and Effect of Sample Size

    A greater power requires a larger sample size. Effect size - This is the estimated difference between the groups that we observe in our sample. To detect a difference with a specified power, a smaller effect size will require a larger sample size. When conducting research about your customers, patients or products it's usually impossible, or ...

  12. Hypothesis Testing

    Understanding the intuition behind Hypothesis Testing. What exactly it is, why do we do it and how do Data Scientists perform it. Let's…

  13. Sample size determination

    Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined ...

  14. 6.6

    6.6 - Confidence Intervals & Hypothesis Testing. Confidence intervals and hypothesis tests are similar in that they are both inferential methods that rely on an approximated sampling distribution. Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis.

  15. Issues in Estimating Sample Size for Hypothesis Testing

    Here we present formulas to determine the sample size required to ensure that a test has high power. The sample size computations depend on the level of significance, aα, the desired power of the test (equivalent to 1-β), the variability of the outcome, and the effect size. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference ...

  16. Probing into Minimum Sample Size Formula: Derivation and Usage

    1) How to derive the formula for the minimum sample size 𝜨? The essential idea behind the formula is to reverse the p-value calculation in hypothesis testing, with a particular focus on statistical power, which is the probability of rejecting the null hypothesis when the null hypothesis is indeed false.

  17. hypothesis testing

    Placidia. 14.5k 6 42 73. 1. In a two sample situation, increasing the sample size of one group to infinity does not send the power of the test to 1. The power will be limited by the sample size of the smaller group (or, to be precise, a combination of the variances within the groups and the sample sizes, if you think about a t t -test).

  18. How sample size influences research outcomes

    An appropriate sample renders the research more efficient: Data generated are reliable, resource investment is as limited as possible, while conforming to ethical principles. The use of sample size calculation directly influences research findings. Very small samples undermine the internal and external validity of a study.

  19. Sample Size Calculation for a Hypothesis Test

    The power of a hypothesis test is the probability of obtaining a statistically significant result when there is a true difference in treatments. For example, suppose, as Koegelenberg et al 1 did, that the smoking abstinence rate were 45% for varenicline alone and 14% larger, or 59%, for the combination regimen.

  20. Sample size too large? [duplicate]

    In other cases they might need other things. A caveat: If some of the assumptions don't hold, you might in some situations get an increase in false positives as sample size increases, but that's a failure of the assumptions, rather than a problem with large-sample hypothesis testing itself.

  21. Chapter 9: Data Analysis

    Chapter 9: Data Analysis - Hypothesis Testing, Estimating Sample Size, and Modeling This chapter provides the foundational concepts and tools for analyzing data commonly seen in the transportation profession. The concepts include hypothesis testing, assessing the adequacy of the sample sizes, and estimating the least square model fit for the data. These applications are useful in collecting ...

  22. hypothesis testing

    Hint: the clause before the portion you quoted is essential. "Given a sufficiently large sample size, a non-null statistical comparison will always show a statistically significant results unless the population effect size is exactly zero…" - Kodiologist Jan 19, 2018 at 2:29