Analysis of Categorical Data
- Page:
- 1
- | 2
- | 3
- | 4
- | 5
- | 6
- | 7
- | 8
- | 9
- | 10
- | 11
- | 12
- | 13
- | 14
- Module 6 - Analysis of Categorical Data
- Learning Objectives
One Sample Test of Proportions
The pbkid data set.
- One Sample Test of Proportions Using SAS: proc freq
- Cross-Tabulation
- Effect Measures
- Alternative Ways to Express the Null Hypothesis
- The Chi-Square Test
- Using proc freq to Perform Chi-Square Tests
- The ORDER= Option
- Suppressing the Column and Overall Percentage
- Including Expected Frequencies
- Requesting the Chi Square Test
- Requesting Effect Measures
- Reporting Results in 2x2 Tables
- Effect measures, with no formatting: what happens?
- The Missing Option
- A Compact Way to Import Data for Chi-Square Tests
- R x C Tables
- Confidence Intervals for Measures of Effect (Optional)
- Risk Difference (RD)
- Relative Risk (RR)
- Odds Ratio (OR)
- The Chi-SquareDistribution
Suppose we are interested in estimating the proportion of individuals in a population who have a certain trait. For instance, we might be interested in studying the proportion of children living near a lead smelter who have colic. The prevalence of colic in the general public is estimated to be as low as 7%. The data set " pbkiddat " contains information on a sample of children living near a lead smelter.
Rosner (Rosner, Fundamentals of Biostatistics, 1995) presents the data from an observational study which evaluated the effects of lead exposure on neurological and psychological function in children who lived near a lead smelter. Each child had his or her blood lead level measured twice, once in 1982 and again in 1983. These readings were used to quantify lead exposure. The control group (n=78) consisted of children whose blood lead levels were less than 40 ug/100mL in both 1982 and 1983, whereas the exposed group of children (n=46) had blood lead levels of at least 40 ug/100mL in either 1982 or 1983. We can use these data to make inferences on the general population of children living near a lead smelter.
Point estimate
We estimate the proportion, p, as:
where x is the number in the sample who have the trait or outcome of interest, and n is the size of the sample.
Hypothesis Tests
- Null hypothesis H 0 : p= p 0
- Alternative Hypothesis H 1 : p≠ p 0
This hypothesis considers whether the population proportion is equivalent to some pre-specified value, p 0 . This value might be of historical interest or a result obtained in another study that we are trying to corroborate with our study data. A rule of thumb used to perform this test is that both np 0 and n(1-p 0 ) are greater than five.
To perform this test, we:
- Calculate the following test statistic, which under the null hypothesis, follows approximately (dependent on the rule of thumb stated above) a Standard Normal Distribution:
where n is the sample size.
Decision Rule:
Reject if Z > Z α/2 , where Z α/2 is the 1-α/2 percentile of the standard normal distribution
Confidence Intervals
Additionally we can calculate confidence intervals for the sample proportion, again relying on the rule of thumb as stated above. The upper and lower limits of the confidence interval are given by:
In the pbkid data set there were 124 children and 23 of them had colic.
We can first estimate the proportion of colicky infants as:
Using the information from the pbkid data set we can test if the prevalence of colic among children who live near lead smelter differs from that in the general public, which is around 7%.
Hypothesis:
- H 0 : The proportion of colic among children living near lead smelters is 0.07 (p= 0.07)
- H 1: The proportion of colic among children living near lead smelters is not 0.07 (p≠ 0.07)
Significance Level: 0.05
Test Statistic:
Decision Rule: Reject if |z| > 1.96.
Confidence interval:
Conclusion:
I n our sample, the proportion of colic among children living near lead smelters was 0.19. We calculated a z-statistic of 5.24 which is greater than the critical value, 1.96 associated with a significance level α = 0.05. Thus we reject the null hypothesis and conclude that the prevalence of colic among children living near lead smelters is different from 0.07. The 95% confidence interval is 0.12 to 0.25.
return to top | previous page | next page
STM1001 Topic 9: Hypothesis Testing for One and Two Sample Proportions
Chapter 1 one-sample test of proportions.
We will start by discussing the one-sample test of proportions. As an example, suppose it has been claimed that among social media users, 73% use Facebook more than once per day, and we wanted to test this claim. Consider the hypotheses
\[H_0 : p = 0.73 \text{ versus } H_1 : p \neq 0.73,\]
- \(p\) denotes the population proportion of social media users who use Facebook more than once per day
- \(H_0\) denotes the null hypothesis that the population proportion of social media users who use Facebook more than once per day is equal to 0.73 (or as a percentage, 73%)
- \(H_1\) denotes the alternative hypothesis that the population proportion of social media users who use Facebook more than once per day is different from 73%.
In more general terms, suppose we have a random sample of \(n\) observations with an expected proportion \(p\) of these observations to have a certain characteristic, letting \(x\) denote the number of observations in the sample that actually have that characteristic. Equivalently, suppose we conduct \(n\) independent trials each with probability of success \(p\) and let \(x\) denote the number of successes in these \(n\) trials. Consider the hypotheses
\[H_0 : p = p_0\text{ versus } H_1 : p \neq p_0\text{ (or }p<p_0\text{ or }p>p_0),\]
- \(p_0\) denotes the population proportion under the null hypothesis.
Then, provided \(n\) is not too small (this will be further discussed shortly), a commonly used statistical test for this type of hypothesis is the one-sample proportion test based on the estimate to \(p\) , which we can denote as \(\hat{p} = x/n\) .
Test your knowledge
In the example above where we are looking at the proportion of social media users who use Facebook more than once per day, what is the value of \(p_0\) ?
Returning to our example, a survey was carried out ( Raymond 2019 ) to study the social media habits of regular social media users from around the world. Supposing that of the \(n = 484\) respondents, \(x = 368\) said they used Facebook more than once per day, we then have that
\[\hat{p} = \frac{x}{n} = \frac{368}{484} \approx 0.76.\]
Once we know the values of \(x\) and \(n\) , we then have enough information to calculate \(\hat{p}\) and then carry out the hypothesis test. Alternatively, if we know the value of \(n\) and \(\hat{p}\) , we can use this information to calculate \(x\) . Before carrying out the test, it is a good idea to visualise the data and check the assumptions, which we will do in the next section.
Teach yourself statistics
Hypothesis Test for a Proportion
This lesson explains how to conduct a hypothesis test of a proportion, when the following conditions are met:
- The sampling method is simple random sampling .
- Each sample point can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.
- The sample includes at least 10 successes and 10 failures.
- The population size is at least 20 times as big as the sample size.
This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.
State the Hypotheses
Every hypothesis test requires the analyst to state a null hypothesis and an alternative hypothesis . The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false; and vice versa.
Formulate an Analysis Plan
The analysis plan describes how to use sample data to accept or reject the null hypothesis. It should specify the following elements.
- Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
- Test method. Use the one-sample z-test to determine whether the hypothesized population proportion differs significantly from the observed sample proportion.
Analyze Sample Data
Using sample data, find the test statistic and its associated P-Value.
σ = sqrt[ P * ( 1 - P ) / n ]
z = (p - P) / σ
- P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a z-score, use the Normal Distribution Calculator to assess the probability associated with the z-score. (See sample problems at the end of this lesson for examples of how this is done.)
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.
Test Your Understanding
In this section, two hypothesis testing examples illustrate how to conduct a hypothesis test of a proportion. The first problem involves a a two-tailed test; the second problem, a one-tailed test.
Sample Size Calculator
As you probably noticed, the process of testing a hypothesis about a proportion can be complex. Stat Trek's Sample Size Calculator can do the same job quickly and easily. When you need to test a hypothesis, consider using the Sample Size Calculator. The calculator is free. It can found in the Stat Trek main menu under the Stat Tools tab. Or you can tap the button below.
Problem 1: Two-Tailed Test
The CEO of a large electric utility claims that 80 percent of his 1,000,000 customers are very satisfied with the service they receive. To test this claim, the local newspaper surveyed 100 customers, using simple random sampling. Among the sampled customers, 73 percent say they are very satisified. Based on these findings, can we reject the CEO's hypothesis that 80% of the customers are very satisfied? Use a 0.05 level of significance.
Solution: The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.
Null hypothesis: P = 0.80
Alternative hypothesis: P ≠ 0.80
- Formulate an analysis plan . For this analysis, the significance level is 0.05. The test method, shown in the next section, is a one-sample z-test .
σ = sqrt [(0.8 * 0.2) / 100]
σ = sqrt(0.0016) = 0.04
z = (p - P) / σ = (.73 - .80)/0.04 = -1.75
where P is the hypothesized value of population proportion in the null hypothesis, p is the sample proportion, and n is the sample size.
Since we have a two-tailed test , the P-value is the probability that the z-score is less than -1.75 or greater than 1.75. We use the Normal Distribution Calculator to find P(z < -1.75) = 0.04. Since the standard normal distribution is symmetric with a mean of zero, we know that P(z > 1.75) = 0.04. Thus, the P-value = 0.04 + 0.04 = 0.08.
- Interpret results . Since the P-value (0.08) is greater than the significance level (0.05), we cannot reject the null hypothesis.
Note: If you use this approach on an exam, you may also want to mention why this approach is appropriate. Specifically, the approach is appropriate because the sampling method was simple random sampling, the sample included at least 10 successes and 10 failures, and the population size was at least 10 times the sample size.
Problem 2: One-Tailed Test Suppose the previous example is stated a little bit differently. Suppose the CEO claims that at least 80 percent of the company's 1,000,000 customers are very satisfied. Again, 100 customers are surveyed using simple random sampling. The result: 73 percent are very satisfied. Based on these results, should we accept or reject the CEO's hypothesis? Assume a significance level of 0.05.
Null hypothesis: P >= 0.80
Alternative hypothesis: P < 0.80
σ = sqrt[ P * ( 1 - P ) / n ] = sqrt [(0.8 * 0.2) / 100]
- Interpret results . Since the P-value (0.04) is less than the significance level (0.05), we cannot accept the null hypothesis.
This calculator runs a one sample proportion test for a given sample data set and specified null and alternative hypotheses. In the fields below enter the sample size \(n\) and the number of scores with the trait of interest, \(f\).
Enter a value for the null hypothesis. This value should indicate the absence of an effect in your data. It must be between the values 0 and 1. Indicate whether your alternative hypothesis involves one-tail or two-tails. If it is a one-tailed test, then you need to indicate whether it is a positive (right tail) test or a negative (left tail) test.
Enter an \(\alpha\) value for the hypothesis test. This is the Type I error rate for your hypothesis test. It also determines the confidence level \(100 \times (1-\alpha)\) for a confidence interval. The confidence interval is based on the normal distribution, which is an approximation.
Press the Run Test button and a table summarizing the computations and conclusions will appear below.
- Prompt Library
- DS/AI Trends
- Stats Tools
- Interview Questions
- Generative AI
- Machine Learning
- Deep Learning
One sample Z-test for proportion: Formula & Examples
One proportion z-test or one-sample Z-test for proportion is one of the most popular statistical hypothesis tests dealing with one sample proportion. It is used to determine whether or not a hypothesized mean difference between the sample and the population can be rejected by drawing conclusions from sample data. As a data scientist , it is important to be proficient in this type of Z-test and understand how it works. In this blog post, we will learn about how one proportion z-test works with the help of formula and examples.
Table of Contents
What is one sample Z-test for proportion?
A one proportion Z-test is a hypothesis testing technique which is used for testing hypothesis about a hypothesized proportion in comparison to a given theoretical proportion of the population. The test is used to determine if the difference between the sample proportion (hypothesized) and population theoritical proportion is significant or not. One-proportion z-test can be used for one-tailed or two-tailed test. The p-value for one sample z-test for proportion is calculated using the Z statistic . When conducting one proportion z-test, if the p-value is less than the significance level, we can reject the null hypothesis. Otherwise, we fail to reject it. A one proportion z-test can be used to answer the following questions:
- Is there a difference between the sample proportion (hypothesized) and population proportion (theoritical)? In other words, the sample bellongs to the current population in terms of hypothesis around no difference in proportions.
- What is the magnitude of the difference between the sample proportion and population proportion?
- Is this difference statistically significant? Accordingly, can the null hypothesis be rejected or otherwise?
In order to conduct this type of test, we need to know the following:
- The hypothesized value for the population proportion (p); In other words, the sample proportion (p)
- The value of theoretical proportion
- The standard error
- The sample size (n)
The formula of Z-statistics or Z-score for 1-proportion Z-test is:
Z = (p – P0) / SE
P0 = Population proportion (theoritical)
p = Sample proportion (hypothesized)
SE = Standard error
The value of standard error for 1-sample proportion Z-test can be calculated using the following formula. The standard error can also be termed as the standard deviation of the sample proportion from the population proportion.
SE = SQRT[P0(1-P0)/n]
P0 = Population proportion
n = Sample size
Examples of one proportion Z-test
Let’s take an example to understand how one proportion z-test works. Suppose we want to test whether there is a difference between the population and sample proportions for unemployment rate in a particular city. The claim is made that in a particular city, the unemployment rate is not same as the theoretical or well-established population proportion. The ask is to test the claim or hypothesis and find out whether the difference is statistically significant and can’t be attributed to the chance occurrence. Let’s say the population proportion for unemployment rate is 10%. A sample of 50 persons is taken and the unemployment rate was found to be 14% (7 persons out of 50 were found to be unemployed). Is the difference statistically significant at the level of significance of 0.05.
P0 = The value of theoretical proportion is 0.1 (10%)
p = The sample proportion is 0.14 (14%)
n = The sample size is 50
Now, we will calculate the Z-statistics or Z-score with the help of the following formula.
The value of Z-score comes out to be 0.95. The p-value comes out to be 0.34211 for two-tailed test. At a significance level of 0.05, the test result is not statistically significant and thus, we can’t reject the null hypothesis based on the given evidence (sample selected). This indicates that the test outcome has happened purely by chance.
In real-world, one proportion z-test is used to compare the hypothesized proportion with that of theoritical population proportion. For example, one proportion z-test can be used to test whether there is a difference between the proportions of unemployment rate of a particular city against the theoritical population proportion of the whole country. Another example could be testing the difference between proportions of smokers in a city or region vs the theoritical population proportion. Yet another example can be testing the difference between proportions of people who voted for a particular party in a particular region vs the theoritical population proportion.
One proportion z-test is a statistical test used to determine if there is a difference between the population theoritical and sample hypothesized proportions. The test can be used to answer questions such as: “Is there a difference between the population and sample proportions?” In order to conduct one proportion z-test, we need to know the information such as the hypothesized value for the population proportion, the value of theoretical proportion and the standard error that can be calculated as a function of value of theoretical proportion and the sample size. In real-world, one proportion z-test can be used in different scenarios such as medical studies, surveys or product testing. If you would like to learn more, please drop a message with your queries or suggestions.
Recent Posts
- Agentic Reasoning Design Patterns in AI: Examples - October 18, 2024
- LLMs for Adaptive Learning & Personalized Education - October 8, 2024
- Sparse Mixture of Experts (MoE) Models: Examples - October 6, 2024
Ajitesh Kumar
- Search for:
ChatGPT Prompts (250+)
- Generate Design Ideas for App
- Expand Feature Set of App
- Create a User Journey Map for App
- Generate Visual Design Ideas for App
- Generate a List of Competitors for App
- Agentic Reasoning Design Patterns in AI: Examples
- LLMs for Adaptive Learning & Personalized Education
- Sparse Mixture of Experts (MoE) Models: Examples
- Anxiety Disorder Detection & Machine Learning Techniques
- Confounder Features & Machine Learning Models: Examples
Data Science / AI Trends
- • Sentiment Analysis Real World Examples
- • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
- • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
- • Guides, papers, lecture, notebooks and resources for prompt engineering
- • Common tricks to make LLMs efficient and stable
Free Online Tools
- Create Scatter Plots Online for your Excel Data
- Histogram / Frequency Distribution Creation Tool
- Online Pie Chart Maker Tool
- Z-test vs T-test Decision Tool
- Independent samples t-test calculator
Recent Comments
I found it very helpful. However the differences are not too understandable for me
Very Nice Explaination. Thankyiu very much,
in your case E respresent Member or Oraganization which include on e or more peers?
Such a informative post. Keep it up
Thank you....for your support. you given a good solution for me.
User Preferences
Content preview.
Arcu felis bibendum ut tristique et egestas quis:
- Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
- Duis aute irure dolor in reprehenderit in voluptate
- Excepteur sint occaecat cupidatat non proident
Keyboard Shortcuts
5.2 - hypothesis testing for one sample proportion.
Recall our “test” about whether Penn State students like cold weather. we have to ask about the relationship of the data we have (from our sample) relative to the hypothesized null value. In other words, is our observed sample proportion far enough away from the 0.5 to suggest that there is evidence against the null?
We can use what we know about the sampling distribution of sample proportions to help find our evidence!
Hypothesis Testing for One Sample Proportion Section
Recall that under certain conditions, the sampling distribution of the sample proportion, \(\hat{p} \), is approximately normal with mean, \(p \), standard error \(\sqrt{\dfrac{p(1-p)}{n}}\), and estimated standard error \(\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\).
\(H_0\colon p=p_0\)
Conditions:
- \(np_0 \ge 5\) and \(n(1-p_0)\ge5\)
Test Statistic:
\(z^{*}=\dfrac{\hat{p}-p_{0}}{\sqrt{\dfrac{p_{0}\left(1-p_{0}\right)}{n}}}\)
IMAGES
VIDEO