# install.packages("MASS") - run this code if you do not have MASS package
library(MASS)
I warn you that several propositions described here are wrong and we will figure out why and which during the seminar.
I higly appreciate questions and suggestions during the seminar, but I also want all participants to take part in the discussion. So if you know all the answers, please, let others try.
“When current IQ tests are developed, the median raw score of the norming sample is defined as IQ 100 and scores each standard deviation (SD) up or down are defined as 15 IQ points greater or less” - wikipedia.
Suppose we know that average IQ in population is equal to 100 and IQ is normally distributed with standard deviation of 15. You have a agreement with a company that produces chocolate that you will study an impact of chocolate intake on people’s IQ. You want to study if persons who eat significant amount of chocolate every day are more or less clever than average. You made an ad on the university’s website and in mass media and asked for volunteers. Your secretary told you that you have claims from 25 volunteers. You feed them with chocolate for 1 month and their IQ measures after this period were:
chocolate_IQs <- c(88, 107, 106, 127, 99, 71, 95, 114, 91, 82, 80, 114, 94, 59, 122, 134, 97, 89, 79, 128, 93, 87, 100, 102, 111, 107, 95, 108, 97, 70)
summary(chocolate_IQs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.00 88.25 97.00 98.20 107.80 134.00
Consideration: If a sample of n
individuals has been chosen at random
from the population, then the likely size of chance error of the sample mean (called the standard error) is SE = population_sigma / sqrt(n)
.
Q: Population sigma is equal to 15, calculate the standard error of the mean for IQs of your volunteers (chocolate_IQs
).
Since you know that IQs are normalised in a way that standard deviation is equal to 15 and you sampled your cohort from the population, you may assume that the standard deviation is equal to 15. Just to be sure you test if the variances are equal for Control and Treated group. You can perform it using F-test on variations equality (?var.test
for additional information) if you compare variances of two samples, but here we know true variance and it is equal to 15. Since we know true standard deviation from our population (sd=15
), we apply Chi-squared test directly. The statistic will be (number of degrees of freedom) * (sample variance / true variance)
. H0 will be “standard deviation is less or equal to populational standard deviation” and H1 will be “standard deviation is greater than populational standard deviation”. For hypothesis “variances are not equal” you should use alpha/2
level of significance from both sides.
(length(chocolate_IQs) - 1) * var(chocolate_IQs) / (15 ** 2)
qchisq(1 - 0.05, length(chocolate_IQs) - 1)
Q0: What did this test tell us about equality of variances between population and group of chocolate feeded people?
Q1: What test would you choose for this example? Why?
Null hypothesis: mean of the sample is equal to the hypothesized (population) mean.
Assumptions of Z-test:
How to test data for normality?
shapiro.test(chocolate_IQs)
##
## Shapiro-Wilk normality test
##
## data: chocolate_IQs
## W = 0.98876, p-value = 0.9833
Interpret the result of a test. How to test independence of sampling?
z.test = function(a, mu, std){
zeta = (mean(a) - mu) / (std / sqrt(length(a)))
return(zeta)
}
z.test(chocolate_IQs, 100, 15)
## [1] -0.6572671
qnorm(0.975)
## [1] 1.959964
Q2: This test returns Z-statistic that should be compared with quantile of standard normal distribution. What can you say about the result of test? What if you will change your hypothesis to “are people who are eating chocolate are more clever”? (Hint: you will need function qnorm(0.95)
).
Q3: The company pays you to test chocolate intake impact more extensively and you increased the number of samples that took part in the study. Now the vector “chocolate_IQs” contains results of 10000 measures. Interpret result of the test (you do not need to perform calculations by yourself).
z.test(chocolate_IQs_long, 100, 15)
## [1] 6.937202
Q4: The data for this task was simulated from the distributions with different parameters. Here we see true density of IQ of people who do not eat chocolate and estimated density of your 10000 volunteers. Interpret the results of test using the plot. (Plot will be shown during the seminar)
What is wrong with this plot? Why it can not look exactly like this?
Assumptions of one-sample t-test:
You performed a genetic modification of bacteria. You wanted to estimate how this modification influenced its life span and measured the lengths of life for 25 bacteria. But your colleague told you that he already did this modification and length of life was equal to 80 minutes. Is your colleague right? Can you trust him or is he just trying to ruin your research?
Copy following measures to the vector life_span
:
paste(as.character(life_span), sep="' '", collapse=", ")
## [1] "88.1182719073621, 94.8332872750113, 93.216098240428, 94.4837517244522, 89.0162824741825, 98.3549941533524, 90.9306505649222, 79.4738128933811, 95.7943525794675, 79.907550971115, 67.5608620077608, 84.1118666207429, 80.4405426387418, 93.8668348621264, 77.2212434214561, 68.1068633480655, 84.8906327374611, 75.2442966398898, 92.1865794241239, 92.1903595428325, 83.4280593426128, 81.6008659108239, 74.223954191764, 106.223060058316, 91.3518150269478"
You can call t.test
function to perform one-sample t-test, but before - check the assumption of normality and absence of outliers (hint: use shapiro.test()
and boxplot()
command).
t.test(life_span, mu=80)
Q5: Could you interpret the results?
You have group of samples who recieved new drug and group of sample that were cured according to the standard protocol. You were told that 20 patients with the disease were selected from the population and divided into 2 equal groups: 10 and 10. You measure hormone levels that are known to be distributed normally, and the lower level is the better.
Q6: Is your new drug better?
Assumptions of two-sample non-paired t-test:
treated_group <- c(9830.76249436958, 6161.37243638468, 13131.8353604025, 10284.6598357407, 9245.98726622183, 7175.32445354559, 11174.4236746503, 12437.8577787669, 7093.47847958366, 9925.07623800654)
control_group <- c(12253.9047358527, 11425.4583361033, 11659.6962460054, 11759.4030595056, 9671.33755357213, 13352.9689425818, 7974.08085172749, 11220.3927754223, 12885.6045226814, 13032.8169883379)
You need to check if your variances are equal if you want to apply original version of t-test. Now we will use R
to check it.
var.test(treated_group, control_group)
You also would like to examine the data visually:
x <- cbind(treated_group, control_group)
x <- as.data.frame(x)
names(x) <- c("Treated", "Control")
colors <- c("red","blue")
boxplot(x, border=colors)
stripchart(x, col=colors, vertical = T, add=T, method="stack",pch=20)
Do you see any outliers?
Call your vectors as treated_group
and control_group
, respectively.
t.test(treated_group, control_group,alternative="less")
Q7: Interpret result of the test.
“Dependent groups can be the same subjects used again (repeated measures), or they can be matched samples. Either way, the t-test is performed on the difference scores and amounts to little more than a single sample t-test. A normal distribution of difference scores is strongly encouraged, unless the sample is large enough that you can hide behind the central limit theorem and claim a normal sampling distribution of means of the differences, in which case the t-test is robust to violations of the normality assumption.” - William B. King, Coastal Carolina University.
Assumptions of two-sample test are approx. the same as for non-paired test, but now differences betweeen observations should be independent.
Now you got the email that, actually, your samples were the same people, but measured before treatment and after.
Q8: Re-do the test, using new information.
t.test(treated_group, control_group,paired=TRUE,alternative="less")
Another example (analysis of anorexia data: weights of woman before study Prewt
and after Postwt
, Cont
means Control Group, CBT
means Cognitive Behavioural treatment, FT
means family therapy).
data(anorexia, package="MASS")
attach(anorexia)
str(anorexia)
## 'data.frame': 72 obs. of 3 variables:
## $ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
## $ Prewt : num 80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
## $ Postwt: num 80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...
Examine the data. What type of hypothesis can you test on this data? Test variances for equality.
ft = subset(anorexia, subset=(Treat=="FT")) # we subsample only patients that followed family therapy
with(ft, var.test(Postwt, Prewt))
Can we apply t-test for this real data?
Read this part from ?t.test
:
If paired is TRUE then both x and y must be specified and they must be the same length. Missing values are silently removed (in pairs if paired is TRUE). If var.equal is TRUE then the pooled estimate of the variance is used. By default, if var.equal is FALSE then the variance is estimated separately for both groups and the Welch modification to the degrees of freedom is used.
Finally, apply t-test.
with(ft, t.test(Postwt-Prewt, mu=0, alternative="greater"))
with(ft, t.test(Postwt, Prewt, paired=T, alternative="greater"))
Why two tests (single sample and paired two samples) gave equal results?
Welch test is not a magic pill - it gaves also approximate solution. However, it is recommended to use usual t-test for equal population variances case, or if the samples are rather small and the population variances can be assumed to be approximately equal. Otherwise Welch test is preferred. If you want to learn more about the potential issues and ways how to deal with them, read wiki’s article on Behrens–Fisher problem.
You study one SNP allele in Spanish population. You found a paper that you want to use in your research and this paper states that frequency of this SNP allele is equal to 0.4 in German population. But you sequenced 50 Spanish and found that only 13 of samples had this allele (and you expect 20). You wander if it is different from German population (because you want to assume that popultations are similar and use the same procedures as were used in the paper).
Assumptions of proportions test:
At first, do it by hand:
(estimated_proportion - true_proportion) / sqrt((true_proportion * (1 - true_proportion)) / sample_size)
and compare it with standard normal quantile. Or you can use 2 * pnorm(statistic)
to obtain a p-value (two-tailed, for one-tailed it is not necessary to multiply by 2).
Then use R function
prop.test(13, 50, p=0.4, correct=F)
##
## 1-sample proportions test without continuity correction
##
## data: 13 out of 50, null probability 0.4
## X-squared = 4.0833, df = 1, p-value = 0.04331
## alternative hypothesis: true p is not equal to 0.4
## 95 percent confidence interval:
## 0.1587153 0.3955316
## sample estimates:
## p
## 0.26
(changed example from stattrek.com) Suppose the Acme Drug Company develops a new drug, designed to prevent colds. The company states that the drug is equally effective for men and women. To test this claim, they choose a random sample of 100 women and 200 men from a population of 100,000 volunteers.
At the end of the study, 38 of the women caught a cold; and 101 of the men caught a cold. Based on these findings, can we reject the company’s claim that the drug is equally effective for men and women? Use a 0.05 level of significance.
prop.test(c(38, 100), c(101,200))
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(38, 100) out of c(101, 200)
## X-squared = 3.6568, df = 1, p-value = 0.05584
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.2483786071 0.0008538546
## sample estimates:
## prop 1 prop 2
## 0.3762376 0.5000000
Q9: R uses continuity correction by default. Test the same data with correction (remove correct=F
). Explain results.
Here is another example. You want to test if two bacteria are equally sensitive to one particular antibiotic. You created a mixed culture of bacterias of 2 types, then divided them into two Petri’s dishes and added antibiotic into one dish. Did the proportion of bacterias in different dishes changed?
bacteria_symbiosis <-
matrix(c(68, 32, 125, 95),
nrow = 2,
dimnames = list(
Colony = c("Without antibiotic", "With antibiotic"),
Bacteria = c("E.coli", "Listeria")))
print(bacteria_symbiosis)
## Bacteria
## Colony E.coli Listeria
## Without antibiotic 68 125
## With antibiotic 32 95
prop.test(bacteria_symbiosis , alternative = "greater")
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: bacteria_symbiosis
## X-squared = 3.1392, df = 1, p-value = 0.03822
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.008898995 1.000000000
## sample estimates:
## prop 1 prop 2
## 0.3523316 0.2519685
Fisher’s exact test:
Agresti (1990, p. 61f; 2002, p. 91) A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test, she was given 8 cups of tea, in four of which milk was added first. The null hypothesis is that there is no association between the true order of pouring and the woman’s guess, the alternative that there is a positive association (that the odds ratio is greater than 1).
Assumptions of Fisher exact test on proportions:
Woman knows the amount of cups with different orders so the outcome table will look like:
matrix(c(4 - k, k, k, 4 - k))
TeaTasting <-
matrix(c(3, 1, 1, 3),
nrow = 2,
dimnames = list(Guess = c("Milk", "Tea"),
Truth = c("Milk", "Tea")))
fisher.test(TeaTasting, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: TeaTasting
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3135693 Inf
## sample estimates:
## odds ratio
## 6.408309
Q10: let us assume that the woman guessed all cups correctly. Is it enough to say that she can distinguish between the orders with 0.05 significance?
Let us show the definition of confidence intervals on the simlation example. For example, we are trying to estimate the average male student height in some population. We assume that the average height is 175 and standard deviation is equal to 15, height is normally distributed. We calculate 95% confidence intervals using formula from the lectures.
But, evidently, one single person can not measure a lot of people so we distribute the research between different universities. It seems quite reasonable to estimate number of measures that could be done in one university as 100. But we involve more researchers, say, 100, so we can make measures in different universities, and compare confidence intervals that were obtained in different studies.
set.seed(2012)
m = 175
s = 15
number_of_elements_in_one_sample <- 100 # number of persons that can be measured in one place within one day
number_of_simulations <- 100 # number of different researchers that make measures
matrix_of_bounds_of_confidence_intervals <- matrix(rep(0, 2 * number_of_simulations), nrow=number_of_simulations, ncol=2)
plot(rep(m, number_of_simulations) ~ c(1:number_of_simulations), ylim=c(m - 2 * s, m + 2 * s), cex=0.3, pch=19, xlab="Different researchers' confidence intervals", ylab="Estimation of height with true = 175")
for (i in 1:number_of_simulations) {
sample_from_population <- rnorm(number_of_elements_in_one_sample, mean = m, sd = s)
mean_estimated = mean(sample_from_population)
sd_estimated = sd(sample_from_population)
error <- (qnorm(0.975) * sd_estimated) / sqrt(number_of_elements_in_one_sample)
left_end <- mean_estimated - error
right_end <- mean_estimated + error
matrix_of_bounds_of_confidence_intervals[i,1] <- left_end
matrix_of_bounds_of_confidence_intervals[i,2] <- right_end
if (right_end < m | left_end > m) {
segments(i, left_end, i, right_end, lwd=2, col="red")
} else {
segments(i, left_end, i, right_end, lwd=2)
}
}
Q11: How many 95% confidence intervals are “wrong”, i.e., do not contain true mean (175 cm)? How many wrong intervals are expected?
Q12: Try to run the code several more times (without set.seed
). Did the percentage of wrong confidence intervals change? Let us assume that you do not have time to measure height of 100 males in your study, but 20 males is OK for you. Change “number_of_elements_in_one_sample” to 20 and run the code again several times.
Q13: Let us assume that larger number of researchers, i.e., 10000, measured 200 males from the population each. What percentage of wrong confidence intervals do you expect? Check with the following code.
set.seed(900)
number_of_simulations <- 10000
number_of_elements_in_one_sample <- 100
number_of_confidence_intervals_that_do_not_cover_mean <- 0
for (i in 1:number_of_simulations) {
sample_from_population <- rnorm(number_of_elements_in_one_sample, mean = m, sd = s)
mean_estimated = mean(sample_from_population)
sd_estimated = 15
error <- (qnorm(0.975) * sd_estimated) / sqrt(number_of_elements_in_one_sample)
left_end <- mean_estimated - error
right_end <- mean_estimated + error
if (right_end < m | left_end > m) {
number_of_confidence_intervals_that_do_not_cover_mean = number_of_confidence_intervals_that_do_not_cover_mean + 1
}
}
Check your estimation.
print( 100 * (number_of_confidence_intervals_that_do_not_cover_mean) / number_of_simulations)
## [1] 5.26
p = number_of_confidence_intervals_that_do_not_cover_mean / number_of_simulations
ME = qnorm(1-0.05/2)*sqrt(p*(1-p)/1000)
print(c(p-ME,p+ME))
## [1] 0.03876409 0.06643591
Q14: Now we assume that each researcher has time to measure only 20 samples. Change variable “number_of_elements_in_one_sample” to 20 and estimate percentage of 95% confidence intervals that do not cover true mean again. Has something changed in the percentage of “wrong” confidence intervals? How to resolve this issue?
number_of_simulations <- 10000
number_of_elements_in_one_sample <- 20
matrix_of_bounds_of_confidence_intervals <- matrix(rep(0, 2 * number_of_simulations), nrow=number_of_simulations, ncol=2)
number_of_confidence_intervals_that_do_not_cover_mean <- 0
for (i in 1:number_of_simulations) {
sample_from_population <- rnorm(number_of_elements_in_one_sample, mean = m, sd = s)
mean_estimated = mean(sample_from_population)
sd_estimated = sd(sample_from_population)
error <- (qnorm(0.975) * sd_estimated) / sqrt(number_of_elements_in_one_sample)
left_end <- mean_estimated - error
right_end <- mean_estimated + error
matrix_of_bounds_of_confidence_intervals[i,1] <- left_end
matrix_of_bounds_of_confidence_intervals[i,2] <- right_end
if (right_end < m | left_end > m) {
number_of_confidence_intervals_that_do_not_cover_mean = number_of_confidence_intervals_that_do_not_cover_mean + 1
}
}
print( 100 * (number_of_confidence_intervals_that_do_not_cover_mean) / number_of_simulations)
## [1] 6.58
p = number_of_confidence_intervals_that_do_not_cover_mean / number_of_simulations
ME = qnorm(1-0.05/2)*sqrt(p*(1-p)/1000)
print(c(p-ME,p+ME))
## [1] 0.05043329 0.08116671
Change quantile from normal quantile to Student’s t 0.975 quantile with replacement qnorm(0.975)
with qt(0.975, number_of_elements_in_one_sample - 1)
.
Q15: (From http://www.stat.columbia.edu/) An outbreak of Salmonella-related illness was attributed to ice cream produced at a certain factory. Scientists measured the level of Salmonella in 9 randomly sampled batches of ice cream. The levels (in MPN/g) were: 0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418
. Is there evidence that the mean level of Salmonella in the ice cream is greater than 0.3 MPN/g?
Q16: Find an information about r
dataset sleep
. Think about possible designs of studies on this data. Perform your study with your design.