STAT 218 - Week 7, Lecture 4 Lab 5
Important
1) Random Sampling: the data can be regarded as coming from independently chosen random sample(s),
2) Independence of Observations: the observations should be independent within each sample, and
3) Normal Distribution: Many of the methods depend on the data being from a population that has a normal distribution.
If the only source of information is the data at hand, then normality can be roughly checked by making a histogram and normal quantile plot of the data.
In any case, a rudimentary check is better than none, and every data analysis should begin with inspection of a graph of the data, with special attention to any observations that lie very far from the center of the distribution.
We check assumptions before conducting any statistical analysis. To check normality assumption, we need to first check sample size.
\(1^{st}\) option - small samples: Check the \(p\)-value of Shapiro Wilk test. It is best used with a sample size less than 50 (Shapiro & Wilk 1965; Uttley,2019).
\(2^{nd}\) option - large samples: Check the visual plots (e.g., histogram, normal quantile plot) if your sample size is more than 50.
Download the assignment and dataset from Canvas.
.csv
file from Canvas can be tricky. Try opening it in a new tab to download..csv
file should download without a fight (hopefully).Save both files to your STAT 218
folder (VERY IMPORTANT). Otherwise, you won’t be able to import the dataset.
Ensure your assignment and dataset are in the same folder, with file extensions .qmd
and .csv
.
Follow the instructions on this slideshow to complete the assignment.
Let’s use library
functions and load the data sets that we will use today.
IMPORTANT!: If you don’t see stream_data
in your Environment Pane, this means that your dataset and quarto file are not in the same folder!
You may have downloaded Example 8.3.4.csv
multiple times, so its name could look like Example 8.3.4 (2)(1).csv
.
Check the file name, correct it if necessary, and try again. Remember, R is very stubborn—it won’t read the data if the name doesn’t match exactly what you typed in your code.
Example of a Case: Pollutants in a stream may accumulate or attenuate as water flows down the stream. In a study to monitor the accumulation and attenuation of fecal contamination in a stream running through cattle rangeland, monthly water specimens were collected at two locations along the stream over a period of 21 months.
The data set stream
the total coliform count (MPN/100ml) for a water specimen.
Perform a paired samples \(t\)-test to assess whether the mean total coliform count is consistent across the two locations. Use the 5% significance level (\(\alpha = 0.05\)).
Check your sample size first!
Shapiro–Wilk Test is a statistical method that provides a numerical assessment of evidence for certain types of nonnormality in data.
The output of the Shapiro–Wilk test includes a P-value. Interpretation:
Shapiro-Wilk normality test
data: stream$Difference W = 0.9641, p-value = 0.6022
If you notice, we used t_test()
function last week while conducting independent samples t-test by using infer
package.
Today, we will use t.test()
which is available default in R (And some BIO courses use this too).
Please pay attention the difference between these two functions.
To be able to conduct paired-samples t test, I changed the structure of data set a little bit.
Rows: 21
Columns: 4
$ sample <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ date <chr> "01/08/2008 11:15", "02/04/2008 10:30", "03/03/2008 10:48",…
$ upstream <dbl> 2909, 2382, 2046, 1483, 4611, 3448, 4106, 2755, 3448, 2098,…
$ downstream <dbl> 2359.0, 3873.0, 1725.0, 771.0, 1529.0, 2909.0, 2014.0, 1872…
Paired t-test
data: wide_data$upstream and wide_data$downstream
t = 4.6092, df = 20, p-value = 0.0001697
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
603.7724 1602.0562
sample estimates:
mean difference
1102.914
Conclusion: Type your conclusion statement to your worksheet!
Confidence Interval: Type your confidence interval statement to your worksheet!