Paired Samples \(t\)-tests

STAT 218 - Week 7, Lecture 4 Lab 5

Let’s Remember Hypothesis Testing Steps

Revising the Steps of Hypothesis Testing

Important

  1. Construct the Hypotheses of \(H_0\) and \(H_A\)
  2. Determine/Identify your \(\alpha\) level
  3. Check the assumptions
  4. Compute test statistic and find the P-value (Interpreting R Output)
  5. Draw conclusion.

Assumptions - Verification of Conditions

  • It is always important to check first whether the conditions are reasonable in a given case.
  • Here is the list of assumptions that we should be aware of for \(t\)-tests.

1) Random Sampling: the data can be regarded as coming from independently chosen random sample(s),

2) Independence of Observations: the observations should be independent within each sample, and

3) Normal Distribution: Many of the methods depend on the data being from a population that has a normal distribution.

  • REMEMBER! If sample size is large, then condition (3) is less important (Central Limit Theorem).

Normality Assumption - I

  • If the only source of information is the data at hand, then normality can be roughly checked by making a histogram and normal quantile plot of the data.

    • Unfortunately, for a small or moderate sample size, this check is fairly rudimentary.
    • If the sample is large, then the visual plots give us good information about the population shape;
      • However, if \(n\) is large, the requirement of normality is less important anyway due to the Central Limit Theorem.
  • In any case, a rudimentary check is better than none, and every data analysis should begin with inspection of a graph of the data, with special attention to any observations that lie very far from the center of the distribution.

Normality Assumption - II

We check assumptions before conducting any statistical analysis. To check normality assumption, we need to first check sample size.

  • \(1^{st}\) option - small samples: Check the \(p\)-value of Shapiro Wilk test. It is best used with a sample size less than 50 (Shapiro & Wilk 1965; Uttley,2019).

  • \(2^{nd}\) option - large samples: Check the visual plots (e.g., histogram, normal quantile plot) if your sample size is more than 50.

Paired Samples \(t\)-test

To-Do List Before Start

  • Download the assignment and dataset from Canvas.

    • Mac Users: Downloading a .csv file from Canvas can be tricky. Try opening it in a new tab to download.
    • Windows Users: Congratulations, you are privileged and blessed! Your .csv file should download without a fight (hopefully).
    • Important: Even if you manage to download it, please do not open it on your computer. Macs sometimes automatically change the file extension (yes, really!), which makes it harder to import with this lab’s code.
  • Save both files to your STAT 218 folder (VERY IMPORTANT). Otherwise, you won’t be able to import the dataset.

  • Ensure your assignment and dataset are in the same folder, with file extensions .qmd and .csv.

  • Follow the instructions on this slideshow to complete the assignment.

Introduction


Let’s use library functions and load the data sets that we will use today.

library(tidyverse)
stream_data <- read_csv("Example 8.3.4.csv")

IMPORTANT!: If you don’t see stream_data in your Environment Pane, this means that your dataset and quarto file are not in the same folder!

OR…

You may have downloaded Example 8.3.4.csv multiple times, so its name could look like Example 8.3.4 (2)(1).csv.


Check the file name, correct it if necessary, and try again. Remember, R is very stubborn—it won’t read the data if the name doesn’t match exactly what you typed in your code.

Example of a Case: Pollutants in a stream may accumulate or attenuate as water flows down the stream. In a study to monitor the accumulation and attenuation of fecal contamination in a stream running through cattle rangeland, monthly water specimens were collected at two locations along the stream over a period of 21 months.

The data set stream the total coliform count (MPN/100ml) for a water specimen.

Perform a paired samples \(t\)-test to assess whether the mean total coliform count is consistent across the two locations. Use the 5% significance level (\(\alpha = 0.05\)).

Checking the Normality Assumption

  • Check your sample size first!

    • \(n\) < 50, so we should check Shapiro-Wilk test.
  • Shapiro–Wilk Test is a statistical method that provides a numerical assessment of evidence for certain types of nonnormality in data.

    • The procedure’s mechanics are complex, but statistical software packages simplify the testing process. Output and Interpretation:
  • The output of the Shapiro–Wilk test includes a P-value. Interpretation:

    • P-value < 0.001: Very strong evidence for nonnormality.
    • P-value < 0.01: Strong evidence for nonnormality.
    • P-value < 0.05: Moderate evidence for nonnormality.
    • P-value < 0.10: Mild or weak evidence for nonnormality.

  • This output is from a calculation of Shapiro-Wilk test. We generally use Shapiro-Wilk test for relatively smaller sample size because visuals can be misleading in smaller sample sizes. Please interpret Shapiro-Wilk \(p\)-value.

Shapiro-Wilk normality test

data: stream$Difference W = 0.9641, p-value = 0.6022

t_test vs. t.test

  • If you notice, we used t_test() function last week while conducting independent samples t-test by using infer package.

  • Today, we will use t.test() which is available default in R (And some BIO courses use this too).

  • Please pay attention the difference between these two functions.

Interpreting the Output

# Convert to wide format
wide_data <- stream_data %>%
  pivot_wider(names_from = location, values_from = coliform_count)

To be able to conduct paired-samples t test, I changed the structure of data set a little bit.

Rows: 21
Columns: 4
$ sample     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ date       <chr> "01/08/2008 11:15", "02/04/2008 10:30", "03/03/2008 10:48",…
$ upstream   <dbl> 2909, 2382, 2046, 1483, 4611, 3448, 4106, 2755, 3448, 2098,…
$ downstream <dbl> 2359.0, 3873.0, 1725.0, 771.0, 1529.0, 2909.0, 2014.0, 1872…

Interpreting the Output


    Paired t-test

data:  wide_data$upstream and wide_data$downstream
t = 4.6092, df = 20, p-value = 0.0001697
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
  603.7724 1602.0562
sample estimates:
mean difference 
       1102.914 

Conclusion: Type your conclusion statement to your worksheet!

Confidence Interval: Type your confidence interval statement to your worksheet!