simple-linear-regression-slides-1.qmd

---
title: "Simple Linear Regression"
subtitle: "University of Florida"
date: 2024-01-25
author: "Anthony Raborn, Psychometrician"
institute: "National Association of Boards of Pharmacy"
format: 
  revealjs:
    theme: dark
    css: styles.css
editor_options: 
  chunk_output_type: console
---

```{r}
#| label: setup
#| message: false

require(tidyverse)

# https://ufl.zoom.us/j/7362215223
```

## Introduction

::: columns
::: {.column width="40%"}
### Myself

-   UF Graduate, 2019
-   2.5 years K-12, 2.5 years C&L
-   Reproducible and automated data analysis
:::

::: {.column width="60%"}
### Approach to Instruction

-   Reproducible slides and lectures
-   Embedded practice within lectures
-   Provide multiple solutions to problems (where possible)
    -   focusing on reproducibility as much as reasonable!
:::
:::

::: notes
First, a little bit about myself and my approach to teaching.

I graduated from the University of Florida about 5 years ago, with a Ph.D. in Research and Evaluation Methodology and a minor in Statistics. Dr. Manley and Dr. Leite were my advisors.

I worked about two and a half years with Pasco County Schools as a Supervisor, Accountability, Research, and Measurement before my current two and a half years with the National Association of Boards of Pharmacy. NABP is in the certification and licensure industry, and I work on our licensing exams and related data for prospective pharmacists.

My professional focus is on reproducible and automated data analytic processes. If I'm asked to do the same task twice, I begin thinking about how to make that process automatic through programming, which also allows me to have a record to refer to for future requests. Any research I do follows a similar process, and I have a few R packages that came from work with colleagues on improving the flexibility of scripts.

My approach to teaching mirrors from this focus. I aim to provide lectures that are easily editable and reproducible. This means that students would have direct access to my lecture slides and notes, and that I can make, track, and share changes to these notes in real-time.

I do this by embedding my instruction within the software I use for analysis. For example, this lecture is created within R using Quarto and will be available for everyone to access on my GitHub account.

While I encourage students to follow a similar mindset with statistics and data analysis, I support and expect to provide support for multiple different approaches, including different software. My aim is to help everyone get to the point of creating reproducible work, if possible, and support students who are not as technically-oriented produce good work regardless of its reproducibility.

For today, though, I will be working in R exclusively.
:::

## Building upon... {.smaller}

-   Mathematical notation
    -   bar notation, i notation, sum notation, hat notation
-   Normal distribution
    -   symmetric, bell-shaped
-   statistical error
    -   random differences
-   correlation
    -   strength of linear relationship
-   statistical parameter
    -   general understanding

::: notes
This lecture assumes that you are familiar with the following topics, as they will be touched upon but not fully explained.

1.  Mathematical notation---subscripted i, squares, square roots, sum notation, bar notation, hat notation
2.  What the normal distribution is---most importantly, that it is a symmetrical and bell-shaped distribution, meaning that values drawn from this distribution are less likely to be observed the further they are (in either direction) from the mean.
3.  What statistical error is---most importantly, that it means random differences between expected or estimated values and observed values, not systemic differences or differences due to things like biased measurements or mis-keyed data.
4.  What correlation is---most importantly, that it measures the strength of the linear relationship between two variables.
5.  What statistical parameters are--just a general understanding that they provide a mathematical way to describe a statistical model and that analyses try to estimate these values
:::

## Lesson Objectives

1.  Understand what simple linear regression is and when it should be used
2.  Estimate and Interpret regression coefficients
3.  Utilize R for fitting regression models

Next time:

4.  Summarize the four main assumptions of regression
5.  Read diagnostic plots
6.  Perform hypothesis testing on simple linear regression models

::: notes
At the end of this lesson, you should be able to:

-   Understand what simple linear regression is, when it should be used (and by converse when it shouldn't be used).
-   Estimate the model coefficients for regression algebraically as well as interpret these coefficients.
-   Utilize R for fitting regression models

We won't have enough time to cover these topics, but the next lecture would also add these objectives:

-   Summarize the four main assumptions of regression.
-   Read and interpret some regression diagnostic plots, identifying obvious violations of the four assumptions
-   Perform hypothesis testing on the coefficients of a fit regression model
:::

## What is Simple Linear Regression?

-   Statistical (NOT deterministic) model
    -   explores how variation in one variable is related to (explained by) variation in a second variable
-   Draws a "line of best fit" between variables
    -   Geometric $y = mx + b$ vs statistical $y = \beta_0 + \beta_1 x$
-   Uses values of one predictor (explanatory, independent) variable *x* to estimate values of one outcome (response, dependent) variable *y*
-   For today, *x* and *y* will be continuous
    -   (though *x* can be categorical!)

::: notes
Simple linear regression (generally referred to as "regression" hereout; other regression models would be directly specified) is a statistical model that explains how the variation in one variable is related to the variation in a second variable. Sometimes, the "related to" part is stated as "explained by", which is fine but keep in mind that "explained by" does NOT mean "caused by".

Statistical models are contrasted with deterministic models, which are perfect predictions of y given x. A good example of this is the relationship between Fahrenheit and Celsius: knowing one gives you an exact value for the other: $F = 32 + 1.8*C$

A regression draws a "line of best fit"---for a specific definition of "best"---between the two variables. This line is in the same form as the geometric line, $y = mx + b$, but in statistical terms we usually refer to this regression line as $y = \beta_0 + \beta_1*x$, with $\beta_0$ and $\beta_1$ referred to as regression parameters.

As you can see from the statistical form, the value of one variable---x---is used to predict the value of the second variable---y. The x variable is interchangeably referred to as predictor, explanatory, or independent, while the y variable is interchangeably referred to as the outcome, response, or dependent variable. The parameters provide the mathematical relationship between the two variables as a straight line.

For today, we will consider cases where both x and y are continuous variables. However, in a simple linear regression, x can also be categorical and the model will still work.
:::

------------------------------------------------------------------------

### Questions it *can* answer

-   statistical, linear relationship between two variables
-   predictions of group averages
    -   strength of differences between mathematical groups
-   predictions of new, individual values
    -   generally includes interpolation

::: notes
We can use regression to investigate the statistical, linear relationship between two variables. It gives us the predicted value of group averages---that is, the difference in the average value of the outcome variable for some mathematical and possibly theoretical group based on the predictor variable. This can be observed predictor groups or---if we are careful and the variability of the predictor allows it---unobserved, interpolated predictor groups. Note that outliers or extreme values affect the quality of your interpolation, though we will not investigate that today.
:::

------------------------------------------------------------------------

### Questions it *cannot* answer

-   nonlinear relationships
    -   polynomial regression
-   effects of multiple predictor variables
    -   multiple regression
-   effects of one variable to many variables
    -   multivariate regression
-   causal relationship (by itself)
    -   additional requirements
-   extrapolation

::: notes
A simple linear regression can't answer questions about nonlinear relationships between variables; we can use polynomial regression in some cases, or transformations of the variables in other cases, but using original units of x and y the relationship must be linear for simple linear regression to be appropriate.

We also can't answer questions with multiple predictor variables (yet; multiple regression can), or with multiple outcome variables (multivariate regression can).

Regression, by itself, CANNOT answer causal relationships (i.e., does a change in x CAUSE a specific change in average y). There are other conditions that must be satisfied before we can make causal interpretations. For now, we assume that changes are associative only.

Extrapolating outside the range of your x values is also generally inappropriate. We can use regression to give us an idea of how average y values change for x values outside of the range of x values in the sample, but the further we get from the observed range the more hazardous and inappropriate the extrapolation gets. Extrapolate with caution!
:::

## Regression "Line of Best Fit"

### Example data

We have a record of a hypothetical sample of college-aged (18-22) males, with data for their heights (in inches) and weights (in pounds).

The data-generating process for our sample is defined below.

```{r}
#| label: create-height-data
#| echo: true

set.seed(1)
x <- rnorm(100, 70, 4)
y <- 50 + 1.75*x + rnorm(100, 0, 3)

height_data <-
  tibble(height = x
         , weight = y
  )
```

::: notes
We're going to return to this idea of a "line of best fit", but first let's generate some data that we know follows a simple regression model. This data will represent a sample of 100 college-aged (18 yrs old -- 22 yrs old) males, with data collected on their height (in inches) and weight (in pounds).

The R code for generating this data can be seen here.

Next is a scatterplot of the resulting data. What can we tell about the association between height and weight from this plot alone?

(positive correlation; positive/linear association/relationship; as height increases weight increases)

Let's consider how we can draw a line that tells us the most likely weight for someone represented by our sample, given their height.
:::

------------------------------------------------------------------------

```{r}
#| label: scatterplot-height-data
#| fig-cap: 'Male College Students Height (in inches; x-axis) and Weight (in pounds; y-axis)'
#| fig-subcap: 'Basic scatterplot'

height_data %>%
  ggplot(
    mapping = aes(x = height, y = weight)
    ) +
    geom_point() +
    xlab('Height (inches)') +
    ylab('Weight (pounds)') +
    theme_bw()
```

::: notes
In regression, the line of best fit is estimated with a process known as "least squares estimation". We take it as true, for now, that this is the best way to create a prediction line of best fit for a linear association between two variables.

This process finds an intercept and slope that minimizes squared distance between line and observations (this distance is also known as error) for ALL observations simultaneously. Our predictions based on this line are as mathematically close as possible to all the observations in our data. In a moment, we will look at an example to help us understand why SQUARED error is used over OBSERVED error.
:::

------------------------------------------------------------------------

### Line of Best Fit

Least Squares Estimator:

-   Minimizes squared distance between line and observations (error) for ALL observations

$$\text{L}_{min} = \text{argmin}(\Sigma_{i=1}^n (y_i - \hat{y_i})^2)$$

::: notes
The equation here defines the problem mathematically. The important part to understand is that we are taking the difference between the observed values, $y_i$, and the values predicted by the regression line, $\hat{y_i}$, as our error. We then square the error and add the sums. We are looking for a line that produces the values $\hat{y_i}$ that results in the smallest possible sum of squared errors in our sample.
:::

------------------------------------------------------------------------

### Bad fit:

```{r}
#| label: sample-bad-fit-height-data
#| fig-cap: 'Male College Students Height (x) and Weight (y)'
#| fig-subcap: 'With a poorly-fitting line'

height_data <-
  height_data %>%
  mutate(
    # bad_prediction = 30 + 2*height 
    bad_prediction = -.9276 + 2.471429*height
    , bad_error = weight - bad_prediction
    , prediction = lm(weight ~ height) |> predict()
    , error = weight - prediction
    , sum_error_bad = sum(bad_error)
    , sum_error_best = sum(error)
    , SSE_bad = sum(bad_error^2)
    , SSE_best = sum(error^2)
    )

height_data %>%
  ggplot(
    aes(x = height, y = weight)
  ) +
  geom_point() +
  geom_abline(intercept = -.9276, slope = 2.471429, color = 'red') +
  geom_linerange(aes(ymax = weight, ymin = bad_prediction)) +
  theme_bw() +
  ylab("Weight (pounds)") +
  xlab("Height (inches)") +
  xlim(60, 80) + ylim(150, 200)
```

::: notes
This next graph shows what I am calling a "line of bad fit". The red line looks like it provides reasonable estimates of weight for height values near the center of the distribution, but for both shorter and taller individuals it produces estimates that are noticeably further way from the prediction line. The black lines between the points and the red line are the errors. We could manually manipulate these errors (and I will show you how to later), but for now I will record these errors and use them later.
:::

------------------------------------------------------------------------

### Best fit:

```{r}
#| label: sample-good-fit-height-data
#| fig-cap: 'Male College Students Height (x) and Weight (y)'
#| fig-subcap: 'With a mathematically optimal line of best fit'


height_data %>%
  ggplot(
    aes(x = height, y = weight)
  ) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE, fullrange=TRUE) +
  geom_linerange(aes(ymax = weight, ymin = prediction)) +
  theme_bw() +
  ylab("Weight (pounds)") +
  xlab("Height (inches)") +
  xlim(60, 80) + ylim(150, 200)
```

::: notes
It may be hard to compare right now, but this blue line is the line of best fit using least squares estimation. The errors should be noticeably smaller at the upper and lower values of height. Again, the black lines between the points and the blue line are the errors. Let's see how these two lines compare to one another.
:::

------------------------------------------------------------------------

### Both lines

```{r}
#| label: sample-both-fits-height-data
#| fig-cap: 'Male College Students Height (x) and Weight (y)'
#| fig-subcap: 'With Best Fit in Blue and "Bad" Fit in Red'
sum_error_text <-
  paste0(
    "\nS(e) for the red line: "
    , height_data$sum_error_bad[1] |> round(digits = 1)
    , "\nS(e) for the blue line: "
    , height_data$sum_error_best[1] |> round(digits = 1)
    , "\nSS(e) for the red line: "
    , height_data$SSE_bad[1] |> round(digits = 1)
    , "\nSS(e) for the blue line: "
    , height_data$SSE_best[1] |> round(digits = 1)
  )

height_data %>%
  ggplot(
    aes(x = height, y = weight)
  ) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_abline(intercept = -.9276, slope = 2.471429, color = 'red') +
  geom_text(
    x = 75
    , y = 160
    , label = sum_error_text) +
  theme_bw() +
  ylab("Weight (pounds)") +
  xlab("Height (inches)") +
  xlim(60, 80) + ylim(150, 200)
```

::: notes
Again, the red line is the "bad fit" and the blue line is the "best fit". These lines are the same as in the previous graphs. We also see four values imposed on the graph: S(e), the sum of errors (notice that these are regular, NOT squared, errors) for both lines, and SS(e), the sum of squared errors for both lines.

Notice that the sum of unsquared errors is zero for both lines, but the sum of squared errors is lower for the blue least squares estimate line. Does everyone see why SS(e) is better than S(e) for regression lines?

(unsquared errors allows positive and negative errors to cancel; multiple possible solutions with unsquared errors)

Is there another method of minimizing errors that you think could also work?

(e.g., absolute errors)
:::

## Deriving the Least Squares Estimator as the Line of Best Fit

<div>

-   Population model:
    -   $y = \beta_0 + \beta_1x$
    -   $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$
-   Estimation: $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + \epsilon_i$
-   Minimize the squared errors, defined as: $\text{L} = \Sigma_{i=1}^n \epsilon_i^2 = \Sigma_{i=1}^n (y_i - \hat{y_i})^2 = \Sigma_{i=1}^n(y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2$

</div>

::: notes
(make sure discussion of squared errors happened)

Again, here is the statistical equation for a regression line for our observations, with the addition of the error terms and the subscript for individual observations.

Remember that to show that we are moving from the population model to an estimation, we add "hats" to the values that are estimated. Now, our observed x values are used to estimate $\hat{y}$ values with our estimated regression parameters, $\hat{\beta}_0$ and $\hat{\beta}_1$ .

With a bit of algebra, we can redefine the Least Squares Estimator L as a function of the estimated regression line and the observed y values.

Any questions so far?

From here, we will see some of the derivation. If you are unfamiliar or out of practice with calculus, don't worry; we will only consider the final equations moving forward. If you are interested in fully understanding the math (for example, if you want to take classes with the Statistics department), you should work on understanding these steps.
:::

------------------------------------------------------------------------

### Derivatives:

$\beta_0$: $$\frac{\delta\text{L}}{\delta\beta_0} = -2\Sigma_{i=1}^n(y_i - \hat{\beta_0} - \hat{\beta_1} x_i) = 0$$ $$n\hat{\beta_0} + \hat{\beta_1}\Sigma_{i=1}^n x_i = \Sigma_{i=1}^n y_i$$

$\beta_1$: $$\frac{\delta\text{L}}{\delta\beta_1} = -2\Sigma_{i=1}^n(y_i - \hat{\beta_0} - \hat{\beta_1} x_i)x_i = 0$$ $$\hat{\beta_0}\Sigma_{i=1}^n x_i + \hat{\beta_1}\Sigma_{i=1}^n x^2_i = \Sigma_{i=1}^n y_i x_i$$

::: notes
This slide shows the results of taking the partial derivatives of the sum of squared errors, with respect to $\beta_0$ and $\beta_1$ respectively and setting the result to 0 (as we are looking for the minimum of the sum of squares). With some extra algebra, we get the following algebraic solution to the least squares estimator for simple linear regression.
:::

------------------------------------------------------------------------

### Algebraic Estimates of Least Squares

$$\hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x}$$

$$\hat{\beta_1} = \frac{\Sigma_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\Sigma_{i=1}^n(x_i-\bar{x})^2}$$

::: notes
Using these equations, we now have a way to estimate the slope and intercept of a regression equation for any data set with two continuous variables (and other situations we aren't considering at the moment).

Notice that the intercept, $\hat{\beta_0}$, depends on the slope, $\hat{\beta_1}$, so if we are using these equations we need to solve for $\hat{\beta_1}$ first, then plug that estimate into $\hat{\beta_0}$

We will practice using these equations in a little bit.
:::

## Review of Regression Coefficients

::: incremental
-   $\beta_0$: intercept
    -   Estimated ${\bar{y}}$ when $x = 0$
:::

::: incremental
-   $\beta_1$: regression slope
    -   a one-point change in $x$ results in a $\beta_1$-point change in the average of $y$
    -   note: results does NOT mean causes!
    -   a units-adjusted transformation of (Pearson) correlation
:::

::: notes
(Knowledge Check)

With the formulas to algebraically estimate the regression coefficients completed, let's review what the parameters are and what they mean.

$\beta_0$ is the intercept. It is interpreted as the expected or mean value of the outcome variable y when the predictor variable x equals zero.

$\beta_1$ is the regression coefficient or regression slope. It is interpreted as the change in expected or mean y values resulting from a one-unit change in x values (as in, increasing x by 1 in whatever units it is measured in). Note that "results from" doesn't mean "caused by" as we are not making causal relationships right now!

An interesting mathematical fact: the regression coefficient in a simple linear regression is a units-adjust transformation of the Pearson correlation; that is, for a given data set, $\beta_1$ and the correlation are directly related!
:::

------------------------------------------------------------------------

### Relationship between $\beta_1$ and Correlation

```{r}
#| label: showing-height-data-again
#| fig-cap: 'Male College Students Height (in inches; x-axis) and Weight (in pounds; y-axis)'
#| fig-subcap: 'Basic scatterplot'

height_data %>%
  ggplot(
    mapping = aes(x = height, y = weight)
    ) +
    geom_point() +
    xlab('Height (inches)') +
    ylab('Weight (pounds)') +
    theme_bw()
```

::: notes
Let's briefly show how these are related.

First, take a look at the data we originally created. How would you characterize the correlation between height and weight? If you had to estimate the correlation coefficient, what value would you use?

(characterized: strong, positive; estimate: anything greater than .7 but not 1.0, definitely not negative)
:::

------------------------------------------------------------------------

Pearson Correlation: $$r_{(x,y)} = \frac{\Sigma^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\Sigma_{i=1}^n(x_i-\bar{x})^2\Sigma_{i=1}^n(y_i-\bar{y})^2}} = \frac{\text{cov}(x,y)}{\text{sd}(x)\text{sd}(y)}$$

Regression slope: $$\hat{\beta_1} = \frac{\Sigma_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\Sigma_{i=1}^n(x_i-\bar{x})^2}=\frac{\text{cov}(x,y)}{\text{var}(x)}$$

::: notes
Let's revisit the equation for the Pearson correlation. Remember that correlation is unitless and can range from negative 1 to positive 1 but cannot go beyond this range.

Let's compare that equation to the algebraic equation for beta_1. It should be evident that beta_1 is in units of y. Why?

(when x increased by 1 unit, y increases by beta_1 units, so beta_1 is in y units over x units. Mathematically, cov(x,y) is in x\*y unites, and var(x) is in x\^2 units, so we get y/x units. However, we generally interpret beta_1 in one-x-unit changes so the x units in the denominator are not important for interpretation)

Since both values use the covariance between x and y in the numerator, and the difference between the denominators is the variance of x and sd of both variables; some algebra makes them equivalent (but we'll skip over the specifics).
:::

------------------------------------------------------------------------

::: columns
Relating the two:

$r = \hat{\beta_1}*\frac{\sqrt{\Sigma_{i=1}^n(x_i-\bar{x})^2}}{\sqrt{\Sigma_{i=1}^n(y_i-\bar{y})^2}}$

```{r}
#| label: showing-relationship-between-slope-and-correlation
#| echo: true
#| code-line-numbers: false
#| code-overflow: wrap
#| code-fold: show

pearson_correlation <-
  cor(height_data$height, height_data$weight) |>
  round(3)

regression_slope <-
  lm(weight ~ height, data = height_data)$coefficients[2] |>
  round(3)

numerator <-
  (height_data$height - mean(height_data$height))^2 |>
  sum() |> sqrt() 
denominator <-
  (height_data$weight - mean(height_data$weight))^2 |>
  sum() |> sqrt()
units_adjustment <-
  (numerator/denominator) |> round(3)
```

```{r}
#| label: print-relationship-between-cor-and-reg
paste0(
  "r = ", pearson_correlation, 
  "\nbeta_1 = ", regression_slope, 
  "\nunits-adjustment = ", units_adjustment,
  "\nr = beta_1 * units-adjustment?: ", 
  (regression_slope*units_adjustment) |> 
    round(3)
  ) |>
  cat()
```
:::

::: notes
On the left, we see the results from the algebraic manipulation I alluded to. By multiplying $\beta_1$ by the units-adjustment value, sqrt(SS(x)) / sqrt(SS(y)), the two values should be the same.

On the right is some R code that performs these calculations automatically. The `cor` function should look familiar. I took a shortcut and used the `lm` function instead of calculating $\beta_1$ myself, but we will see on the next slide that this result is functionally equivalent to the algebraic calculations on the same data. And, after calculating the units-adjustment value for the data and multiplying it with $\beta_1$, we see that in fact the correlation and regression slope are related.
:::

## Perform Simple Regression in R

### Algebraically

```{r}
#| label: regression-code-algebra-sample 
#| echo: true  

mean_x <-
  mean(height_data$height)
mean_y <-
  mean(height_data$weight)
SS_x <-
  sum((height_data$height - mean_x)^2)
S_xy <-
  sum((height_data$height - mean_x) * (height_data$weight - mean_y))

beta_1 <-
  S_xy / SS_x
beta_0 <-
  mean_y - beta_1*mean_x

print(paste0("Estimated Beta1: ", beta_1 |> round(digits = 2))); print(paste0("Estimated Beta0: ", beta_0 |> round(digits = 2)))
```

::: notes
Here is an of how to perform regression algebraically in R. This code snippet calculates the values we need for the least squares estimate of our regression coefficients using the data we simulated at the beginning of class. Note that the estimates for our coefficients are almost identical to the coefficients we used to simulate the data.
:::

------------------------------------------------------------------------

### With `lm`

```{r}
#| label: regression-code-lm-sample 
#| echo: true  

lm_height_weight <-   
  lm(weight ~ height, data = height_data)  
summary(lm_height_weight)
```

::: notes
Here is another look at how we can get the same information using the `lm()` function built into R. By fitting the model, saving it to the `lm_height_weight` object, then using the `summary()` function on this object, we get a printout of the regression model, including a multitude of statistics. We will look further into these statistics in another lecture, so for now look for the Coefficients and their respective Estimate values. These values are basically equivalent to the values we found algebraically and to the values we used to simulate the data.
:::

## Practice

```{=html}
<iframe width='100%' height='80%' src='https://4kve1c-anthony0raborn.shinyapps.io/simple-linear-regression/'>
  </iframe>
```
::: notes
Now let's practice what we've learned so far about regression.

This is a shiny app, built in R, that generates random data, plots the data, and lets us pick a line that we think fits the data. It also provides information about our data (e.g., means of the variables, the sum of squares for x and y individually, and the covariance between x and y). In case you don't remember how to use these values to algebraically solve for the regression coefficients, there is a hint button you can press that will show them to you. It also shows our sum of squared errors so we know how well our guess (or math!) is at minimizing these errors.

Let's generate some data and see how well we do!

(spend about 5 minutes)
:::

## End

Questions?

Lecture available at:

<https://anthony-raborn.quarto.pub/introduction-to-statistics/>

::: notes
Any questions?

(can I find a picture to add here?)
:::