-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsimple-linear-regression-slides-1.qmd
608 lines (441 loc) · 27.3 KB
/
simple-linear-regression-slides-1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
---
title: "Simple Linear Regression"
subtitle: "University of Florida"
date: 2024-01-25
author: "Anthony Raborn, Psychometrician"
institute: "National Association of Boards of Pharmacy"
format:
revealjs:
theme: dark
css: styles.css
editor_options:
chunk_output_type: console
---
```{r}
#| label: setup
#| message: false
require(tidyverse)
# https://ufl.zoom.us/j/7362215223
```
## Introduction
::: columns
::: {.column width="40%"}
### Myself
- UF Graduate, 2019
- 2.5 years K-12, 2.5 years C&L
- Reproducible and automated data analysis
:::
::: {.column width="60%"}
### Approach to Instruction
- Reproducible slides and lectures
- Embedded practice within lectures
- Provide multiple solutions to problems (where possible)
- focusing on reproducibility as much as reasonable!
:::
:::
::: notes
First, a little bit about myself and my approach to teaching.
I graduated from the University of Florida about 5 years ago, with a Ph.D. in Research and Evaluation Methodology and a minor in Statistics. Dr. Manley and Dr. Leite were my advisors.
I worked about two and a half years with Pasco County Schools as a Supervisor, Accountability, Research, and Measurement before my current two and a half years with the National Association of Boards of Pharmacy. NABP is in the certification and licensure industry, and I work on our licensing exams and related data for prospective pharmacists.
My professional focus is on reproducible and automated data analytic processes. If I'm asked to do the same task twice, I begin thinking about how to make that process automatic through programming, which also allows me to have a record to refer to for future requests. Any research I do follows a similar process, and I have a few R packages that came from work with colleagues on improving the flexibility of scripts.
My approach to teaching mirrors from this focus. I aim to provide lectures that are easily editable and reproducible. This means that students would have direct access to my lecture slides and notes, and that I can make, track, and share changes to these notes in real-time.
I do this by embedding my instruction within the software I use for analysis. For example, this lecture is created within R using Quarto and will be available for everyone to access on my GitHub account.
While I encourage students to follow a similar mindset with statistics and data analysis, I support and expect to provide support for multiple different approaches, including different software. My aim is to help everyone get to the point of creating reproducible work, if possible, and support students who are not as technically-oriented produce good work regardless of its reproducibility.
For today, though, I will be working in R exclusively.
:::
## Building upon... {.smaller}
- Mathematical notation
- bar notation, i notation, sum notation, hat notation
- Normal distribution
- symmetric, bell-shaped
- statistical error
- random differences
- correlation
- strength of linear relationship
- statistical parameter
- general understanding
::: notes
This lecture assumes that you are familiar with the following topics, as they will be touched upon but not fully explained.
1. Mathematical notation---subscripted i, squares, square roots, sum notation, bar notation, hat notation
2. What the normal distribution is---most importantly, that it is a symmetrical and bell-shaped distribution, meaning that values drawn from this distribution are less likely to be observed the further they are (in either direction) from the mean.
3. What statistical error is---most importantly, that it means random differences between expected or estimated values and observed values, not systemic differences or differences due to things like biased measurements or mis-keyed data.
4. What correlation is---most importantly, that it measures the strength of the linear relationship between two variables.
5. What statistical parameters are--just a general understanding that they provide a mathematical way to describe a statistical model and that analyses try to estimate these values
:::
## Lesson Objectives
1. Understand what simple linear regression is and when it should be used
2. Estimate and Interpret regression coefficients
3. Utilize R for fitting regression models
Next time:
4. Summarize the four main assumptions of regression
5. Read diagnostic plots
6. Perform hypothesis testing on simple linear regression models
::: notes
At the end of this lesson, you should be able to:
- Understand what simple linear regression is, when it should be used (and by converse when it shouldn't be used).
- Estimate the model coefficients for regression algebraically as well as interpret these coefficients.
- Utilize R for fitting regression models
We won't have enough time to cover these topics, but the next lecture would also add these objectives:
- Summarize the four main assumptions of regression.
- Read and interpret some regression diagnostic plots, identifying obvious violations of the four assumptions
- Perform hypothesis testing on the coefficients of a fit regression model
:::
## What is Simple Linear Regression?
- Statistical (NOT deterministic) model
- explores how variation in one variable is related to (explained by) variation in a second variable
- Draws a "line of best fit" between variables
- Geometric $y = mx + b$ vs statistical $y = \beta_0 + \beta_1 x$
- Uses values of one predictor (explanatory, independent) variable *x* to estimate values of one outcome (response, dependent) variable *y*
- For today, *x* and *y* will be continuous
- (though *x* can be categorical!)
::: notes
Simple linear regression (generally referred to as "regression" hereout; other regression models would be directly specified) is a statistical model that explains how the variation in one variable is related to the variation in a second variable. Sometimes, the "related to" part is stated as "explained by", which is fine but keep in mind that "explained by" does NOT mean "caused by".
Statistical models are contrasted with deterministic models, which are perfect predictions of y given x. A good example of this is the relationship between Fahrenheit and Celsius: knowing one gives you an exact value for the other: $F = 32 + 1.8*C$
A regression draws a "line of best fit"---for a specific definition of "best"---between the two variables. This line is in the same form as the geometric line, $y = mx + b$, but in statistical terms we usually refer to this regression line as $y = \beta_0 + \beta_1*x$, with $\beta_0$ and $\beta_1$ referred to as regression parameters.
As you can see from the statistical form, the value of one variable---x---is used to predict the value of the second variable---y. The x variable is interchangeably referred to as predictor, explanatory, or independent, while the y variable is interchangeably referred to as the outcome, response, or dependent variable. The parameters provide the mathematical relationship between the two variables as a straight line.
For today, we will consider cases where both x and y are continuous variables. However, in a simple linear regression, x can also be categorical and the model will still work.
:::
------------------------------------------------------------------------
### Questions it *can* answer
- statistical, linear relationship between two variables
- predictions of group averages
- strength of differences between mathematical groups
- predictions of new, individual values
- generally includes interpolation
::: notes
We can use regression to investigate the statistical, linear relationship between two variables. It gives us the predicted value of group averages---that is, the difference in the average value of the outcome variable for some mathematical and possibly theoretical group based on the predictor variable. This can be observed predictor groups or---if we are careful and the variability of the predictor allows it---unobserved, interpolated predictor groups. Note that outliers or extreme values affect the quality of your interpolation, though we will not investigate that today.
:::
------------------------------------------------------------------------
### Questions it *cannot* answer
- nonlinear relationships
- polynomial regression
- effects of multiple predictor variables
- multiple regression
- effects of one variable to many variables
- multivariate regression
- causal relationship (by itself)
- additional requirements
- extrapolation
::: notes
A simple linear regression can't answer questions about nonlinear relationships between variables; we can use polynomial regression in some cases, or transformations of the variables in other cases, but using original units of x and y the relationship must be linear for simple linear regression to be appropriate.
We also can't answer questions with multiple predictor variables (yet; multiple regression can), or with multiple outcome variables (multivariate regression can).
Regression, by itself, CANNOT answer causal relationships (i.e., does a change in x CAUSE a specific change in average y). There are other conditions that must be satisfied before we can make causal interpretations. For now, we assume that changes are associative only.
Extrapolating outside the range of your x values is also generally inappropriate. We can use regression to give us an idea of how average y values change for x values outside of the range of x values in the sample, but the further we get from the observed range the more hazardous and inappropriate the extrapolation gets. Extrapolate with caution!
:::
## Regression "Line of Best Fit"
### Example data
We have a record of a hypothetical sample of college-aged (18-22) males, with data for their heights (in inches) and weights (in pounds).
The data-generating process for our sample is defined below.
```{r}
#| label: create-height-data
#| echo: true
set.seed(1)
x <- rnorm(100, 70, 4)
y <- 50 + 1.75*x + rnorm(100, 0, 3)
height_data <-
tibble(height = x
, weight = y
)
```
::: notes
We're going to return to this idea of a "line of best fit", but first let's generate some data that we know follows a simple regression model. This data will represent a sample of 100 college-aged (18 yrs old -- 22 yrs old) males, with data collected on their height (in inches) and weight (in pounds).
The R code for generating this data can be seen here.
Next is a scatterplot of the resulting data. What can we tell about the association between height and weight from this plot alone?
(positive correlation; positive/linear association/relationship; as height increases weight increases)
Let's consider how we can draw a line that tells us the most likely weight for someone represented by our sample, given their height.
:::
------------------------------------------------------------------------
```{r}
#| label: scatterplot-height-data
#| fig-cap: 'Male College Students Height (in inches; x-axis) and Weight (in pounds; y-axis)'
#| fig-subcap: 'Basic scatterplot'
height_data %>%
ggplot(
mapping = aes(x = height, y = weight)
) +
geom_point() +
xlab('Height (inches)') +
ylab('Weight (pounds)') +
theme_bw()
```
::: notes
In regression, the line of best fit is estimated with a process known as "least squares estimation". We take it as true, for now, that this is the best way to create a prediction line of best fit for a linear association between two variables.
This process finds an intercept and slope that minimizes squared distance between line and observations (this distance is also known as error) for ALL observations simultaneously. Our predictions based on this line are as mathematically close as possible to all the observations in our data. In a moment, we will look at an example to help us understand why SQUARED error is used over OBSERVED error.
:::
------------------------------------------------------------------------
### Line of Best Fit
Least Squares Estimator:
- Minimizes squared distance between line and observations (error) for ALL observations
$$\text{L}_{min} = \text{argmin}(\Sigma_{i=1}^n (y_i - \hat{y_i})^2)$$
::: notes
The equation here defines the problem mathematically. The important part to understand is that we are taking the difference between the observed values, $y_i$, and the values predicted by the regression line, $\hat{y_i}$, as our error. We then square the error and add the sums. We are looking for a line that produces the values $\hat{y_i}$ that results in the smallest possible sum of squared errors in our sample.
:::
------------------------------------------------------------------------
### Bad fit:
```{r}
#| label: sample-bad-fit-height-data
#| fig-cap: 'Male College Students Height (x) and Weight (y)'
#| fig-subcap: 'With a poorly-fitting line'
height_data <-
height_data %>%
mutate(
# bad_prediction = 30 + 2*height
bad_prediction = -.9276 + 2.471429*height
, bad_error = weight - bad_prediction
, prediction = lm(weight ~ height) |> predict()
, error = weight - prediction
, sum_error_bad = sum(bad_error)
, sum_error_best = sum(error)
, SSE_bad = sum(bad_error^2)
, SSE_best = sum(error^2)
)
height_data %>%
ggplot(
aes(x = height, y = weight)
) +
geom_point() +
geom_abline(intercept = -.9276, slope = 2.471429, color = 'red') +
geom_linerange(aes(ymax = weight, ymin = bad_prediction)) +
theme_bw() +
ylab("Weight (pounds)") +
xlab("Height (inches)") +
xlim(60, 80) + ylim(150, 200)
```
::: notes
This next graph shows what I am calling a "line of bad fit". The red line looks like it provides reasonable estimates of weight for height values near the center of the distribution, but for both shorter and taller individuals it produces estimates that are noticeably further way from the prediction line. The black lines between the points and the red line are the errors. We could manually manipulate these errors (and I will show you how to later), but for now I will record these errors and use them later.
:::
------------------------------------------------------------------------
### Best fit:
```{r}
#| label: sample-good-fit-height-data
#| fig-cap: 'Male College Students Height (x) and Weight (y)'
#| fig-subcap: 'With a mathematically optimal line of best fit'
height_data %>%
ggplot(
aes(x = height, y = weight)
) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE, fullrange=TRUE) +
geom_linerange(aes(ymax = weight, ymin = prediction)) +
theme_bw() +
ylab("Weight (pounds)") +
xlab("Height (inches)") +
xlim(60, 80) + ylim(150, 200)
```
::: notes
It may be hard to compare right now, but this blue line is the line of best fit using least squares estimation. The errors should be noticeably smaller at the upper and lower values of height. Again, the black lines between the points and the blue line are the errors. Let's see how these two lines compare to one another.
:::
------------------------------------------------------------------------
### Both lines
```{r}
#| label: sample-both-fits-height-data
#| fig-cap: 'Male College Students Height (x) and Weight (y)'
#| fig-subcap: 'With Best Fit in Blue and "Bad" Fit in Red'
sum_error_text <-
paste0(
"\nS(e) for the red line: "
, height_data$sum_error_bad[1] |> round(digits = 1)
, "\nS(e) for the blue line: "
, height_data$sum_error_best[1] |> round(digits = 1)
, "\nSS(e) for the red line: "
, height_data$SSE_bad[1] |> round(digits = 1)
, "\nSS(e) for the blue line: "
, height_data$SSE_best[1] |> round(digits = 1)
)
height_data %>%
ggplot(
aes(x = height, y = weight)
) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
geom_abline(intercept = -.9276, slope = 2.471429, color = 'red') +
geom_text(
x = 75
, y = 160
, label = sum_error_text) +
theme_bw() +
ylab("Weight (pounds)") +
xlab("Height (inches)") +
xlim(60, 80) + ylim(150, 200)
```
::: notes
Again, the red line is the "bad fit" and the blue line is the "best fit". These lines are the same as in the previous graphs. We also see four values imposed on the graph: S(e), the sum of errors (notice that these are regular, NOT squared, errors) for both lines, and SS(e), the sum of squared errors for both lines.
Notice that the sum of unsquared errors is zero for both lines, but the sum of squared errors is lower for the blue least squares estimate line. Does everyone see why SS(e) is better than S(e) for regression lines?
(unsquared errors allows positive and negative errors to cancel; multiple possible solutions with unsquared errors)
Is there another method of minimizing errors that you think could also work?
(e.g., absolute errors)
:::
## Deriving the Least Squares Estimator as the Line of Best Fit
<div>
- Population model:
- $y = \beta_0 + \beta_1x$
- $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$
- Estimation: $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + \epsilon_i$
- Minimize the squared errors, defined as: $\text{L} = \Sigma_{i=1}^n \epsilon_i^2 = \Sigma_{i=1}^n (y_i - \hat{y_i})^2 = \Sigma_{i=1}^n(y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2$
</div>
::: notes
(make sure discussion of squared errors happened)
Again, here is the statistical equation for a regression line for our observations, with the addition of the error terms and the subscript for individual observations.
Remember that to show that we are moving from the population model to an estimation, we add "hats" to the values that are estimated. Now, our observed x values are used to estimate $\hat{y}$ values with our estimated regression parameters, $\hat{\beta}_0$ and $\hat{\beta}_1$ .
With a bit of algebra, we can redefine the Least Squares Estimator L as a function of the estimated regression line and the observed y values.
Any questions so far?
From here, we will see some of the derivation. If you are unfamiliar or out of practice with calculus, don't worry; we will only consider the final equations moving forward. If you are interested in fully understanding the math (for example, if you want to take classes with the Statistics department), you should work on understanding these steps.
:::
------------------------------------------------------------------------
### Derivatives:
$\beta_0$: $$\frac{\delta\text{L}}{\delta\beta_0} = -2\Sigma_{i=1}^n(y_i - \hat{\beta_0} - \hat{\beta_1} x_i) = 0$$ $$n\hat{\beta_0} + \hat{\beta_1}\Sigma_{i=1}^n x_i = \Sigma_{i=1}^n y_i$$
$\beta_1$: $$\frac{\delta\text{L}}{\delta\beta_1} = -2\Sigma_{i=1}^n(y_i - \hat{\beta_0} - \hat{\beta_1} x_i)x_i = 0$$ $$\hat{\beta_0}\Sigma_{i=1}^n x_i + \hat{\beta_1}\Sigma_{i=1}^n x^2_i = \Sigma_{i=1}^n y_i x_i$$
::: notes
This slide shows the results of taking the partial derivatives of the sum of squared errors, with respect to $\beta_0$ and $\beta_1$ respectively and setting the result to 0 (as we are looking for the minimum of the sum of squares). With some extra algebra, we get the following algebraic solution to the least squares estimator for simple linear regression.
:::
------------------------------------------------------------------------
### Algebraic Estimates of Least Squares
$$\hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x}$$
$$\hat{\beta_1} = \frac{\Sigma_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\Sigma_{i=1}^n(x_i-\bar{x})^2}$$
::: notes
Using these equations, we now have a way to estimate the slope and intercept of a regression equation for any data set with two continuous variables (and other situations we aren't considering at the moment).
Notice that the intercept, $\hat{\beta_0}$, depends on the slope, $\hat{\beta_1}$, so if we are using these equations we need to solve for $\hat{\beta_1}$ first, then plug that estimate into $\hat{\beta_0}$
We will practice using these equations in a little bit.
:::
## Review of Regression Coefficients
::: incremental
- $\beta_0$: intercept
- Estimated ${\bar{y}}$ when $x = 0$
:::
::: incremental
- $\beta_1$: regression slope
- a one-point change in $x$ results in a $\beta_1$-point change in the average of $y$
- note: results does NOT mean causes!
- a units-adjusted transformation of (Pearson) correlation
:::
::: notes
(Knowledge Check)
With the formulas to algebraically estimate the regression coefficients completed, let's review what the parameters are and what they mean.
$\beta_0$ is the intercept. It is interpreted as the expected or mean value of the outcome variable y when the predictor variable x equals zero.
$\beta_1$ is the regression coefficient or regression slope. It is interpreted as the change in expected or mean y values resulting from a one-unit change in x values (as in, increasing x by 1 in whatever units it is measured in). Note that "results from" doesn't mean "caused by" as we are not making causal relationships right now!
An interesting mathematical fact: the regression coefficient in a simple linear regression is a units-adjust transformation of the Pearson correlation; that is, for a given data set, $\beta_1$ and the correlation are directly related!
:::
------------------------------------------------------------------------
### Relationship between $\beta_1$ and Correlation
```{r}
#| label: showing-height-data-again
#| fig-cap: 'Male College Students Height (in inches; x-axis) and Weight (in pounds; y-axis)'
#| fig-subcap: 'Basic scatterplot'
height_data %>%
ggplot(
mapping = aes(x = height, y = weight)
) +
geom_point() +
xlab('Height (inches)') +
ylab('Weight (pounds)') +
theme_bw()
```
::: notes
Let's briefly show how these are related.
First, take a look at the data we originally created. How would you characterize the correlation between height and weight? If you had to estimate the correlation coefficient, what value would you use?
(characterized: strong, positive; estimate: anything greater than .7 but not 1.0, definitely not negative)
:::
------------------------------------------------------------------------
Pearson Correlation: $$r_{(x,y)} = \frac{\Sigma^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\Sigma_{i=1}^n(x_i-\bar{x})^2\Sigma_{i=1}^n(y_i-\bar{y})^2}} = \frac{\text{cov}(x,y)}{\text{sd}(x)\text{sd}(y)}$$
Regression slope: $$\hat{\beta_1} = \frac{\Sigma_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\Sigma_{i=1}^n(x_i-\bar{x})^2}=\frac{\text{cov}(x,y)}{\text{var}(x)}$$
::: notes
Let's revisit the equation for the Pearson correlation. Remember that correlation is unitless and can range from negative 1 to positive 1 but cannot go beyond this range.
Let's compare that equation to the algebraic equation for beta_1. It should be evident that beta_1 is in units of y. Why?
(when x increased by 1 unit, y increases by beta_1 units, so beta_1 is in y units over x units. Mathematically, cov(x,y) is in x\*y unites, and var(x) is in x\^2 units, so we get y/x units. However, we generally interpret beta_1 in one-x-unit changes so the x units in the denominator are not important for interpretation)
Since both values use the covariance between x and y in the numerator, and the difference between the denominators is the variance of x and sd of both variables; some algebra makes them equivalent (but we'll skip over the specifics).
:::
------------------------------------------------------------------------
::: columns
Relating the two:
$r = \hat{\beta_1}*\frac{\sqrt{\Sigma_{i=1}^n(x_i-\bar{x})^2}}{\sqrt{\Sigma_{i=1}^n(y_i-\bar{y})^2}}$
```{r}
#| label: showing-relationship-between-slope-and-correlation
#| echo: true
#| code-line-numbers: false
#| code-overflow: wrap
#| code-fold: show
pearson_correlation <-
cor(height_data$height, height_data$weight) |>
round(3)
regression_slope <-
lm(weight ~ height, data = height_data)$coefficients[2] |>
round(3)
numerator <-
(height_data$height - mean(height_data$height))^2 |>
sum() |> sqrt()
denominator <-
(height_data$weight - mean(height_data$weight))^2 |>
sum() |> sqrt()
units_adjustment <-
(numerator/denominator) |> round(3)
```
```{r}
#| label: print-relationship-between-cor-and-reg
paste0(
"r = ", pearson_correlation,
"\nbeta_1 = ", regression_slope,
"\nunits-adjustment = ", units_adjustment,
"\nr = beta_1 * units-adjustment?: ",
(regression_slope*units_adjustment) |>
round(3)
) |>
cat()
```
:::
::: notes
On the left, we see the results from the algebraic manipulation I alluded to. By multiplying $\beta_1$ by the units-adjustment value, sqrt(SS(x)) / sqrt(SS(y)), the two values should be the same.
On the right is some R code that performs these calculations automatically. The `cor` function should look familiar. I took a shortcut and used the `lm` function instead of calculating $\beta_1$ myself, but we will see on the next slide that this result is functionally equivalent to the algebraic calculations on the same data. And, after calculating the units-adjustment value for the data and multiplying it with $\beta_1$, we see that in fact the correlation and regression slope are related.
:::
## Perform Simple Regression in R
### Algebraically
```{r}
#| label: regression-code-algebra-sample
#| echo: true
mean_x <-
mean(height_data$height)
mean_y <-
mean(height_data$weight)
SS_x <-
sum((height_data$height - mean_x)^2)
S_xy <-
sum((height_data$height - mean_x) * (height_data$weight - mean_y))
beta_1 <-
S_xy / SS_x
beta_0 <-
mean_y - beta_1*mean_x
print(paste0("Estimated Beta1: ", beta_1 |> round(digits = 2))); print(paste0("Estimated Beta0: ", beta_0 |> round(digits = 2)))
```
::: notes
Here is an of how to perform regression algebraically in R. This code snippet calculates the values we need for the least squares estimate of our regression coefficients using the data we simulated at the beginning of class. Note that the estimates for our coefficients are almost identical to the coefficients we used to simulate the data.
:::
------------------------------------------------------------------------
### With `lm`
```{r}
#| label: regression-code-lm-sample
#| echo: true
lm_height_weight <-
lm(weight ~ height, data = height_data)
summary(lm_height_weight)
```
::: notes
Here is another look at how we can get the same information using the `lm()` function built into R. By fitting the model, saving it to the `lm_height_weight` object, then using the `summary()` function on this object, we get a printout of the regression model, including a multitude of statistics. We will look further into these statistics in another lecture, so for now look for the Coefficients and their respective Estimate values. These values are basically equivalent to the values we found algebraically and to the values we used to simulate the data.
:::
## Practice
```{=html}
<iframe width='100%' height='80%' src='https://4kve1c-anthony0raborn.shinyapps.io/simple-linear-regression/'>
</iframe>
```
::: notes
Now let's practice what we've learned so far about regression.
This is a shiny app, built in R, that generates random data, plots the data, and lets us pick a line that we think fits the data. It also provides information about our data (e.g., means of the variables, the sum of squares for x and y individually, and the covariance between x and y). In case you don't remember how to use these values to algebraically solve for the regression coefficients, there is a hint button you can press that will show them to you. It also shows our sum of squared errors so we know how well our guess (or math!) is at minimizing these errors.
Let's generate some data and see how well we do!
(spend about 5 minutes)
:::
## End
Questions?
Lecture available at:
<https://anthony-raborn.quarto.pub/introduction-to-statistics/>
::: notes
Any questions?
(can I find a picture to add here?)
:::