-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathstatistics-notes.html
More file actions
1361 lines (1135 loc) · 50.3 KB
/
statistics-notes.html
File metadata and controls
1361 lines (1135 loc) · 50.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Statistics Guide - Learning Hub</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=Fira+Code&display=swap" rel="stylesheet">
<link rel="stylesheet" href="css/common.css">
<link rel="stylesheet" href="css/notes.css">
</head>
<body>
<div class="particles" id="particles"></div>
<nav class="navbar">
<div class="nav-container">
<a href="index.html" class="nav-brand">📚 Learning Hub</a>
<div class="nav-links">
<a href="index.html">Home</a>
<a href="python-notes.html">Python</a>
<a href="numpy-notes.html">NumPy</a>
<a href="pandas-notes.html">Pandas</a>
<a href="sql-notes.html">SQL</a>
<a href="about.html">About</a>
</div>
</div>
</nav>
<div class="container">
<aside class="sidebar">
<h3>📖 Contents</h3>
<ul>
<li><a href="#intro" class="active">Introduction</a></li>
<li><a href="#descriptive">Descriptive Statistics</a></li>
<li><a href="#central-tendency">Central Tendency</a></li>
<li><a href="#dispersion">Measures of Dispersion</a></li>
<li><a href="#distributions">Distributions</a></li>
<li><a href="#probability">Probability Basics</a></li>
<li><a href="#sampling">Sampling Methods</a></li>
<li><a href="#hypothesis">Hypothesis Testing</a></li>
<li><a href="#correlation">Correlation</a></li>
<li><a href="#regression">Regression Analysis</a></li>
<li><a href="#anova">ANOVA</a></li>
<li><a href="#chi-square">Chi-Square Tests</a></li>
<li><a href="#bayes">Bayesian Statistics</a></li>
<li><a href="#time-series">Time Series Basics</a></li>
<li><a href="#python-stats">Python for Statistics</a></li>
</ul>
</aside>
<main class="content">
<div class="hero">
<h1>📊 Statistics Fundamentals</h1>
<p>Master statistical concepts essential for data science, from descriptive statistics to hypothesis testing</p>
</div>
<section id="intro" class="card">
<h2>1. Introduction to Statistics</h2>
<p>Statistics is the science of collecting, analyzing, interpreting, and presenting data. It's fundamental to data science, enabling us to extract insights, make predictions, and support decision-making.</p>
<div class="highlight-box">
<p><strong>Why Statistics?</strong> Understand data patterns, make data-driven decisions, validate hypotheses, build predictive models, quantify uncertainty</p>
</div>
<h3>Types of Statistics</h3>
<ul>
<li><strong>Descriptive Statistics</strong> - Summarize and describe data (mean, median, standard deviation)</li>
<li><strong>Inferential Statistics</strong> - Make predictions and inferences about populations from samples</li>
</ul>
<h3>Types of Data</h3>
<ul>
<li><strong>Quantitative (Numerical)</strong>
<ul>
<li>Discrete: Countable values (number of students, items sold)</li>
<li>Continuous: Measurable values (height, weight, temperature)</li>
</ul>
</li>
<li><strong>Qualitative (Categorical)</strong>
<ul>
<li>Nominal: No order (colors, gender, country)</li>
<li>Ordinal: Has order (ratings, education level)</li>
</ul>
</li>
</ul>
</section>
<section id="descriptive" class="card">
<h2>2. Descriptive Statistics</h2>
<p>Descriptive statistics summarize and describe the main features of a dataset.</p>
<h3>Key Concepts</h3>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">import numpy as np
import pandas as pd
# Sample data
data = [12, 15, 18, 20, 22, 25, 28, 30, 35, 40]
# Basic descriptive statistics
print("Count:", len(data))
print("Sum:", sum(data))
print("Min:", min(data))
print("Max:", max(data))
print("Range:", max(data) - min(data))
# Using NumPy
data_np = np.array(data)
print("\nNumPy Statistics:")
print("Mean:", np.mean(data_np))
print("Median:", np.median(data_np))
print("Std Dev:", np.std(data_np))
print("Variance:", np.var(data_np))
# Using Pandas
df = pd.DataFrame({'values': data})
print("\nPandas describe():")
print(df.describe())</code></pre>
</div>
<h3>Frequency Distributions</h3>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">import pandas as pd
# Categorical data
grades = ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'B', 'C', 'A']
# Frequency distribution
freq = pd.Series(grades).value_counts()
print("Frequency:\n", freq)
# Relative frequency (proportions)
rel_freq = pd.Series(grades).value_counts(normalize=True)
print("\nRelative Frequency:\n", rel_freq)
# Cumulative frequency
cum_freq = freq.sort_index().cumsum()
print("\nCumulative Frequency:\n", cum_freq)</code></pre>
</div>
</section>
<section id="central-tendency" class="card">
<h2>3. Measures of Central Tendency</h2>
<p>Measures that describe the center or typical value of a dataset.</p>
<h3>Mean (Average)</h3>
<p>Sum of all values divided by the number of values. Sensitive to outliers.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">import numpy as np
import statistics as stats
data = [10, 20, 30, 40, 50]
# Mean calculation
mean = sum(data) / len(data)
print("Manual Mean:", mean)
# Using built-in functions
print("NumPy Mean:", np.mean(data))
print("Statistics Mean:", stats.mean(data))
# Weighted mean
values = [80, 85, 90]
weights = [0.3, 0.3, 0.4]
weighted_mean = sum(v * w for v, w in zip(values, weights))
print("Weighted Mean:", weighted_mean)</code></pre>
</div>
<h3>Median</h3>
<p>Middle value when data is sorted. Robust to outliers.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">data = [10, 20, 30, 40, 50]
# Median
print("Median:", np.median(data))
print("Median:", stats.median(data))
# With outlier
data_with_outlier = [10, 20, 30, 40, 1000]
print("\nWith outlier:")
print("Mean:", np.mean(data_with_outlier)) # 220 (affected)
print("Median:", np.median(data_with_outlier)) # 30 (not affected)</code></pre>
</div>
<h3>Mode</h3>
<p>Most frequently occurring value. Can have multiple modes or no mode.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from scipy import stats as sp_stats
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
# Mode using scipy
mode_result = sp_stats.mode(data, keepdims=True)
print("Mode:", mode_result.mode[0])
print("Count:", mode_result.count[0])
# Mode for categorical data
grades = ['A', 'B', 'A', 'C', 'B', 'A']
mode = pd.Series(grades).mode()[0]
print("Most common grade:", mode)</code></pre>
</div>
<h3>When to Use Each Measure</h3>
<div class="highlight-box">
<p><strong>Mean:</strong> Symmetric distribution, no outliers. <strong>Median:</strong> Skewed distribution or outliers present. <strong>Mode:</strong> Categorical data or finding most common value</p>
</div>
</section>
<section id="dispersion" class="card">
<h2>4. Measures of Dispersion (Spread)</h2>
<p>Measures that describe how spread out or varied the data is.</p>
<h3>Range</h3>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">data = [10, 20, 30, 40, 50]
# Range
data_range = max(data) - min(data)
print("Range:", data_range) # 40
# NumPy
print("Range:", np.ptp(data)) # Peak to peak</code></pre>
</div>
<h3>Variance</h3>
<p>Average of squared deviations from the mean.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">data = [10, 20, 30, 40, 50]
# Population variance
pop_variance = np.var(data)
print("Population Variance:", pop_variance)
# Sample variance (n-1 denominator)
sample_variance = np.var(data, ddof=1)
print("Sample Variance:", sample_variance)
# Manual calculation
mean = np.mean(data)
variance = sum((x - mean)**2 for x in data) / len(data)
print("Manual Variance:", variance)</code></pre>
</div>
<h3>Standard Deviation</h3>
<p>Square root of variance. Same unit as original data.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Standard deviation
pop_std = np.std(data)
sample_std = np.std(data, ddof=1)
print("Population Std Dev:", pop_std)
print("Sample Std Dev:", sample_std)
# Interpretation: ~68% of data within 1 std dev of mean
# ~95% within 2 std devs, ~99.7% within 3 std devs (for normal distribution)</code></pre>
</div>
<h3>Interquartile Range (IQR)</h3>
<p>Range of middle 50% of data. Robust to outliers.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">data = [10, 15, 20, 25, 30, 35, 40, 45, 50]
# Quartiles
q1 = np.percentile(data, 25) # 1st quartile
q2 = np.percentile(data, 50) # 2nd quartile (median)
q3 = np.percentile(data, 75) # 3rd quartile
# IQR
iqr = q3 - q1
print("Q1:", q1)
print("Q2 (Median):", q2)
print("Q3:", q3)
print("IQR:", iqr)
# Outlier detection using IQR
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print("\nOutlier bounds:", lower_bound, "to", upper_bound)
# Coefficient of Variation (CV)
cv = (np.std(data) / np.mean(data)) * 100
print("Coefficient of Variation:", cv, "%")</code></pre>
</div>
</section>
<section id="distributions" class="card">
<h2>5. Probability Distributions</h2>
<h3>Normal Distribution (Gaussian)</h3>
<p>Bell-shaped, symmetric distribution. Many natural phenomena follow this.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from scipy import stats
import matplotlib.pyplot as plt
# Generate normal distribution
mu, sigma = 0, 1 # mean and standard deviation
x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, mu, sigma)
# Properties
print("Mean:", mu)
print("Std Dev:", sigma)
# Probability calculations
# P(X < 1)
prob = stats.norm.cdf(1, mu, sigma)
print("P(X < 1):", prob)
# P(X > 1)
prob = 1 - stats.norm.cdf(1, mu, sigma)
print("P(X > 1):", prob)
# P(-1 < X < 1)
prob = stats.norm.cdf(1, mu, sigma) - stats.norm.cdf(-1, mu, sigma)
print("P(-1 < X < 1):", prob) # ~0.68
# Z-score (standardization)
value = 85
mean = 75
std = 10
z_score = (value - mean) / std
print("\nZ-score for 85:", z_score)</code></pre>
</div>
<h3>Binomial Distribution</h3>
<p>Discrete distribution for number of successes in n independent trials.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Binomial distribution
n = 10 # number of trials
p = 0.5 # probability of success
# Probability of exactly k successes
k = 6
prob = stats.binom.pmf(k, n, p)
print(f"P(X = {k}):", prob)
# Probability of at most k successes
prob_cumulative = stats.binom.cdf(k, n, p)
print(f"P(X ≤ {k}):", prob_cumulative)
# Mean and variance
mean = n * p
variance = n * p * (1 - p)
print("Mean:", mean)
print("Variance:", variance)</code></pre>
</div>
<h3>Poisson Distribution</h3>
<p>Models number of events in fixed interval (calls per hour, defects per batch).</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Poisson distribution
lambda_param = 3 # average rate
# Probability of k events
k = 5
prob = stats.poisson.pmf(k, lambda_param)
print(f"P(X = {k}):", prob)
# Mean and variance (both equal to λ)
print("Mean:", lambda_param)
print("Variance:", lambda_param)</code></pre>
</div>
<h3>Exponential Distribution</h3>
<p>Time between events in a Poisson process.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Exponential distribution
lambda_param = 0.5 # rate parameter
# Probability density at x
x = 2
prob = stats.expon.pdf(x, scale=1/lambda_param)
print(f"PDF at x={x}:", prob)
# Probability X < x
prob_cdf = stats.expon.cdf(x, scale=1/lambda_param)
print(f"P(X < {x}):", prob_cdf)</code></pre>
</div>
</section>
<section id="probability" class="card">
<h2>6. Probability Basics</h2>
<h3>Fundamental Concepts</h3>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Probability rules
# P(A) = favorable outcomes / total outcomes
# Example: Rolling a die
total_outcomes = 6
favorable = 1 # rolling a 6
prob = favorable / total_outcomes
print("P(rolling 6):", prob) # 0.1667
# Complement rule: P(A') = 1 - P(A)
prob_not_6 = 1 - prob
print("P(not 6):", prob_not_6) # 0.8333</code></pre>
</div>
<h3>Conditional Probability</h3>
<p>P(A|B) = Probability of A given B has occurred</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Example: Cards
# P(King | Face card)
total_cards = 52
face_cards = 12
kings = 4
# P(King and Face card) = P(King) since all kings are face cards
p_king_and_face = kings / total_cards
# P(Face card)
p_face = face_cards / total_cards
# P(King | Face card)
p_king_given_face = p_king_and_face / p_face
print("P(King | Face card):", p_king_given_face) # 0.333</code></pre>
</div>
<h3>Independence</h3>
<p>Events A and B are independent if P(A and B) = P(A) × P(B)</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Example: Coin flips
# P(Heads on flip 1 AND Heads on flip 2)
p_heads = 0.5
p_both_heads = p_heads * p_heads
print("P(HH):", p_both_heads) # 0.25
# Three independent events
p_three_heads = p_heads ** 3
print("P(HHH):", p_three_heads) # 0.125</code></pre>
</div>
<h3>Addition Rule</h3>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># P(A or B) = P(A) + P(B) - P(A and B)
# Example: Drawing a card
p_king = 4/52
p_heart = 13/52
p_king_of_hearts = 1/52
# P(King OR Heart)
p_king_or_heart = p_king + p_heart - p_king_of_hearts
print("P(King or Heart):", p_king_or_heart) # 0.308</code></pre>
</div>
</section>
<section id="sampling" class="card">
<h2>7. Sampling Methods</h2>
<h3>Population vs Sample</h3>
<ul>
<li><strong>Population:</strong> Entire group of interest</li>
<li><strong>Sample:</strong> Subset of population used for analysis</li>
<li><strong>Sampling:</strong> Process of selecting samples</li>
</ul>
<h3>Sampling Techniques</h3>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">import random
# Population
population = list(range(1, 101)) # 1 to 100
# 1. Simple Random Sampling
simple_sample = random.sample(population, 10)
print("Simple Random Sample:", simple_sample)
# 2. Systematic Sampling
k = 10 # every kth element
systematic_sample = population[::k]
print("Systematic Sample:", systematic_sample)
# 3. Stratified Sampling
# Divide into groups and sample from each
group1 = population[:50]
group2 = population[50:]
stratified_sample = random.sample(group1, 5) + random.sample(group2, 5)
print("Stratified Sample:", stratified_sample)
# 4. Using Pandas
df = pd.DataFrame({'value': population})
random_sample = df.sample(n=10) # 10 random rows
print("\nPandas Random Sample:")
print(random_sample)</code></pre>
</div>
<h3>Central Limit Theorem (CLT)</h3>
<p>As sample size increases, sampling distribution of mean approaches normal distribution.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Demonstrate CLT
np.random.seed(42)
# Non-normal population (uniform)
population = np.random.uniform(0, 100, 10000)
# Take many samples and calculate means
sample_means = []
for i in range(1000):
sample = np.random.choice(population, size=30)
sample_means.append(np.mean(sample))
# Sample means are approximately normal!
print("Mean of sample means:", np.mean(sample_means))
print("Std of sample means:", np.std(sample_means))
print("Population mean:", np.mean(population))
# Standard Error of Mean
sem = np.std(population) / np.sqrt(30)
print("Standard Error:", sem)</code></pre>
</div>
</section>
<section id="hypothesis" class="card">
<h2>8. Hypothesis Testing</h2>
<p>Statistical method to make decisions based on data.</p>
<h3>Key Concepts</h3>
<ul>
<li><strong>Null Hypothesis (H₀):</strong> No effect or difference exists</li>
<li><strong>Alternative Hypothesis (H₁):</strong> Effect or difference exists</li>
<li><strong>p-value:</strong> Probability of observing data if H₀ is true</li>
<li><strong>Significance level (α):</strong> Threshold for rejecting H₀ (commonly 0.05)</li>
<li><strong>Type I Error:</strong> Rejecting true H₀ (false positive)</li>
<li><strong>Type II Error:</strong> Failing to reject false H₀ (false negative)</li>
</ul>
<h3>One-Sample t-Test</h3>
<p>Test if sample mean differs from known population mean.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from scipy import stats
# Sample data
sample = [23, 25, 27, 29, 31, 33, 35, 37, 39, 41]
population_mean = 30
# H₀: μ = 30
# H₁: μ ≠ 30
# Perform t-test
t_statistic, p_value = stats.ttest_1samp(sample, population_mean)
print("t-statistic:", t_statistic)
print("p-value:", p_value)
# Decision
alpha = 0.05
if p_value < alpha:
print("Reject H₀: Mean is significantly different from 30")
else:
print("Fail to reject H₀: No significant difference from 30")</code></pre>
</div>
<h3>Two-Sample t-Test</h3>
<p>Compare means of two independent groups.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Two groups
group1 = [23, 25, 27, 29, 31, 33, 35]
group2 = [30, 32, 34, 36, 38, 40, 42]
# H₀: μ₁ = μ₂
# H₁: μ₁ ≠ μ₂
# Independent samples t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print("t-statistic:", t_stat)
print("p-value:", p_value)
if p_value < 0.05:
print("Significant difference between groups")
else:
print("No significant difference")</code></pre>
</div>
<h3>Paired t-Test</h3>
<p>Compare means of same group at different times.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Before and after measurements
before = [72, 75, 78, 80, 82, 85, 88]
after = [70, 73, 76, 78, 80, 83, 86]
# H₀: μ_diff = 0
# H₁: μ_diff ≠ 0
# Paired t-test
t_stat, p_value = stats.ttest_rel(before, after)
print("t-statistic:", t_stat)
print("p-value:", p_value)
if p_value < 0.05:
print("Significant change from before to after")
else:
print("No significant change")</code></pre>
</div>
<h3>Z-Test</h3>
<p>Use when population standard deviation is known and n > 30.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from statsmodels.stats.weightstats import ztest
# Sample
sample = [23, 25, 27, 29, 31, 33, 35, 37, 39, 41]
population_mean = 30
# Z-test
z_stat, p_value = ztest(sample, value=population_mean)
print("z-statistic:", z_stat)
print("p-value:", p_value)</code></pre>
</div>
</section>
<section id="correlation" class="card">
<h2>9. Correlation Analysis</h2>
<p>Measures strength and direction of relationship between variables.</p>
<h3>Pearson Correlation</h3>
<p>Measures linear relationship. Range: -1 to +1.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">import numpy as np
from scipy import stats
# Two variables
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 7, 8, 10, 11, 13, 14, 16]
# Pearson correlation
corr, p_value = stats.pearsonr(x, y)
print("Pearson r:", corr)
print("p-value:", p_value)
# Interpretation:
# r > 0.7: Strong positive correlation
# r = 0: No correlation
# r < -0.7: Strong negative correlation
# Using NumPy
corr_matrix = np.corrcoef(x, y)
print("\nCorrelation matrix:\n", corr_matrix)
# Using Pandas
df = pd.DataFrame({'x': x, 'y': y})
print("\nPandas correlation:")
print(df.corr())</code></pre>
</div>
<h3>Spearman Correlation</h3>
<p>Measures monotonic relationship (non-linear). Better for ordinal data.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Spearman correlation
corr, p_value = stats.spearmanr(x, y)
print("Spearman rho:", corr)
print("p-value:", p_value)</code></pre>
</div>
<h3>Correlation vs Causation</h3>
<div class="highlight-box">
<p><strong>Important:</strong> Correlation does NOT imply causation! Just because two variables are correlated doesn't mean one causes the other. There could be confounding variables or spurious correlations.</p>
</div>
</section>
<section id="regression" class="card">
<h2>10. Regression Analysis</h2>
<p>Model relationship between dependent and independent variables.</p>
<h3>Simple Linear Regression</h3>
<p>Y = β₀ + β₁X + ε</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from scipy import stats
import numpy as np
# Data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.2, 5.8, 8.1, 10.3, 11.9, 14.2, 16.1, 17.9, 20.2])
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("Slope (β₁):", slope)
print("Intercept (β₀):", intercept)
print("R-squared:", r_value**2)
print("p-value:", p_value)
# Make predictions
x_new = 11
y_pred = slope * x_new + intercept
print(f"\nPrediction for x={x_new}: {y_pred}")
# Predict for all x values
y_predicted = slope * x + intercept
print("Predicted values:", y_predicted)</code></pre>
</div>
<h3>Multiple Linear Regression</h3>
<p>Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# Multiple predictors
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([3, 5, 7, 9, 11])
# Fit model
model = LinearRegression()
model.fit(X, y)
# Coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)
# Make predictions
y_pred = model.predict(X)
print("Predictions:", y_pred)
# Model evaluation
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
print("\nR-squared:", r2)
print("MSE:", mse)
print("RMSE:", rmse)</code></pre>
</div>
<h3>Assumptions of Linear Regression</h3>
<ul>
<li><strong>Linearity:</strong> Relationship between X and Y is linear</li>
<li><strong>Independence:</strong> Observations are independent</li>
<li><strong>Homoscedasticity:</strong> Constant variance of errors</li>
<li><strong>Normality:</strong> Errors are normally distributed</li>
<li><strong>No multicollinearity:</strong> Predictors not highly correlated (multiple regression)</li>
</ul>
</section>
<section id="anova" class="card">
<h2>11. Analysis of Variance (ANOVA)</h2>
<p>Test if means of three or more groups are significantly different.</p>
<h3>One-Way ANOVA</h3>
<p>Compare means across one categorical variable.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from scipy import stats
# Three groups
group1 = [23, 25, 27, 29, 31]
group2 = [30, 32, 34, 36, 38]
group3 = [35, 37, 39, 41, 43]
# H₀: μ₁ = μ₂ = μ₃
# H₁: At least one mean is different
# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
if p_value < 0.05:
print("At least one group mean is significantly different")
else:
print("No significant difference between group means")</code></pre>
</div>
<h3>Two-Way ANOVA</h3>
<p>Examine effect of two categorical variables and their interaction.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Sample data
df = pd.DataFrame({
'score': [23, 25, 27, 30, 32, 34, 35, 37, 39, 40, 42, 44],
'method': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'B'],
'gender': ['M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F']
})
# Fit model
model = ols('score ~ C(method) + C(gender) + C(method):C(gender)', data=df).fit()
# ANOVA table
anova_table = anova_lm(model, typ=2)
print(anova_table)</code></pre>
</div>
<h3>Post-hoc Tests</h3>
<p>After significant ANOVA, determine which groups differ.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Data
data = [23, 25, 27, 29, 31, 30, 32, 34, 36, 38, 35, 37, 39, 41, 43]
groups = ['A']*5 + ['B']*5 + ['C']*5
# Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=data, groups=groups, alpha=0.05)
print(tukey)</code></pre>
</div>
</section>
<section id="chi-square" class="card">
<h2>12. Chi-Square Tests</h2>
<p>Test relationships between categorical variables.</p>
<h3>Chi-Square Goodness of Fit</h3>
<p>Test if observed frequencies match expected distribution.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python">from scipy import stats
# Observed frequencies
observed = [30, 25, 20, 25]
# Expected frequencies (equal distribution)
expected = [25, 25, 25, 25]
# Chi-square test
chi2, p_value = stats.chisquare(observed, expected)
print("Chi-square statistic:", chi2)
print("p-value:", p_value)
if p_value < 0.05:
print("Distribution significantly different from expected")
else:
print("No significant difference from expected distribution")</code></pre>
</div>
<h3>Chi-Square Test of Independence</h3>
<p>Test if two categorical variables are independent.</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Contingency table
# Rows: Gender (M/F), Columns: Preference (A/B/C)
observed = np.array([[30, 25, 20],
[20, 30, 25]])
# Chi-square test of independence
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print("Chi-square statistic:", chi2)
print("p-value:", p_value)
print("Degrees of freedom:", dof)
print("\nExpected frequencies:\n", expected)
if p_value < 0.05:
print("\nVariables are dependent (associated)")
else:
print("\nVariables are independent (not associated)")</code></pre>
</div>
</section>
<section id="bayes" class="card">
<h2>13. Bayesian Statistics</h2>
<p>Update beliefs based on new evidence using Bayes' Theorem.</p>
<h3>Bayes' Theorem</h3>
<p>P(A|B) = P(B|A) × P(A) / P(B)</p>
<div class="code-container">
<div class="code-header">
<span>Python</span>
<button class="copy-btn">Copy</button>
</div>
<pre><code class="language-python"># Example: Medical test
# P(Disease) = 0.01 (1% of population has disease)
# P(Positive|Disease) = 0.95 (95% sensitivity)
# P(Positive|No Disease) = 0.05 (5% false positive rate)