[π§π· PortuguΓͺs] [πΊπΈ English]
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
Access Data Mining Main Repository
If youβd like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.
Variation is the difference between the maximum and minimum entries in a quantitative data set.
[ \text{Variation} = \text{Maximum Value} - \text{Minimum Value} ]
- The data must be quantitative (numerical).
Example: Calculating Variation
A company hired 10 graduates. Their starting salaries (in thousand dollars):
41, 38, 39, 45, 47, 41, 44, 41, 37, 42
Ordered data:
37, 38, 39, 41, 41, 41, 42, 44, 45, 47
- Minimum: 37
- Maximum: 47
[ \text{Variation} = 47 - 37 = 10 ]
Deviation for a value
- Population:
$x - \mu$ - Sample:
$x - \bar{x}$
Formulas:
[ \mu = \frac{\sum x}{N} ] [ \sigma^2 = \frac{\sum (x - \mu)^2}{N} ] [ \sigma = \sqrt{\frac{\sum (x - \mu)^2}{N}} ]
where:
-
$\mu$ is the population mean -
$N$ is the population size -
$\sigma^2$ is the variance -
$\sigma$ is the standard deviation
Salaries: 41, 38, 39, 45, 47, 41, 44, 41, 37, 42
Sum:
Salary ( |
||
---|---|---|
41 | -0.5 | 0.25 |
38 | -3.5 | 12.25 |
39 | -2.5 | 6.25 |
45 | 3.5 | 12.25 |
47 | 5.5 | 30.25 |
41 | -0.5 | 0.25 |
44 | 2.5 | 6.25 |
41 | -0.5 | 0.25 |
37 | -4.5 | 20.25 |
42 | 0.5 | 0.25 |
Total | 88.5 |
[ \sigma^2 = \frac{88.5}{10} = 8.85 ] [ \sigma = \sqrt{8.85} \approx 3.0 ]
The population standard deviation is about
$3.0$ ($3,000).
Formulas:
[ \bar{x} = \frac{\sum x}{n} ] [ s^2 = \frac{\sum (x - \bar{x})^2}{n - 1} ] [ s = \sqrt{s^2} ]
where:
-
$\bar{x}$ is the sample mean -
$n$ is the sample size -
$s^2$ is the sample variance -
$s$ is the sample standard deviation
Salary ( |
||
---|---|---|
41 | -0.5 | 0.25 |
38 | -3.5 | 12.25 |
39 | -2.5 | 6.25 |
45 | 3.5 | 12.25 |
47 | 5.5 | 30.25 |
41 | -0.5 | 0.25 |
44 | 2.5 | 6.25 |
41 | -0.5 | 0.25 |
37 | -4.5 | 20.25 |
42 | 0.5 | 0.25 |
Total | 88.5 |
[ s^2 = \frac{88.5}{10-1} = \frac{88.5}{9} = 9.83 ] [ s = \sqrt{9.83} \approx 3.1 ]
The sample standard deviation is about
$3.1$ ($3,100).
Formula for sample standard deviation from a frequency distribution (using midpoints
[
s = \sqrt{\frac{\sum (x - \bar{x})^2 f}{n - 1}}
]
where
Number of children in 50 households (frequency table):
|
||
---|---|---|
0 | 10 | 0 |
1 | 19 | 19 |
2 | 7 | 14 |
3 | 7 | 21 |
4 | 2 | 8 |
5 | 1 | 5 |
6 | 4 | 24 |
Total: | 50 | 91 |
Sample mean:
[ \bar{x} = \frac{91}{50} = 1.82 ]
Now, calculate squared deviations for each group and sum:
0 | 10 | -1.82 | 3.31 | 33.10 |
1 | 19 | -0.82 | 0.67 | 12.73 |
2 | 7 | 0.18 | 0.03 | 0.21 |
3 | 7 | 1.18 | 1.39 | 9.73 |
4 | 2 | 2.18 | 4.75 | 9.50 |
5 | 1 | 3.18 | 10.11 | 10.11 |
6 | 4 | 4.18 | 17.47 | 69.88 |
Total | 145.26 |
Compute sample standard deviation:
[ s = \sqrt{\frac{145.26}{49}} \approx \sqrt{2.965} \approx 1.72 ]
-
Quartile
$Q_1$ : 25% of data below -
Quartile
$Q_2$ : (median) 50% of data below -
Quartile
$Q_3$ : 75% of data below
Interquartile Range (IQR):
[ \text{IQR} = Q_3 - Q_1 ]
Test scores (ordered):
5, 7, 9, 10, 11, 13, 14, 15, 16, 17, 18, 18, 20, 21, 37
$Q_1 = 10$ $Q_2 = 15$ $Q_3 = 18$
[ \text{IQR} = 18 - 10 = 8 ]
Code Example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('banco.csv')
# Quartiles for the 'age' column
q1 = np.quantile(df['age'], 0.25)
q2 = np.quantile(df['age'], 0.5)
q3 = np.quantile(df['age'], 0.75)
print(f'Q1 = {q1}, Q2 (median) = {q2}, Q3 = {q3}')
# Boxplot
plt.boxplot(df['age'])
plt.title("Boxplot of Age")
plt.show()
-
Percentiles (
$P_k$ ): Divide data into 100 equal parts - z-score: Tells how many standard deviations a value is from the mean
[ z = \frac{x - \mu}{\sigma} ]
Forest Whitaker, age 45, won Best Actor. Average: 43.7, SD: 8.8
Helen Mirren, age 61, won Best Actress. Average: 36, SD: 11.5
- Forest:
$z = \frac{45 - 43.7}{8.8} \approx 0.15$ - Helen:
$z = \frac{61 - 36}{11.5} \approx 2.17$
Fractile | What it Divides | Symbols |
---|---|---|
Quartiles | Into 4 equal parts | |
Deciles | Into 10 equal parts | |
Percentiles | Into 100 equal parts |
- Variation and standard deviation measure the spread or scatter of data
- Variance uses squared deviations; standard deviation is the square root of variance
-
Sample formulas use
$n-1$ in denominator for unbiased estimation - Grouped data: Use midpoints for classes; frequency-weighted calculations
- Interquartile range (IQR) is a robust measure of spread
- Boxplot visually displays data spread, median, quartiles, and outliers
- z-scores allow comparison between different data sets by standardizing values
# Cell 1: Define data (salaries)
salaries =
# Cell 2: Calculate population mean
pop_mean = sum(salaries) / len(salaries)
pop_mean \# Output expected: 41.5
# Cell 3: Calculate deviations squared for population
pop_sq_dev = [(x - pop_mean) ** 2 for x in salaries]
pop_sq_dev \# Output: list of squared deviations
# Cell 4: Calculate population variance and std deviation
pop_variance = sum(pop_sq_dev) / len(salaries)
pop_std_dev = pop_variance ** 0.5
print(f"Population variance: {pop_variance}")
print(f"Population standard deviation: {pop_std_dev}")
# Cell 5: Calculate sample mean (same as population mean here)
sample_mean = pop_mean
sample_mean \# Output: 41.5
# Cell 6: Calculate deviations squared for sample
sample_sq_dev = pop_sq_dev
# Cell 7: Calculate sample variance and std deviation
sample_variance = sum(sample_sq_dev) / (len(salaries) - 1)
sample_std_dev = sample_variance ** 0.5
print(f"Sample variance: {sample_variance}")
print(f"Sample standard deviation: {sample_std_dev}")
Suppose you have frequency data of children per household:
Children |
Frequency |
---|---|
0 | 10 |
1 | 19 |
2 | 7 |
3 | 7 |
4 | 2 |
5 | 1 |
6 | 4 |
Calculate mean and standard deviation with:
[ \bar{x} = \frac{\sum (x \cdot f)}{N} ]
[ s = \sqrt{\frac{\sum f (x - \bar{x})^2}{N-1}} ]
# Cell 1: Define frequencies and values
x =[^1][^2][^3][^4][^5][^6]
f =[^1][^2][^4][^7][^10][^19]
# Cell 2: Calculate total frequency
N = sum(f)
N \# Output expected: 50
# Cell 3: Calculate mean using weighted sum
mean = sum([xi * fi for xi, fi in zip(x, f)]) / N
mean \# Output expected about: 1.82
# Cell 4: Calculate squared deviations (x - mean)^2
sq_dev = [(xi - mean) ** 2 for xi in x]
sq_dev \# Output: list of squared deviations
# Cell 5: Calculate weighted squared deviations
weighted_sq_dev = [fi * sd for fi, sd in zip(f, sq_dev)]
weighted_sq_dev \# Output: weighted squared deviations
# Cell 6: Calculate sample variance and std deviation
sample_variance = sum(weighted_sq_dev) / (N - 1)
sample_std_dev = sample_variance ** 0.5
print(f"Grouped data sample variance: {sample_variance}")
print(f"Grouped data sample standard deviation: {sample_std_dev}")
This completes the comprehensive guide on variation and standard deviation with formulas and stepwise Python code ready for running in Google Colab.
-
Quartiles divide an ordered data set into four approximately equal parts.
Quartile Description $Q_1$ First quartile (25% data below) $Q_2$ Second quartile (Median) (50%) $Q_3$ Third quartile (75%) -
Interquartile Range (IQR):
[ \text{IQR} = Q_3 - Q_1 ]
Data:
5, 7, 9, 10, 11, 13, 14, 15, 16, 17, 18, 18, 20, 21, 37
$Q_1 = 10$ $Q_2 = 15$ $Q_3 = 18$ - IQR =
$18 - 10 = 8$
- Divide data into 100 equal parts.
- Example: The 72nd percentile means 72% of data fall below this value.
A graphical display showing:
- Minimum
- First quartile (
$Q_1$ ) - Median (
$Q_2$ ) - Third quartile (
$Q_3$ ) - Maximum
Useful for identifying spread, center, and potential outliers.
# Cell 1: Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Cell 2: Create sample data (can replace with actual data)
data =[^5][^7][^9][^10][^11][^13][^14][^15][^16][^17][^18][^20]
# Cell 3: Calculate quartiles
Q1 = np.percentile(data, 25)
Q2 = np.percentile(data, 50) \# Median
Q3 = np.percentile(data, 75)
print(f"Q1 (25th percentile): {Q1}")
print(f"Q2 (Median, 50th percentile): {Q2}")
print(f"Q3 (75th percentile): {Q3}")
# Cell 4: Calculate Interquartile Range (IQR)
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")
# Cell 5: Calculate specific percentile (e.g. 72nd percentile)
p72 = np.percentile(data, 72)
print(f"72nd percentile: {p72}")
# Cell 6: Generate boxplot
plt.boxplot(data)
plt.title("Boxplot of Sample Data")
plt.ylabel("Values")
plt.show()
# Cell 1: Import packages and load CSV file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('your_file.csv') \# Replace 'your_file.csv' with your filename
# Cell 2: View numeric columns
numeric_cols = df.select_dtypes(include=np.number).columns
print("Numeric columns in dataset:", numeric_cols)
# Cell 3: Calculate quartiles for a numeric column, e.g. 'age'
col = 'age' \# Replace with your column name
Q1 = np.percentile(df[col], 25)
Q2 = np.percentile(df[col], 50)
Q3 = np.percentile(df[col], 75)
print(f"{col} Q1: {Q1}, Q2 (Median): {Q2}, Q3: {Q3}")
# Cell 4: Plot boxplot for the column
plt.boxplot(df[col])
plt.title(f"Boxplot of {col}")
plt.ylabel(col)
plt.show()
1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence β A Machine Learning Approach. 2nd Ed. LTC.
3. Larson & Farber (2015). Applied Statistics. Pearson.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.