The Berlin Marathon is one of the world’s most prestigious long-distance running events, known for its fast course and record-breaking performances.
For this project, we positioned ourselves as a data analytics company hired by a sports magazine to analyze the Berlin Marathon’s history.
The goal is to provide insights that will support the magazine in writing a feature article about the marathon, focusing on performance trends, gender dynamics, country dominance, and female participation.
Throughout this project, we applied data wrangling, cleaning, and exploratory data analysis (EDA) techniques to answer key research questions and test hypotheses about the evolution of the Berlin Marathon.
We used two datasets, both of which were cleaned and transformed before analysis:
- Source: Kaggle – Berlin Marathons Data
- Shape after cleaning: ~884,000 rows, 5 columns (
year,gender,time,finish_time,finish_seconds). - Strengths: Covers all participants, allows analysis of participation and performance distributions.
- Weaknesses: Missing country data for many runners.
- Source: Wikipedia – Berlin Marathon
- Shape after cleaning: 100 rows, 5 columns (
year,winner,country,time,gender). - Strengths: Clean overview of winners per year, good for studying records and country dominance.
- Weaknesses: Only includes winners, does not represent the full participant population.
- How have winning times evolved in the Berlin Marathon from 1974 to 2024?
- Are world records being broken more frequently in Berlin than in other marathons?
of female and male winner (Kinga)**
- What is the time difference between male and female winners?
- Has the gender time scores difference decreased over the years?
- Which countries have produced the most winners in the Berlin Marathon?
- How has the country distribution of winners shifted across decades?
- How has the percentage of female runners changed since 1974?
- Is the finishing time distribution of female runners approaching that of male runners?
- Winning times in the Berlin Marathon have significantly improved over the years.
- The time score difference between men and women has decreased over time.
- The number of participants from Africa has increased over the years.
- Female participation has increased since 1974.
We applied multiple data cleaning and wrangling techniques to both datasets:
- Standardized column names to
snake_case. - Removed duplicates and irrelevant columns (
country,agein the raw runners dataset). - Handled missing values (nulos in
country,age). - Normalized categorical values (
gender: male/female/unknown). - Converted
timetotimedeltaand createdfinish_seconds. - Reshaped winners dataset (wide → long format) and added
gendercolumn. - Saved final cleaned datasets to
data/clean/.
For analysis, we used:
- Aggregation & filtering with
groupby,value_counts, and pivot tables. - Visualizations with Matplotlib and Seaborn.
- Exported figures to
figures/for presentation.
- Winning times in the Berlin Marathon have consistently improved.
- Men’s performances have stabilized just above 2h01.
- Women achieved a major breakthrough in 2023 with 2h11.
- Berlin is the leading marathon for world records, hosting nearly all men’s records since 2003 and the women’s record in 2023.
- This far surpasses other cities like London or Chicago.
- Time results changed over the years for both genders.
- The gender time scores difference has decreased.
- Kenya (25 wins) and Ethiopia (20 wins) dominate the overall history.
- In early decades, winners came mainly from Europe (West Germany, UK).
- Since the 1990s, East African countries have taken the lead.
- Female participation has increased explosively since 1974 (5.0%). There has been a average 40% increase in female participation.In 2024: there is a 45% increase.
- Time gap between finishing times for female runners and male runners is steadily decreasing. In 1974, the time gap was at 61.3 min. As of 20204, it is now at a 30.4 min difference. At this current rate, times will be equal in 49 years
- Hypothesis 1: Supported – Winning times have consistently improved; men are near the 2h01 barrier and women set a breakthrough with 2h11 in 2023. Berlin is confirmed as the fastest course, with nearly all world records since 2003.
- Hypothesis 2: Supported – The performance gap score between men and women has narrowed.
- Hypothesis 3: Supported – African participation and victories have grown strongly (Kenya and Ethiopia lead since the 1990s).
- Hypothesis 4: Supported – Female participation has risen steadily since 1974.
- How do Berlin results compare to other World Marathon Majors (London, Boston, Tokyo)?
- What external factors (weather, temperature, technology, shoes) influence performance trends?
- How do amateur vs. elite trends differ?
first_project/ ├── 📂 data │ ├── 📂 raw │ └── 📂 clean │ ├── 📂 figures ├── 📂 notebooks ├── 📂 slides ├── 📂 sql_scripts
├── 📄 config.yaml ├── 📄 main.py ├── 📄 pyproject.toml ├── 📄 README.md └── 📄 uv.lock
- Project managed with Trello: Berlin Marathon Project Trello
- GitHub collaboration with branches per member (Irma, Kinga, Rafael, Joma).
- Each member responsible for one research question.
- Cleaned datasets in
/data/clean/. - Analysis notebooks in
/notebooks/. - Figures in
/figures/. - Final presentation: Google Slides
- Rafael – Analysis of Evolution of performance
- Kinga – Analysis of Gender gap
- Irma – Analysis of Country dominance
- Joma (Project Manager) – Analysis of Female participation