Skip to content

CrSamson/Monte-Carlo-Simulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Monte Carlo Simulation — Decision Tree vs. Random Forest

Academic R study comparing the predictive performance and training cost of regression trees (rpart) and random forests (randomForest) across 8 synthetic data scenarios, 1000 repetitions each. Random forest beats the single tree in every scenario, by 34–68% MSE.

🎯 Objective

The textbook answer says "random forests dominate single trees." This project asks the more honest question: always? by how much? and is the extra training cost worth it? Eight scenarios are generated by crossing three factors — sample size (n=500 vs 2500), predictor count (p=5 vs 15), and signal type (linear y = X·β vs quadratic y = a·X² + X·β). Each scenario is replicated 1000 times so that mean differences are reported with empirical standard deviations rather than single-shot results.

📊 Results

MSE by scenario

Mean test MSE on a 20 % hold-out, ± 1 SD across 1000 repetitions per scenario. Lower is better. Numbers above each pair are the relative MSE reduction (Tree → Forest).

Where forest's edge is decisive vs. arguable

Forest > Tree Error bars overlap?
GEL, GEQ (large n, many p, both relations) Decisive — error bars don't touch No
PEL, PEQ (small n, many p) Strong, but slight overlap Slight
GPL, GPQ, PPL, PPQ (few predictors) Forest wins on average, but error bars overlap meaningfully Yes

The headline "forest is always better" is true on the mean. But for the low-predictor scenarios (*P*), the variance of the tree's MSE is large enough that any single replication might show no gap. The advantage is most reliable when the data is harder (more rows, more columns, non-linear signal).

Cost of the win

Forest training time is 5–25× higher than the tree across scenarios:

Scenario Tree (s) Forest (s) Forest / Tree
PPL 0.003 0.026
PPQ 0.0036 0.028
GPL 0.010 0.22 22×
GPQ 0.011 0.24 22×
PEL 0.008 0.073
PEQ 0.008 0.088 11×
GEL 0.030 0.63 21×
GEQ 0.035 0.79 23×

Even at 0.79 s the absolute cost is small, but the ratio is large enough that for very low-stakes prediction problems with simple signal, the single tree is a defensible choice.

🏗️ Methodology

  1. Data generationX ~ N(0, 1) matrix (n × p); β, α ~ N(0, 1) coefficient vectors. Two targets per replication: y_line = X·β and y_quad = X²·α + X·β.
  2. Split — 80% train / 20% test (random sample, fresh per replication; set.seed(i + 151)).
  3. Fitrpart() and randomForest(ntree=100) with all other hyperparameters at defaults.
  4. Measure — test MSE and system.time() elapsed seconds.
  5. Replicate — 1000 times × 4 (n, p) combinations × 2 relations = 8000 fits per model, results dumped to resultats_simulation.csv.

🛠️ Tech Stack

  • Language: R
  • Modelling: rpart, randomForest
  • Data wrangling: dplyr, tidyr
  • Plots (R): ggplot2
  • Reporting: PDF (Rapport.pdf)
  • Chart for this README: Python + matplotlib (assets/generate_charts.py) — reproducible from the numbers in Tableau 2 of the report

📁 Repository Structure

Monte-Carlo-Simulation/
├── Projet_simulation_final_V3.r   # Main R script: simulation + ggplot2 charts
├── Rapport.pdf                    # Full written report (French) with tables + plots
├── assets/
│   ├── mse_by_scenario.png        # Headline MSE comparison (used in this README)
│   └── generate_charts.py         # Reproducible chart from PDF-reported numbers
└── README.md

🚀 How to Run

# Open Projet_simulation_final_V3.r in RStudio
# Required packages:
install.packages(c("randomForest", "rpart", "ggplot2", "dplyr", "tidyr"))

# Run the full script (writes resultats_simulation.csv and renders the four ggplot2 charts)
source("Projet_simulation_final_V3.r")

To regenerate just the README chart from the recorded numbers:

python assets/generate_charts.py

📝 Notes / Limitations

  • Academic project, not a benchmark. Designed as a Monte Carlo exercise — rpart and randomForest defaults only, no hyperparameter tuning. Real-world cross-validation might shrink or grow the gap.
  • Scenario coding is non-standard. Three-letter codes (e.g. GEQ) compress sample size · predictor count · signal type. Spelled out: G = large n (2500), P = small n (500); E = many predictors (15, "Élevé"), P = few (5, "Petit"); L = linear, Q = quadratic. Easy to misread PE* as "predictor + element" instead of "Petite n + Élevé p" — the chart's x-axis subtitle decodes it.
  • Synthetic data only. Targets are generated from the predictors with no noise term beyond the random β/α draws and the random train/test split — there is no irreducible error injected, so the reported MSEs are essentially "how well the model recovers a known function." Real datasets behave differently.
  • Forest's relative advantage shrinks with low p. Despite the textbook framing, the forest's reduction in scenarios with p=5 is 44 – 68%, compared to 53 – 56 % at p=15. The forest's absolute MSE gap is largest in complex scenarios, but the ratio isn't strictly monotone in complexity. The README chart annotates the per-scenario reductions to make this visible.
  • Single random seed strategy. set.seed(i + 151) yields 1000 distinct seeds per scenario, but the same 1000 seeds are reused across scenarios. Fine for reproducibility; would not stand in for a serious bias / variance decomposition.
  • No statistical test on the differences. The report uses overlapping vs. non-overlapping ± 1 SD as the criterion for "decisive." A paired Wilcoxon or t-test across the 1000 replications would be a stricter test and is the natural next step.

About

Simulation Monte-Carlo comparant arbres de décision et forêts aléatoires selon 8 scénarios. Résultats : la forêt offre une meilleure précision prédictive au prix d’un temps de calcul plus élevé.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors