Monte Carlo Simulation — Decision Tree vs. Random Forest

Academic R study comparing the predictive performance and training cost of regression trees (rpart) and random forests (randomForest) across 8 synthetic data scenarios, 1000 repetitions each. Random forest beats the single tree in every scenario, by 34–68% MSE.

🎯 Objective

The textbook answer says "random forests dominate single trees." This project asks the more honest question: always? by how much? and is the extra training cost worth it? Eight scenarios are generated by crossing three factors — sample size (n=500 vs 2500), predictor count (p=5 vs 15), and signal type (linear y = X·β vs quadratic y = a·X² + X·β). Each scenario is replicated 1000 times so that mean differences are reported with empirical standard deviations rather than single-shot results.

📊 Results

Mean test MSE on a 20 % hold-out, ± 1 SD across 1000 repetitions per scenario. Lower is better. Numbers above each pair are the relative MSE reduction (Tree → Forest).

Where forest's edge is decisive vs. arguable

	Forest > Tree	Error bars overlap?
GEL, GEQ (large n, many p, both relations)	Decisive — error bars don't touch	No
PEL, PEQ (small n, many p)	Strong, but slight overlap	Slight
GPL, GPQ, PPL, PPQ (few predictors)	Forest wins on average, but error bars overlap meaningfully	Yes

The headline "forest is always better" is true on the mean. But for the low-predictor scenarios (*P*), the variance of the tree's MSE is large enough that any single replication might show no gap. The advantage is most reliable when the data is harder (more rows, more columns, non-linear signal).

Cost of the win

Forest training time is 5–25× higher than the tree across scenarios:

Scenario	Tree (s)	Forest (s)	Forest / Tree
PPL	0.003	0.026	9×
PPQ	0.0036	0.028	8×
GPL	0.010	0.22	22×
GPQ	0.011	0.24	22×
PEL	0.008	0.073	9×
PEQ	0.008	0.088	11×
GEL	0.030	0.63	21×
GEQ	0.035	0.79	23×

Even at 0.79 s the absolute cost is small, but the ratio is large enough that for very low-stakes prediction problems with simple signal, the single tree is a defensible choice.

🏗️ Methodology

Data generation — X ~ N(0, 1) matrix (n × p); β, α ~ N(0, 1) coefficient vectors. Two targets per replication: y_line = X·β and y_quad = X²·α + X·β.
Split — 80% train / 20% test (random sample, fresh per replication; set.seed(i + 151)).
Fit — rpart() and randomForest(ntree=100) with all other hyperparameters at defaults.
Measure — test MSE and system.time() elapsed seconds.
Replicate — 1000 times × 4 (n, p) combinations × 2 relations = 8000 fits per model, results dumped to resultats_simulation.csv.

🛠️ Tech Stack

Language: R
Modelling: rpart, randomForest
Data wrangling: dplyr, tidyr
Plots (R): ggplot2
Reporting: PDF (Rapport.pdf)
Chart for this README: Python + matplotlib (assets/generate_charts.py) — reproducible from the numbers in Tableau 2 of the report

📁 Repository Structure

Monte-Carlo-Simulation/
├── Projet_simulation_final_V3.r   # Main R script: simulation + ggplot2 charts
├── Rapport.pdf                    # Full written report (French) with tables + plots
├── assets/
│   ├── mse_by_scenario.png        # Headline MSE comparison (used in this README)
│   └── generate_charts.py         # Reproducible chart from PDF-reported numbers
└── README.md

🚀 How to Run

# Open Projet_simulation_final_V3.r in RStudio
# Required packages:
install.packages(c("randomForest", "rpart", "ggplot2", "dplyr", "tidyr"))

# Run the full script (writes resultats_simulation.csv and renders the four ggplot2 charts)
source("Projet_simulation_final_V3.r")

To regenerate just the README chart from the recorded numbers:

python assets/generate_charts.py

📝 Notes / Limitations

Academic project, not a benchmark. Designed as a Monte Carlo exercise — rpart and randomForest defaults only, no hyperparameter tuning. Real-world cross-validation might shrink or grow the gap.
Scenario coding is non-standard. Three-letter codes (e.g. GEQ) compress sample size · predictor count · signal type. Spelled out: G = large n (2500), P = small n (500); E = many predictors (15, "Élevé"), P = few (5, "Petit"); L = linear, Q = quadratic. Easy to misread PE* as "predictor + element" instead of "Petite n + Élevé p" — the chart's x-axis subtitle decodes it.
Synthetic data only. Targets are generated from the predictors with no noise term beyond the random β/α draws and the random train/test split — there is no irreducible error injected, so the reported MSEs are essentially "how well the model recovers a known function." Real datasets behave differently.
Forest's relative advantage shrinks with low p. Despite the textbook framing, the forest's reduction in scenarios with p=5 is 44 – 68%, compared to 53 – 56 % at p=15. The forest's absolute MSE gap is largest in complex scenarios, but the ratio isn't strictly monotone in complexity. The README chart annotates the per-scenario reductions to make this visible.
Single random seed strategy. set.seed(i + 151) yields 1000 distinct seeds per scenario, but the same 1000 seeds are reused across scenarios. Fine for reproducibility; would not stand in for a serious bias / variance decomposition.
No statistical test on the differences. The report uses overlapping vs. non-overlapping ± 1 SD as the criterion for "decisive." A paired Wilcoxon or t-test across the 1000 replications would be a stricter test and is the natural next step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monte Carlo Simulation — Decision Tree vs. Random Forest

🎯 Objective

📊 Results

Where forest's edge is decisive vs. arguable

Cost of the win

🏗️ Methodology

🛠️ Tech Stack

📁 Repository Structure

🚀 How to Run

📝 Notes / Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
Projet_simulation_final_V3.r		Projet_simulation_final_V3.r
README.md		README.md
Rapport.pdf		Rapport.pdf

Folders and files

Latest commit

History

Repository files navigation

Monte Carlo Simulation — Decision Tree vs. Random Forest

🎯 Objective

📊 Results

Where forest's edge is decisive vs. arguable

Cost of the win

🏗️ Methodology

🛠️ Tech Stack

📁 Repository Structure

🚀 How to Run

📝 Notes / Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages