Academic R study comparing the predictive performance and training cost of regression trees (
rpart) and random forests (randomForest) across 8 synthetic data scenarios, 1000 repetitions each. Random forest beats the single tree in every scenario, by 34–68% MSE.
The textbook answer says "random forests dominate single trees." This project asks the more honest question: always? by how much? and is the extra training cost worth it? Eight scenarios are generated by crossing three factors — sample size (n=500 vs 2500), predictor count (p=5 vs 15), and signal type (linear y = X·β vs quadratic y = a·X² + X·β). Each scenario is replicated 1000 times so that mean differences are reported with empirical standard deviations rather than single-shot results.
Mean test MSE on a 20 % hold-out, ± 1 SD across 1000 repetitions per scenario. Lower is better. Numbers above each pair are the relative MSE reduction (Tree → Forest).
| Forest > Tree | Error bars overlap? | |
|---|---|---|
| GEL, GEQ (large n, many p, both relations) | Decisive — error bars don't touch | No |
| PEL, PEQ (small n, many p) | Strong, but slight overlap | Slight |
| GPL, GPQ, PPL, PPQ (few predictors) | Forest wins on average, but error bars overlap meaningfully | Yes |
The headline "forest is always better" is true on the mean. But for the low-predictor scenarios (*P*), the variance of the tree's MSE is large enough that any single replication might show no gap. The advantage is most reliable when the data is harder (more rows, more columns, non-linear signal).
Forest training time is 5–25× higher than the tree across scenarios:
| Scenario | Tree (s) | Forest (s) | Forest / Tree |
|---|---|---|---|
| PPL | 0.003 | 0.026 | 9× |
| PPQ | 0.0036 | 0.028 | 8× |
| GPL | 0.010 | 0.22 | 22× |
| GPQ | 0.011 | 0.24 | 22× |
| PEL | 0.008 | 0.073 | 9× |
| PEQ | 0.008 | 0.088 | 11× |
| GEL | 0.030 | 0.63 | 21× |
| GEQ | 0.035 | 0.79 | 23× |
Even at 0.79 s the absolute cost is small, but the ratio is large enough that for very low-stakes prediction problems with simple signal, the single tree is a defensible choice.
- Data generation —
X ~ N(0, 1)matrix (n × p);β, α ~ N(0, 1)coefficient vectors. Two targets per replication:y_line = X·βandy_quad = X²·α + X·β. - Split — 80% train / 20% test (random sample, fresh per replication;
set.seed(i + 151)). - Fit —
rpart()andrandomForest(ntree=100)with all other hyperparameters at defaults. - Measure — test MSE and
system.time()elapsed seconds. - Replicate — 1000 times × 4 (n, p) combinations × 2 relations = 8000 fits per model, results dumped to
resultats_simulation.csv.
- Language: R
- Modelling:
rpart,randomForest - Data wrangling:
dplyr,tidyr - Plots (R):
ggplot2 - Reporting: PDF (
Rapport.pdf) - Chart for this README: Python + matplotlib (
assets/generate_charts.py) — reproducible from the numbers in Tableau 2 of the report
Monte-Carlo-Simulation/
├── Projet_simulation_final_V3.r # Main R script: simulation + ggplot2 charts
├── Rapport.pdf # Full written report (French) with tables + plots
├── assets/
│ ├── mse_by_scenario.png # Headline MSE comparison (used in this README)
│ └── generate_charts.py # Reproducible chart from PDF-reported numbers
└── README.md
# Open Projet_simulation_final_V3.r in RStudio
# Required packages:
install.packages(c("randomForest", "rpart", "ggplot2", "dplyr", "tidyr"))
# Run the full script (writes resultats_simulation.csv and renders the four ggplot2 charts)
source("Projet_simulation_final_V3.r")To regenerate just the README chart from the recorded numbers:
python assets/generate_charts.py- Academic project, not a benchmark. Designed as a Monte Carlo exercise —
rpartandrandomForestdefaults only, no hyperparameter tuning. Real-world cross-validation might shrink or grow the gap. - Scenario coding is non-standard. Three-letter codes (e.g.
GEQ) compress sample size · predictor count · signal type. Spelled out: G = large n (2500), P = small n (500); E = many predictors (15, "Élevé"), P = few (5, "Petit"); L = linear, Q = quadratic. Easy to misreadPE*as "predictor + element" instead of "Petite n + Élevé p" — the chart's x-axis subtitle decodes it. - Synthetic data only. Targets are generated from the predictors with no noise term beyond the random
β/αdraws and the random train/test split — there is no irreducible error injected, so the reported MSEs are essentially "how well the model recovers a known function." Real datasets behave differently. - Forest's relative advantage shrinks with low p. Despite the textbook framing, the forest's reduction in scenarios with p=5 is 44 – 68%, compared to 53 – 56 % at p=15. The forest's absolute MSE gap is largest in complex scenarios, but the ratio isn't strictly monotone in complexity. The README chart annotates the per-scenario reductions to make this visible.
- Single random seed strategy.
set.seed(i + 151)yields 1000 distinct seeds per scenario, but the same 1000 seeds are reused across scenarios. Fine for reproducibility; would not stand in for a serious bias / variance decomposition. - No statistical test on the differences. The report uses overlapping vs. non-overlapping ± 1 SD as the criterion for "decisive." A paired Wilcoxon or t-test across the 1000 replications would be a stricter test and is the natural next step.
