A statistical analysis of how cross-platform metrics predict streaming success on Spotify.
How do cross-platform engagement metrics (YouTube views, TikTok posts/likes, Apple/Deezer playlists, Shazam counts, airplay) and explicit content labels relate to Spotify streaming counts? Does the effect of playlist inclusion differ for explicit vs. clean tracks?
Understanding these relationships helps artists, labels, and streaming platforms make informed decisions about promotion strategies and resource allocation.
We analyzed 2,597 tracks from Kaggle's "Most Streamed Spotify Songs 2024" dataset after cleaning (removed duplicates and missing values). The dataset includes:
- Response variable: Spotify stream counts (ranging from ~1.42M to 4.28B)
- Predictors:
- Spotify, Apple Music, and Deezer playlist counts
- YouTube views
- TikTok posts and likes
- Shazam counts
- Airplay spins
- Spotify popularity score
- Explicit content label (Clean/Explicit)
Summary statistics:
- Mean streams: 602 million
- Median streams: 434 million
- 82.3% clean tracks, 17.7% explicit tracks
Started with a linear model on raw streams but encountered serious assumption violations—residuals showed a clear funnel pattern (heteroscedasticity) and heavy right tail in the QQ plot.
Applied Box-Cox transformation to stabilize variance:
- Optimal λ ≈ 0.55 (95% CI: [0.52, 0.57])
- Near a square-root transformation
- Dramatically improved residual behavior
Used theory-driven variable selection rather than automated methods:
- Interaction test: Playlist × explicit interaction showed no improvement (partial F p ≈ 0.97), so we dropped it
- TikTok retention: Despite weak evidence (p ≈ 0.40), we kept TikTok metrics to preserve the cross-platform scope of the research question
- Multicollinearity: All adjusted VIF values < 5 (range: 1.26–4.06)
- Influence: Max Cook's distance ≈ 1.48; no single track dominates results
- Sensitivity: Dropping the 5 most influential tracks barely changed coefficients or fit statistics
The final Box-Cox model (without interaction) achieved:
- R² = 0.821
- Adjusted R² = 0.821
- 104 standardized residuals exceed |2| (down from 123 in the log model)
Strongest positive predictors:
- Spotify playlist count (β = 0.43)
- Apple Music playlist count (β = 136)
- Spotify popularity (β = 770)
- YouTube views (β = 1.14×10⁻⁵)
Negative associations:
- Explicit tracks (β = -5,730) — after controlling for promotion metrics, explicit content is associated with fewer streams
- Airplay spins (β = -0.0103)
Practical interpretations (back-transformed to original scale at median covariate values):
- Adding 1,000 Spotify playlists → tens of millions more streams
- Adding 10 million YouTube views → millions more streams
Playlist placement and YouTube visibility are the primary drivers. Spotify and Apple Music playlist inclusion have the strongest effects, followed by YouTube views and Spotify's own popularity metric. Explicit content carries a modest penalty even after accounting for promotion levels. TikTok metrics showed weak signals in this sample—likely because viral TikTok success is highly unpredictable and concentrated in a few mega-hits.
- Sample bias: Dataset contains only hit songs; results don't generalize to the full population of music releases
- Artist clustering: Multiple tracks per artist may violate independence assumption
- Scale interpretation: Box-Cox transformation improves model fit but makes coefficients less intuitive
- TikTok uncertainty: Weak evidence for TikTok effects (p ≈ 0.40); retained for theoretical completeness but contribution is uncertain
We chose manual, theory-based model selection over automated methods to avoid overfitting and maintain transparency. All data came from publicly available aggregate metrics (no individual listener information). The analysis focuses on industry-level insights for promotion strategy rather than making claims about artistic merit or prescribing what music should be created.
├── data_raw/ # Original Kaggle dataset
├── data_clean/ # Cleaned and processed data
├── rmd/
│ ├── spotify_part2.Rmd # Part 1 preliminary analysis
│ └── spotify_part2_analysis.Rmd # Part 2 full analysis with transformations
├── prop and present/ # LaTeX poster and presentation files
├── spotify_part2.pdf # Final rendered analysis report
└── STA302 Poster.pdf # Conference-style research poster
- R for statistical analysis
- tidyverse for data manipulation
- broom for model summaries
- car for diagnostics (VIF)
- MASS for Box-Cox transformation
- R Markdown for reproducible reports
- LaTeX/Beamer for poster and presentation
- Git/GitHub for version control
- Aguiar, L., & Waldfogel, J. (2021). Platforms, promotion, and product discovery: Evidence from Spotify playlists. Journal of Industrial Economics, 69(3), 653–691.
- Interiano, M., et al. (2018). Musical trends and predictability of success in contemporary songs in and out of the top charts. Royal Society Open Science, 5(5), 171274.
- Kaimann, D., & Cox, J. (2021). Music characteristics, originality, and the success of contemporary songs. Empirical Studies of the Arts, 39(1), 96–119.
- Nelgiriyewithana, N. (2024). Most Streamed Spotify Songs 2024 [Dataset]. Kaggle.
