Skip to content

Predicting Spotify streams using cross-platform metrics (playlists, YouTube, TikTok). Box-Cox regression analysis with R²=0.821.

Notifications You must be signed in to change notification settings

n33levo/spotify-cross-platform-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Spotify Streaming Success Through Cross-Platform Engagement

A statistical analysis of how cross-platform metrics predict streaming success on Spotify.

Project Poster

Research Question

How do cross-platform engagement metrics (YouTube views, TikTok posts/likes, Apple/Deezer playlists, Shazam counts, airplay) and explicit content labels relate to Spotify streaming counts? Does the effect of playlist inclusion differ for explicit vs. clean tracks?

Understanding these relationships helps artists, labels, and streaming platforms make informed decisions about promotion strategies and resource allocation.

Dataset

We analyzed 2,597 tracks from Kaggle's "Most Streamed Spotify Songs 2024" dataset after cleaning (removed duplicates and missing values). The dataset includes:

  • Response variable: Spotify stream counts (ranging from ~1.42M to 4.28B)
  • Predictors:
    • Spotify, Apple Music, and Deezer playlist counts
    • YouTube views
    • TikTok posts and likes
    • Shazam counts
    • Airplay spins
    • Spotify popularity score
    • Explicit content label (Clean/Explicit)

Summary statistics:

  • Mean streams: 602 million
  • Median streams: 434 million
  • 82.3% clean tracks, 17.7% explicit tracks

Methods

Preliminary Analysis

Started with a linear model on raw streams but encountered serious assumption violations—residuals showed a clear funnel pattern (heteroscedasticity) and heavy right tail in the QQ plot.

Transformation Strategy

Applied Box-Cox transformation to stabilize variance:

  • Optimal λ ≈ 0.55 (95% CI: [0.52, 0.57])
  • Near a square-root transformation
  • Dramatically improved residual behavior

Model Selection

Used theory-driven variable selection rather than automated methods:

  • Interaction test: Playlist × explicit interaction showed no improvement (partial F p ≈ 0.97), so we dropped it
  • TikTok retention: Despite weak evidence (p ≈ 0.40), we kept TikTok metrics to preserve the cross-platform scope of the research question

Diagnostics

  • Multicollinearity: All adjusted VIF values < 5 (range: 1.26–4.06)
  • Influence: Max Cook's distance ≈ 1.48; no single track dominates results
  • Sensitivity: Dropping the 5 most influential tracks barely changed coefficients or fit statistics

Results

Model Fit

The final Box-Cox model (without interaction) achieved:

  • R² = 0.821
  • Adjusted R² = 0.821
  • 104 standardized residuals exceed |2| (down from 123 in the log model)

Key Findings

Strongest positive predictors:

  • Spotify playlist count (β = 0.43)
  • Apple Music playlist count (β = 136)
  • Spotify popularity (β = 770)
  • YouTube views (β = 1.14×10⁻⁵)

Negative associations:

  • Explicit tracks (β = -5,730) — after controlling for promotion metrics, explicit content is associated with fewer streams
  • Airplay spins (β = -0.0103)

Practical interpretations (back-transformed to original scale at median covariate values):

  • Adding 1,000 Spotify playlists → tens of millions more streams
  • Adding 10 million YouTube views → millions more streams

What Drives Spotify Success?

Playlist placement and YouTube visibility are the primary drivers. Spotify and Apple Music playlist inclusion have the strongest effects, followed by YouTube views and Spotify's own popularity metric. Explicit content carries a modest penalty even after accounting for promotion levels. TikTok metrics showed weak signals in this sample—likely because viral TikTok success is highly unpredictable and concentrated in a few mega-hits.

Limitations

  • Sample bias: Dataset contains only hit songs; results don't generalize to the full population of music releases
  • Artist clustering: Multiple tracks per artist may violate independence assumption
  • Scale interpretation: Box-Cox transformation improves model fit but makes coefficients less intuitive
  • TikTok uncertainty: Weak evidence for TikTok effects (p ≈ 0.40); retained for theoretical completeness but contribution is uncertain

Ethics

We chose manual, theory-based model selection over automated methods to avoid overfitting and maintain transparency. All data came from publicly available aggregate metrics (no individual listener information). The analysis focuses on industry-level insights for promotion strategy rather than making claims about artistic merit or prescribing what music should be created.

Repository Structure

├── data_raw/               # Original Kaggle dataset
├── data_clean/             # Cleaned and processed data
├── rmd/
│   ├── spotify_part2.Rmd              # Part 1 preliminary analysis
│   └── spotify_part2_analysis.Rmd     # Part 2 full analysis with transformations
├── prop and present/       # LaTeX poster and presentation files
├── spotify_part2.pdf       # Final rendered analysis report
└── STA302 Poster.pdf       # Conference-style research poster

Technologies Used

  • R for statistical analysis
    • tidyverse for data manipulation
    • broom for model summaries
    • car for diagnostics (VIF)
    • MASS for Box-Cox transformation
  • R Markdown for reproducible reports
  • LaTeX/Beamer for poster and presentation
  • Git/GitHub for version control

References

  • Aguiar, L., & Waldfogel, J. (2021). Platforms, promotion, and product discovery: Evidence from Spotify playlists. Journal of Industrial Economics, 69(3), 653–691.
  • Interiano, M., et al. (2018). Musical trends and predictability of success in contemporary songs in and out of the top charts. Royal Society Open Science, 5(5), 171274.
  • Kaimann, D., & Cox, J. (2021). Music characteristics, originality, and the success of contemporary songs. Empirical Studies of the Arts, 39(1), 96–119.
  • Nelgiriyewithana, N. (2024). Most Streamed Spotify Songs 2024 [Dataset]. Kaggle.

About

Predicting Spotify streams using cross-platform metrics (playlists, YouTube, TikTok). Box-Cox regression analysis with R²=0.821.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •