This project investigates whether Dua Lipa's lyrics exhibit characteristics consistent with Zipf's Law, a fundamental linguistic phenomenon.
We performed a comprehensive Exploratory Data Analysis (EDA) on 246 of her songs, exploring word frequencies, token distributions, and lyrical structures.
- Analyze Dua Lipaโs lyrical data for statistical patterns.
- Test Zipfโs Law, which states that word frequency is inversely proportional to its rank.
- Explore lexical diversity, word distributions, and structural insights into her songwriting.
- Python 3
- Jupyter Notebook
- Libraries:
pandasโ data preprocessingnumpyโ numerical analysismatplotlib/seabornโ visualizationnltkโ natural language processing
- Handled missing values to ensure dataset integrity.
- Normalized text: converted to lowercase, removed punctuation.
- Tokenized lyrics into individual words.
- Distribution of word frequencies โ revealed a long-tailed distribution.
- Top 20 most frequent words included common terms like โyouโ, โIโ, etc.
- Log-log plot of rank vs frequency showed a linear relationship.
- Regression slope โ -1.57, close to the expected -1.
- Rยฒ = 0.96, indicating strong adherence to Zipfโs Law.
- Correlation heatmaps revealed relationships between word counts, unique words, and average word length.
- Scatterplots/pairplots showed structural patterns in songwriting.
-
Zipfโs Law Adherence
- Dua Lipaโs lyrics strongly align with Zipfโs Law.
-
Lexical Diversity
- Rich vocabulary with a few high-frequency words and many rare words.
-
Structural Insights
- Complexity in songwriting highlighted through relationships between unique word counts, word length, etc.
Through this project, we gained experience in:
- Text preprocessing & NLP basics.
- Applying statistical linguistics (Zipfโs Law).
- Data visualization & correlation analysis.
- Collaborative research and presentation of results.
- Extend analysis to other artists for comparison.
- Perform sentiment analysis on lyrics.
- Create an interactive dashboard (Streamlit/Dash).
- Aditya Singh โ Data collection & cleaning, EDA framework.
- Pratik Kumar Pan โ Multivariate analysis, heatmaps, visualizations.
- Shreyas Sarkar โ Zipfโs Law analysis & interpretation.
- Priyank Gaur โ Documentation & presentation design.
This project is for educational purposes only.
Lyrics dataset is used under fair use for linguistic analysis.