TL;DR This work creates an unsupervised model to clustering Thai news into 10 categories. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.
Create unsupervised model to clustering sanook news 10 categories.
- The data was scraped from sanook.com is ordered by most popular views for each category.
- there are 10 categories: crime, politics, money, technology, sport, health, horoscope, car, game, entertain
- I scraped using Selenium and BeautifulSoup.
- he source code can be found in sanook_web_scraping.ipynb or you can download it from Google drive
I create vector representation using Bag-of-Words (TF-IDF) and use it as a baseline.
- Text cleaning: remove link, symbols, numbers, special characters
- Word tokenization: newmm (dictionary-based, Maximum Matching + Thai Character Cluster)
- TF-IDF vectorization
- Text cleaning: remove link, symbols, numbers, special characters
- Sentence tokenization: CRF
- Sentence embedding: The best model is WangchanBERTa with SimCSE.
- Weighted with number of Named-Entities After, Sentences are embedded to vector by Transformer model. The embedded vectors are weighted by number of named entities of particular types in sentence. then make Document vector representation using these formulas.
where ns denotes the number of named entities of particular types in sentence. This weighting scheme is adopted from https://ieeexplore.ieee.org/document/9085059
After, we get vector representation. we use the vector as a cluster features. I used simple k-mean clustering following the code below.
from sklearn.cluster import KMeans
k = 10
km = KMeans(n_clusters=k, max_iter=100, n_init=55,)
-
For web scraping (you can skip this. we download it for you)
- Install the library by running this command
pip install -r requirements.txt
- Download chromedriver.exe and put in directory.
- then run this notebook sanook_web_scraping.ipynb with that environment.
- Install the library by running this command
-
Document clustering
- Run this Document_clustering.ipynb notebook on Google Colab.
it contains
- Text preprocessing
- Text representation
- Bag-of-Words
- Transformer Embedding
- Clustering model
- Evaluation
- Error analysis
- Run this Document_clustering.ipynb notebook on Google Colab.
it contains
Chosen the class of cluster by select the most frequency in each cluster.
compare the predictions with Labels by accuracy score as a evaluation metric.
Vector representation techniques | Acc |
---|---|
TF-IDF | 0.8216 |
SimCSE WangchanBERTa | 0.8330 |
SimCSE WangchanBERTa Weighted with number of Named-Entities | 0.8445 |
SimCSE WangchanBERTa finetuned Weighted with number of Named-Entities | 0.7368 |
- I have tried a lot of Transformer model (BERT, RoBERTa, and WangchanBERTa) by adding pooling layer to get embedding vector shape (number_of_samples, 768). But they are not perform well in this task.
- SimCSE improves the model's performance.
- SimCSE model with Weighted with number of Named-Entities is the best in my experiments.
- Try other Clustering models (e.g., Hierarchical clustering, DBSCAN)
- Try Dimension reduction methods (e.g., PCA)
- Try other weighted schemes
- Try Vector representation with Doc2vec method
- Try soft clustering (topic modeling) (e.g., LDA)
- pre-trained model from huggingface
- weighting scheme is adopted from https://ieeexplore.ieee.org/document/9085059
- WangchanBERTa: Pretraining transformer-based Thai Language Models
- SimCSE: Simple Contrastive Learning of Sentence Embeddings