Cluster tweets using the Jaccard Distance metric and the K-means clustering algorithm. This approach groups similar tweets together, which is useful for applications like trend analysis and content organization on Twitter.
The data is accessible through the link below:
https://archive.ics.uci.edu/dataset/438/health+news+in+twitter
Steps:
Tokenize Tweets: Convert tweets into sets of words.
Define Jaccard Distance: Calculate the similarity between sets.
Initialize Centroids: Randomly pick k tweets as initial centroids.
Cluster Assignment: Assign each tweet to the nearest centroid based on Jaccard distance.
Update Centroids: Update each centroid to the tweet in the cluster that minimizes the sum of distances to other tweets in the same cluster.
Calculate SSE: Sum of Squared Errors for evaluation.