An arena for comparing topic model quality.
- Sensible overview:
- Should be representative of the documents: As many documents' content should be covered as possible. (Task 1)
- Should be informative: You should gain as much information about the corpus as possible. (TODO find relevant task)
- Should be corpus specific: You should gain information about this corpus, and preferably not just about general themes. (Task 3)
- Getting into the weeds - Filtering/Retrieval
- Should retrieve relevant documents when filtering based on one topic (Task 2)
- Discourse analysis (dynamic models, optional)
The arena would be an HF Space in the form of a web app, similar to mteb-arena. This could be implemented in React, Gradio or Dash. We can discuss this based on how much flexibility we need and how Python-dev friendly it needs to be.
Participation would be as follows:
- Anyone can enter without registration. We should keep at least some identifying information to be able to account for personal baselines when modelling.
- A participant will complete a number of randomly assigned tasks.
- You can quit whenever, and shouldn't be excluded if you only complete a couple. We should record, how many tasks each participant completes.
- We should, however encourage completing more, by giving people a personalized leaderboard after completing N tasks (this could be something like 20-40)
The leaderboard should contain a table of results and ranks, as well as a radar chart displaying performance of models on different aspects.
You get a document annotated by two topic models. You are presented with the highest scoring topics and their topic descriptions. You should choose which model's output you prefer. We should try accounting for atypical documents as well when doing this so that we see how representative models are. This is probably a question of sampling technique and recording how specific documents were sampled.
Take a single topic model and sample four documents that rank sufficiently differently on the topic. Rank the four documents in the order that you think they are relevant to the topic. We can then use some reranking metric.
We could also potentially use examples that are hard to see how granular the knowledge is.
We need a hierarchically organized corpus. This could be something like wikipedia, were we have a very clear category hierarchy.
In both tasks, we select a parent corpus, and two subcorpora that are adjacent. One of the subcorpora should be appointed a main corpus and the other and intruder corpus from which intruder documents can be sampled.
In Task 3 we fit two different topic models to the main corpus, then ask participants to assign the intruder document to one of the models. The less specific model will likely have the intruder document assigned to it.
In Task 4 we fit the same model to the main and parent corpora, resulting in a parent and child model. Participants are asked to assign the intruder document to one of the models. If the child model is specific enough, we can expect participants to assign the topic to the parent model consistently.
We should use a representative sample of modern topic models. topic-benchmark already has sensible implementations of a decent chunk of these, but more can be accommodated if need be. (Please tell me if we want other models too)
- BERTopic
- Top2Vec
- SemanticSignalSeparation (possibly take negative sides as separate positive topics to make it comparable wiht other models)
- KeyNMF
- FASTopic
- Possibly ECRTM
- CTM (we should probably settle on one of them, preferably CombinedTM)
- LDA
- Vanilla NMF
Since models (especially contextual ones) are made up of components, we should also account for these.
These determine the quality of embeddings that can be used for analysis, and should be accounted for at least to a certain extent.
I think using to sentence transformers is very reasonable:
- MiniLM-L6-v2 (parapharase- or all-)
- mpnet-base-v2 (parapharase- or all-)
Plus one static model to see if a) models are able to use these effectively b) models gain any plus information from context.
- static-retrieval-mrl-en-v1
Each model will need a vectorizer to determine what the models vocabulary will be, and to use word counts if relevant.
Using CountVectorizer(min_df=10)
is a reasonable option in my opinion.
Should we remove stop words from the models' vocabulary? And if so, should we test which models are resistant to the presence of stop words? I'm also open to trying out something like NounPhraseCountVectorizer or KeyphraseVectorizers.
I would prefer not to use LLMs. This is mainly because I believe that a) LLM-based topics are usually heavily dependent on the keyword/keyphrase based topic descriptions, but also because b) our results would be dependent on the LLM of choice and its limitations.
There is a list of openly usable open-license datasets implemented in topic-benchmark
, we could use these as a starting point, but I'm open to ideas about what else could be used.
- General
- BBC News
- Hierarchical
- Wikipedia
- Known baseline:
- 20 Newsgroups
- Expert corpora
- StackExchange discussions
- ArXiv ML paper abstracts
Additionally, Wikipedia could be used for the hierarchical tasks.