topic-arena

An arena for comparing topic model quality.

What do we want from a topic model:

Sensible overview:
- Should be representative of the documents: As many documents' content should be covered as possible. (Task 1)
- Should be informative: You should gain as much information about the corpus as possible. (TODO find relevant task)
- Should be corpus specific: You should gain information about this corpus, and preferably not just about general themes. (Task 3)
Getting into the weeds - Filtering/Retrieval
- Should retrieve relevant documents when filtering based on one topic (Task 2)
Discourse analysis (dynamic models, optional)

Arena

The arena would be an HF Space in the form of a web app, similar to mteb-arena. This could be implemented in React, Gradio or Dash. We can discuss this based on how much flexibility we need and how Python-dev friendly it needs to be.

Participation would be as follows:

Anyone can enter without registration. We should keep at least some identifying information to be able to account for personal baselines when modelling.
A participant will complete a number of randomly assigned tasks.
You can quit whenever, and shouldn't be excluded if you only complete a couple. We should record, how many tasks each participant completes.
We should, however encourage completing more, by giving people a personalized leaderboard after completing N tasks (this could be something like 20-40)

Public Leaderboard

The leaderboard should contain a table of results and ranks, as well as a radar chart displaying performance of models on different aspects.

Tasks

Task 1: Representativeness

You get a document annotated by two topic models. You are presented with the highest scoring topics and their topic descriptions. You should choose which model's output you prefer. We should try accounting for atypical documents as well when doing this so that we see how representative models are. This is probably a question of sampling technique and recording how specific documents were sampled.

Task 2: Relevance

Take a single topic model and sample four documents that rank sufficiently differently on the topic. Rank the four documents in the order that you think they are relevant to the topic. We can then use some reranking metric.

We could also potentially use examples that are hard to see how granular the knowledge is.

Tasks 3 and 4: Specificity

We need a hierarchically organized corpus. This could be something like wikipedia, were we have a very clear category hierarchy.

In both tasks, we select a parent corpus, and two subcorpora that are adjacent. One of the subcorpora should be appointed a main corpus and the other and intruder corpus from which intruder documents can be sampled.

Task 3: Which model is more generic?

In Task 3 we fit two different topic models to the main corpus, then ask participants to assign the intruder document to one of the models. The less specific model will likely have the intruder document assigned to it.

Task 4: Is the model specific enough?

In Task 4 we fit the same model to the main and parent corpora, resulting in a parent and child model. Participants are asked to assign the intruder document to one of the models. If the child model is specific enough, we can expect participants to assign the topic to the parent model consistently.

Models

We should use a representative sample of modern topic models. topic-benchmark already has sensible implementations of a decent chunk of these, but more can be accommodated if need be. (Please tell me if we want other models too)

BERTopic
Top2Vec
SemanticSignalSeparation (possibly take negative sides as separate positive topics to make it comparable wiht other models)
KeyNMF
FASTopic
Possibly ECRTM
CTM (we should probably settle on one of them, preferably CombinedTM)
LDA
Vanilla NMF

Since models (especially contextual ones) are made up of components, we should also account for these.

Encoder models

These determine the quality of embeddings that can be used for analysis, and should be accounted for at least to a certain extent.

I think using to sentence transformers is very reasonable:

MiniLM-L6-v2 (parapharase- or all-)
mpnet-base-v2 (parapharase- or all-)

Plus one static model to see if a) models are able to use these effectively b) models gain any plus information from context.

static-retrieval-mrl-en-v1

Vectorization/vocabulary

Each model will need a vectorizer to determine what the models vocabulary will be, and to use word counts if relevant.

Using CountVectorizer(min_df=10) is a reasonable option in my opinion.

Should we remove stop words from the models' vocabulary? And if so, should we test which models are resistant to the presence of stop words? I'm also open to trying out something like NounPhraseCountVectorizer or KeyphraseVectorizers.

I would prefer not to use LLMs. This is mainly because I believe that a) LLM-based topics are usually heavily dependent on the keyword/keyphrase based topic descriptions, but also because b) our results would be dependent on the LLM of choice and its limitations.

Corpora

There is a list of openly usable open-license datasets implemented in topic-benchmark, we could use these as a starting point, but I'm open to ideas about what else could be used.

General
- BBC News
Hierarchical
- Wikipedia
Known baseline:
- 20 Newsgroups
Expert corpora
- StackExchange discussions
- ArXiv ML paper abstracts

Additionally, Wikipedia could be used for the hierarchical tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

topic-arena

What do we want from a topic model:

Arena

Public Leaderboard