Skip to content

centre-for-humanities-computing/topic-arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

topic-arena

An arena for comparing topic model quality.

What do we want from a topic model:

  1. Sensible overview:
    • Should be representative of the documents: As many documents' content should be covered as possible. (Task 1)
    • Should be informative: You should gain as much information about the corpus as possible. (TODO find relevant task)
    • Should be corpus specific: You should gain information about this corpus, and preferably not just about general themes. (Task 3)
  2. Getting into the weeds - Filtering/Retrieval
    • Should retrieve relevant documents when filtering based on one topic (Task 2)
  3. Discourse analysis (dynamic models, optional)

Arena

The arena would be an HF Space in the form of a web app, similar to mteb-arena. This could be implemented in React, Gradio or Dash. We can discuss this based on how much flexibility we need and how Python-dev friendly it needs to be.

arena_flowchart

Participation would be as follows:

  1. Anyone can enter without registration. We should keep at least some identifying information to be able to account for personal baselines when modelling.
  2. A participant will complete a number of randomly assigned tasks.
  3. You can quit whenever, and shouldn't be excluded if you only complete a couple. We should record, how many tasks each participant completes.
  4. We should, however encourage completing more, by giving people a personalized leaderboard after completing N tasks (this could be something like 20-40)

Public Leaderboard

The leaderboard should contain a table of results and ranks, as well as a radar chart displaying performance of models on different aspects.

Tasks

Task 1: Representativeness

You get a document annotated by two topic models. You are presented with the highest scoring topics and their topic descriptions. You should choose which model's output you prefer. We should try accounting for atypical documents as well when doing this so that we see how representative models are. This is probably a question of sampling technique and recording how specific documents were sampled.

task_1

Task 2: Relevance

Take a single topic model and sample four documents that rank sufficiently differently on the topic. Rank the four documents in the order that you think they are relevant to the topic. We can then use some reranking metric.

We could also potentially use examples that are hard to see how granular the knowledge is.

task_2

Tasks 3 and 4: Specificity

We need a hierarchically organized corpus. This could be something like wikipedia, were we have a very clear category hierarchy.

In both tasks, we select a parent corpus, and two subcorpora that are adjacent. One of the subcorpora should be appointed a main corpus and the other and intruder corpus from which intruder documents can be sampled.

Task 3: Which model is more generic?

In Task 3 we fit two different topic models to the main corpus, then ask participants to assign the intruder document to one of the models. The less specific model will likely have the intruder document assigned to it.

task_3

Task 4: Is the model specific enough?

In Task 4 we fit the same model to the main and parent corpora, resulting in a parent and child model. Participants are asked to assign the intruder document to one of the models. If the child model is specific enough, we can expect participants to assign the topic to the parent model consistently.

task_4

Models

We should use a representative sample of modern topic models. topic-benchmark already has sensible implementations of a decent chunk of these, but more can be accommodated if need be. (Please tell me if we want other models too)

  • BERTopic
  • Top2Vec
  • SemanticSignalSeparation (possibly take negative sides as separate positive topics to make it comparable wiht other models)
  • KeyNMF
  • FASTopic
  • Possibly ECRTM
  • CTM (we should probably settle on one of them, preferably CombinedTM)
  • LDA
  • Vanilla NMF

Since models (especially contextual ones) are made up of components, we should also account for these.

Encoder models

These determine the quality of embeddings that can be used for analysis, and should be accounted for at least to a certain extent.

I think using to sentence transformers is very reasonable:

  • MiniLM-L6-v2 (parapharase- or all-)
  • mpnet-base-v2 (parapharase- or all-)

Plus one static model to see if a) models are able to use these effectively b) models gain any plus information from context.

  • static-retrieval-mrl-en-v1

Vectorization/vocabulary

Each model will need a vectorizer to determine what the models vocabulary will be, and to use word counts if relevant.

Using CountVectorizer(min_df=10) is a reasonable option in my opinion.

Should we remove stop words from the models' vocabulary? And if so, should we test which models are resistant to the presence of stop words? I'm also open to trying out something like NounPhraseCountVectorizer or KeyphraseVectorizers.

I would prefer not to use LLMs. This is mainly because I believe that a) LLM-based topics are usually heavily dependent on the keyword/keyphrase based topic descriptions, but also because b) our results would be dependent on the LLM of choice and its limitations.

Corpora

There is a list of openly usable open-license datasets implemented in topic-benchmark, we could use these as a starting point, but I'm open to ideas about what else could be used.

  • General
    • BBC News
  • Hierarchical
    • Wikipedia
  • Known baseline:
    • 20 Newsgroups
  • Expert corpora
    • StackExchange discussions
    • ArXiv ML paper abstracts

Additionally, Wikipedia could be used for the hierarchical tasks.

About

An arena for comparing topic model quality.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published