Weighted Term Co-association approach for producing more coherent topics, a ranking of the topics and visualization of the topical structure.
Pre-process the corpus of text:
python prep-text.py -o dataset --df 20 --tfidf --norm path/to/datsest
Apply NMF to the pre-processed corpus, for the specified value or range of number of topics:
python topic-nmf.py dataset.pkl --init random --kmin 5 --kmax 5 -r 20 --seed 1000 --maxiters 100 -o models/dataset
To check the results:
python display-topics.py -t 10 data/bbc/nmf_k05/*rank*
python ensemble-weighted-coassoc.py -k 5 -m wikipedia2016-w2v-cbow-d100.bin -t 10 data/bbc.pkl data/bbc/nmf_k05/*partition* data/bbc/nmf_k05/*rank* -o results/bbc
Embeddings are available to download here
python evaluate-embedding.py -b -t 10 -m wikipedia2016-w2v-cbow-d100.bin -o results/bbc-coherence.csv data/bbc/nmf_k05/*rank*
python evaluate-accuracy.py -o results/bbc-accuracy.csv data/bbc.pkl data/bbc/nmf_k05/*partition*