You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks a lot for sharing this useful work of yours !
As you point it out in your instructions for Lucene index creation , the preprocessing steps of the indexed dataset must be the same ones as those of your modelisation dataset.
What preprocessor have you used to make the lemmatization ? Have you lower-cased all words ?
I'd be glad if you still have that information so I know wether I need to re-compute another index.
Thanks again,
Evan Dufraisse
The text was updated successfully, but these errors were encountered:
We used the Stanford Core NLP library for preprocessing (including lemmatization). I am 99% sure, that the words are lower-cased (the effect can be seen by the issue #19 😉 ). I also think that we applied the lemmatizer first, before tranforming the words into their lower-cased form (it simply makes more sense in this order 😉 ).
You may also want to take into consideration, that we created the index in 2014. Depending on the documents that you use to generate your topics, this might influence your results as well. In 2014, Donald Trump hadn't even started his presidential campaign and (obviously) COVID 19 did not exist. So if you process news articles, a newer reference corpus might be interesting 🤔
Dear Michael Roeder,
Thanks a lot for sharing this useful work of yours !
As you point it out in your instructions for Lucene index creation , the preprocessing steps of the indexed dataset must be the same ones as those of your modelisation dataset.
What preprocessor have you used to make the lemmatization ? Have you lower-cased all words ?
I'd be glad if you still have that information so I know wether I need to re-compute another index.
Thanks again,
Evan Dufraisse
The text was updated successfully, but these errors were encountered: