Precision on wikipedia index preprocessing #60

EvanDufraisse · 2021-10-01T12:55:48Z

Dear Michael Roeder,

Thanks a lot for sharing this useful work of yours !

As you point it out in your instructions for Lucene index creation , the preprocessing steps of the indexed dataset must be the same ones as those of your modelisation dataset.

What preprocessor have you used to make the lemmatization ? Have you lower-cased all words ?
I'd be glad if you still have that information so I know wether I need to re-compute another index.

Thanks again,

Evan Dufraisse

MichaelRoeder · 2021-10-08T13:14:53Z

Hi Evan Dufraisse,

Thank you for your interest in our project. 😃

We used the Stanford Core NLP library for preprocessing (including lemmatization). I am 99% sure, that the words are lower-cased (the effect can be seen by the issue #19 😉 ). I also think that we applied the lemmatizer first, before tranforming the words into their lower-cased form (it simply makes more sense in this order 😉 ).

You may also want to take into consideration, that we created the index in 2014. Depending on the documents that you use to generate your topics, this might influence your results as well. In 2014, Donald Trump hadn't even started his presidential campaign and (obviously) COVID 19 did not exist. So if you process news articles, a newer reference corpus might be interesting 🤔

Cheers,
Michael Röder

EvanDufraisse closed this as completed Oct 1, 2021

EvanDufraisse reopened this Oct 1, 2021

MichaelRoeder closed this as completed Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precision on wikipedia index preprocessing #60

Precision on wikipedia index preprocessing #60

EvanDufraisse commented Oct 1, 2021 •

edited

Loading

MichaelRoeder commented Oct 8, 2021

Precision on wikipedia index preprocessing #60

Precision on wikipedia index preprocessing #60

Comments

EvanDufraisse commented Oct 1, 2021 • edited Loading

MichaelRoeder commented Oct 8, 2021

EvanDufraisse commented Oct 1, 2021 •

edited

Loading