Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precision on wikipedia index preprocessing #60

Closed
EvanDufraisse opened this issue Oct 1, 2021 · 1 comment
Closed

Precision on wikipedia index preprocessing #60

EvanDufraisse opened this issue Oct 1, 2021 · 1 comment

Comments

@EvanDufraisse
Copy link

EvanDufraisse commented Oct 1, 2021

Dear Michael Roeder,

Thanks a lot for sharing this useful work of yours !

As you point it out in your instructions for Lucene index creation , the preprocessing steps of the indexed dataset must be the same ones as those of your modelisation dataset.

What preprocessor have you used to make the lemmatization ? Have you lower-cased all words ?
I'd be glad if you still have that information so I know wether I need to re-compute another index.

Thanks again,

Evan Dufraisse

@MichaelRoeder
Copy link
Member

Hi Evan Dufraisse,

Thank you for your interest in our project. 😃

We used the Stanford Core NLP library for preprocessing (including lemmatization). I am 99% sure, that the words are lower-cased (the effect can be seen by the issue #19 😉 ). I also think that we applied the lemmatizer first, before tranforming the words into their lower-cased form (it simply makes more sense in this order 😉 ).

You may also want to take into consideration, that we created the index in 2014. Depending on the documents that you use to generate your topics, this might influence your results as well. In 2014, Donald Trump hadn't even started his presidential campaign and (obviously) COVID 19 did not exist. So if you process news articles, a newer reference corpus might be interesting 🤔

Cheers,
Michael Röder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants