-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tfidf(s): remove normalization, improve docstring #76
Comments
Working on this with @mk2510 🐐 |
Should the new tfidf function already return such a Representation Series? Working on it, we currently implemented it so that it calculates the Representation Series correctly (using some of your code from #43 ) and then use a function |
What's your opinion? We might also want to do both tasks 1 and 2 in #84 before releasing a new version. This is fine as well. In this case, we can already return the "new" Pandas Representation but this means that many tests will fail ... ... have a look at the unit tests to check for the "new" Pandas Representation under #43 (it took me a long while to implement that, it might spare you a bit of time too!) |
Docstring now includes formula/explaination. Normalization disabled. Representation Series is already being handled (although output is still like before). Function representation_series_to_flat_series added. Co-authored-by: Maximilian Krahn <[email protected]>
Docstring now includes formula/explaination. Normalization disabled (the option "normalization=None" was "hidden" in the sklearn code, so that turned out to be an easy fix). Representation Series is already being handled (although output is still like before, using representation_series_to_flat_series). Function representation_series_to_flat_series added. Unit tests are changed accordingly, also one with the explicit calculation using the formula. Co-authored-by: Maximilian Krahn <[email protected]>
We just opened a PR at #97 .
Yes, that is very important to discuss, and probably should be handled later on in #85 2.
I think it makes sense to still return a "old" output at the moment, so that's what we did currently. It probably makes sense to overhaul all the representation functions at the same time in #85 2. As the new tfidf already calculates with the representation series and just gives the old output, it will be easy to change. |
The
tfidf
function, under-the-hoods makes use of the sklearn.feature_extraction.text.TfidfVectorizer.By default, the
TfidfVectorizer
return a L2-normalized TF-IDF vector. In Texthero, we would like to avoid this hidden behavior, to let the user have more granular control over each action.This task consists of removing (and testing) the normalization, as well as make it more clear in the docstring which tfidf-formula is used. For extra information, the chapter text feature extraction of the scikit-learn documentation might be useful.
Task:
Implementation:
There are probably two ways of solve that:
Extra:
After having implemented #43, we will add two new functions, L1 and L2, to normalize any "Pandas Representation Series"
The text was updated successfully, but these errors were encountered: