-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add LSA encoder #1121
Comments
Great!! We need to think about a name. I think LSA is a bit of a technical name that might ring a bell to non technical users. We brainstormed a bit in terms of name with @jeromedockes and @rcap107 . The name StringEncoder came to mind. It would be close to TextEncoder (#1077), but we feel that the difference is somewhat understandable. That said, maybe it would be an argument to move the name TextEncoder to SentenceEncoder, which would also be (maybe) a good name because it would be more explicit (link to "SentenceTransformer") |
Very interesting! One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT? |
One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT?
Yes, this was raised, and it is true. I guess that one difference that I make is that the GapEncoder assumes more latent structure (aka dirty-category structure) than open ended strings.
One argument for naming it the "StringEncoder" is that if you really have no prior information on the data or the use of the encoding, it's probably a good default to encode a string. Of course, we'll have to have good "see also" section, and a good discussion in the docs.
|
Okay, this sounds easy to explain in the doc! |
Any thoughts @GaelVaroquaux? I'm curious |
Scikit-learn mentions LSA in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place
Any thoughts ***@***.***? I'm curious
Probably because it's easy to implement with the tools in scikit-learn and scikit-learn being general (not focused on text or the like) it didn't feel like it should be there.
|
Problem Description
Latent Semantic Analysis (LSA) consists of a TfidfVectorizer followed by Singular Value Decomposition (SVD). Scikit-learn mentions it in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place, @GaelVaroquaux?
Feature Description
Create the
LSAEncoder
, a simple pipeline chainingTfidfVectorizer
andTruncatedSVD
(or a PCA, both support sparse matrices).Alternative Solutions
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: