Skip to content

Text Feature Extraction

Sagen Soren edited this page Nov 30, 2020 · 2 revisions

Text Feature Extraction

Bag of Words

Bag of words is a feature extraction method used to train machine learning models. It is one of the fundamental method to convert tokens into features.

  • text-preprocessing :
    • convert entire text into lowercase characters
    • remove all punctuations and unnecessary symbols
  • Vocabulary creation :
    • from the text create a set of unique word
  • Text vectorization :
    • create a matrix of features by assigning a separate column for each word, while each row corresponds to a sentence
    • assign 1 if the word is present in the sentence, and 0 if it is not present
Clone this wiki locally