Skip to content

Latest commit

 

History

History
74 lines (64 loc) · 3.95 KB

readme.md

File metadata and controls

74 lines (64 loc) · 3.95 KB

Getting Started with Text Analysis in Python

If you'd like to follow along with this presentation from PyGotham 2018, please feel free to download this repo and follow these instructions.

Installation

To install TextBlob, simply open a shell and type:

$ pip install textblob
$ python -m textblob.download_corpora

If you run into an issue with certificate errors on step two above and you are on a Mac, you can see this article to learn more about ways to address it.

Also, if you want to be all professional about this, you probably want to learn about installing and using virtualenv, and probably virtualenvwrapper. But it's really up to you -- not required to just get going here.

Using TextBlob

Now you're ready to get started using TextBlob! Launch a python shell to get going.

$ python

Let's import TextBlob:

from textblob import TextBlob

Now let's create our first TextBlob from the text of a random tweet, and we're going to call it tweet:

tweet = TextBlob("I am very excited about the person who will be taking the place of Don McGahn as White House Councel! I liked Don, but he was NOT responsible for me not firing Bob Mueller or Jeff Sessions. So much Fake Reporting and Fake News! ")

Congrats! You've made your first TextBlob. Here's some stuff you can do with it:

tweet.words
tweet.sentences
tweet.tags
tweet.noun_phrases
tweet.word_counts
tweet.sentiment

Naive Bayes with TextBlob

To start making a Naive Bayes classifier with TextBlob, we need to start by importing the NaiveBayesClassifier from the TextBlob package:

from textblob.classifiers import NaiveBayesClassifier

Great. Now you can make your own NaiveBayesClassifier and train it with some training data. Assuming you've downloaded the sample.json file in this repo and you opened your Python shell from inside that same directory, you can type these commands to open the sample file and use it to train a new classifier we're creating and calling cl:

with open('sample.json', 'r') as training:
  cl = NaiveBayesClassifier(training, format='json')

That may take a second, but now you have a cl object that is a NaiveBayesClassifier, and it has been trained with our sample.json training data. You've basically built up a big probability table of which words or "features" are more closely associated with the labels in our training data. To see some information about these probabilities, type:

cl.show_informative_features()

Now you're ready to classify some new text that the cl classifier object hasn't yet seen.

Let's try to classify the tweet TextBlob we created earlier with this command:

cl.classify(tweet)

That's great. It should only return the name of the label (from our training sample.json data) that the tweet text is most like. But what if you want some more information, like if it was a very strong match to one of our labels, or if it's almost 50/50?

You can use the prob_classify method to get some probability information about how likely each of your labels is to be used, by calling the methods this way (using the same 'labels' that are contained in the training data):

cl.prob_classify(tweet).prob('Trump')
cl.prob_classify(tweet).prob('Staff')

Now if you want to run some real tests, you can also feed in some test data in a very similar way to how we trained our classifier. Assuming you have the test.json file in this repo, you can load it in and test how accurate your classifier is by running:

with open('test.json', 'r') as test:
  cl.accuracy(test, format='json')

And that's our demo! The TextBlob docs are really great and has a lot of the information here (along with much more).

Enjoy getting started with text analysis and classification in Python!