Today I will be leading a workshop on naive Bayes classifiers at Ibotta. The example I will use is classifying tweets as belonging to @kanyewest or @realDonaldTrump. The github repo for the workshop contains tweets from both accounts since the beginning of 2016 (to account for the fact that although Kanye’s account has been around since 2013, Trump’s account is much more active).
After obtaining the tweets, I removed a few stopwords that would be giveaways along with URLs, hashtags, and @-replies. Using the training and test sets in the example, this classifier achieves about 80 percent accuracy on the test set.
Which terms are useful for distinguishing the two accounts? Tweets containing “great” are 11.4 times more likely to belong to @realDonaldTrump, while tweets with the word “happy” are 8.9 times more likely to be Kanye’s.
These classifiers are “naive” because they do not account for correlation structure between the features–for example, all words that occur in a tweet are treated as independent even though this may not actually be the case. More complex models like latent Dirichlet allocation take these features into account, but naive Bayes models are easy to understand as an introduction to textual analysis. More pre-processing, such as word stemming could also improve the accuracy of the classifier.