Introduction

Let's make a naive Bayes classifier. We will run it on a collection of tweets and Reddit comments whose sentiment is either positive, neutral, or negative.

First, let's load the data into training and test corpora. We'll simply combine, shuffle, and split the datasets into training and test sets.

In [19]:
from random import shuffle

def getsamples(lines: list):
    n = len(lines)
    i = 1
    samples = []
    while i < n:
        line = lines[i]
        xy = line.split(',')
        x = xy[0].strip().split(' ')
        if len(xy) == 1: # multiline tweet
            line = lines[i][1:].rstrip() # remove double quote
            i += 1
            if len(line) == 0 and i < n:
                # beginning of tweet so dont put space
                line += lines[i].split(',')[0].rstrip()
                i += 1
            while i < n and len(lines[i].split(',')) == 1:
                line += ' ' + lines[i].rstrip()
                i += 1
            xy = lines[i].split(',')
            endsize = len(xy[0])
            if endsize > 1: # ignore double quote i.e. endsize == 1
                # remove double quote at the end
                line += ' ' + xy[0][:endsize - 1] 
            x = line.split(' ')
        y = int(xy[1].rstrip())
        samples.append([x, y])
        i += 1
    return samples

def getcorpus():
    corpus = []
    fnames = ['reddit',
              'twitter']
    ext = '.csv'
    for fname in fnames:
        with open(fname + ext, 'r', encoding='latin-1') as f:
            corpus.extend(getsamples(f.readlines()))
    return corpus

corpus = getcorpus()
shuffle(corpus) # shuffle so the test set isn't just twitter
train = corpus[:int(len(corpus) * 0.8)]
test = corpus[int(len(corpus) * 0.8):]
classes = [-1, 0, 1] # negative, neutral, positive

Naive Bayes

Now we can define our naive Bayes classifier class. All we really need is a train function and a test function.

To train, we calculate the probabilities of each class, P(c), and of every feature (in our case individual words) given each class, P(w|c), using MLE and counting.

The prior probability of a class P(c) will be defined as the number of documents (tweets) in class c out of all available documents.

The likelihood of a feature given a class P(w|c) will be defined as the number of occurrences of w in documents in class c out of the number of occurrences of all words in documents in class c.

For a test sample, we find the probability of the sample document given each class, P(d|c), and return the argmax class as our prediction.

Implementation Details

We'll need to import a log function since we would rather add up log probabilities than multiply (lots of near zero) probabilities.

We'll also need the Counter class to get frequency counts from our training corpus. Specifically, we're going to use a hyperparameter vocab_size to set our features as the vocab_size most common words in the corpus. These are the most common words in the entire corpus, not just for a class.

Notice the additional smoothing parameter. Since we are using individual words as features, there's a chance during training we find a word that doesn't show up in any documents for a class and gets assigned zero probability. The computer will not like taking log of zero, and we will not like the model. To avoid this, we simply add the value of smoothing to every word count to get rid of zero probabilities. Since we add the same amount to every count in the numerator and denominator, we maintain a valid probability distribution. This is called Laplace smoothing or Add-one smoothing, and typically a value like 0.05 works better than 1 on smaller datasets because we don't want to assign too much probability to words that never occurred in a class.

UPDATES:

Included an unknown token <unk> as part of the vocabulary, so we can assign a probability to out of vocabulary (OOV) words during testing. Its probability for a given class is the sum of probabilities of all the OOV words in that class.

Added an option to use a list of stopwords from NLTK. It's disabled by default because unless we somehow account for the effect of any stopwords in a sample we lose information and thus performance.

In [ ]:
from math import log2
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

class NaiveBayesClassifier:

    UNK = '<unk>'

    def __init__(self, C:list, vocab_size:int=5000, smoothing:float=0.05,
                 ignore_stopwords:bool=False):
        self.C = C
        self.vocab_size = vocab_size
        self.smoothing = smoothing
        self.ignore_stopwords = ignore_stopwords

    # ignore stopwords when building vocab
    # returns the vocab list
    def stopword_free_vocab(self, counts:Counter):
        V = []
        common_counts= counts.most_common()
        size, i = 0, 0
        while size < self.vocab_size:
            while common_counts[i][0] in stopwords.words('english') or len(common_counts[i][0]) == 0:
                i += 1
            V.append(common_counts[i][0])
            i += 1
            size += 1
        return V

    # see most likely words in vocab
    def common_class_words(self, class_index:int, num_words:int=10):
        vocab_likelihoods = [row[class_index] for row in self.loglikelihood]
        likely_word_indices = sorted(range(len(vocab_likelihoods)),
                                     key=lambda idx: vocab_likelihoods[idx],
                                     reverse=True)[:num_words]
        for j in range(num_words):
            w = self.V[likely_word_indices[j]] if likely_word_indices[j] < self.vocab_size else self.UNK
            print(' (%s, %f)' % (w, vocab_likelihoods[j]))

    def train(self, D:list):
        # get counts from a single document of all samples
        big = []
        for i in range(len(D)):
            big.extend(D[i][0])
        counts = Counter(big)
        
        # build vocabulary
        if self.ignore_stopwords:
            self.V = self.stopword_free_vocab(counts)
        else:
            self.V = [tup[0] for tup in counts.most_common(self.vocab_size)]
        
        # iterate over the classes, compute P(c) and P(w|c)'s for each class
        N_doc = len(D)
        self.logpriors = []
        self.bigdoc = []
        self.loglikelihood = [[] for _ in range(self.vocab_size + 1)] # extra row for UNK
        for i in range(len(self.C)):
            c = self.C[i]
            print('Training %d' % c)

            # build class document and calculate P(c)
            N_c = 0
            self.bigdoc.append([])
            for sample in D:
                if sample[1] == c:
                    N_c += 1
                    self.bigdoc[i].extend(sample[0])
            self.logpriors.append(log2(N_c / N_doc))
            print(' Counted %d samples, log P(c) = %f. Counting word occurrences...' % 
                  (N_c, self.logpriors[i]))
            
            # get word counts from class doc
            counter = Counter(self.bigdoc[i])
            # smoothed occurrences, vocab_size + 1 for UNK
            occurrences = sum(counter.values()) + self.smoothing * (self.vocab_size + 1)

            print(' Counted %d occurrences. Computing log P(w|c)s...' % occurrences)
            # get counts for every word in V
            for j in range(self.vocab_size): 
                w = self.V[j]
                c = counter[w]
                self.loglikelihood[j].append(log2((c + self.smoothing) / occurrences))

            # get sum of OOV word counts for UNK
            unk_c = 0
            for w, c in counter.most_common():
                if w in self.V:
                    continue
                unk_c += c
            self.loglikelihood[self.vocab_size].append(log2((unk_c + self.smoothing) / occurrences))

            # print most common words
            # self.common_class_words(i, 5)

    # get probability of each class, choose argmax class as prediction
    def test(self, testdoc):
        max_prob, max_c = float('-inf'), None
        for i in range(len(self.C)):
            class_prob = self.logpriors[i]
            for word in testdoc:
                # don't give stopwords UNK probability if ignoring
                if self.ignore_stopwords and word in stopwords.words('english'):
                    continue
                index = self.V.index(word) if word in self.V else self.vocab_size
                class_prob += self.loglikelihood[index][i]
            if class_prob > max_prob:
                max_prob = class_prob
                max_c = self.C[i]
        return max_c

We need to be able to evaluate our predictions, so for now we will just compute the recall of each class with a simple function

In [21]:
# update the true positive (prediction, not sentiment) rates for each class
def evaluate(y, y_pred, tp, fp, tn, fn, tneu, fneu):
    if y == -1:
        if y_pred == -1:
            tn += 1
        else:
            fn += 1
    elif y == 0:
        if y_pred == 0:
            tneu += 1
        else:
            fneu += 1
    else:
        if y_pred == 1:
            tp += 1
        else:
            fp += 1
    return (tp, fp, tn, fn, tneu, fneu)

Now we can train and test.

In [22]:
vocab_size = 5000
model = NaiveBayesClassifier(classes, vocab_size)
model.train(train)
print('Testing %d samples' % len(test))
tp, fp, tn, fn, tneu, fneu, count = 0, 0, 0, 0, 0, 0, 0
for sample in test:
    count += 1
    y = sample[1]
    y_pred = model.test(sample[0])
    tp, fp, tn, fn, tneu, fneu = evaluate(y, y_pred, tp, fp, tn, fn, tneu, fneu)
    if count % 5000 == 0:
        print(' %d samples, tpr %f tneur %f tnr %f' % (count, (tp / (tp + fp)),
                                                              (tneu / (tneu + fneu)),
                                                              (tn / (tn + fn))))

print('positive recall %%: %f' % (100 * tp / (tp + fp)))
print('neutral recall %%: %f' % (100 * tneu / (tneu + fneu)))
print('negative recall %%: %f' % (100 * tn / (tn + fn)))
Training -1
 Counted 34804 samples, log P(c) = -2.196412. Counting word occurrences...
 Counted 908207 occurrences. Computing log P(w|c)s...
Training 0
 Counted 54570 samples, log P(c) = -1.547557. Counting word occurrences...
 Counted 737328 occurrences. Computing log P(w|c)s...
Training 1
 Counted 70146 samples, log P(c) = -1.185305. Counting word occurrences...
 Counted 1851252 occurrences. Computing log P(w|c)s...
Testing 39880 samples
 5000 samples, tpr 0.849655 tneur 0.774688 tnr 0.735654
 10000 samples, tpr 0.845842 tneur 0.778969 tnr 0.722880
 15000 samples, tpr 0.847724 tneur 0.776567 tnr 0.729147
 20000 samples, tpr 0.849292 tneur 0.778674 tnr 0.728712
 25000 samples, tpr 0.850263 tneur 0.777230 tnr 0.727173
 30000 samples, tpr 0.850155 tneur 0.777898 tnr 0.725023
 35000 samples, tpr 0.851736 tneur 0.777918 tnr 0.725493
positive recall %: 85.145072
neutral recall %: 77.702203
negative recall %: 72.488934

Possible Improvements

We could have different unknown tokens for nouns, verbs, and other parts of speech but that would require part of speech tagging.

A naive Bayes classifier defined with individual words as features can be viewed as a set of unigram language models, one for each class. We can convert our classifier to use any natural number n-grams we want instead of unigrams. If so, we have more options for smoothing:

  1. Use discounting and Katz backoff, where we "back off" from an n-gram to (n-1)-gram if the probability of the n-gram is 0.
  2. Interpolate probabilities i.e. compute a weighted sum of each n-gram, (n-1)-gram, etc. probability such that the weights sum to 1.
  3. Use Kneser-Ney smoothing which computes the probability of a word as the probability of it appearing as a continuation in some novel context instead of just the count. This favors words that are "friendlier": those that appear after many different words.

As mentioned earlier, just removing stopwords adversely affects performance. We can try to account for the lost meaning in each sample by transforming the remaining words somehow. For example, to handle removing stopwords that negate meaning like not, nor, or don't we can apply negation tagging by adding a prefix indicating negation to each affected word. The sentence "I don't like pizza" becomes "I NOT_like NOT_pizza." The best way to implement this likely requires part of speech tagging.

We used individual words as features, but we can add any kind of features we want. We could have binary features that merely indicate the presence of a word rather than its count, categorical features like parts of speech, or numerical like the number of words used from some predefined sentiment lexicon.

Computing precision and F1 score along with recall would give a better indication of overall performance.

Lastly, we could combine the Twitter and Reddit datasets and split our data into training, tuning, and test sets so we have chance to tune hyperparameters like vocab size or smoothing.