Introduction
Let's make a naive Bayes classifier. We will run it on a collection of tweets and Reddit comments whose sentiment is either positive, neutral, or negative.
First, let's load the data into training and test corpora. We'll simply combine, shuffle, and split the datasets into training and test sets.
from random import shuffle
def getsamples(lines: list):
n = len(lines)
i = 1
samples = []
while i < n:
line = lines[i]
xy = line.split(',')
x = xy[0].strip().split(' ')
if len(xy) == 1: # multiline tweet
line = lines[i][1:].rstrip() # remove double quote
i += 1
if len(line) == 0 and i < n:
# beginning of tweet so dont put space
line += lines[i].split(',')[0].rstrip()
i += 1
while i < n and len(lines[i].split(',')) == 1:
line += ' ' + lines[i].rstrip()
i += 1
xy = lines[i].split(',')
endsize = len(xy[0])
if endsize > 1: # ignore double quote i.e. endsize == 1
# remove double quote at the end
line += ' ' + xy[0][:endsize - 1]
x = line.split(' ')
y = int(xy[1].rstrip())
samples.append([x, y])
i += 1
return samples
def getcorpus():
corpus = []
fnames = ['reddit',
'twitter']
ext = '.csv'
for fname in fnames:
with open(fname + ext, 'r', encoding='latin-1') as f:
corpus.extend(getsamples(f.readlines()))
return corpus
corpus = getcorpus()
shuffle(corpus) # shuffle so the test set isn't just twitter
train = corpus[:int(len(corpus) * 0.8)]
test = corpus[int(len(corpus) * 0.8):]
classes = [-1, 0, 1] # negative, neutral, positive
Naive Bayes
Now we can define our naive Bayes classifier class. All we really need is a train function and a test function.
To train, we calculate the probabilities of each class, P(c), and of every feature (in our case individual words) given each class, P(w|c), using MLE and counting.
The prior probability of a class P(c) will be defined as the number of documents (tweets) in class c out of all available documents.
The likelihood of a feature given a class P(w|c) will be defined as the number of occurrences of w in documents in class c out of the number of occurrences of all words in documents in class c.
For a test sample, we find the probability of the sample document given each class, P(d|c), and return the argmax class as our prediction.
Implementation Details
We'll need to import a log function since we would rather add up log probabilities than multiply (lots of near zero) probabilities.
We'll also need the Counter class to get frequency counts from our training corpus. Specifically, we're going to use a hyperparameter vocab_size
to set our features as the vocab_size
most common words in the corpus. These are the most common words in the entire corpus, not just for a class.
Notice the additional smoothing
parameter. Since we are using individual words as features, there's a chance during training we find a word that doesn't show up in any documents for a class and gets assigned zero probability. The computer will not like taking log of zero, and we will not like the model. To avoid this, we simply add the value of smoothing
to every word count to get rid of zero probabilities. Since we add the same amount to every count in the numerator and denominator, we maintain a valid probability distribution. This is called Laplace smoothing or Add-one smoothing, and typically a value like 0.05 works better than 1 on smaller datasets because we don't want to assign too much probability to words that never occurred in a class.
UPDATES:
Included an unknown token <unk>
as part of the vocabulary, so we can assign a probability to out of vocabulary (OOV) words during testing. Its probability for a given class is the sum of probabilities of all the OOV words in that class.
Added an option to use a list of stopwords from NLTK. It's disabled by default because unless we somehow account for the effect of any stopwords in a sample we lose information and thus performance.
from math import log2
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
class NaiveBayesClassifier:
UNK = '<unk>'
def __init__(self, C:list, vocab_size:int=5000, smoothing:float=0.05,
ignore_stopwords:bool=False):
self.C = C
self.vocab_size = vocab_size
self.smoothing = smoothing
self.ignore_stopwords = ignore_stopwords
# ignore stopwords when building vocab
# returns the vocab list
def stopword_free_vocab(self, counts:Counter):
V = []
common_counts= counts.most_common()
size, i = 0, 0
while size < self.vocab_size:
while common_counts[i][0] in stopwords.words('english') or len(common_counts[i][0]) == 0:
i += 1
V.append(common_counts[i][0])
i += 1
size += 1
return V
# see most likely words in vocab
def common_class_words(self, class_index:int, num_words:int=10):
vocab_likelihoods = [row[class_index] for row in self.loglikelihood]
likely_word_indices = sorted(range(len(vocab_likelihoods)),
key=lambda idx: vocab_likelihoods[idx],
reverse=True)[:num_words]
for j in range(num_words):
w = self.V[likely_word_indices[j]] if likely_word_indices[j] < self.vocab_size else self.UNK
print(' (%s, %f)' % (w, vocab_likelihoods[j]))
def train(self, D:list):
# get counts from a single document of all samples
big = []
for i in range(len(D)):
big.extend(D[i][0])
counts = Counter(big)
# build vocabulary
if self.ignore_stopwords:
self.V = self.stopword_free_vocab(counts)
else:
self.V = [tup[0] for tup in counts.most_common(self.vocab_size)]
# iterate over the classes, compute P(c) and P(w|c)'s for each class
N_doc = len(D)
self.logpriors = []
self.bigdoc = []
self.loglikelihood = [[] for _ in range(self.vocab_size + 1)] # extra row for UNK
for i in range(len(self.C)):
c = self.C[i]
print('Training %d' % c)
# build class document and calculate P(c)
N_c = 0
self.bigdoc.append([])
for sample in D:
if sample[1] == c:
N_c += 1
self.bigdoc[i].extend(sample[0])
self.logpriors.append(log2(N_c / N_doc))
print(' Counted %d samples, log P(c) = %f. Counting word occurrences...' %
(N_c, self.logpriors[i]))
# get word counts from class doc
counter = Counter(self.bigdoc[i])
# smoothed occurrences, vocab_size + 1 for UNK
occurrences = sum(counter.values()) + self.smoothing * (self.vocab_size + 1)
print(' Counted %d occurrences. Computing log P(w|c)s...' % occurrences)
# get counts for every word in V
for j in range(self.vocab_size):
w = self.V[j]
c = counter[w]
self.loglikelihood[j].append(log2((c + self.smoothing) / occurrences))
# get sum of OOV word counts for UNK
unk_c = 0
for w, c in counter.most_common():
if w in self.V:
continue
unk_c += c
self.loglikelihood[self.vocab_size].append(log2((unk_c + self.smoothing) / occurrences))
# print most common words
# self.common_class_words(i, 5)
# get probability of each class, choose argmax class as prediction
def test(self, testdoc):
max_prob, max_c = float('-inf'), None
for i in range(len(self.C)):
class_prob = self.logpriors[i]
for word in testdoc:
# don't give stopwords UNK probability if ignoring
if self.ignore_stopwords and word in stopwords.words('english'):
continue
index = self.V.index(word) if word in self.V else self.vocab_size
class_prob += self.loglikelihood[index][i]
if class_prob > max_prob:
max_prob = class_prob
max_c = self.C[i]
return max_c
We need to be able to evaluate our predictions, so for now we will just compute the recall of each class with a simple function
# update the true positive (prediction, not sentiment) rates for each class
def evaluate(y, y_pred, tp, fp, tn, fn, tneu, fneu):
if y == -1:
if y_pred == -1:
tn += 1
else:
fn += 1
elif y == 0:
if y_pred == 0:
tneu += 1
else:
fneu += 1
else:
if y_pred == 1:
tp += 1
else:
fp += 1
return (tp, fp, tn, fn, tneu, fneu)
Now we can train and test.
vocab_size = 5000
model = NaiveBayesClassifier(classes, vocab_size)
model.train(train)
print('Testing %d samples' % len(test))
tp, fp, tn, fn, tneu, fneu, count = 0, 0, 0, 0, 0, 0, 0
for sample in test:
count += 1
y = sample[1]
y_pred = model.test(sample[0])
tp, fp, tn, fn, tneu, fneu = evaluate(y, y_pred, tp, fp, tn, fn, tneu, fneu)
if count % 5000 == 0:
print(' %d samples, tpr %f tneur %f tnr %f' % (count, (tp / (tp + fp)),
(tneu / (tneu + fneu)),
(tn / (tn + fn))))
print('positive recall %%: %f' % (100 * tp / (tp + fp)))
print('neutral recall %%: %f' % (100 * tneu / (tneu + fneu)))
print('negative recall %%: %f' % (100 * tn / (tn + fn)))
Training -1 Counted 34804 samples, log P(c) = -2.196412. Counting word occurrences... Counted 908207 occurrences. Computing log P(w|c)s... Training 0 Counted 54570 samples, log P(c) = -1.547557. Counting word occurrences... Counted 737328 occurrences. Computing log P(w|c)s... Training 1 Counted 70146 samples, log P(c) = -1.185305. Counting word occurrences... Counted 1851252 occurrences. Computing log P(w|c)s... Testing 39880 samples 5000 samples, tpr 0.849655 tneur 0.774688 tnr 0.735654 10000 samples, tpr 0.845842 tneur 0.778969 tnr 0.722880 15000 samples, tpr 0.847724 tneur 0.776567 tnr 0.729147 20000 samples, tpr 0.849292 tneur 0.778674 tnr 0.728712 25000 samples, tpr 0.850263 tneur 0.777230 tnr 0.727173 30000 samples, tpr 0.850155 tneur 0.777898 tnr 0.725023 35000 samples, tpr 0.851736 tneur 0.777918 tnr 0.725493 positive recall %: 85.145072 neutral recall %: 77.702203 negative recall %: 72.488934
Possible Improvements
We could have different unknown tokens for nouns, verbs, and other parts of speech but that would require part of speech tagging.
A naive Bayes classifier defined with individual words as features can be viewed as a set of unigram language models, one for each class. We can convert our classifier to use any natural number n-grams we want instead of unigrams. If so, we have more options for smoothing:
- Use discounting and Katz backoff, where we "back off" from an n-gram to (n-1)-gram if the probability of the n-gram is 0.
- Interpolate probabilities i.e. compute a weighted sum of each n-gram, (n-1)-gram, etc. probability such that the weights sum to 1.
- Use Kneser-Ney smoothing which computes the probability of a word as the probability of it appearing as a continuation in some novel context instead of just the count. This favors words that are "friendlier": those that appear after many different words.
As mentioned earlier, just removing stopwords adversely affects performance. We can try to account for the lost meaning in each sample by transforming the remaining words somehow. For example, to handle removing stopwords that negate meaning like not, nor, or don't we can apply negation tagging by adding a prefix indicating negation to each affected word. The sentence "I don't like pizza" becomes "I NOT_like NOT_pizza." The best way to implement this likely requires part of speech tagging.
We used individual words as features, but we can add any kind of features we want. We could have binary features that merely indicate the presence of a word rather than its count, categorical features like parts of speech, or numerical like the number of words used from some predefined sentiment lexicon.
Computing precision and F1 score along with recall would give a better indication of overall performance.
Lastly, we could combine the Twitter and Reddit datasets and split our data into training, tuning, and test sets so we have chance to tune hyperparameters like vocab size or smoothing.