iTranslated by AI
Amazon Review Classification using Natural Language Processing (Naive Bayes Edition)
In this article, we will tackle the task of determining whether Amazon reviews are positive or negative based on their content.
We will use Naive Bayes and BERT models, focusing on Naive Bayes this time.
Please refer to this article for the BERT version.
What is Naive Bayes?
In this case, we will be using the Bernoulli model rather than the Multinomial model.
Naive Bayes is a statistical method based on Bayes' Theorem.
As an example, let's consider whether a certain sentence d belongs to the "IT" category.
In Naive Bayes, we consider the probability of being in the IT category given a sentence d.
It calculates the probability for each category and classifies the sentence into the category with the highest probability.
This is called the posterior probability and can be expressed by the following formula:
Since P(d) is the same for all categories, to find the posterior probability, we just need the prior probability P(IT) and the likelihood P(d|IT).
P(IT) is simply the proportion of the IT category in the total number of sentences in the training data.
For example, if there are 1000 sentences and 100 of them are about IT, then
P(d|IT) requires a bit more thought.
First, we assume that a sentence is a set of words.
Since it's a set, we don't consider where the words appear in the sentence (this approach is called Bag-of-Words).
Using this concept, sentence d can be seen as a set of words w.
We also need to make one more assumption: all words appear independently.
Hold back your urge to shout, "Don't you know about co-occurring words! 'Machine' and 'Learning' are used together all the time!" and let's move on.
Assuming word occurrences are independent, the probability of sentence d (a set of words w) appearing given category IT is expressed as:
P(w_i|IT) is the probability of word w_i appearing in the IT category, which can be determined from the training data.
In practice, we take the logarithm to prevent underflow, but for now, this allows us to calculate the probability.
However, there's still a problem.
When predicting categories for unknown sentences, if even one word w' is included that wasn't in the training data, P(w'|IT) becomes 0, and consequently P(d'|IT) also becomes 0.
For example, imagine if the articles used for training were all written by Apple enthusiasts.
In this case, even if the target sentence contains words like iPhone, if "Android" is included, the probability for the IT category becomes 0 (this is called the zero-frequency problem).
To address this, we assign a certain value to prevent this issue (this method is called smoothing).
The mathematical expression is as follows:
Count(x) is the frequency of word x, N is the total number of words in the IT category, and V is the vocabulary size.
This way, the probability doesn't become zero, and we can avoid the zero-frequency problem.
Preprocessing
Now, let's move on to the implementation.
My execution environment is as follows:
- Google Colaboratory (Pro +)
- Python 3.8.10
- janome 0.4.2
We will use the publicly available dataset Webis Cross-Lingual Sentiment Dataset 2010.
From the available files, we will use cls-acl10-unprocessed.
Although preprocessed datasets are available, we will perform the preprocessing ourselves for educational purposes.
I referred to this book for the preprocessing code.
First, let's check the contents of the dataset.
<item><category>Book</category><raiting>4.0</raiting>
(Omitted)
<text>
Review text
</text>
(Omitted)
</item>
Preprocessing steps:
- Decompose into individual items using the
itemtag. - Extract the review text and rating using regular expressions.
- Tokenize the review text into individual tokens using a tokenizer.
Let's implement this.
t = Tokenizer()
vectorizer = CountVectorizer(token_pattern='(?u)¥¥b¥¥w+¥¥b') # Support for Japanese
def get_tokenized_sentences_and_labels(file_name: str):
# Loading data
with open(file_name) as f:
data = f.read()
data = data.replace('\n', '').replace('\r', '')
reviews = re.findall(pattern=r'<item>(.+?)</item>',string=data) # Remove reviews that have a rating but no text
tokenized_sentences = []
labels = []
for item in reviews:
raiting = re.findall(pattern=r'<rating>(.+?)</rating>',string=item)
text = re.findall(pattern=r'<text>(.+?)</text>',string=item)
if raiting and text:
# Adopt only reviews with text
words = [token for token in t.tokenize(text[0], wakati=True)] # Decompose into tokens
a_tokenized_sentence=" ".join(words) # Join with space delimiters
tokenized_sentences.append(a_tokenized_sentence)
raiting = int(float(raiting[0])) # Get 5-point rating
label = 0 if raiting >= 3 else 1 # positive -> 0, negative -> 1
labels.append(label)
return tokenized_sentences, labels
def make_sample_vectors(train_file_name: str, test_file_name: str):
train_tokenized_sentences, train_y = get_tokenized_sentences_and_labels(train_file_name)
train_X = vectorizer.fit_transform(train_tokenized_sentences)
test_tokenized_sentences, test_y = get_tokenized_sentences_and_labels(test_file_name)
test_X = vectorizer.transform(test_tokenized_sentences)
return train_X, train_y, test_X, test_y
train_X, train_y, test_X, test_y = make_sample_vectors(config.input_path + 'train.review', config.input_path + 'test.review')
Training
We will proceed with training using sklearn's naive_bayes module.
In Naive Bayes, the smoothing parameter
If you're interested, you might be able to improve accuracy by performing hyperparameter tuning, so feel free to try it out.
from sklearn.naive_bayes import BernoulliNB
cl = BernoulliNB()
cl.fit(train_X, train_y)
Checking the score, it turned out to be 0.729.
The confusion matrix is as follows:

The vertical axis represents the actual labels, and the horizontal axis represents the predicted labels.
You can see that there are many cases where positive reviews are misidentified as negative.
The code implemented this time can be found here. It is messy because it was not written for sharing purposes.
It should work if you update the PATH.
Please try playing around with it by changing parameters, etc.
Summary
Although Naive Bayes is a classical method, it was able to predict with a decent accuracy of over 70%.
The high probability of false negatives seems to leave room for further analysis.
I also wrote an article experimenting with the BERT version, so please check it out here.
This article is part of a relay article series from the "Online Community AcademiX for Learning AI and Machine Learning", to which I belong.
If you want to learn more deeply about AI, please join the community!
Discussion