Sentiment Analysis with BERT and TensorFlow

The Bidirectional Encoder Representation from Transformers (BERT for short) model comes from Google in 2018. At that time, Recurrent Neural Networks were primarily used for text processing. In an earlier article, we already discussed in detail why the so-called Transformer models are much better suited for this purpose.

How does the BERT Model work?

The BERT model was one of the first examples of how Transformers were used for Natural Language Processing tasks, such as sentiment analysis (is an evaluation positive or negative) or more generally for text classification. The basic idea behind it came from the field of Transfer Learning. BERT models were pre-trained on a huge linguistic dataset with the goal to predict missing words in a text as good as possible based on context. Depending on which BERT model we use exactly, it was trained on the complete English Wikipedia, for example.

Das Bild zeigt ein BERT Trainingsbeispiel. In dem Text ist ein einzelnes Wort verdeckt und das Modell muss lernen dieses Wort anhand des Kontextes zu erraten. — Example of BERT Training Task

As it turned out, these pre-trained models can then be “refined” to a specific application with significantly less effort and still deliver very good results. In the so-called pre-training on the large data set, the basic understanding of the language, such as grammar or vocabulary, was learned. In fine-tuning, the BERT model then concentrates exclusively on the use case and thus still delivers very good results even with comparatively little data.

How to use BERT and Tensorflow for Sentiment Analysis?

The IMDb Datensatz of Kaggle contains a total of 50,000 movie and series reviews and a label that describes whether it is a positive or negative review.

import pandas as pd

movie_reviews = pd.read_csv("IMDB Dataset.csv")
print(f"Review: {movie_reviews.iloc[0]['review']}")
print()
print(f"Sentiment: {movie_reviews.iloc[0]['sentiment']}")

Out:
Review: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.

Sentiment: positive

We want to use this dataset to use BERT Embedding to specialize a model to infer from general movie reviews whether the person liked the movie or not.

Data Preprocessing Kaggle

First of all, we need to import all the modules that we will need for this task.

# Regular imports
import numpy as np
import pandas as pd
import tqdm # for progress bar
import math
import random
import re

# Tensorflow Import
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers

# Bert Import for Tokenizer
import bert

To be able to use the text, we have to prepare it accordingly. In the first step, we create a function that removes, for example, the line breaks (<br/>) and other HTML leftovers from the text. In this step, we also filter out other text impurities using Regular Expressions.

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"s+[a-zA-Z]s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r's+', ' ', sentence)

    return sentence

# Clean all Reviews in DataFrame
reviews = []
sentences = list(movie_reviews['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))

From now on, we will represent the sentiment “positive” and “negative” with the numbers 1 and 0, since the machine learning model can only deal with numbers anyway and not with text.

# Save sentiments as dependent variable y
y = movie_reviews['sentiment']

# Set 1 for positive reviews and 0 for negative reviews
y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

We now also have to translate the reviews into integers. For this we use the so-called tokenizer of a BERT model. It splits the sentence into individual tokens, i.e. individual words or even syllables, and represents them by a number. At the same time, it creates a dictionary that serves as a vocabulary book, so that we are still able to assign a word or a syllable to exactly one number after the model training.

# Load Tokenizer and Model
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
 trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

# Try Tokenizer
tokenizer.tokenize("don't be so rude")
Out:
['don', "'", 't', 'be', 'so', 'rude']

tokenizer.convert_tokens_to_ids(tokenizer.tokenize("dont be so rude"))
Out:
[2123, 2102, 2022, 2061, 12726]

We load a specific BERT model from Tensorflow Hub. Training Transformer and BERT models is usually very costly and resource intensive. Especially when dealing with such large datasets. From Tensorflow, we can use the pre-trained models from Google and other companies for free.

We then tokenize all movie reviews in our dataset so that our data consists only of numbers and not text.

# Tokenize all reviews
def tokenize_reviews(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))
tokenized_reviews = [tokenize_reviews(review) for review in reviews]

# Create list of list with review, sentiment, and length of review for each entry
reviews_with_len = [[review, y[i], len(review)]
 
                     for i, review in tqdm.tqdm(enumerate(tokenized_reviews))]

# Shuffle dataset
random.shuffle(reviews_with_len)

# Sort by length of review
reviews_with_len.sort(key=lambda x: x[2])

# Drop review
sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in tqdm.tqdm(reviews_with_len)]

We will store the data in a Tensorflow dataset and pass it to the model. Among other things, this allows the model to be trained much more efficiently.

# Create Tensorflow datasets
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

# Define batch size and cut datasets by batch size
BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

As usual, we divide the information into a training set and a test set. This allows us to examine how well our model generalizes to data we have not yet seen. We will use ten percent of the data as a test data set.

TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

Create the Model

Now comes the most exciting part of this application: We define the model.

class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

The model itself is rather unspectacular. We use the BERT tokenizer for embedding the words in the first layer of the model. We then run these inputs through a total of three Convolutional Blocks with MaxPool layers. Then we decrease the number of neurons over two Dense layers so that we have the number of sentiments again in the Output layer. In our case, we have only two types of criticism, positive and negative. However, this class could also be used for more than two output labels.

Before we can train the model we still have to define the hyperparameters and compile the model.

# Hyperparameters
VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2
DROPOUT_RATE = 0.2
NB_EPOCHS = 2

# Build Model
text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

# Compile Model
 text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])

Now we are ready to train the model for a total of two epochs.

text_model.fit(train_data, epochs=NB_EPOCHS)

Out:
Epoch 1/2
1407/1407 [==============================] - 463s 326ms/step - loss: 0.3029 - accuracy: 0.8667
Epoch 2/2
1407/1407 [==============================] - 452s 320ms/step - loss: 0.1293 - accuracy: 0.9531

In the training data set, we have already achieved an accuracy of 95%. To make sure that the model has not just memorized the training data, we still look at the accuracy in the test dataset to see if it can generalize well.

text_model.evaluate(test_data)

Out:
156/156 [==============================] - 2s 14ms/step - loss: 0.3774 - accuracy: 0.8812

Here, too, we achieve a satisfactory accuracy of 88.12%.

# Test model on two individual reviews
test_reviews = ['This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good',
                'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie']
test_reviews_with_len = [[tokenize_reviews(test_reviews[0]), 
                          1, 
                          len(tokenize_reviews(test_reviews[0]))],
                         [tokenize_reviews(test_reviews[1]), 
                          0,  
                          len(tokenize_reviews(test_reviews[1]))]]
test_sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in tqdm.tqdm(test_reviews_with_len)]

predict_input = tf.data.Dataset.from_generator(lambda: test_sorted_reviews_labels, output_types=(tf.int32, tf.int32))
BATCH_SIZE = 2
test_batched_dataset = predict_input.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

# Get model prediction
text_model.predict(test_batched_dataset)

Out: 
array([[0.9999305 ],[0.00232589]], dtype=float32)

Even for completely fictitious reviews, the model classifies unerringly and has recognized that the first text is a positive review and the second is a negative one.

This is what you should take with you

BERT models can be used for many different NLP applications, such as sentiment analysis.
This example has shown how to use BERT Embedding for text classification.
Even without the BERT encoding layer, pure embedding already provides very good results.

Das Bild zeigt ein großes Zahnrad mit mehreren kleinen Zahnrädern in Anlehnung an Microservices.

What are Microservices?

10. September 2022

Build scalable and modular applications with microservices. Enable flexible, independent services for efficient development and deployment.

Das Bild zeigt eine Straße in Anlehnung an Named Entity Recognition.

What is Named Entity Recognition (NER)?

21. May 2022

Explanation of Named Entity Recognition with examples and applications.

Gesichtserkennung in Anlehnung an Convolutional Neural Network

Convolutional Neural Network in TensorFlow with CIFAR10 images

4. December 2021

Create a Convolutional Neural Network in Python with Tensorflow.

Das Bild zeigt eine Spinne, die sich über ihr Netz auf eine Website abseilt.

Web Scraping with Python – in 5 minutes to a working Web Scraper!

23. November 2021

Web scraping using Python and the Beautiful Soup library as an example.

You can find a selection of larger and smaller pre-trained BERT models here.
We largely use Stackabuse’s code and supplement it with our own comments.

Niklas Lang

I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.

My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.