Sentiment analysis is actually one of the hottest topics in Deep learning area. It falls under the umbrella of NLP. Companies use sentiment analysis for product analytics, brand monitoring and many other applications. In this tutorial, you will discover how to develop sentiment analysis model using Deep Learning, specifically RNN.

Once again, we will use the dataset from UCI Machine learning repository. The dataset is comprised of user reviews from three different websites , Amazon, Yelp, Imdb, and “Positive” or “Negative” labels. The purpose is to build a model to classify the user reviews into those 2 classes.

As always, first step is data cleanup. We will remove non-alphanumeric characters.

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

X = []
sentences = list(df[0])
y = df[1].to_numpy()
for sen in sentences:
    X.append(preprocess_text(sen))

Next step, split the dataset for training and evaluation. To achieve that, we use Scikit-learn as Tensorflow doesn’t provide an easy way to do it at the time I write this article.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Now, one of the most important step of text classification is the word embedding. There are several ways to do that but I use a tokenizer.

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

Now, we proceed to word padding with fixed length to 300 elements.

# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1
MAX_LENGTH = 300

X_train = pad_sequences(X_train, padding='post', maxlen=MAX_LENGTH)
X_test = pad_sequences(X_test, padding='post', maxlen=MAX_LENGTH)

Now we are ready to build the input functions for the estimator:

BATCH_SIZE = 32
REPEAT_SIZE = 100

def train_input_fn():
    return tf.data.Dataset.from_tensor_slices((X_train, y_train)).repeat(REPEAT_SIZE).batch(BATCH_SIZE)

def test_input_fn():
    return tf.data.Dataset.from_tensor_slices((X_test, y_test)).repeat(REPEAT_SIZE).batch(BATCH_SIZE)

For a better accuracy, we will use Glove pre-trained word vectors.

from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()
glove_file = open('./glove/glove.6B.300d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions
glove_file.close()

For each word from user review, we will get the associated Glove’s vectors.

embedding_matrix = zeros((vocab_size, 300))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

Time to define the model function. We will use bidirectional LSTM layers.

import tensorflow.compat.v1 as tf1
tf1.disable_v2_behavior()

def model_fn(features, labels, mode):
    
    layer = Embedding(vocab_size, MAX_LENGTH, weights=[embedding_matrix], input_length=maxlen , trainable=False)(features)
    layer = Bidirectional(LSTM(128, return_sequences=True))(layer)
    layer = Bidirectional(LSTM(128))(layer)    
    logits = tf.keras.layers.Dense(units=2)(layer) 
    
    predictions = {
      # Generate predictions (for PREDICT and EVAL mode)
      "classes": tf.argmax(input=logits, axis=1, name="classes"),
      "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
    }
    
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    
     # Calculate Loss   
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits)    
    loss = tf.reduce_mean(loss)
    
    # Configure the Training Op (for TRAIN mode)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf1.train.AdamOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(
            loss = loss,
            global_step = tf1.train.get_global_step()
            )
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
    
    # Add evaluation metrics Evaluation mode
    eval_metric_ops = {
        "accuracy": tf1.metrics.accuracy(
            labels=labels,
            predictions=predictions["classes"]
        )}
    
    return tf.estimator.EstimatorSpec(
        mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

Now, we can build the estimator, train and evaluate.

# Create the Estimator
import tempfile
model_dir = tempfile.mkdtemp()
my_estimator = tf.estimator.Estimator(
    model_fn=model_fn, model_dir=model_dir)

# Set up training logging for predictions
tensors_to_log = {"probabilities": "softmax_tensor", "classes": "classes"}
logging_hook = tf.estimator.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=100)

# Build specification 
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=500, hooks=[logging_hook])
eval_spec = tf.estimator.EvalSpec(input_fn=test_input_fn)

tf.estimator.train_and_evaluate(
    my_estimator,
    train_spec,
    eval_spec)

We can observe an accuracy up to 80%

{'accuracy': 0.8, 'loss': 0.84808326, 'global_step': 500}

Jupyter file is available on Github