FastText sentiment analysis for tweets: A straightforward guide

The essential about fastText architecture, cleaning, upsampling and sentiments for tweets

A robot learning sentiments

In this post, we present fastText library, how it achieves faster speed and similar accuracy than some deep neural networks for text classification. Next, we show how to train a sentiment analysis model thanks to data generated with AWS Comprehend. In another article, we show how to use AWS Elastic Beanstalk to create a machine learning server to serve your model.

FastText — Shallow neural network architecture

FastText is an open-source NLP library developed by facebook AI and initially released in 2016. Its goal is to provide word embedding and text classification efficiently. According to their authors, it is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. [1]

This makes fastText an excellent tool to build NLP models and generate live predictions for production environments.

FastText architecture overview

The core of FastText relies on the Continuous Bag of Words (CBOW) model for word representation and a hierarchical classifier to speed up training.

Continuous Bag of Words (CBOW) is a shallow neural network that is trained to predict a word from its neighbors. FastText replaces the objective of predicting a word with predicting a category. These single-layer models train incredibly fast and can scale very well.

Also, fastText replaces the softmax over labels with a hierarchical softmax. Here each node represents a label. This reduces computation as we don’t need to compute all labels probabilities. The limited number of parameters reduces training time.

fastText hierarchical architecture for sentiment analysis.

Faster training but similar results

According to the initial paper [1], fastText achieves similar results to other algorithms while training a lot faster.

As you can see below, fastText training time is between 1 and 10 seconds versus minutes or hours for other models.

Bag of Tricks for Efficient Text Classification — Joulin 2016

Open dataset for sentiment analysis

Most open datasets for text classification are quite small and we noticed that few, if any, are available for languages other than English. Therefore in addition to provide a guide for sentiment analysis, we want to provide open datasets for sentiment analysis [2].

For these reasons we provide files with lists of tweets and their sentiments in:

  • English tweets dataset => 6.3 millions tweets available.
  • Spanish tweets dataset => 1.2m tweets.
  • French tweets dataset => 250 000 tweets
  • Italian tweets dataset => 425 000 tweets
  • German tweets dataset => 210 000 tweets

These were generated thanks to AWS Comprehend API. For Spanish and French, tweets were first translated to English using Google Translate, and then analyzed with AWS Comprehend. Sentiment is classified to either positive, negative, neutral, or mixed.

For this article, we use the English tweets dataset.

Cleaning tweets for sentiment.

They say that cleaning is usually 80% of a data scientist’s time. Sadly there is no exception here. To obtain the best results, we have to make sure that the data is something close to proper English, and because we work on tweets, this is no easy task.

Example of (funny) misspelled tweets — source: thepoke.co.uk

Our goal is to clean tweets to make them easier to read by a machine. There are many techniques out there to clean text. Most famous ones being lemmatization, stemming and stop words.

  • The goal of both stemming and lemmatization is to reduce inflectional forms and derivationally related forms of a word to a common base form. (Ex: am, are, is => be / dog, dogs, dog’s, dogs’ => dog.) These reduce the corpus size and its complexity, allowing for simpler word embedding (am, are and is share the same exact word vector).
  • Stop Words filters common words that add noise or provide no value for machines' understanding of a text. Examples: a, and, the….

While stemming and lemmatization helps for sentiment analysis, stop words filtering is not as straightforward. The goal of stop words is to remove unnecessary words, but if you look at available lists of stop words, the one from the NLTK library for instance, you find words that potentially convey negative sentiments such as: not, don’t, hasn’t…. but for sentiment analysis problem we want to keep negative words. It is evident that “It is a good game” and “It is not a good game”, provide opposite sentiment. Hence one needs to either edit the stop words list to exclude words that convey negative meaning or not use stop words at all. We chose the latter.

Furthermore, tweets are short messages that contain loads of emojis, contractions, hashtags, misspelled words and slang. Most of these have little value for sentiment analysis and need to be cleaned:

  • Contractions/slang cleaning. If we want to simplify our problem, we need to remove contractions and translate slang when there is an appropriate alternative. However, it is hard to find a library or database of words that do that. We had to create a list for that purpose. Check my GitHub page to see it.
#CONTRACTIONS is a list of contractions and slang and their conversion. { "you've":"you have", "luv":"love", etc...}
tweet = tweet.replace("’","'")
words = tweet.split()
reformed = [CONTRACTIONS[word] if word in CONTRACTIONS else word for word in words]
tweet = " ".join(reformed)
  • Fix misspelled word. Here we use a regular expression, using regex, to remove repeating characters in a word. Apart from regex, you could use other libraries that really detect and fix misspellings. Sadly they are quite slow and this is not acceptable in production when you have thousands of tweets to analyze every day.
import itertools
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
  • Escaping HTML characters: The Twitter API sometimes returns HTML characters. When this happens we need to convert them to their ASCII form. For instance, %20 is converted to space, and & is converted to &. To do this we use Beautiful Soup, a Python package for parsing HTML and XML documents.
from bs4 import BeautifulSoup
tweet = BeautifulSoup(tweet).get_text()
  • Removal of hashtags/accounts: Names through the use of Twitter hashtags (#) and accounts (@) needs to be removed. We wouldn’t want a football player’s name classified forever as “negative” by our model, just because he has been associated with poor comments in our dataset.
tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", " ", tweet).split())
  • Removal of web addresses:
tweet = ' '.join(re.sub("(\w+:\/\/\S+)", " ", tweet).split())
  • Removal of punctuation: Punctuation is not used for “bag of words” techniques.
tweet = ' '.join(re.sub("[\.\,\!\?\:\;\-\=]", " ", tweet).split())
  • Lower case: Convert everything to lower case to avoid case sensitive issues:
#Lower case
tweet = tweet.lower()
  • Emojis/Smileys: In a tweet, Emojis and Smileys are represented with ‘ \\’ or punctuation, and for this reason are not tokenized correctly. To keep their meaning, we need to convert them to a simpler form. For emojis, there is the python library “emoji” that does exactly that by converting the emoji code to a label. For smileys :-), you have to provide your list, which we do on our GitHub page.
#Part for smileys - SMILEY is a list of smiley and their conversion. {"<3" : "love", ":-)" : "smiley", etc...}
words = tweet.split()
reformed = [SMILEY[word] if word in SMILEY else word for word in words]
tweet = " ".join(reformed)
#Part for emojis
tweet = emoji.demojize(tweet)
  • Strip accents: Limited for English but widely used for other languages, accents are often misplaced or forgotten. The easiest way to deal with them is to get rid of them.
def strip_accents(text):
if 'ø' in text or 'Ø' in text:
#Do nothing when finding ø
return text
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)

To see everything tied up together, please check the code on my GitHub page [2].

Formatting the data

FastText needs labeled data to train the supervised classifier. Labels must start by the prefix __label__ , which is how it recognizes what a label or what a word is. Below is an example of the required format for tweets with label POSITIVE and NEGATIVE.

__label__POSITIVE congratulations you played very well yesterday.
__label__NEGATIVE disappointing result today.
...

We use the code below to format the data.

def transform_instance(row):
cur_row = []
#Prefix the index-ed label with __label__
label = "__label__" + row[0]
cur_row.append(label)
#Clean tweet and tokenize it
cur_row.extend( nltk.word_tokenize(tweet_cleaning_for_sentiment_analysis(row[1].lower())))
def preprocess(input_file, output_file, keep=1):
    with open(output_file, 'w') as csvoutfile:
csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
with open(input_file, 'r', newline='') as csvinfile: #,encoding='latin1'
csv_reader = csv.reader(csvinfile, delimiter=',', quotechar='"')
for row in csv_reader:
if row[0].upper() in ['POSITIVE','NEGATIVE','NEUTRAL','MIXED']:
row_output = transform_instance(row)
csv_writer.writerow(row_output)
# Preparing the training dataset 
preprocess('BetSentimentTweetAnalyzed.csv', 'tweets.train')

Upsampling to offset categories imbalance.

Category imbalance problem occurs when one label appears more often than others. In such a situation, classifiers tend to be overwhelmed by the large classes and ignore the small ones.

Applied to our dataset of English tweets [2], we notice an imbalance of neutral versus positive/negative classes. As a consequence, a primitive strategy of classifying everything as neutral would give an accuracy of 73% (see table below). For the same reason, our model might tend to favor neutral. If unmanaged, category imbalance would make our model simplistic and inaccurate.

Example of imbalance in labels

To deal with this, we have to use upsampling. Upsampling (or oversampling) consists of adding new tweets for the minority classes, positive and negative, to have them reach a number of tweets equal to the majority class, neutral here. We provide a simple code to do that.

def upsampling(input_file, output_file, ratio_upsampling=1):
#
# Create a file with equal number of tweets for each label
# input_file: path to file
# output_file: path to the output file
# ratio_upsampling: ratio of each minority classes vs majority one. 1 mean there will be as much of each class than there is for the majority class.
i=0
counts = {}
dict_data_by_label = {}
    i=0
counts = {}
dict_data_by_label = {}
# GET LABEL LIST AND GET DATA PER LABEL
with open(input_file, 'r', newline='') as csvinfile:
csv_reader = csv.reader(csvinfile, delimiter=',', quotechar='"')
for row in csv_reader:
counts[row[0].split()[0]] = counts.get(row[0].split()[0], 0) + 1
if not row[0].split()[0] in dict_data_by_label:
dict_data_by_label[row[0].split()[0]]=[row[0]]
else:
dict_data_by_label[row[0].split()[0]].append(row[0])
i=i+1
if i%10000 ==0:
print("read" + str(i))
# FIND MAJORITY CLASS
majority_class=""
count_majority_class=0
for item in dict_data_by_label:
if len(dict_data_by_label[item])>count_majority_class:
majority_class= item
count_majority_class=len(dict_data_by_label[item])

# UPSAMPLE MINORITY CLASS
data_upsampled=[]
for item in dict_data_by_label:
data_upsampled.extend(dict_data_by_label[item])
if item != majority_class:
items_added=0
items_to_add = count_majority_class - len(dict_data_by_label[item])
while items_added<items_to_add:
data_upsampled.extend(dict_data_by_label[item][:max(0,min(items_to_add-items_added,len(dict_data_by_label[item])))])
items_added = items_added + max(0,min(items_to_add-items_added,len(dict_data_by_label[item])))
# WRITE ALL
i=0
with open(output_file, 'w') as txtoutfile:
for row in data_upsampled:
txtoutfile.write(row+ '\n' )
i=i+1
if i%10000 ==0:
print("writer" + str(i))
upsampling( 'tweets.train','uptweets.train')

With upsampling, you run the risk of overfitting by repeating over and over the same tweets. But if your dataset is big enough this should not be an issue.

Training with fastText

Now the fun part. Time to train our machine for sentiments!

We use the fastText python wrapper to train our model. You can find implementation examples and documentation on Facebook Research’s GitHub page. Please make sure you install fastText using “git clone …” and not using “pip install fasttext”.

As we already prepared our data, all we need to do now is to use the function fastText.train_supervised. There are tons of options for this function, but for sake of simplicity we focus on the following:

  • input: the path to our training data.
  • lr: Learning rate. We set it at 0.01.
  • epoch: Number times we go through the entire dataset. We use 20.
  • wordNgrams: An n-gram is a contiguous sequence of max n words from a given sample of text, tweet here. We set it at 2.
  • dim: Dimension of word vector. We use 20.

The following python code shows the training of our model.

hyper_params = {"lr": 0.01,
"epoch": 20,
"wordNgrams": 2,
"dim": 20}

# Train the model.
model = fastText.train_supervised(input=training_data_path, **hyper_params)
print("Model trained with the hyperparameter \n {}".format(hyper_params))

Once trained, we need to assess how good our model is at sentiment analysis. For this, we can use the two measures Precision and Recall which are the output of fastText functionmodel.test. However, due to the nature of our problem, precision and recall give similar figures and we can focus on precision only.

The code below implements model.test on the training and validation data to compare the accuracy of our model. Note that for validation, we take a different dataset on which we use the same cleaning process but no upsampling.

# CHECK PERFORMANCE      
result = model.test(training_data_path)
validation = model.test(validation_data_path)

# DISPLAY ACCURACY OF TRAINED MODEL
text_line = str(hyper_params) + ",accuracy:" + str(result[1]) + ",validation:" + str(validation[1]) + '\n'
print(text_line)

Overall the model gives an accuracy of 97.5% on the training data, and 79.7% on the validation data.

Not so bad considering we did not tweak the hyperparameters. Furthermore, research estimates that people only agree around 60 to 80% of the times when judging the sentiment for a particular piece of text. So while we could try to reach for 100% accuracy, we have to keep in mind that humans are fallible… and most importantly that we work on tweets!!

Conclusion

We just showed how fastText works and how to train an English sentiment analysis model. We used sentiment data produced by AWS Comprehend. In another article, we explain how to serve your model with a robust cloud infrastructure, using AWS Elastic Beanstalk and a Python Flask application.

Should you want to reproduce the results, just go to my gitHub. For the full English dataset just ask me (too big for GitHub). I’ll be happy to share it with you.

References

[1] Bag of Tricks for Efficient Text Classification, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, 2016

[2] https://github.com/charlesmalafosse. My GitHub page with full code for this article.

[3] Facebook GitHub with fastText python wrapper. https://github.com/facebookresearch/fastText/tree/master/python

[4] Deploy a machine learning model with AWS Elastic Beanstalk https://medium.com/@charlesmalafosse/deploy-a-machine-learning-model-with-aws-elasticbeanstalk-dfcc47b6043e