Modelling the Coronavirus discussion using BERT


Some Context

Unless you’ve been living under a rock that is lucky enough to be lying outside of the vast reach of COVID-19, you’ll be aware that the virus is taking the world by storm. At the time of writing, some 250,000 cases have been confirmed, with the death count surpassing 10,000 people. Mass congregations are discouraged, shops and restaurants are closing, countries are shutting their borders and working from home is mandatory, rather than a privilege. It’s a big deal.

It accounts for virtually all discussion in the media, enjoying priority over such topics as the 2020 US presidential election or the UK finally leaving the EU for good in less than 9 months. People are flooding social media with COVID information, which can only mean one thing: data. Fresh data waiting to be analysed. And analyse it we will.

The Why

Why analyse text data? What’s the imaginary business case here?

In the age of social media when every individual has access to a platform on which to broadcast their views, it has never been easier to receive direct and instantaneous feedback from customers. Because people post their opinion online to be heard, organisations not only have a gift, but an obligation as well, to utilise this and extract actionable insight form submissions posted by their customer base.

Social media data, however, is vast. Very vast. A medium sized organisation would be hard pressed to keep a tab on, understand, summarise and present all their customers’ views, complaints and praises posted online; even if they hired an entire team to do so. And why would they, when they could use data science.

The What

We will attempt to uncover underlying topics in a snapshot of the Coronavirus discussion. To better understand how we’ll achieve this, let me take you on a journey of topic modelling history. Please save your applause for the end.

The Classics

Picture it. It’s the year 2003. Cristiano Ronaldo has just made his debut for Manchester United and Mike just proposed to Phoebe in Friends. Three computer scientists suggest using an algorithm that was previously pioneered in genetics, in the field of topic modelling. The algorithm in question is Latent Dirichlet Allocation, dubbed LDA because no one can pronounce the second word. The method uses a probabilistic approach to allocate documents into topics based on word co-occurrence. It’s a landslide success and the model is widely adopted by the fledgling NLP community.

Classical approaches to topic modelling, such as LDA or LSA, have been around for a while now. These are built on document-term matrix representations of text data and can work very effectively at relatively low cost. They do however lack the ability to capture any information about the position or order of words in our text, nor how similar they may be to each other.

Let’s Get Vectoral

You blink — it’s now 2013, but you’re still watching TV. Leo DiCaprio raises his glass with a wry smile for the first time on The Great Gatsby, and Miley Cyrus comes in like a wrecking ball. Bruno Mars is there. Meanwhile, Tomas Mikolov is experimenting with improving the Google Search Engine using shallow neural networks, and finds that he is able to map words to N-dimensional vector space which captures the meaning of words with respect to each other in numerical positions. A light bulb pops into existence over his head, and he publishes the word2vec model, unleashing with it upon the world the plague of the word2word nomenclature (see doc2vec, node2vec, seq2seq, graph2vec…).

Word embedding approaches are a major step in numerical text representation. This family of techniques maps words in a corpus to vector space, most commonly using neural networks (e.g. word2vec or GloVe). One major benefit word embedding brings over the aforementioned probabilistic models is their ability to represent similarity between words, given context in the training data, as proximity in vector space; for example, cat and kitten will be much closer to each other than cat and carpentry. The approach has been widely used for years, but is still limited in that it can only map words to a single vector, unable to capture different meanings for the same word in various contexts.

Transformers Assemble

The picture goes blurry again. When your vision clears, you notice that the date on your phone reads 11 October 2018. Corona means nothing more than a brand of beer to you. Google are still working tirelessly on maintaining their status as the Search Engine, and this time it’s Jacob Devlin to the rescue. He publishes a paper on using a Transformer-Encoder deep learning algorithm for text prediction tasks. He names it Bidirectional Encoder Representations from Transformers, which luckily abbreviates to BERT. It works well. Really well. No one is sure why.

BERT manages to one-up word embedding approaches by not only blowing them out of the park on many downstream NLP tasks, but more importantly for us, it can assign different vector representations for the same word in different context. This is especially important for homonyms: the word address has an entirely different meaning when I give someone my address and if I address someone, for example.

BERT has been covered on this site many times before, but if you haven’t come across it yet, I recommend taking a detour to read up about it. Jay Alammar’s blog provides a really good summary of it, and his dedication to illustrate the models using Sesame Street characters is formidable. Do come back though, because it’s about to get interesting.

We’ve now seen how far text representation has come (it has since gone further), so let’s put it to the test.

We will be using a sample of Tweets posted on the subject of the Coronavirus outbreak. We will then use BERT to represent these in vector space, using the average of their word embedding values. We can then postulate that, if words of similar meaning are closer to each other in vector space, we can group nearby Tweets together to find clusters of common topics. Our high level workflow will be:

  1. Collect data: search for Tweets on Coronavirus
  2. Pre-process data: carry out the usual text cleaning steps
  3. Embed documents: use BERT to find vector representations of each Tweet
  4. Reduce dimensionality: use PCA to decrease the size of our vectors while preserving variance
  5. Cluster embeddings: apply a clustering algorithm to find groups of Tweets with the same meaning
  6. Evaluate topics: try to make sense of what the topics are about

All with the ultimate aim of answering the question: What are people talking about on Twitter in relation to the Coronavirus?

The How

In this section I will run through my approach and share some of my code. You may skip this part if you’re only interested in the destination, not the journey.

Data collection and cleaning

I used the Twitter Search API to find Tweets on the 11th of March containing the words ‘COVID’ or ‘Coronavirus’, which netted me 17,998 Tweets in the English language from around the globe.

Let’s take a look at an example:

“@RepKinzinger I know you have a lot of incompetence to “overcome,” And you’ve been so “incredibly vigilant” in your Coronavirus monitoring. But would you tiki torch trumpists care to explain what agreements and acquiescence your party made to your “friends” in Moscow?

We run the data through a series of pre-processing steps; I won’t go into detail as this topic has been covered many times elsewhere. I turned everything to lower case, removed hyperlinks, mentions, non-alphanumerical characters and newlines, removed stopwords and lemmatised the rest. The above text now looks like this:

“know lot incompetence overcome incredibly vigilant coronavirus monitoring would tiki torch trumpists care explain agreements acquiescence party make friends moscow”

Document Embedding

We will use the Flair Python library, a framework developed by Zalando Research built on PyTorch, to embed our Tweets using a combination of pre-trained word embedding models.

Note: I used Google Colab to embed the Tweets, which took round about 30 mintes to do. Your mileage may vary, but if, like me, you don’t have a particularly powerful machine, I’d recommend making use of that free GPU access.

We’ll initialise the word embedding models:

import torch
!pip install flair # install Flair on Google Colab
from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings, Sentence, BertEmbeddings
# initialise embedding classes
flair_embedding_forward = FlairEmbeddings(‘news-forward’)
flair_embedding_backward = FlairEmbeddings(‘news-backward’)
bert_embedding = BertEmbeddings(‘bert-base-uncased’)
# combine word embedding models
document_embeddings = DocumentPoolEmbeddings([bert_embedding, flair_embedding_backward, flair_embedding_forward])

This will give us a tensor of size (1, 7168) for each Tweet, so we’ll initialise an empty tensor of size (17998, 7168) and iteratively fill it with our document vectors:

# set up empty tensor
X = torch.empty(size=(len(df.index), 7168)).cuda()
# fill tensor with embeddings
for text in tqdm(df['text_cl']):
sentence = Sentence(text)
embedding = sentence.get_embedding()
X[i] = embedding
i += 1

This will take some time, so grab a drink. Maybe do the dishes for once.

We now have a tensor with (17998, 7168) dimensions populated with embeddings for each Tweet. We are done with PyTorch at this point, so we’ll detach the tensor from the GPU and convert it to a NumPy array:

X = X.cpu().detach().numpy()

PCA and Clustering

We want to cluster these vectors into topics, and we’ll invoke Agglomerative Clustering with Ward affinity from scikit-learn to do so. Bottom-up hierarchical clustering algorithms have a memory complexity of O(n²), so we’ll use Principal Component Analysis to speed up this process. After all, we just finished watching a progress bar for 30 minutes.

As a side note, I did test a number of clustering algorithms (K-means, BIRCH, DBSCAN, Agglomerative with complete/average affinity), but Ward seems to perform the best in most cases. I attribute this to its ability to identify smaller fringe clusters, and does not seem to be hell-bent on splitting my data points into equal-sized groups, so it’s good for picking out underlying topics which do not necessarily correspond to the main discussion.

Let’s reduce the dimensionality of our vectors to length 768 — I picked this number somewhat arbitrarily, but BERT on its own produces vectors of this size, so it should be good enough for us, while also reducing the data size by something like 80%.

from sklearn.decomposition import PCA
pca = PCA(n_components=768)
X_red = pca.fit_transform(X)

We’ll initialise the algorithm with 10 clusters, fit our data and allocate the cluster labels to our main DataFrame:

from sklearn.cluster import AgglomerativeClustering
ward = AgglomerativeClustering(n_clusters=N_CLUSTERS,
pred_ward = ward.fit_predict(X_red)
df['topic'] = pred_ward

This yields the following topic distribution:

Distribution of Tweets by topic

We can see the benefits of our choice of clustering algorithm in action. Major topics, such as 0 and 3 were picked up, but we managed to separate some fringe discussions like 5 and 8.

We can visualise the topic clusters in 2-D:

Topic clusters represented in 2-D

Top Terms

We have now allocated each of our Tweets to a topic, but how do we make sense of them? We will find the words and phrases (uni- and bi-grams) in each topic with the highest TF-IDF scores; that is, we will identify the terms which appear a lot in one topic but don’t appear a lot in other topics. To do this, we’ll use scikit-learn’s TfidfVectorizer() in a custom function. Because we are dealing with a few large documents (treating each topic as its own document), we’ll limit our Document-Frequency to 50%, ensuring that the terms extracted do not appear in over half of the total. This step helps exclude very common words (like Coronavirus), which wouldn’t be very helpful in identifying the topic.

from sklearn.feature_extraction.text import TfidfVectorizer
def get_top_words(documents, top_n):
function to get top tf-idf words and phrases
  vectoriser = TfidfVectorizer(ngram_range=(1, 2),
  tfidf_matrix = vectoriser.fit_transform(documents)
  feature_names = vectoriser.get_feature_names()
  df_tfidf = pd.DataFrame()
  for doc in range(len(documents)):
words = []
scores = []
    feature_index = tfidf_matrix[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfidf_matrix[doc, x] for x in feature_index])
    for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    df_temp = pd.DataFrame(data={’word’:words, 'score’:scores})
df_temp = df_temp.sort_values(’score’,ascending=False).head(top_n)
df_temp[’topic’] = doc
df_tfidf = df_tfidf.append(df_temp)
  return df_tfidf

We group our Tweets into their allocated topics to form long documents, then apply the above function to them to find the 10 most important terms in each topic:

topic_docs = []
# group text into topic-documents
for topic in range(N_CLUSTERS):
topic_docs.append(' '.join(df[df['cluster']==topic]['text_cl'].values))
# apply function
df_tfidf = get_top_words(topic_docs, 10)

We’ll visualise the results: each chart represents a topic and its 10 most important terms. The longer the bar, the more representative the term:

Most representative terms in each topic

We can spot some topics around the economic stimulus package (topic 0), virus testing (topic 3) and sports (topic 9). We will do some more digging into the others later on.

Topic Compactness

How good are our topics? This is a non-trivial question as we are using unsupervised techniques on live data (we don’t have any training sets). All we can do is compare them to each other. We’ll postulate that ‘good’ topics are more compact in vector space, i.e. their document vectors are closer to each other, than in bad ones. To assess this, we will look at how close each Twitter vector in its respective topic is to the centroid of the topic vectors.

We find the centroids of the vectors by averaging them across each topic:

topic_centroids = []
for topic in tqdm(range(N_CLUSTERS)):
X_topic = X_red[df.index[df['cluster']==topic]]
X_mean = np.mean(X_topic, axis=0)

We then calculate the euclidean distance of each Tweet vector to their respective topic centroid:

from scipy.spatial.distance import euclidean
topic_distances = []
for row in tqdm(df.index):
topic_centroid = topic_centroids[df.iloc[row]['cluster']]
X_row = X_red[row]
topic_distance = euclidean(topic_centroid, X_row)

df['topic_distance'] = topic_distances

We can visualise the distribution of distances to the topic centroid:

Distribution of document vectors to respective topic centroids

The closer the distribution to the left of the graph, the more compact the topic is. Topics 3, 4, 6, 7 and 8 seem to be strong contenders; 8 is woefully spread out, indicating a lack of consistent content.

Topic Similarity

We looked at how similar Tweets are within each topic, but we can also look at how similar the topics are to each other. We will construct a euclidean distance matrix between the 10 topic centroids to find the distance between the topic averages. The closer the averages, the more overlap we’d expect between the topics.

from scipy.spatial import distance_matrix
df_dist_matrix = pd.DataFrame(distance_matrix(topic_centroids,
Distance matrix of topic centroids

The distance matrix shows the distance across all the topics. The darker the colour (and lower the number) of the cell, the closer the topics corresponding to its row and column are. Topics 3 and 7 or 0 and 2 are quite close together; Topics 1 and 4 very far from each other; topic 8, the black sheep in the family, is as far away from everyone as the others combined.

The Insight

Thanks to those who stuck it out with me during the previous section — it was an arduous task, but we have uncovered some useful information, so let’s recap on them.

What are the topics about?

The top terms provide some much needed context to the topics, allowing us to make very reasonable guesses as to what is (broadly) being discussed in each:

  • Topic 0 keywords: federal, president trump, tax, stimulus package, block, economic stimulus, pharmacies, federal health, agency classify, tell federal
    Most likely about: Trump’s Coronavirus stimulus package
  • Topic 1 keywords: hijack, hijack cells, coronavirus hijack, coronavirus classify, warmth hope, newmusic, warmth, covid19 covid2019, conference cancel, classify covid
    Most likely about: people sharing knowledge on how COVID ‘hijacks your cells’, and various events being cancelled
  • Topic 2 keywords: fauci coronavirus, worse trump, young unafraid, trump travel, unafraid coronavirus, stop kill, travel ban, good stop, people opinion, unafraid
    Most likely about: Trump’s travel bans
  • Topic 3 keywords: federal, testing, illness, tax, resources, energy, systems, labs, weaken, advocacy
    Most likely about: Coronavirus testing in the US
  • Topic 4 keywords: don know, american italy, lockdown horrific, haven, horrific, kinda, test set, set coronavirus, coronavirus isn, delay test
    Most likely about: Italy’s lockdown due to the outbreak
  • Topic 5 keywords: tweet covid, brand tweet, twitter suggest, suggest appropriate, appropriate ways, ways brand, 19 covid, 19 mattgsouthern, mattgsouthern, canadian officials
    Most likely about: Twitter, apparently, is giving advice to brands on how to post about Coronavirus
  • Topic 6 keywords: wanna, na, ya, nasty, bunch, aint, coronavirus ain, tp, warn house, ko
    Most likely about: this one isn’t obvious. I had to look at some examples to find out it’s non-news related general discourse on Coronavirus. (e.g. “If y’all nasty drunk girls would wash ya hands after crawling around the bathroom floor on a Saturday night we might not be in the predicament.” That’s a real Tweet.)
  • Topic 7 keywords: cybersecurity, pmp, pmp ppm, projectmanagement, agile, machinelearning, ppm projectmanagement, projectmanagement agile, agile cybersecurity, cybersecurity planning
    Most likely about: How to effectively manage a delivery team remotely in COVID lockdown!
  • Topic 8 keywords: home san, worth try, trump seriously, enjoy trump, begin enjoy, fuck begin, come fuck, face usa, portent possibly, scary portent
    Most likely about: …your guess is as good as mine. We have seen in the previous section that this is a weak topic.
  • Topic 9 keywords: men women, tournaments, ncaa men, daniele, daniele rugani, women, fan coronavirus, sign petition, play fan, premier league
    Most likely about: The effect of Coronavirus on world sports, e.g. the NCAA, the English PL or Daniele Rugani, the Italian footballer who reportedly tested positive for the virus.

How good are the topics?

  • The topic about testing (3), the Italian lockdown (4), the joke Tweets (6), the remote ways of working (7) and world sports (9) were the most compact therefore we can assume they cover more concise subjects than the others.
  • COVID testing (3) and remote ways of working (7) are closely related — my theory is that this is due to overlapping technical terms such as laboratory and cyber, respectively.
  • Trump’s COVID stimulus package (0) and his travel ban (2) are also closely related, for obvious reasons.
  • The biological workings of the virus (1) and the Italian lockdown (4) are furthest apart, semantically.

We have demonstrated the effectiveness of a lesser taken path to topic modelling using state of the art language models by extracting some coherent topics from a large collection of real data. Our approach also allowed us to evaluate our topics’ relation to each other, which seem to coincide with our interpretation of them. I’d chalk this one up as a success.


Did I do something wrong? Could I have done something better? Did I do something well?

Please don’t hesitate to reach out to me on LinkedIn; I’m always happy to be challenged or just have a chat if you’re interested in my work.