Figure 1 illustrates tagged sentence samples of unsupervised NER performed using BERT (bert-large-cased) with no fine tuning. The examples highlight just a few entity types tagged by this approach. Tagging 500 sentences yielded about 1000 unique entity types — of which a select few were mapped to the synthetic labels shown above. The bert-large-cased model is unable to distinguish between GENE and PROTEIN because descriptors for these entities fall within the same tail of predicted distributions for masked terms (they are not distinguishable in the base vocabulary either). Distinguishing them may require MLM fine tuning on domain specific corpus or in some instances pre-training a model from scratch using a custom vocabulary (examined in detail below)


In natural language processing, identifying entities of interest (NER) in a sentence such as person, location, organization etc. requires labeled data. We need sentences labeled with entities of interest where the labeling of each sentence is done either manually or by some automated method (often using heuristics to create a noisy/weakly labeled data set). These labeled sentences are then used to train a model to recognize those entities as a supervised learning task.

This post describes an approach to do unsupervised NER. NER is done unsupervised without labeled sentences using a BERT model that has only been trained unsupervised on a corpus with the masked language model objective.

How does this work?

If we are asked the entity type of a term (term refers to both words and phrases in this post) we have never seen before, we can guess it by just how the term sounds and/or from the sentence structure the term appears in. That is,

  • A term’s subword structure offers clue to its entity type.
Nonenbury is a _____
  • This is a fabricated city name but we could guess it may be a location given the suffix “bury”. Here the term suffix gives us a clue even though we don’t have any other clue about the entity type from the sentence context.
  • Sentence structure offers clue to a term’s entity type.
He flew from _____ to Chester

Here the sentence context gives us a clue that the unknown term is a location. We could guess any term in the blank position in the sentence is likely to be a location even without having seen it before (e.g. Nonenbury).

BERT’s mask language model head(MLM) can predict the masked words above, given its training objective described earlier — it learns by predicting words that have been blanked out in a sentence. This learning is then used during inference, to output a prediction for a masked term in a sentence, where the prediction is a probability distribution over BERT’s fixed vocabulary of words. This output distribution has a distinct but small tail ( < ~0.1 % of total mass) where words capturing the context sensitive entity type of a term resides. This tail is a context sensitive signature of a term. For instance, the context sensitive signature for the masked position in a sentence is shown below

Nonenbury is a _____
Predictions: village hamlet town settlement parish farm place river township location

Nearly 45% of BERT’s fixed vocabulary of words(28,996 for bert-large-cased) serve as a universal set of descriptors (e.g. common nouns, pronouns etc.). Subsets (could be overlapping) of these descriptors characterize the entity type of a term independent of its sentence context. These subsets are the context independent signatures of terms. The context independent subsets in BERT’s vocabulary that capture entity types close to context sensitive signature above are

['villages', 'towns', 'village', 'settlements', 'villagers', 'communities', 'cities']
['city', 'town', 'City', 'cities', 'village']
['settlement', 'settlements', 'Settlement']
['Township', 'townships', 'township']
['parish', 'Parish', 'parishes']
['neighborhood', 'neighbourhood', 'neighborhoods']
['castle', 'castles', 'Castle', 'fortress', 'palace']
['forest', 'forests', 'Forest', 'woods', 'woodland', 'rainforest']

A closest match function in the embedding space of BERT’s vocabulary between the m terms {B1, B2, C3,…. Bm} constituting the context sensitive signature and n sets of terms {{C11,C12,C13,…Ck1}, {C21,C22,C23,…Ck2},… {Cn1,Cn₂,Cn₃,…Ckn}} constituting context independent signatures, yields the NER label for a term (see figure 4 below).

Given the number of context independent signatures we can automatically harvest from BERT’s vocabulary is in the thousands (~6000 for bert-large-cased) — this approach allows us to perform unsupervised entity recognition for a large number of entity types at a fine grained level of granularity without the need for labeled data.

The unsupervised NER approach described above works largely because, as examined in this article

  • BERT’s raw word embeddings capture useful and separable information (distinct histogram tails with less than 0.1 % of vocabulary) about a term using other words in BERT’s vocabulary
  • This information can be harvested from both raw embeddings and their transformed versions after they pass through BERT with a Masked Language Model (MLM) head.

Steps for performing unsupervised NER

A one time offline processing is done to create a mapping from the sets of context independent signatures harvested from BERT’s vocabulary to a single descriptor/label. Subsequent steps are performed to label terms in an input sentence.

Offline one time processing

Step 1. Filter BERT’s vocabulary to pick context sensitive signature terms

BERT’s cased vocabulary is a mixture of common nouns, proper nouns, subwords and symbols. Lower case terms are largely common nouns, pronouns etc. and could serve as descriptors characterizing an entity type. This subset contains about 13,000 terms — 45% of BERT’s vocabulary. These lower case terms are common across both cased and uncased models. However we will not be using uncased models because cased models take advantage of casing in the predictions, which helps boost entity prediction performance.

Step 2. Generate context independent signatures from BERT’s vocabulary

Iterate through all terms in BERT’s vocabulary, and for each term pick the terms from the tail above a threshold, where the threshold for choosing is determined by the average number of elements in the tail. For bert-large-cased model, about .1% of the terms reside in a tail on average for a term above a cosine threshold of 0.5. Treat the terms in the tail of a word as a complete graph where the edge strengths are cosine similarity values.

Consider all possible bipartite graphs drawn from this complete graph with a single node in one set (in the general case it is be more than one node in the first set) and the rest in the second and pick the bipartite graph with the maximum strength.

Figure 2. Finding the pivot node(s) in a complete graph. In the complete graph above, the bipartite graph with “smoothly” in one set and the rest in the second, is the one with the maximum strength. So “smoothly” is the pivot node for this graph. The general case of this is a bipartite graph where there are multiple pivot nodes in set 1, as opposed to just one.

This bipartite graph serves as a context independent signature with the single node as pivot. Once a term is picked as part of a subgraph, it is not considered as a pivot candidate. However it could be an element of multiple sets

airport 0.6 0.1 Airport airport airports Airfield airfield
stroking 0.59 0.07 stroking stroked caressed rubbing brushing caress
Journalism 0.58 0.09 Journalism journalism Journalists Photography
smoothly 0.52 0.01 smoothly neatly efficiently swiftly calmly

In the example signatures above, the two numerical values are the mean and the standard deviation of the subgraph edge strengths. The first column term serves as the pivot term representative of that signature. These terms serve as the entity labels. These can be manually mapped (a one time operation) to a synthetic label such as person, location etc., if required by our application. In many instances the pivot terms themselves could directly serve as descriptors without need for any manually mapping. Also we do not have to map all 6000+ sets to synthetic labels. We only need to map those sets that are representative of the entity types of relevance to our specific application. The rest can be mapped to synthetic label “other/misc”.

This signature generation yields ~6000 sets, with average cardinality around 4 and standard deviation of 7. About 5000 vocabulary terms(17 % of vocabulary) are singleton sets and are ignored. These values will change if the threshold to pick a set is changed. A threshold of .4 would increase the total tail mass to .2% and would also increase the cluster averages (but sets start to become noisy).

Figure 3. Context independent signature sets stats for BERT (bert-large-cased). The mean cardinality is ~4 with a standard deviation of 7. About 17% of BERT vocabulary terms are singletons.

Entity prediction for each input sentence

Step 3. Minimally preprocess input sentence

Given an input sentence to tag entities, very minimal preprocessing is done on input. Casing normalization is done — sentences with all caps (typically occurs a document title) are transformed to lower case with the casing of first letter in each word preserved. This helps improve accuracy of detecting phrase spans in the next step.

He flew from New York to SFO
He flew from New York to Sfo

Step 4. Identify phrase spans in sentence

A POS tagger (ideally trained to handle all lower case words sentences too) is used to tag the input sentence. These tags are used to identify phrases as well as capitalize the first letter of each noun phrase.

He flew from New York to Sfo

The terms tagged as noun forms are represented in bold above. BERT’s masked word prediction is very sensitive to capitalization — hence using a good POS tagger that reliably tags sentences even if only in lower case is key to tagging performance. For instance, the masked prediction for the sentence below changes entity sense by just changing the capitalization of one letter in the sentence

Elon Musk is a ____
Predictions: politician musician writer son student businessman biologist lawyer painter member
Elon musk is a ____
Predictions: brand beer common popular beverage variant company bar red standard

As an aside, the masked predictions of BERT are only reliable for detecting entity types (in the examples above — person and standard) — not for factually accurate predictions, despite the fact BERT may occasionally make factually accurate predictions. Also the first prediction (person) is a stronger prediction given the maximal strength bipartite graph has a mean of .34 and a standard deviation of .09. In contrast the second prediction(standard) has a mean of .18 and standard deviation of .08 indicating it a weak prediction

Step 5. Use BERT’s MLM head to predict each masked position

For each noun term in a sentence, generate a sentence with that term masked. Use BERT’s MLM head to predict context sensitive signatures for the masked position.

He flew from __ to Sfo
Predictions: ['town', 'home', 'city', 'earth', 'heaven', 'north', 'thence', 'east', 'airport', 'village']
He flew from New York to ___
Predictions: ['wherever', 'home', 'found', 'town', 'before', 'were', 'abroad', 'jail', 're', 'each']

Step 6. Find close match between context sensitive and context independent signatures

A simple close match function, that yields reasonable results is to pick just one pivot node of the context sensitive signature and do a dot product of that term with all the 6000+ pivots in context independent signature set — then sort them to get entity tag candidates. Instead of just the top pivot we could take the top k pivots to improve confidence of tagging/prediction.

Figure 4. Close match between context sensitive signature and context independent signature. (a) the simplest implementation is a dot product between pivot nodes of the context sensitive signature and the pivots of the sets in context independent signature. A better implementation is to decide the number of nodes to be considered as pivots based on the mean and standard deviation of the nodes in the context sensitive signature and then choose the number of pivots to be considered in the bipartite graph. (b) shows the case when that count is 2. Using al the nodes in the context sensitive signature in the computation is unlikely to yield good results given the standard deviation is much higher in the context sensitive nodes on average. Essentially this is perhaps due to the fact the context sensitive signature when evaluated in the embedding space spreads over a larger region.

The tag predictions with just the top pivot is shown below. The tags are shown as opposed to the synthetic labels they are mapped to — in this case location for both the predictions.

He flew from __ to Sfo
Pivot node: town
Tags: city, Town, community, township, country, are, Village, church, school settlement
He flew from New York to __
Pivot: home
Tags: home house homes family there first school the residence work

Evaluation results

This section needs to be updated with confusion matrix once a test dataset is created evaluating the fine-grained and coarse-grained tagging capabilities.

Limitations and challenges of this approach

Corpus bias

While single entity predictions are illustrative of model’s capability to interpret entity types from subword information — in practice they can only be used in conjunction with sentences that have more than one entity type. Single entity sentences without much context are sensitive to corpus biases as illustrated below for Google and Facebook predictions.

Facebook is a __
Predictions: joke monster killer friend story person company failure website fault
Microsoft is a __
Predictions: company website competitor people friend player winner person brand story
Google is a __
Predictions: friend website monster company killer person man story dog winner

Ambiguous entity predictions

A challenge to this approach are sentences that allow for different entity types to fill a masked term. For instance, while predicting the entity type of New York in the sentence below

He felt New York has a chance to win this year's competition

the entity prediction for the masked word could be a word implying a person, which is a valid prediction as illustrated below

He felt __he____ has a chance to win this year's competition

The ambiguity arises by virtue of the masking, which in most cases can be resolved by determining the entity type of the masked term itself — New York.

New York is a _____
Predictions: city town place capital reality square country dream star model

However in some instances even the term being masked is ambiguous making determination of entity challenging. For instance if the original sentence was

He felt Dolphins has a chance to win this year's competition

Dolphins could be a music group or a sports team.

These challenges could be addressed to a large degree by multiple approaches

  • fine tuning a model on a domain specific corpus can help reduct ambiguity in domain specific entity types. For instance, BRAF (which is a gene) does not have the gene sense in its signature whereas the gene sense is present in the fine tuned model
BRAF is a _____
Prediction: standard variant name version world company worldwide common variable year
In a model fine tuned on a biomedical corpus, 
BRAF is a _____
Prediction: protein gene kinase structural family reaction functional receptor molecule viral
  • In some instances, pre-training a model from scratch with a custom vocabulary may not only help solve entity ambiguity but boost performance. For instance, BERT’s default vocabulary is rich with full words and subwords for detecting entity types like person, location etc. However, it is deficient in capturing for terms in biomedical domain. For instance, the tokenization of drugs like imatinib, nilotinib, dasatinib, do not consider the common subword “tinib”. Imatinib is tokenized into I ##mat ##ini ##b whereas dasatinib is tokenized into Das ##ati ##ni ##b. If we create our own vocabulary using sentencepiece on a biomedical corpus, we get im ##a ##tinib and d ##as ##a ##tinib — capturing common suffixes. Also the custom vocabulary contains full words from biomedical domain that capture characteristics of biomedical domain better.
Token:            imatinib             dasatinib
BERT (default): i ##mat ##ni ##b das ##ati ##nib
Custom: im ##a ##tinib d ##as ##a ##tinib

Related work/References This 2018 paper performs entity recognition using distant supervision. Fine grained labels are crowd sourced to train model. . This paper performs fine grained entity typing for over 10,000 free from types using a supervised multi-label classification model

Examining BERT’s raw embeddings

This article was manually imported from Quora