Machine learning without labels using Snorkel
There is a certain irony that machine learning, a tool used for the automation of tasks and processes, often starts with the highly manual process of data labelling.
The task of creating labels to teach computers new tasks is quickly becoming the blue collar job of the 21st century.
It has created complex supply chains which often end in lower income countries such as Kenya, India and the Philippines. Whilst this new industry is creating thousands of jobs, workers can be underpaid and exploited.
The market for data labelling passed $500 million in 2018 and it will reach $1.2 billion by 2023. It accounts for 80 percent of the time spent building A.I. technology.
The ability, therefore, to automate the process for creating data labels is highly desirable from a cost, time and even ethical standpoint.
In this post, we will explore how this can be achieved through a worked example in Python using the excellent Snorkel library.
Snorkel is a really innovative concept; create a series of ‘messy’ label functions and combine these in an intelligent way to build labels for a data set. These labels can then be used to train a machine learning model in exactly the same way as in a standard machine learning workflow. Whilst it is outside the scope of this post it is worth noting that the library also helps to facilitate the process of augmenting training sets and also monitoring key areas of a dataset to ensure a model is trained to handle these effectively.
Snorkel itself has been around since 2016 but has continued to evolve. It is now used by many of the big names in the industry (Google, IBM, Intel). Version 0.9 in 2019 brought with it a more sophisticated way of building a label model, as well as a suite of well documented tutorials covering all of the key areas of the software. Even if you have come across it before, with these updates it is worth a second look.
When to use Snorkel
Before we get started with a worked example, it is worth considering when you should use this library over a traditional (manual) approach of creating labels. If the answer to all of the below is ‘Yes’ then it would be worth considering Snorkel:
- You have a data set with no labels or an incomplete set of labels.
- It will take significant time & effort to label the data set manually.
- You have domain knowledge of the data (or can work closely with someone who has).
- You can think of one or more simplistic functions which could be used to split the data into different classes (for example, by using a key word search, or setting a particular threshold on a value).
How does Snorkel work?
The process for using Snorkel is a simple one:
- 🏅[Optional] create a small subset of ‘golden’ labels for items within the dataset (this is helpful for reviewing performance of the final model but is not essential. The alternative to this is ‘eyeballing’ results to understand how the model performs.)
- ⌨ Write a series of ‘Label Functions’ which define the different classes across the training data.
- 🏗 Build a label model and apply this to the dataset to create a set of labels.
- 📈 Use these labels in your normal machine learning pipeline (ie use the labels produced to train a model).
This process is iterative and you will find yourself evaluating the results and re-thinking and refining the label functions to improve the output.
A worked example
Let's take a real life problem to show how Snorkel can be used a machine learning pipeline.
We will be trying to split out ‘frameworks’ from ‘contracts’ in an open source commercial dataset (from Contracts Finder, a UK transparency system which logs all Government contracts above £10k).
What is a framework?
A framework can be thought of as a ‘parent agreement’. It is a way of settling the ‘Ts and Cs’ with one or more suppliers which then allow for contracts to be agreed without having to go through the paperwork all over again.
The issue is, because there is a parent-child relationship between frameworks and contracts, this can lead to double counting when analysing the data. It is therefore important to be able to split out frameworks from contracts.
The data in this example consists of a contract title, description and value. The below shows an example of what we have to work with:
We have a placeholder column called ‘framework’ which we will be using to add our labels. The naming convention we will use is:
1 = Framework
0 = Not Framework
-1 = Abstain (ie not sure!)
Creating our first label function:
We will start by creating a series of label functions. These can essentially be any standard Python function and can be as simple (or as complex) as you need them to be.
We will start with a simple keyword search on the data set. This example searches for the phrase:
“use by uk public sector bodies”
as this is only likely to occur in the descriptions of frameworks. Snorkel makes this really simple, all you have to do is wrap a standard Python function with the decorator @labeling_function():
from snorkel.labeling import labeling_function
return 1 if "use by uk public sector bodies" in x.desc.lower() else -1
Great we have just created our first label function! We now build up a number of other functions which will help separate frameworks from contracts.
Tips on creating effective labelling functions
After working with the library, the following guidelines should be helpful in designing effective label functions:
- 🤔Always have the end outcome in mind when designing label functions. In particular, think about precision and recall. This will help when deciding on whether coverage or specificity is more important in the labels produced.
- 📑 Think through potential functions before coding them. Creating a list of these in plain English is helpful to allow you to prioritise the most effective label functions before coding them up.
- 🎯Less is more. It is often tempting to take a scatter-gun approach to building functions however a smaller number of well thought out functions are always more effective than a larger number of less refined label functions.
- ⚗️Always test any new function on the dataset by itself before adding to your label functions. What results does it return? Are these what you were expecting.
Applying and evaluating our label functions
Once you have built one or more labelling functions, you need to apply these to create a set of data points. This can be achieved using the PandasLFApplier which allows you to build these data points directly from a Pandas dataframe.
from snorkel.labeling import PandasLFApplier
lfs = [ccs,Other_label_functions...]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train) #unlabelled dataset
L_dev = applier.apply(df=df_test) #small label dev set
Note that in this example we have a train and ‘dev’ set of data. The dev set is a small data set of manually labelled items (our ‘Gold’ labels). This makes things easier to quickly get a rough sense of how label functions are performing (the alternative to eyeballing the results produced).
Once you have applied your label functions, Snorkel provides easy access to label performance by using LFAnalysis
As we have a dev set of information with labels this will provide us with the below information:
The meaning of each of these is below (straight from the Snorkel documentation):
- Polarity: The set of unique labels this LF outputs (excluding abstains)
- Coverage: The fraction of the dataset the LF labels
- Overlaps: The fraction of the dataset where this LF and at least one other LF label
- Conflicts: The fraction of the dataset where this LF and at least one other LF label and disagree
- Correct: The number of data points this LF labels correctly (if gold labels are provided)
- Incorrect: The number of data points this LF labels incorrectly (if gold labels are provided)
- Empirical Accuracy: The empirical accuracy of this LF (if gold labels are provided)
Creating a label model
Once we are happy with our label functions, we can bring these together in a probabilistic model. This is Snorkel’s ‘magic source’ which combines the outputs of each of the functions and either returns the probability of a data point having a particular label, or the labels themselves.
from snorkel.labeling import LabelModel
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, lr=0.001, log_freq=100, seed=123)
Filter the dataset
Depending on the coverage of our label functions we will need to filter out some of our training data before using the labels to train a machine learning model. The reason is that some data points will not have been picked up by any of our label functions. We want to remove these items as they will add noise to the training data. Snorkel has an inbuilt function which makes this easy:
from snorkel.labeling import filter_unlabeled_dataframe
#For label probabilities (optional):
probs_train = label_model.predict_proba(L=L_train)
#For actual labels:
probs_train = label_model.predict(L=L_train,return_probs=False)
#filtering the data:
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(X=df_train, y=probs_train, L=L_train)
You’re all done!
You now have a training set of data with labels without having to perform any manual labelling on the dataset.
You can use these as a starting point for a supervised machine learning task.
As highlighted earlier, the documentation on the Snorkel is excellent and there are a number of in-depth tutorials which go into greater depth of the features available within the library: