The Automatic Text Classification task consists of automatically assigning a document to one or more classes of membership.
It is a fundamental task in many scenarios.
For example, in Social Media Monitoring it is essential to classify tweets related to a certain “brand” as positive or negative opinions.
Or, in the case of Search Engines, it is possible to greatly improve their accuracy if the indexed documents are classified with respect to the topic they are about — so that users can more easily identify the texts of interest.
An Automatic Text Classification task can be implemented through a “rules system”, explicitly defined by a “domain expert”, or by Machine Learning systems.
Machine Learning (ML) is the ideal solution in the case where a sufficiently large set of previously classified texts is already available — a so-called “training corpus”: the corpus is supplied to the ML system, which “learns” autonomously what are the best strategies for classifying documents.
Rule systems, on the other hand, are generally more expensive to implement and maintain, because the intervention of “domain experts” is required, which must formalize generally unwritten, or non-univocal, or fuzzy rules: rule systems are therefore usually employed when an adequate training corpus is not available.
In this article we will focus on the Automatic Text Classification systems based on Machine Learning, we will compare some, and we will try to understand which is the best — or at least what could be the “best practice” to follow in the selection of the system.
I have been dealing with Machine Learning for a long time, and for years I consider SVM (ie Support Vector Machines) as one of the most performing ML tools for the Automatic Text Classification task.
Like most ML practitioners, I have followed with great interest the progressive and pervasive affirmation of “Deep Learning” as the paradigm of choice for ML (for a brief but accurate summary of the “Deep Learning” rise have a look to this great article about the interview that with Prof. Geoffrey Hinton — one of the “fathers” of neural networks — gave at Google I/O 2019).
So, will Deep Learning be more effective than SVM also for the Automatic Text Classification task?
There are obviously many other methods besides SVM, but, in my experience and generally speaking, there are no substantial differences with respect to SVM (for what concerns overall accuracy). It is the impressive progress made in several fields using Deep Learning that led me to wonder if this latter can be significantly better than SVM and alike.
This essay is therefore structured as a report of a (non-exhaustive) comparison between SVM and Deep Neural Networks (DNN) with respect to the Automatic Text Classification task.
1. Experiment setup
I decided to perform this comparison “experiment” using a corpus of texts in Italian; in my professional activity I deal almost exclusively with texts in Italian, and, considering that it is often easier to access Natural Language Processing (NLP) tools for English, it is important for me to make this comparison on an Italian corpus, so I can immediately check what are the issues specific to the language. I hope that this approach can be useful to all ML practitioners who deal with texts written in languages other than English.
1.1 Choice of the corpus, and task requirements
As far as I know, there is no truly “open” corpus of texts in Italian that can be used for training and evaluating an Automatic Text Classifier.
Therefore, I decided to create one.
The articles of the Italian edition of Wired magazine are published on the web with a Creative Commons license: so I decided to use these texts to create the corpus in question.
The wired.it articles are classified by “topic”: the subject of the article is shown before the title. For example, the article titled “Tutti i limiti dell’intelligenza artificiale” is classified in the Attualità > Tech setion of the website (i.e. News > Tech).
Each article is assigned a single topic, and the topics are divided into two levels of hierarchy. It may happen that some articles are listed in “high level” topics — e.g. directly in Attualità; but since the hierarchy has only two levels, I preferred to define the task as a multi-class classification task (i.e. each document is assigned a single class, chosen by a given set) — and I built the corpus accordingly.
Beyond these hierarchical aspects, it is clear that cataloging each article in a single section (a reasonable choice from an publisher point of view), and considering the typology of the sections (which we will see shortly), will lead to a “semantic overlap” between classes that penalizes upstream any ML algorithm. Therefore, I do not assume that this corpus can allow the training of high-performance classifiers — regardless of the ML algorithm used. This limit is, anyway, irrelevant with respect to our purpose, which is mainly a comparison between different approaches.
Nonetheless, in order not to miss any possibility, I made similar evaluations also on other non-public corpora (used during my professional activity of NLP/ML with KIE, the company of which I am CTO), and of which I will report occasionally just some summary information.
The evaluation will be done using precision/recall and F1 as metrics (generally “macro averaged”), which are the most suitable metrics for this task.
- we will deal with a task of automatic classification of “multi-class” texts
- the corpus is in Italian
- the corpus consists of the articles of the Italian edition of wired.it, classified by section/topic
1.2 The wired.it corpus
The corpus was created by crawling wired.it using the Scrapy tool. The crawler code can be found in the GitHub wired-it-scraper project.
I produced two versions of the corpus:
- a single JSON file where, for each article, the related metadata (category, title, URL, copyright, text) are listed, and which can be downloaded from here
- a layout of the corpus on the file system, in which the articles were divided into training and test sets using the folders’ names, and then divided into classes always by means of folders’ names; this “layout” will allow loading the corpus into the ML training systems in a simple way; this corpus can be downloaded from here
Below is a summary table of the classes and the number of training and test documents for each of them:
2.1 Introduction to SVM
Support Vector Machines are an ML method that, starting from a set of training data represented in a given vector space, finds the best hyper-plane that divides the two categories — where “best” means the one that best generalizes, which means that it divides the two groups of data with the widest possible margins.
Support Vector Machines can solve non-linear classification problems: data, using a “kernel” function, is mapped to a vector space with a greater number of dimensions so that the original problem can be redefined as a problem having characteristics of linear separability.
Therefore, in order to apply SVM to an automatic text classification task, it is required to define “features” that represent the documents; we will proceed as follows:
- NLP preprocessing of the corpus — needed to achieve the required dimensionality reduction and allow for a better generalization
- features extraction: we will use TfIdf
- export of the vector representation in the format supported by the chosen SVM tool, that is LibSVM
2.2 Corpus NLP preprocessing
The NLP preprocessing pipeline consists of:
- POS tagging
- filtering by POS tags
I published an Open Source project that implements the execution of the various mentioned steps, and also implements subsequent features extraction passes: the project is called java-ml-text-utils. This tool can also be used for languages other than Italian (although at the moment it requires some minimal customizations).
Instructions can be found in the project “readme”:
- to customize the library for languages other than Italian
- to perform the various pre-processing steps, and, as we will see later, other utilities to prepare the material for the ML update
2.2 Features Extraction
I decided to use the traditional TfIdf as features; in the past, I have experimented with SVM other types of features (eg Tf alone, topic models implemented with SVD, domain linguistic features, noun phrases), but TfIdf has always provided better results.
The features were extracted using the previously mentioned tool java-ml-text-utils.
In order to limit the computational complexity, the dimension of the vector space of the features has been limited to 10000 terms.
2.3 Export to LibSVM format
Again using java-ml-text-utils, the corpus was finally exported in LibSVM format.
2.4 Training of an SVM classifier
The classifier was trained using the LibSVM tool.
I followed the approach indicated in the tutorial “A Practical Guide to Support Vector Classification”, which I personally consider very effective.
This approach involves the following steps:
- data scaling
- hyper-parameters grid search for optimal “C” and“ γ” parameters of an “RBF” Kernel
- classifier training with the best parameters found
- evaluation of the classifier on the test set
For the wired.it corpus, following this procedure, I obtained a “macro-averaged” F1 of circa 0.59.
The result, as expected, does not shine: only 10% of the classes have F1 greater than 0.8.
Below are the detailed results, where:
- TP means “True positives”
- FP means “False positives”
- FN means “False negatives”
3. Deep Learning
3.1 Keras, and Tensorflow
For neural network training, I chose to use Keras, with a Tensorflow backend. The Keras programming APIs are stable and well documented, the software ecosystem is nourished, and there are lots of resources available— e.g. tutorials, but especially definitions of network architectures that can be easily reused.
3.2 A “flat” neural net
I decided to start with a substantially “flat” neural network, without “hidden” levels.
With Keras it is very easy to build a basic tokenizer; it did not seem equally simple to obtain an adequate NLP preprocessing for Italian — but this limit is mitigated by the additional simplicity with which the text can be represented via word embeddings: you can just specify it in the definition of the architecture of the network.
Our first network is thus constituted as follows:
- the input is the first N words of each text (with proper padding)
- the first level creates embedding of words, using vocabulary with a certain dimension, and a given size of embeddings
- afterward, an “average pooling” is applied to the embeddings dimensions
- and finally, the output level has a number of neurons equal to the classes of the problem and a “softmax” activation function
We need to set some basic parameters:
- vocabulary size — ie the maximum number of terms that are used to represent a text: e.g. if we set the size of the “vocabulary” to 1000 only the first thousand terms most frequent in the corpus will be considered (and the other terms will be ignored)
- the maximum length of the texts (which must all be the same length)
- size of embeddings: basically, the more dimensions we have the more precise the semantics will be, but beyond a certain threshold we will lose the ability of the embedding to define a coherent and general enough semantic area
- number of training epochs of the network
We make the first attempts trying to “manually” explore the space of these hyper-parameters:
We obtained an F1 macro of 0.62 using this configuration which, at the moment, seems the optimal one:
- vocabulary size: 30000
- maximum length of the texts: 1000
- size of embeddings: 200
- training epochs: 40
Increasing or decreasing these settings seems to worse or anyway not improve the F1 score.
With LibSVM we had obtained a result of 0.59 F1 (macro): so with 0.62 the DNN improved that performance, even if the increase is modest, and therefore, generally speaking, I would rather consider it a result of “tie”.
3.3 Deep Nets and Convolutions
We have achieved a good result, but the point of “Deep Learning” is exactly in the “deep”, ie in the idea that adding levels inside the network improves the ability to “represent” the “semantics” of a certain entity. In other words, each internal layer should — said intuitively — create a more “abstract” and “high level” representation than the previous level.
So let’s try, in a so to speak naively attempt, to add a single internal level, like this:
Taking the best parameters from the previous step, we get an F1 macro of 0.41: far worse than the 0.62 obtained with the “trivial” flat network!
But there is another aspect to take into account.
Starting from the 2014 article by Yoon Kim “Convolutional Neural Networks for Sentence Classification”, it is widely believed that “convolutional” layers allow to significantly improve the performance of a DNN in the context of Text Classification tasks. The theoretical reason is that convolutions are able to capture linguistic “patterns” articulated on a “window” of consecutive words (or embeddings).
So let’s try adding a “convolutions” layer:
We get an F1 macro of 0.52: better than the tentative with just the internal “dense” layer, but worse than the “flat” network.
However, considering that we can insert an arbitrary number of levels, each with a different number of neurons, that the layer with the convolutions has, in turn, some parameters to set (the number of output dimensions and the activation function), and that these architectural parameters are combined with the hyper-parameters explored in the previous step (and with others that we have not changed for now), the search space becomes very large and complex to explore by going through “manual attempts”.
Is there a more efficient and exhaustive way of trying to understand the architecture and the optimal combination of parameters?
3.4 Hyper-parameters optimization
As with SVM, there are also “best practices” and libraries for DNNs that allow you to systematically explore the space of hyper-parameters.
This is not a completely automatic procedure:
- it is necessary to define in advance which are the parameters that we want to “explore” and which are the possible values
- at the end of a “scan” cycle (which may be exhaustive, or, in order to increase the efficiency of the operation, partial), the results must be analyzed and a decision is made to concentrate on a more restricted “area” of the search space, then proceeding iteratively until a satisfactory combination is found
Let us, therefore, define a “parameterized” network that can be passed to Talos:
We define it as a function that takes in input training and validation set, and a map of parameters: Talos will then generate the various networks by invoking the function.
The parameters that we are going to “explore” are therefore:
- embeddings: size
- hidden layers: shape, number, size, activation, dropout
- convolutions: filters number
- training: optimization, epochs, size of training batches
We only use the training set for the scan, which Talos will then use to create examples of training and validation.
Let’s do a first “quick” scan on a very small area of the search space: 1% of the samples. This first step will allow us to begin to “orient ourselves”.
We set the possible parameter values as follows:
The first execution gives extremely high validation accuracy values, but what interests us in this phase (as suggested by the relative guide) is to understand which are the parameters that most influence the task.
We therefore render — via Talos — a “visual” mapping of the correlations with respect to the “validation accuracy”:
Where colors define correlations according to this scale:
One would, therefore, say that it is better to focus on: embeddings dimensions (confirmation of what has already been learned), number of features of convolutions, number of neurons of internal levels, batch size, dropout. It would seem instead that it is not worth having more than one internal level, and that it is more efficient to limit to a few training epochs (which is in contrast with what occurred in the previous steps, but, in order to obtain an indication quickly, we can nevertheless try).
We then set up a second execution with these parameters:
We increase the sampling rate and we finally obtain a new correlation matrix:
This new execution does not seem to give very significant information — perhaps even contrasting with what has been achieved so far: maybe the only clue we can grasp is that increasing the features map of the convolution only works up to a certain maximum threshold.
Let’s do the last scan by further increasing the sampling rate and going to check 115 possible parameter combinations.
The best result we get is as follows:
So let’s try to create a network with these parameters and architecture and verify their performance:
We get, after 40 training periods, an F1 macro of 0.48: a very disappointing result.
For the sake of completeness, let’s simply try to remove the inner layer — which we previously observed to be the cause of deterioration in performance: we get an F1 macro of 0.50.
For completeness, let’s try with only 10 training periods: on the other hand, this is the parameter suggested by the optimization done with Talos. We get an F1 of 0.53: the value of epochs “discovered” by Talos was actually the best suggestion. It is evident that this architecture is subject to over-fitting, and therefore it is better to limit the number of eras.
As a final test, let’s try restoring the internal “dense” layer and perform the training with 10 epochs: we get F1 of 0.52 — better than the 0.48 obtained with 40 epochs, but slightly worse than 0.53 obtained without the inner layer.
In short, it seems that the research carried out with Talos confirms that a more complex network than a simple “flat” network is in no way able to approach the performance of the latter.
However, Talos, although a very efficient tool, is not easy to use: so I still hope that there are chances to do better.
4 Automated Machine Learning
SVM and a “flat” neural network with embeddings, therefore, seem to be the same — with a slight fan shown by the DNN.
Is it possible that there are currently other solutions that give better performance for this task? An architecture that I have not considered, or even a learning method different from DNN?
Automated Machine Learning (or AutoML in short) is an approach to defining ML models that aims to build an “end-to-end” solution, taking full responsibility for all those decisions that we have seen are complex to take and for which there is no universal “recipe”.
The user only needs to produce the dataset (a duty that is not trivial in any case…): the system then takes on the task of identifying the best algorithm (not necessarily a DNN) and also finding the best hyperparameters.
With Google Cloud AutoML we proceed in this very simple way:
- the texts are loaded into Google Cloud Storage buckets
- using the java-ml-text-utils tool you can export the corpus in CSV format to be interpreted by Google Cloud AutoML
- training starts
The CSV indicates:
- the URI of the document (which must be among those previously loaded on the Storage)
- the label of the related class
Google Cloud AutoML automatically generates training and test sets (more precisely, it also needs to generate a “validation” set).
In order to have better comparable results, we also export in the CSV the breakdown between training and test sets of our corpus, also adding the validation set (which I set at 10% of the training set).
The network training lasts about 3 hours and costs about 10 euros: very cheap!
Results are provided in terms of precision and recall. From what can be inferred, the results are micro-averaged.
We can immediately check the “confusion matrix” — really handy, even if only a subset of the classes is listed:
The confusion matrix clearly shows — as expected — how some pairs of classes are particularly ambiguous, for example:
- Attualiaà > Ambiente vsScienza > Ecologia
- Economia > Business vs Economia > Finanza
- Internet > Regole vs Internet > Web
- Scienza vs two subclasses, Scienza > Ecologia e Scienza > Lab
- Lifestyle > Salute vsScienza > Medicina
At the time of evaluation, it is necessary to set a “score threshold” that allows the system to turn a “ranking” of attribution of a certain class into an actual assignment (the “score threshold” is given a default value by the system).
Obviously, varying the score, precision and recall vary accordingly: this feature allows you to understand how the model behaves depending on the most relevant metric for the task.
In our case, we just set the score so that the precision is as similar as possible to that obtained in the other experiments: in this way, we obtain an F1 of 0.6 (micro averaged), whereas we obtained 0.62 with DNN and 0.58 with SVM.
The results are therefore entirely consistent with what has been achieved so far: the DNN gives results that are entirely comparable to SVM (perhaps slightly better), and Google Could AutoML confirms that on this corpus this result is probably the best that can be achieved.
I reached the same conclusion also with the other corpora on which I carried out similar experiments.
5. Is it possible to explain in a simple way why SVM and DNN are so similar?
Let’s start by recalling what is stated in the literature concerning the nature of the Automatic Text Classification task.
It is stated that, considering the high number of features/dimensions of the vector space, the problem is generally linearly separable (as also reported in the aforementioned LibSVM tutorial). In past experiments, we have found that an RBF kernel can give slightly better results than not using any kernel at all — but we can consider these differences as a negligible optimization.
Wanting to try to find an intuitive interpretation of this statement, we could think of dimensions as “points of view” with which we can observe an object in an n-dimensional space. In the case of texts, these “points of view” are so many (of the order of thousands) that it is always possible (compatibly with the “quality” of the dataset) to find a way to “insert” a hyper-plan among the documents belonging to a class and those not belonging to it.
In the case of DNN, on the other hand, we must ask ourselves — mathematically and not “metaphorically” — what actually “calculates” a certain layer of the network.
Of course, there is a lot of literature that accurately describes neural networks using mathematical models. This article, however, provides an interesting point of view, accessible in a more immediate way, and with a very clear graphic display: it is stated that the function of the various levels of a neural network can be understood as a series of transformations aimed at modifying the input so that the data will become linearly separable. Thus, the modus operandi is similar to SVM:
- try to transform the data so that they are linearly separable — in the case of DNN by adding internal levels, in the case of SVM by applying a kernel function (which, precisely, projects the data in a space with a greater number of dimensions)
- then find the best hyperplane that divides data (“best” in the sense that it generalizes “better”, avoiding the most possible over-fitting on training data)
It follows that, assuming that a problem of automatic classification of texts implies linear separability, adding internal levels to a DNN will not improve the results of the network.
Presumably, I think these are the reasons why SVM and DNN do not present substantial differences for the Automatic Text Classification task.
6. So what?
So if, for the Automatic Text Classification, SVM and DNN task seems to have completely comparable performances (in theory and in practice), how to choose between the two? Is there any advantage to using one or the other?
6.1 Training speed
In the setting used, the training of an SVM classifier — including the grid search of hyper-parameters (indispensable in my experience) — took a good 42 hours in multi-threading on a laptop with 6 cores (12 threads) Intel i7–8750H.
The neural network with similar performance can be trained in about 2.5 minutes (always on the same laptop, on which an Nvidia GeForce GTX 1060 card is installed).
It is, however, fair to mention that LibSVM offers some specifically optimized alternatives for the classification of texts: LibLinear, and Multi-core LibLinear. I have not had the chance to try them on the corpus in question, but I think they could hardly be trained faster than the DNN.
6.2 NLP Preprocessing complexity
The preprocessing of the corpus so that it is ready to be used for training through SVM is very complex: it includes the NLP part (segmentation, tokenization, POS tagging, stemming) and the features extraction part (TfIdf in our case).
In my experience, the NLP preprocessing part is in general necessary, but it could be worth to check case by case how much it contributes to the overall performance.
Nevertheless, the part of features extraction must still be conducted, since it is obviously indispensable.
In the DNN case, on the other hand, these preparatory steps are not required:
- NLP preprocessing is not necessary: a test has been done in this regard, using the NLP preprocessed corpus, and the results of the trained DNN do not give advantages compared to the one trained with the corpus without NLP preprocessing
- feature extraction is implicit in the first level of embeddings and is part of the network architecture itself
6.3 Google Cloud AutoML
The use of a DNN, with the same performance, therefore seems more “economic” and simpler than SVM.
On the other hand, the use of Google Cloud AutoML, always with the same performance, turns out to be even simpler than a DNN — especially since, after training a classifier, a REST endpoint is automatically created for the calculation of predictions.
7. Conclusions, and possible improvements
7.1 The Winner is…
In my opinion, the most powerful, simple and inexpensive way to implement an Automatic Text Classification task is currently Google Cloud AutoML.
If Google Cloud AutoML is not a viable option (perhaps for strategic reasons, or because really sensitive data are used), the use of a DNN “flat” with embeddings (possibly implemented with Keras / Tensorflow) is the best alternative possibility. DNN may also, in some cases, provide slightly better results than Google Cloud AutoML.
If using Google Cloud AutoML is not feasible for strategic reasons, I would still recommend using it to define a “baseline” for the current project.
Regarding the AutoML approach, I also recommend testing alternative platforms — such as H2O Driverless AI and MonkeyLearn.
7.2 Possible Improvements
A missing piece in this brief survey is identifiable in the use of DNN “transfer learning” practices; in the specific case of the task of interest, at present the most promising approach consists in the use of models based on BERT: these are pre-trained models of embeddings that are able to discriminate the semantics of words based on the context in which they are used. However:
- in theory they are the more useful the smaller the data set is: in our case the data set was already sufficiently large, but I cannot exclude that a pre-trained model could still bring advantages
- the biggest obstacle, however, is the fact that there is no (to my knowledge) a pre-trained model for Italian: I should, therefore, arrange to train one, for example following the steps described in this article, where it is shown how to train a BERT model for Russian. There are also BERT “multi-lingual” models, but, since the purpose is to be able to manage documents written in different languages in a certain application, they are not very effective for applications related to individual languages.
Other approaches that can be explored better are those related to other libraries of AutoML and hyper-parameters optimization, such as: