Using a K-Nearest Neighbors Classifier to Predict Democracy Erosion

Photo by Isaac Smith on Unsplash

Introduction

Machine learning models are incredibly useful tools that allows us to both understand the world around us and make predictions about the future. At the same time however it can often be difficult to understand how their determinations are made and how to interpret their results, especially as models become more complex. Data visualization can be a helpful method by which to make these models more interpretable. The following brief will employ data visualizations to explain the function of a K-Nearest Neighbors classifier used to predict democratic backsliding.

K-Nearest Neighbors

Image By Author

K-Nearest Neighbors (KNN) is a supervised machine learning technique that can be applied to both regression and classification. The KNN classifier works by identifying the number of ‘K’ points from the training data that are closest to our observation of interest, and then determines its category based on the majority category of the nearest points. Figure 1 represents this concept neatly by using simple simulated data with only two features. As can be seen in the figure, a (mostly correct) decision boundary is formed between the two classes based on the nearest 5 observations. Our actual dataset will include significantly more features and so this concept will need to be extrapolated to a multi-dimensional feature space. However, this simple visualization should help illustrate the concept.

Background

Before we begin our analysis, it’s important to have a conceptual understanding of democratic backsliding. Much has been made recently of the fact that the global trend towards democracy seems to be reversing and that many countries are experiencing what is known as democratic backsliding. Democratic backsliding, defined as the “state-led debilitation or elimination of the political institutions sustaining an existing democracy,” is a catch-all term to describe the various processes by which states move towards more authoritarian forms of governance.¹ Political scientists have developed a relatively robust understanding of the various processes by which democracies break down, but the question remains, however: what are the causal factors of democratic backsliding and what indicators can be used to predict whether or not a country will regress farther away from democratic governance? In the rest of the brief, we will use the KNN model discussed above to try an answer this question.

Data

Image By Author

The data used for this analysis was compiled from several different sources. The unit of analysis is country-year, with observations beginning in 1960 and spanning 166 different countries. The dependent variable — the presence or lack thereof of democratic backsliding — was created based on the Polity Project’s Polity 5 data set, which measures various qualities of democratic and authoritarian governance.² This includes a “Polity Score” measured on a 221-pointscale from -10 (hereditary monarchy) to 10 (consolidated democracy). Measures of democratic backsliding were created by subtracting the year of interest’s polity score from that of the year prior. The dependent variable is a binary variable, with a negative year over year difference in polity score being coded as a 1- representing an incidence of democratic backsliding- and a positive year over year difference or no change was coded as a 0- indicating no evidence of backsliding. The data shows that democratic backsliding is a rare event — as shown in Figure 2, out of almost 2,500 country-years since 1960 for which Polity Scores were available, there were under 200 incidents of backsliding- just about 8%. As a result, there is a clear skew which will need to be addressed when running the model.

They key independent variables constitute a set of social and institutional factors and were pulled from the Varieties of Democracy dataset. The V-Dem index is a “multidimensional and disaggregated data set that reflects the complexity of the concept of democracy as a system of rule.”³ Of the several hundred possible indicators in the full V-Dem data set, the 58 most relevant and generalizable for use in this analysis were chosen as independent variables.

Population data was sourced from the World Bank and information on GDP and GDP per capita were sourced from Gapminder, which provides a yearly estimate in 2011 dollars, standardized across countries for purchasing power parity.⁴ Yearly GDP per capita growth, along with population, were used to control for different effects of the V-Dem variables for countries of varying sizes and economic development. Additionally, a decades variable was added to address temporal effects. Finally, operating under the assumption that it takes time for conditions to take effect, all independent variables (with the exception of decades) were lagged by 5 years.

Analysis

Image By Author

In order to run the machine learning model, I relied heavily on Python’s Scikit-learn package, which provides a suite of tools for predictive data analysis.⁵ In addition to running the KNN classifier, I used cross validation to validate the model and to tune it to the desired hyperparameters. The Cross-Validation method used here is k-Fold validation, which splits the observations into k groups, treating only the final set as the test set. This is then repeated k times, with the cross-validation estimate computed based on the average of the k test groups. For the purposes of this analysis, k was set to 5. This model was tuned on the number of neighbors it uses for its calculation. An estimate using the 50 nearest neighbors produced the best result.

Image By Author

Because the dataset was imbalanced as displayed in Figure 1, I used resampling techniques to artificially increase the size of the minority class. The technique used in this model is known as up-sampling or oversampling. Oversampling sampling allows for an artificial increase in the number minority examples by generating new samples from your existing data to address the balancing issue. In order to implement oversampling, I used the RandomOverSampler feature through the SkLearn contributor package, imblearn, which is designed specifically to assist in handling imbalanced data sets.⁶

I used two metrics in order to evaluate the performance of the model. The primary metric used was balanced accuracy. Balanced accuracy is superior to regular accuracy metrics when dealing with imbalanced data because it weights the accuracies according to the inverse prevalence of their class, preventing inflated estimates due to simply predicting the dominant class. I also tested the recall power of the best model as determined by the pipeline. Recall is calculated as tp / (tp + fn) — or the ability of the classifier to find all the positive samples. In this case this is equivalent to being able to correctly identify backsliding events.

Finally, after running the models, I attempted to ascertain the importance of each variable in the model. To do so I used SkLearn’s permutation function. This scrambles the data one variable at a time and then uses the model to re-predict on the now scrambled data. This is then repeated 5 times for each variable. We can then determine what variables are most important to the model based on the largest changes in the model’s predictive accuracy, determined again by its balanced accuracy score.

Results

Overall, the model was decently predictive, returning a balanced accuracy score of 0.7498, meaning that it was able to accurately predict instances of backsliding roughly 75% of the time.

The model’s ability to correctly predict backsliding in the test data was even higher, with a recall score of 0.79- meaning it was able to predict 79% of all instances of backsliding. The confusion matrix in Figure 3 demonstrates the model’s performance across the different possible category combinations, with the predicted values on the x axis and the true values on the y axis. As demonstrated by the figure, the model’s weakest link is its propensity to spawn false positives. It predicted that there would be backsliding in 162 cases in which there were none. This is likely due at least in part to the lack of sufficient backsliding samples, and implies that more work needs to be done to compensate for the skewed Y distribution.

Based on the permutations run to assess variable importance, the most important variable in the full model was Freedom of Expression, with an average impact on the Balanced Accuracy score of .013. The top 10 most important variables are represented in Figure 4. The colored bars represent the average effect across all 5 trials, while the error lines running through them span the maximum and minimum effect across the permutations. Interestingly, the top 10 don’t have a common theme running across them, implying that there isn’t any specific defining set of characteristics that is instrumental in predicting backsliding. Rather there are many dispersed causal factors.

Conclusion

As can be seen above, the K-Nearest Neighbors classifier provides an effective way to tackle classification problems that involve a variety of features. Additionally, the visualizations provide a “look under the hood” allowing for the model and its performance to be more easily interpreted. This is especially important in a policy context, as predictions without context aren’t nearly as useful. The ability to interpret the factors driving the model can provide policy makers with the necessary information to strengthen their democracies and avoid backsliding.

[1]: Nancy Bermeo, “On Democratic Backsliding,” Journal of Democracy 27, no. 1 (2016): 5–19. doi:10.1353/jod.2016.0012.

[2]: Monty G. Marshall, Ted Robert Gurr (2020), “Polity5: Political Regime Characteristics and Transitions, 1800–2018. Dataset Users’ Manual. Center for Systemic Peace. http://www.systemicpeace.org/inscr/p5manualv2018.pdf

[3]: Michael Coppedge et al., ”V-Dem [Country–Year/Country–Date] Dataset v10” (2020), Varieties of Democracy (V-Dem) Project. https://doi.org/10.23696/vdemds20

[4]: The World Bank, World Development Indicators. Population, total [Data file], (2012) Retrieved from https://data.worldbank.org/indicator/SP.POP.TOTL

[5]: Pedregosa, F., et al. (2011), Scikit-learn: Machine Learning in Python Journal of Machine Learning Research, 12, 2825–2830.

[6]: Guillaume Lemaitre, Fernando Nogueira, & Christos K. Aridas (2017), Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine LearningJournal of Machine Learning Research, 18(17), 1–5.


Visualizing the Determinants of Democratic Backsliding was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.