Image Source: Photo by Jon Tyson on Unsplash

Understanding how historical data can lead to algorithmic bias taking a naive example of a compensation prediction model

To be human is to be Biased?

“Bias” is a tendency or inclination to favor or disfavor a certain set over the other. All humans have a certain degree of bias because we are inherently programmed to discern anyone different as a threat. Due to this implicit bias, we tend to unconsciously ascribe traits and qualities to usually stigmatized groups in society. These include groups based on sex, race, nationality, or geographic heritage. According to Stanford Encyclopedia of Philosophy-

“Research on ‘implicit bias’ suggests that people can act on the basis of prejudice and stereotypes without intending to do so”

Now if we know and understand this fact, we have a way to eliminate this implicit bias i.e. to be aware and recognize how it influences our cognitive decision making. This would involve making conscious efforts to change stereotypes and adjusting our perspective.

Image Source: GIPHY

The Endless Quest for Accuracy

The goal of Machine Learning is to get the most accurate predictions. Data scientists have several metrics to quantify this set goal by minimizing known prediction errors. In the case of regression problems these metrics are Root Mean Squared Error, Mean absolute error, R-squared value, etc. For classification algorithms, the metrics include AUC, Accuracy, Precision, Sensitivity, Specificity, and so on. Several methods are used to further tune and then fine-tune the parameters and hyperparameters.

Everything boils to optimizing one particular trade-off — the bias-variance tradeoff; making sure that we get accurate predictions for any part of the population data.

The Overlooked Trade-off

While getting a higher accuracy, one fact is forgotten — Data collected by humans is riddled with human bias. The Guardian articulates this fact well:

Although neural networks might be said to write their own programs, they do so towards goals set by humans, using data collected for human purposes. If the data is skewed, even by accident, the computers will amplify injustice.

Human bias can creep in at any point in the data science pipeline and take the form of Algorithmic bias. Of course, the variety of data collected is the most susceptible point. And that is also connected to optimizing the bias-variance trade-off as mentioned before. Additionally, either this data is collected by humans, or the process of data collection is set up by humans. This implies that unconsciously the agent of data collection can also be the source of bias entering the pipeline. Moreover, the insights derived from the predictions generated can also be influenced by the subjectivity of the human interpreting the results. The immediate effect of algorithmic bias can be discriminatory actions against or for particular groups of people.

Surprisingly, this isn’t a new problem. The article, “What do we do about the biases in AI” (HBR, 2019) talks about how back in 1988, a British medical school making admission decisions based on an algorithm, was found to be guilty of bias against women and those with non-European names. Interestingly, this algorithm demonstrated a 90–95% accuracy in matching human decision-making.

A more recent example of algorithmic bias has been highlighted in this article, by Washington Post in 2016. It talks about the software, COMPASS used nation-wide in the United States to make bail & sentencing decisions which the organization ProPublica found to be biased against African-American defendants.

Let’s see an Example

Gender disparity in the workplace has existed and is still prevalent. Thus, historical employee data is bound to have a biased trend against females. To understand algorithmic bias at play, let’s take the example of historical compensation data of senior officers of publicly listed companies. We make models with and without the feature genderof the senior officer to predict their direct compensation. Further, we compare results to see how including bias-sensitive features can lead to discriminatory results.

About the Data

The data has been obtained from BoardEx using Wharton Research Data Services. It covers four geographic categories: US, UK, European Union (EU), all other countries. Two data sets were used to get different features related to compensation of senior officers at publicly listed companies:

1. Compensation data: Includes individual data (taken from annual reports) on present and past stock option holdings, direct compensation, time to retirement from each position served, time served in the company, information about the board (if serving), sector category of the firm, information about past boards (if served).

2. Individual profile characteristics: Includes data on gender, educational qualification, age, network, or connections.

The detailed data dictionary can be accessed here.

Data Cleaning

There were duplicate values in the data due to companies having multiple tickers, these were removed. After which the number of unique BoardIDs (a unique identifier for company boards) in the data was — 17K from the US, 5K from the EU and UK each, and about 8K from other countries. The final merged data (all countries) had information for approximately 325K directors/senior officers.

However, for each director, the data had multiple entries taken from annual reports of the companies spanning over different years. For example, if a director served a company from 2000–2010 then there are 10 entries for this director-company pairing. For the sake of our analysis, we retained the director-company pairing with the highest total direct compensation of director X in company Y.

After choosing the variables most relevant to our analysis i.e. important in deciding a senior officer’s compensation the final data set had — 80879 rows and 21 variables. Now, our data is such that each row is a unique director-company pairing.

Description of Variables in the Data set (By Author)

Note: We will use Total Direct Compensation as the dependant variable (Y) in our analysis.

Snapshot of the data (By Author)

Exploratory Analysis- Is the data really biased?

By Author

Looking at just the number of Females and Males in our data, we see a huge discrepancy. We have 9,908 Females compared to 70,971 Male senior officers!

This is also corroborated by the distribution of the Gender Ratio or the proportion of male senior officers.

By Author

Let’s look at the mean total direct compensation by gender. As seen below, the mean total direct compensation for males is almost double that for females.

By Author

Now, let’s look at two variables that can be considered as a proxy to access opportunity.

By Author

Surprisingly, there isn’t much difference in the distributions of the time spent in companies between the two genders. Indicating that senior officers irrespective of gender are giving equal time to the companies but receiving different amounts of compensation.

By Author

Comparing the distributions of the number of qualifications senior officers have by gender, we see overall women tend to have more qualifications. This could be indicative of a trend that to reach on par with their male counterparts, women need more qualifications.

Modeling Approach

By Author

We first try Lasso regression to see if the variable Gender is picked up by the model. Our dependant variable or Y is the total direct compensation

The model not only chooses Gender as a feature but also predicts the total direct compensation for males approximately 135 units more than that for females. The R-squared for Lasso is 0.38. Thus, the model is only able to explain 38% of the variance in total direct compensation.

Further, to get a higher accuracy Gradient boosting was tried which gave an R-squared of 0.57. And hence, was able to capture a higher variance in the total direct compensation.

Contrary to our expectation “Gender” did not come as an important variable according to GBM. However, the mean predicted total direct compensation for males was almost double that of female senior officers. To investigate this further, a GBM model was run without taking “Gender” as a variable. And still, the mean predicted total direct compensation for female senior officers was lower than their male counterparts.

Limitations to our Example

While we have taken the example of predicting total direct compensation, it is important to consider that no organization uses such an algorithm that uses variables like Gender, Race, Nationality, etc. for this purpose. The data set was chosen to illustrate how historical data can promote social bias in predictions! Another point to remember is that for senior officers of most companies the compensation provided is not only direct. There are indirect components of the compensation which include Stock, Options holdings & Long-term incentive plans. Due to the sparsity of these features, they were not included in the analysis. Since, the data covered organizations from various countries, it is also important to know compensations, tax benefits & firm structures differ from country to country. And hence this may have skewed the results.

With these Caveats in mind, it is still important to see the fact that a simple predictive compensation analysis can also produce socially biased results even as the accuracy of predictions improve.


So, what can we do about it? How can we make ‘ethical’ predictions using Machine Learning?

One solution is to use weights.

1. Adding different weights to variables with an implicit bias for each class of the population that could have been impacted by social bias. For example, in the case of COMPASS weighing or a normalizing, the number of prior arrests for African-Americans and their White counterparts by their population proportions could be used.

2. Give more weightage to other variables which are independent of social bias like the seriousness of the crime than to prior arrests.

Another solution could be to keep monitoring model results with real-time data assuming that the social bias would decrease in data over time.

Keeping this in mind, for fairness in predictions a shift to ‘Explainable ML/AI’ will propagate and produce more socially unbiased data. An important step in this direction is the EU’s General Data Protection Regulation or GDPR. The data usage is more regulated with special protection being applied to sensitive population characteristics like Gender, Race, ethnic origin, Nationality, Geographic locations, religious beliefs, genetics, trade union memberships, etc. under Article 9.

Another interesting point in this regulation is the law to treat even online location data of consumers as personal data under Article 4; the use of any personal data is contingent on the consent of the owner under Article 6. Moreover, under Article 13, the companies are mandated to inform the consumers how their data is being utilized.

We still don't know the extent to which human bias is affecting fairness in machine learning algorithms. But again as Sheryl Sandberg says, “We cannot change what we are not aware of, and once we are aware, we cannot help but change”!


[1] The EU General Data Protection Regulation (Questions and Answers), Human Rights Watch (2018)

The Human Bias-Accuracy Trade-off was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.