Bias in Your Datasets: COVID-19 Case Study

How can bias impact the performance of deep learning models by overfitting? Let’s focus on model interpretability with medical images.

Chest X-ray on Shutterstock


In the last years, numerous experiments have shown the interest of artificial intelligence (AI) in the service of medical imaging and in particular radiology. This was made possible by the availability of large datasets, substantial advances in computational power and the advent of new deep learning algorithms. However, these technologies are not widely democratized today due to a degradation in the performance of these AI algorithms between the experimental phases and implementation in real conditions. One of the explanations for this phenomenon is linked to overfitting: the algorithm does not widespread on new data.

In this project, we want to illustrate how bias can significantly impact the performance of a classification model on our Chest X-ray (CXR) images. Beyond a simple exploratory analysis of the dataset, we tried to demonstrate the existence of these biases and to characterize them.

The main purpose was to develop an algorithm which is going to predict from a CXR image whether a patient has viral pneumonia, COVID-19 or nothing (normal). In this context, three datasets of CXR images for each of these three classes were used, i.e. 3886 radiographic images with a balanced distribution: COVID-19 Radiography Database [1, 2].

Dataset with 3 classes : COVID, NORMAL, PNEUMO (Image by Author)

Being new to the field of Deep Learning, we have chosen a simple CNN model to implement (LeNet) with 2 convolutional layers. And guess what, we got a test accuracy of 97% 🎊 !

Achieving 97% accuracy with (LeNet) model with 30 epochs (Image by Author)

Rather than offering our results for the 2021 Nobel Prize in Medicine, we asked ourselves the following question :

Is it right to have almost 97% accuracy with a simple 2-layer convolutional neural network model to detect chest diseases?

The answer is no. This is why it is important to look deeper into the biases of the dataset and try to correct them before training a model of Transfer Learning.

What biases are present in our dataset?

The first step was to try to spot the disparities and heterogeneity of the images. For this, the brightness distribution between NORMAL images and COVID images was compared.

Brightness distribution for Covid/Normal images (Image by Author)

A significant difference in the distribution of brightness was observed. This global feature could therefore be sufficient to separate these two classes in a “correct” way, and give us a first explanation of the high performance that we noticed earlier (LeNet).

Let’s go further in the analysis …

To visualize the biases of the dataset, we have started to look for local features that could make it possible to separate the 3 classes. The idea is to project the images into a smaller dimensional space to observe trends in the images of the 3 classes. To do this, the images were resized to 28x28:

CXR image resized to 28x28 pixels (Image by Author)

In theory it is impossible to detect covid or pneumonia due to lack of details.

The t-SNE algorithm [3] is a non-linear clustering method making it possible to represent a set of points of a large-dimensional space in a two or three-dimensional space by successive iterations. The data can then be visualized in the form of a point cloud:

Visualizing high-dimensional data with t-SNE algorithm (from

The t-SNE was applied with 2 components on pixel intensity and the dataset was displayed on these two axes. Surprisingly it is possible to distinguish the 3 classes on a 2D projection, with an unsupervised clustering method and images without details!

t-SNE visualization of the dataset followed by a SVM classification (Image by Author)

Indeed, by applying a SVM on the two variables corresponding to the two t-SNE axes, an accuracy of 84% was obtained.

The t-SNE algorithm seems to extract important information from 28x28 non-detail images that allows it to make a good classification. For ease of interpretation, let’s do the same with a PCA.

PCA visualization of the dataset (Image by Author)

The projection is less efficient than with the t-SNE method, but we noticed that the first mode (x-axis) can correctly separate the Covid images from the non-Covid images. Following this observation, we can display the vector “pca.components_[0]”, taking only the absolute values of the coefficients, and view the 28x28 heatmap:

Heatmap of the first PCA component (Image by Author)

One can see from the projection above (first principal component) that the important pixels are on the right and left edges of the image. The brightness of the edges therefore makes it possible to correctly separate the covid images from the non-covid images in our dataset, which represents a real bias in our study!

Let’s go even further …

The most frequent criticism of neural networks is the difficulty to extract and to explain the decision process to a human readable form. They are often seen as “black boxes”. But that’s without counting recent publications tackling the tough question:

How do convolutional neural networks decide?

Grad-CAM is a method published in 2017 [4] that helps to interpret the results of a CNN. It provides an activation map of the class determined by the neural network by calculating the gradients from the decision backpropagating to the last convolution layer:

Grad-CAM architecture (Image by Author)

The Grad-CAM algorithm was applied on the first prototype (LeNet) with 12 random COVID images, and the activation maps showed us that the network mainly uses the edge-of-image to make its decision, instead of looking for inside the lungs for useful informations:

The green/yellow areas are the regions of interest based on what the network takes its decision, the blue ones are the regions of no-interest. (Image by Author)

How to correct these biases?

Following the analysis of the biases of the dataset, we have observed that the classification algorithms are mainly based on the pixels at the edge of the image to make their decision. Our idea is therefore to remove these edges, and to focus only on the lungs with the information needed for the classification.

This required the use of a U-Net neural network, pre-trained on CXR images and developed specially for lung segmentation.

U-Net architecture (Image by Author)

The sequence below illustrates the transformation of images after all the pre-processing methods:

Segmentation pipeline (Image by Author)

After segmentation, the images were cropped around the lungs with a margin of 10 pixels.

The homogenized and cropped images were then used to build our new dataset for training and testing deep learning models.

Before and after preprocessing (Image by Author)

And finally, does it work?

By applying the t-SNE method on the cropped images, biases were still observed in the dataset, since the unsupervised algorithm still managed to separate the classes 😭:

t-SNE on the pre-processed dataset (Image by Author)

Transfer Learning on pre-processed images

Two Transfer Learning models were tested: the DenseNet121 (121 convolutional layers 😮) and the VGG16 architectures, both pre-trained on the huge Imagenet dataset.

PS : A convolution layer was inserted after the input to “color” the images, since the CXR are by definition in black & white.

Training parameters and accuracy (Image by Author)

The accuracy of the two Transfer Learning models were lower than the accuracy of LeNet, however the Grad-CAM activation maps showed us that they are looking mostly in the lungs to make their decision.

Grad-CAM activation map applied to VGG16 with pre-processing (Image by Author)

You can also test these 3 models with your images and observe the Grad-CAM activation maps thanks to the Streamlit app available at the following address:


  • Highlighting the dataset biases by dimension reduction with the t-SNE and PCA methods.
  • Development of a pipeline to equalize histograms and remove edges from images by segmenting lungs.
  • Transfer Learning with (Densenet121 and VGG16).
  • Using the recent Grad-CAM method to visualize neural network class activation maps.


  • Correct the biases still present despite the preprocessing by using new images with different origins for each class.
  • Continue to optimize network performance by training on more epochs and unfreezing some layers of the pre-trained network.
  • Work with radiologists to assess the suitability of grad-CAM cards on patient images
  • Add patient information to the model in addition to images: symptoms, medical history, age, sex, location, date, etc; in order to make the model more robust and faithful to the actual medical diagnosis.


There is no point in having excellent performance, you have to limit the biases in your dataset!
In deep learning, it is mainly the data that determines what algorithms are.


Baptiste Moreau : LinkedIn

Chadi Masri : LinkedIn

Karima Bennia : LinkedIn


[1] CHOWDHURY, Muhammad EH, RAHMAN, Tawsifur, KHANDAKAR, Amith, et al. Can AI help in screening viral and COVID-19 pneumonia?. IEEE Access, 2020, vol. 8, p. 132665–132676.

[2] RAHMAN, Tawsifur, KHANDAKAR, Amith, QIBLAWEY, Yazan, et al. Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Computers in biology and medicine, 2021, vol. 132, p. 104319.

[3] VAN DER MAATEN, Laurens et HINTON, Geoffrey. Visualizing data using t-SNE. Journal of machine learning research, 2008, vol. 9, no 11.

[4] SELVARAJU, Ramprasaath R., COGSWELL, Michael, DAS, Abhishek, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In : Proceedings of the IEEE international conference on computer vision. 2017. p. 618–626.

Bias in your datasets: COVID-19 case study was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.