Computer VisionApplication-Detecting Car Exterior Damage

Recent advances in deep learning and computation infrastructure(cloud,GPUs etc.) have made computer vision applications leap forward: from unlocking office access door with our face to self-driving cars. Not a many years ago image classification task, such as handwritten digit recognition(the great MNIST dataset) or basic object (cat/dog) identification was considered as a great success in the computer vision domain. However Convolutional neural networks (CNN), the driver behind computer vision applications, are fast evolving with advanced and innovative architectures to solve almost any problem under the sun related to the visual system.

Automated detection of car exterior damages and subsequent quantification(damage severity) of those would help used car dealers(Marketplace) to price cars accurately and fast by eliminating the manual process of damage assessment. The concept is equally beneficial for property and casualty(P&C) insurers, in terms of faster claim settlement and hence greater customer satisfaction. In this article, I will step by step to describe the concept of car scratch(most frequent exterior damage) detection using CNN transfer learning leveraging Tensorflow backend.

Car damage detection- A typical application of Instance Segmentation

Before going to details of the business problem and steps to implement I will discuss the technique used for this special application of object detection and rationale behind it. Like most of the real world computer vision problems here also we will leverage transfer learning from suitable pre-trained CNN to save enormous time in retraining the entire weight matrix. As a typical application in object detection technique, we have a few choices of choosing techniques — R-CNN, Fast R-CNN, Faster R-CNN, SSD etc. To get an overview of these techniques I encourage to read this article. In brief like every object detection task, here also we have 3 following subtasks:

A) Extracting Regions of Interest(ROI): Image is passed to a ConvNet which returns the region of interests based on methods like selective search(RCNN) or RPN(Region Proposal N/W for Faster RCNN) and then RoI pooling layer on the extracted ROI to make sure all the regions are of the same size.

B) Classification task: Regions are passed on to a fully connected network which classifies them into different image classes. In our case, it will be scratch(‘damage’) or background(car body without damage).

C) Regression task: At last, a bounding box(BB) regression is used to predict the bounding boxes for each identified region for tightening the bounding boxes(getting exact BB defining relative coordinates)

However in our case only arriving at square/rectangular shaped BBs is not sufficient as car scratch/damages are amorphous(without a clearly defined shape or form).We need to identify the exact pixels in the bounding box that correspond to the class(damage). Exact pixel location of the scratch will only help to identify the location and quantify the damage accurately. So we need to add another step- semantic segmentation(pixel-wise shading of the class of interest) into the entire pipeline for which we will use Masked Region based CNN(Mask R-CNN) architecture.

Mask R-CNN:

Mask R-CNN is an instance segmentation model that allows identifying pixel-wise delineation for object class of our interest. So Mask R-CNN has two broad tasks- 1)BB based Object detection(also called localization task) and 2) Semantic segmentation, which allows segmenting individual objects at pixel within a scene, irrespective of the shapes. Put together these two tasks Mask R-CNN does get Instance Segmentation for a given image.

Although detailed discussion about the Mask R-CNN is beyond the scope of this article, let’s take a look at the basic components and have an overview of different losses.

Mask R-CNN Components(Source)

So essentially Mask R-CNN has two components- 1) BB object detection and 2) Semantic segmentation task.For object detection task it uses similar architecture as Faster R-CNN The only difference in Mask R-CNN is ROI step- instead of using ROI pooling it uses ROI align to allow the pixel to pixel preserve of ROIs and prevent information loss. For Semantic segmentation task, it uses fully convolutional n/w(FCN). FCN creates masks(in our case it’s binary masks) around the BB objects, by creating pixel-wise classification of each region(distinct object of interest). So in overall Mask R-CNN minimizes the total loss comprises of following losses at each phase in Instance Segmentation. Before jumping into different loss definition let’s introduce important notations.

1) rpn_class_loss: RPN anchor classifier loss is calculated for each ROI and then summed up for all ROIs for a single image and network rpn_class_loss will be summing up rpn_class_loss for all images(train/validation). So this is nothing but Cross-entropy loss.

2) rpn_bbox_loss: Network RPN BB regression loss is aggregated as rpn_class_loss The bounding box loss values reflect the distance between the true box parameters -that is, the (x,y) coordinates of the box location, its width and its height- and the predicted ones. It is by its nature a regression loss, and it penalizes larger absolute differences (in an approximately exponential manner for lower differences, and linearly for larger differences.

Given an image this RPN phase extracts many bottom-up region proposals of probable object locations in the image and then it suppresses region proposals with ≥ 0.5 IoU(Intersection over union)criteria and calculates rpn_class_loss(the measurement of the correctness of these refined regions) and how much precise(rpn_bbox_loss) they are. The exact loss computation method requires a bit complex non-linear transformation between the centers(predicted vs. ground truth) and between widths(predicted vs. ground truth) and heights(predicted vs. ground truth). Precisely the network reduces SSE between predicted BB co-ordinate: (tx,ty,th,tw) — the location of the proposed region vs. Target: (Vx,Vy,Vh,Vw) — Ground truth labels for the region. So after incorporating smooth L1 lossfunction for class ‘u’ and for predicted bounding box t, rpn_bbox_loss would be:

So in RPN step, the total network loss is:

3) mrcnn_class_loss: The principle of computation of this loss is the same as rpn_class_loss, however, this is the classification loss at fully convolutional n/w(FCN) step during pixel-wise classification for Semantic segmentation task.

4) mrcnn_bbox_loss: The principle of computation of this loss is the same as rpn_bbox_loss, however, this is the BB regression loss at fully convolutional n/w(FCN) step during Mask R-CNN bounding box refinement for Semantic segmentation task.

5) mrcnn_mask_loss: This is binary cross-entropy loss for the masks head during masking of the exact object location(amorphoric exterior car damage locations).It penalizes wrong per-pixel binary classifications — foreground(damage pixels)/background(car body pixels), in respect to the true class labels.

While the 1st losses are generated during BB object detection step the last three losses are generated during Semantic segmentation task. So the during training the network minimizes to overall loss comprises 5 components(for each train and validation).

Business Problem

In used car industry(both marketplace and brick and mortar dealers), apart from car’s functionality and equipment availability and healthiness, which only can be accessed by test drive/manual inspection, car body external damages(scratch, dent, repaint etc.) play a vital role to decide accurate pricing of the vehicle. In most of the cases, these damages are detected and assessed manually from the car images during the car evaluation process. However, the latest computer vision frameworks can detect the damage location on the car body and help pricers to quantify the damage without much manual intervention. This concept will also help car insurers in assessing the damage automatically and in processing claim faster.

In the following section, I will briefly discuss data preparation and the implementation of this concept on real-life car images using Mask R-CNN. The detailed code along with all inputs(content video and style image) and output(generated image frames) is found here at my GitHub repository.

Step1: Data Collection — Although we will leverage transfer learning from suitable pre-trained(weights) CNN architecture, we need to customize the network for our specific use to minimize application specific loss — loss because of mismatch of damage location in pixel level between ground truth and predicted. So we will run train the n/w on use 56 images of car damages, collected from Google, out of which 49 images are used for train and 7 are used for validation purpose.

Step2: Data Annotation — As the concept falls into the regime supervised learning,we need to label the data. In computer vision object detection or object localization context, this labeling is called annotation. Precisely for our application, it is identifying the region of damage in an image and marking them accurately along the boundary of the scratch.For annotation purpose, I used is the VGG Image Annotator(VIA) at this link. Using this tool I uploaded all my images and drew the polygon masks along the damage boundary for each image as follows.

After annotation of all images, we downloaded the annotation in .json format and I did it separately for train and validation images.

Step3: Environment set-up- This is one of the important steps before training the model on collected images and annotations(labels), as I will use ‘Matterport Mask R-CNN’ repositoryto leverage a few pre-trained CNN n/w weight matrices built on different standard datasets like COCO dataset, ImageNet etc. and custom functions such as, data processing and preparation, configuration setup, model training, creating log-file to save iteration wise weight-matrix objects & n/w losses, object detection, masking detected localized areas etc. To run the custom training function on the images and annotations, we need to first clone the repository, follow the exact file-folder structure as described in the repository. This Matterport Mask R-CNN is built on the top of Tensorflow Object Detection API. Following are the steps before starting the training process.

a) To keep the training and validation images and respective annotation files in the separate sub-folders named as ‘train’ and ‘val’ inside data folder. I named it ‘custom’.

b) Based on the computational infrastructure, desirable object detection precision, training steps we need to define the training configuration.

class CustomConfig(Config):
"""Configuration for training on the toy dataset.
Derives from the base Config class and overrides some values.
# Give the configuration a recognizable name
NAME = "scratch"
    # We use a GPU with 6GB memory, which can fit only one image.
# Adjust down if you use a smaller GPU.
    # Number of classes (including background)
NUM_CLASSES = 1 + 1 # Car Background + scratch
    # Number of training steps per epoch
    # Skip detections with < 90% confidence

c) Lastly, we need to choose the starting point- pre-trained weight matrix object to start the training process. I chose mask_rcnn_coco.h5, which is pre-trained on coco dataset.

Step4: Loading datasets: Here we load training and validation images and tag the individual image to respective labeling or annotation. Here I customized the code, written for Mask R-CNN as per application(class label, directory path,shape standardization etc.) to prepare which loads images and annotations and adds them to a CustomDataset class. The code is found at my GitHub repository.

class CustomDataset(utils.Dataset):
    def load_custom(self, dataset_dir, subset):
"""Load a subset of the dataset.
dataset_dir: Root directory of the dataset.
subset: Subset to load: train or val
# Add classes. We have only one class to add.
self.add_class("scratch", 1, "scratch")
        # Train or validation dataset?
assert subset in ["train", "val"]
dataset_dir = os.path.join(dataset_dir + subset)

Step4: Network training- Now we need to refine the base ‘mask_rcnn_coco.h5’ with model training on real images and after each iteration(epoch) updated weight matrix is saved in ‘log’. Also, iteration/epoch wise loss statistics are saved to monitor it in TensorBoard.

def train(model):
"""Train the model."""
# Training dataset.
dataset_train = CustomDataset()
dataset_train.load_custom(args.dataset, "train")
    # Validation dataset
dataset_val = CustomDataset()
dataset_val.load_custom(args.dataset, "val")
    # *** This training schedule is an example. Update to your needs ***
# Since we're using a very small dataset, and starting from
# COCO trained weights, we don't need to train too long. Also,
# no need to train all layers, just the heads/last few layers should do it.
print("Training network heads")

We need to run the training code(.py file) on the images, with following commands

### Train the base model using pre-trained COCO weights(I ran using these weights,Download 'mask_rcnn_coco.h5' weights before starting the training)
py train --dataset=C:/Users/Sourish/Mask_RCNN/custom --weights=coco
### Train the base model using pre-trained imagenet weights(for this to download imagenet weights))
py train --dataset=C:/Users/Sourish/Mask_RCNN/custom --weights=imagenet
## We can even resume from the latest saved callback(latest saved weights)
python3 train --dataset=C:/Users/Sourish/Mask_RCNN/custom --weights=last

and the magic starts.

Step4: Model Validation- As each iteration(epoch) wise updated weight matrix is saved in ‘log’. Also, iteration/epoch wise loss statistics are saved to monitor it in TensorBoard.

Although most of the model training part is standardized and beyond our control we can’t control, we can view the different training and validation loss components(as described in earlier section) in TensorBoard. Also from the saved callbacks(saved weight matrices), we can check the histogram of weights and biases.

Step5: Model Prediction- After satisfactory and desirable loss monitoring — ideally monotonically decaying both training and validation loss, we can test the model object on randomly picked validation images to see the prediction(car damage masking) accuracy.

image_id = random.choice(dataset.image_ids) #select a random image from validation dataset
image, image_meta, gt_class_id, gt_bbox, gt_mask =\
modellib.load_image_gt(dataset, config, image_id, use_mini_mask=False) #image loading
# Run object detection
results = model.detect([image], verbose=1)
# Display results
ax = get_ax(1)
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'],
dataset.class_names, r['scores'], ax=ax,
log("gt_class_id", gt_class_id)
log("gt_bbox", gt_bbox)
log("gt_mask", gt_mask)
#Showing damage polygon on car body
print('The car has:{} damages'.format(len(dataset.image_info[image_id]['polygons'])))

and here is the prediction.

And the prediction looks decent.

Business Implementation and Road Ahead

Used car dealers/car insurance company can install infrastructure with high-resolution cameras at suitable angles and locations to click standardized(size) images of different car body sections(front, back, side etc.) and can detect all possible exterior damages in cars. This concept can be used as a mobile app as an API solution, which can ease the car evaluation process

Further after detection and masking of car damage, the process can help the car evaluators/claim settlement personnel in quantifying the damage severity, in terms of dimensions and approximate relative area(w.r.t. the car surface area) of damage. Most importantly since we’re leveraging transfer learning, we don’t have to collect many images and subsequent annotation and as the model training starts from trained weights(‘coco’), we don’t need to train too long. Also, this concept can be extended to detect other types of visible car damages/faults as well.

CNN Application-Detecting Car Exterior Damage(full implementable code) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.