From an idea to a native application with YOLOv5, TorchServe & React Native

Photo by Kevin Ku on Unsplash

Whenever there is an article on an end-to-end deep learning project, it consists of training a deep learning model, deploying a Flask API, and then making sure it works or it extensively consists of creating a web demo using Streamlit or something similar. The problem with this approach is that it talks about a straight-forward and typical path that has been tried and tested. It merely takes replacing a single piece of the puzzle with an equivalent, such as a sentiment analysis model with a classification model, etc, and a new project can be created, but the wireframe remains mostly the same. The approach is quite valid from an educational perspective since it teaches you about the domain, but a typical deep learning project, from an engineering perspective, can differ a lot.

Engineering is more about designing and building. Knowing the “how-to” is more important than following a set pattern, or a step-wise procedure to build a project. In 2021, when there are so many alternative frameworks to work with, multiple approaches to the same problem, varying ways to deploy models, and build a demo for them, it is important to know which to choose, and more importantly how to choose one out of the available options.

Note: For the rest of this article, project and deep learning projects are used interchangeably.

When one starts learning about deep learning concepts, projects are made to learn a specific technology, or sub-domain, such as NLP, or CV. Today, if you go to resources like https://paperswithcode.com/, you’ll find research over different problems, which have become inter-domain or have exceedingly become difficult to solve with the traditional supervised learning approaches and hence to tackle these problems, approaches such as self-supervised learning, semi-supervised learning, causal inference, etc are been experimented with. This leads to an important question, what is a good way to approach and work on a project? Should a domain be picked, and problems based on that be worked out, or should a problem be chosen, and learned what would involve solving it? Let's try and answer some of these questions in this article.

The rest of this article talks about the process of a specific project and some of its components might not be as familiar as others, but the point is to walk you through the process and specifically the thought process of working on such a project, rather than it acting as a tutorial for the specific project. And to resonate with the idea of not having any fixed path or set pattern, this article is one of how a project can be worked out, and it leaves out details that aren’t directly connected to the discussion.

Let’s take a look at the end product of this project and what it involves, before any further discussion.

This project consists of building an app, which when shown food ingredients, tells you which dishes these ingredients can be used in. The simple use case of the application is to capture an image of the ingredients, and the app would detect the ingredients itself, and present you with a list of detected ingredients along with the probable recipes in which the ingredients can be used.

Project Architecture

The project is broken down into three different parts. The first part consists of working with a deep learning model to detect the food ingredients, the second part consists of creating an endpoint to deploy it as a service, and the third part consists of actually using it in a native application.

At this point, you must be wondering, how is this different from any other smooth sailing project walkthrough, or demo, but as you’ll keep reading, the difference will become more and more apparent. The purpose of this article is not to spoon-feed a step-by-step procedure to create such a project, but rather to give you intuition as to what is involved in coming up with one, and what active decisions may involve at various stages.

At this point, it would be pleonasm to say that there is no set formula for approaching a problem or working on a project but there is something that always helps.

“Choose something that excites you. Choose something that encourages you to work hard.”
- Abhishek Thakur

Knowing what you are working on is important, but it’s equally important to know why you are working on it. It can be extremely helpful when things aren’t working as you want or expect them to(it happens, a lot!). It keeps you encouraged and gives you a clear perspective on the project. But that’s it. That’s the only pseudo-set piece of the puzzle. The rest can be looked into as we go along with the project.

Let’s talk a little about the motivation for this project then. I am a huge fan of Masterchef(most of them), and for those who are living under a rock and haven’t watched it ever, just know that it’s about cooking. There is this one segment on the show, called, Mystery Box Challenge, which involves the contestants having to work up a dish given some mystery ingredients. It made me think about how I would tackle this problem while remaining in my comfort zone of deep learning and software engineering. And the light bulb went off, and so was born the idea of the MBC App.

Taking a small segue at this point, it would be good to talk about how close this process is to a real-life project. More often than not, it happens so that there is either an existing product or a new product is in the works, and using any kind of AI is one of the alternatives, and in today’s time, from a startup to large tech giants, all put AI as their first option for a solution, hence our thought process conforms to some degree with that of the industry.

Now that we have the motivation resolved, and we have decided not to follow any particular path to solving the problem, do we not think of a solution at all? Well, that’s too bold for even us deep learning aficionados. There needs to be a clear distinction between a process and a methodology. “Process” is a broad term that describes what needs to be done to solve the problem, whereas the methodology of doing it could differ a lot, whereas a methodology talks of the steps in getting something particular done. For example, knowing whether to approach a problem as regression or classification, can be part of the process, whereas which specific backbone to choose from ResNet, EfficientNet, VGG can consist of the specific methodology.

So, let’s talk about the process that would be involved in solving the problem at hand. We need a way to get the ingredients. It could be as simple as asking the user to type in the names, but why wouldn’t they just google that instead of using the app we are so passionately building? So could it be a multiclass problem, where the user would simply either click or upload a picture of the ingredients and the app would recommend recipes based on that? Could be a good alternative, but sounds like too safe an option to experiment with. So what are the alternatives to this? We could use object detection to detect different ingredients. Even that could be a choice, whether to approach it from a supervised perspective or use some self-supervised localization technique for it. The latter however is out of the scope of our discussion, so we’ll stick with using object detection.

Now that that’s settled, how do we want to create this application? Should it be a script, which users would run on their systems after installing all the dependencies or should it be a web deployment? Both the alternatives sound charmingly wrong for the problem that we are trying to solve, rather creating a native application that can be used at the get-go sounds both user-friendly and right for the task.

Now let’s dive into the three different parts constituting this project and talk about their working individually.

YOLOv5 Training Pipeline

We decided to use an object detection model to detect different ingredients as part of the process, but what we didn’t decide then is how this would be accomplished, and what specifications to choose for it, from the various available options. There are various popular object detection models out there, such as Faster-RCNN, Mask-RCNN, SSD, YOLO, etc. You can find papers with a comprehensive comparison of these models, but we would rather keep it short and simple, and justify our choice than focusing on the ones we didn’t choose.

Source: https://github.com/ultralytics/yolov5

YOLO v5 is one of the latest additions to the list of these object detection models. It outperforms all the recent models, in terms of both prediction speed on GPU, as well as, COCO metric(ranging from 0 to 1). It has a well-written code base, and it provides various alternatives to choose from, depending on the scale and requirement of the project. Along with that, it provides a good amount of portability. Although written in Pytorch, it can be converted to formats, such as Onnx, CoreML, TFLite, and Torchscript(which we’ll be using). Also, its ability to convert into Onnx itself opens a lot of possibilities. Hence due to its well-written code base and good portability, we choose YOLOv5 as our object detection model, and specifically YOLOv5m, as a good trade-off between speed and average precision(AP).

Great! Now, all we need to do is to train the object detection model and we are good to go, right? To burst the bubble, no. We talked plenty about the process, but we didn’t talk about the data itself, which is one of the main elements of any Data Science pipeline. So let’s put a pin on the model and process, and talk about data.

Let’s talk about another big distinction at this point, first. We always hear terms and phrases like, “Big Data”, “Data is the new gold”, “Data in abundance”, but the fact of the matter is that unstructured and unprocessed data is in abundance. And again, when we start learning about Deep Learning and Data Science in general, we come across datasets like Iris or MNIST, etc. These are well-processed data, and it takes effort to do that. This effort goes unrealized most of the time, but consider curating a dataset with 70,000 images of different digits, of different handwriting, or something even bigger and more relevant to the project, such as the COCO dataset. Collecting and processing such large-scale datasets is neither easy nor cheap. Most of the time in professional scenarios, the data relevant to the task is not available and rather needs to be collected, curated, and processed well before even considering starting the whole training-testing-experimentation cycle.

And such is our case. We need a dataset of food ingredients, with bounding box information of different labels across all these images. And spoiler alert, such a dataset is not readily available, and hence we need to put in the extra effort to prepare the dataset ourselves, which is more of a manual boring task than a fun coding challenge, but again, a very crucial part of the whole lifecycle.

This step in the project lifecycle involves us collecting data, annotating it with bounding box information, structuring it before it can then be passed on to the model for training. For the sake of simplicity, we rather choose to work with 10–20 ingredients than a whole bunch of ingredients, sleepless nights, and a jug full of coffee. The ingredients are chosen as per their frequency in different dishes in the Indian Food 101 dataset. As of now, only 10 ingredients are chosen. This dataset will also be important in a later part when we would have to predict dishes from the recognized ingredients, but more on that later.

Curry leaves, Garam masala, Ghee, Ginger, Jaggery, Milk, Rice flour, Sugar, Tomato, Urad dal

Now that we have our list of top 10 dishes that we want to work with, we need a way to collect this data, and that too in a well-structured way. There could be multiple ways of doing this. One can go around flaunting their photography skills and get images of these 10 ingredients, but that sounds too much manual labor. Engineering is also about being crafty and lazy, which leads to smart alternatives to painstaking work. Hence, an easy alternative to this would be to download the data from different free sources and web searches. But an even smarter approach would be to write a web scraper to get the data automatically from different free sources.

We choose the third option. Writing a web scraper to do the task for us, is better than clicking and collecting images manually ourselves. We collect the data in a manner that for each label(food ingredient) there is a separate folder, and we can control the number of images we want to scrape.

food-ingredients/
|--curry-leaves/
|--garam-masala/
.
.
.
|--urad-dal/

There remains one task left before getting back to the model itself, and with fewer alternatives than the previous one. The task of annotating and creating bounding boxes on each of the images. Either we can do this manually using tools like LabelImg or we can choose some entirely different approach, such as Semi-supervised object localization, but let’s refrain from talking about the latter since it’s out of the scope for this discussion. Let’s stay on this track and manually annotate the data. The YOLO v5 project expects a very particular directory structure for the data:

Folder Structure

And hence we convert the above data to replicate the above folder structure with the following piece of code:

After this using the LabelImg tool we generate the bounding boxes for each of the images in the train and val splits. And here is when we go back to our initial motivation and remind ourselves why we were working on this project in the first place.

Great! Now that the labeling is done and the data is properly structured, we can go ahead and train the YOLO v5 model on our custom dataset. We’ll use the yolov5m and write our configuration for the data, which would look something like this:

Since everything is well prepared now, all that remains is to train and export our model in the desired format, for later use. Since it involves running a few scripts either locally or over services like Google Colab, it can easily be skipped here, and the reference for the training and entire part 1 can be found here:

himanshu-dutta/mbc-object-detection-yolov5

Inference Endpoint

Source: https://github.com/pytorch/serve

There is a variety of options available when it comes to serving a machine learning or deep learning model. Ranging from more familiar ones like Streamlit and Flask to slightly less talked about, “deployment on the edge”, and new options that keep on emerging, like FastAPI(which I am yet to play with).

We need to choose the option that best suits our purpose. Since we decided to create a native application for the project, we can either choose to deploy the model natively, with options like Onnxjs or TensorFlow.js. The problem with this option is, Onnx doesn’t properly work with React Native(since the application will be built using that), and even though TF.js has a compatible framework for React Native, it has some bugs that need to be worked out. This restricts our options to create an API endpoint. For this purpose, we choose TorchServe, which inherently uses C++ bindings for deploying a Torchscript model. The reason for choosing TorchServe is the pythonic style in which the endpoints can be written, along with its Torchscript deployment, which accounts for the speed.

The endpoint handler consists of the methods which would be part of your codebase(in our case YOLOv5 codebase) anyway. It consists of the following methods:

__init__: Used to initialize the endpoint, can be image transforms, or encoders for text, etc.
preprocess: Method used to process the input batch before feeding it to the model and making inference, such as given an input image, transforming it in the desired way.
inferencce: This method is where the actual inference happens. The preprocessed input is passed to the model, and output is obtained from it. Other smaller changes to the output can also be performed here.
postprocess: After the inference is made the output is passed to this method, where the output can be packaged in the desired way, before the response is made for the given input, such as converting class index to label names, etc.
Note: Although the method names are predefined, what is actually done inside these methods is completely upto us. Above are the suggeested best practices.

Now that we have our endpoint handler in place, all that’s needed is the actual endpoint function, which simply does what we have described in our handler, preprocess the data, makes inference on it, and postprocesses to return a response.

The endpoint can be deployed using a command similar to this:

The above script simply archives our saved model to a compatible “model archive” file packing all the necessary scripts, model graphs, and weights, which is then passed to the TorchServe command to start the endpoint.

Native Application

The third part is very specific to the project. There are multiple ways to demo or create a full-fledged frontend for a project. Sometimes it is good to package your work as a library or framework, at other times, it can be a complete web or native application. And there are also options like Streamlit, which are genuinely a boon for when you have to demo something really quick and easy. But being from an engineering background, learning something new is what needs to be done.

For the specific project, writing a native application makes the most sense. But for the sake of discussion, let’s consider other options. A close alternative can be deploying a simple web application. Since we are working in React Native, a lot of the code itself is reusable if we consider working in React later, so the codebase has good reusability. Other than that, web applications don’t feel as user-friendly as a native application would, hence we will go ahead with a native application.

Although we have already decided on the use case of the application, we are yet to figure out what needs to be done for the use case to materialize. There can be two good options for this:

  • We can either let the user point the camera to the food ingredients, and the app will keep on making async calls to our endpoints and depending on the response keep on suggesting the recipes. For this alternative to work, our endpoint must be highly responsive, which at times won’t be possible since the endpoint might be running on a CPU. Along with this, an API call would have to be made continuously which is redundant for the use case.
  • Another way to do this is to let the user click an image of the food ingredients, and then make an API call to get the inference, based on which the recipes can be suggested.

Since the latter option is more feasible, we proceed with it. After deciding on the preferred way of making an inference, we now need to decide on the UI and components needed for the application. We’ll definitely require a Camera component and an Image component(to display still images), along with that it would require some buttons for the user to click images, flip the camera, etc, and a section where the ingredients and the recipes can be listed. It would also be nice to have bounding boxes drawn around the detected ingredients. Most of these elements are already built-in React Native, but we’ll have to build the “bounding box” component ourselves.

The “bounding box” component is a transparent rectangle box with a Text component on the top left corner, which renders over the Image component directly. Now that all our components are ready, we need to assemble them into a working application, which itself will be a single screen application that looks like the image we went over at the start. The code to make inference would look something like this:

It simply sends an image as a Base64 string and structures the response into a suitable format to be passed on to different components for rendering and further processing. It also makes sure that the predicted bounding boxes and labels for the ingredients have a confidence level above the set threshold.

Now we need to suggest some recipes based on the detected ingredients, which could range anywhere from making an external API call to some existing service to having a record of recipes within the app itself. For the sake of simplicity, we’ll use the list of recipes we extracted from the “Indian Food 101” dataset, and use that to suggest recipes based on the detected ingredients.

Now it is up to us how interactive and user-friendly we want this application to be. We can even add more use cases and features if we want to. But since the application we have worked on so far conforms with the initial idea that we had, we will stop here.

Application Screen

Hopefully, this article provides you with a good motivation to research and find your own way of working on projects, than following any set pattern. Each project has some learning to bestow, but what’s important is to use that learning for future endeavors, both the technical and non-technical ones.

Project References

[1] https://github.com/himanshu-dutta/mbc-object-detection-yolov5

[2] https://github.com/himanshu-dutta/mystery-box-challenge-app

References

[1] Yolo v5

[2] TorchServe

[3] React Native


End to End Deep Learning: A Different Perspective was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.