How to automate data science code with Jenkins and Docker: MLOps = ML + DEV + OPS

Photo by Annamária Borsos

MLOPS = ML + DEV + OPS

How many created AI models have been put into production in enterprises ? With investment in data science teams and technologies, the number of AI projects increased significantly and with it a number of missed opportunities to put then into production and assess the real business value. One of the solutions is MLOPS that delivers the capabilities to bring data science and IT ops together to deploy, monitor and manager ML/DL models in production.

Continuous integration (CI) and continuous delivery (CD), known as CI/CD pipeline, embody a culture with agile operating principles and practices for DevOps teams that allows software development teams to change code more frequently and reliably or data scientist to continuously test the models for accuracy. CI/CD is a way to focus on business requirements such as improved models accuracy, automated deployment steps or code quality. Continuous integration is a set of practices that drive development teams to continuously implement small changes and check in code to version control repositories. Today, data scientists and IT ops have at their disposal different platforms (on premises, private and public cloud, multi-cloud …) and tools that need to be addressed by an automatic integration and validation mechanism allowing building, package and test applications with agility. Continuous delivery steps in when continuous integration ends by automating the delivery of applications to selected platforms.

The objective of this article is to integrate machine learning models with DevOps using Jenkins and Docker. There are many advantages to use Jenkins and Docker for ML/DL. One example is when we train a machine learning model, it is necessary to continuously test the models for accuracy. This task can be fully automated using Jenkins. When we work on a data science project, we usually spend some time increasing model accuracy and then when we are satisfied we deploy the application to production serving it as an API. Let’s say our model accuracy is 85%. After a few days or weeks, we decide to tune some hyperparameters and add some more data in order to improve the model accuracy. Then, we plan to deploy it in production and to do it we need to spend some efforts to build, test and deploy again the model which can be a lot of work depending on the context and environments. This is where the open source automation server, Jenkins, comes in.

Jenkins provides a continuous integration and continuous delivery (CI/CD) system proving hundreds of plugins to build, deploy and automate software projects. There are several advantages using Jenkins. It’s easy to install and configure, it has an important community, hundreds of plugins and it has the ability to distribute work across different environments. Jenkins has one objective: spend less time on deployment and more on the code quality. Jenkins allows us to create Jobs which are the nucleus of the build process in Jenkins. For example, we can create Jobs to test our data science project with different tasks. Jenkins also offers a suite of plugins, Jenkins Pipeline, that supports CI/CD. They can be both Declarative and Scripted Pipelines.

In this article, we will see how to integrate a machine learning model (Linear Discriminant Analysis and Multi-layer Perceptron Neural Network) trained on EEG data using Jenkins and Docker.

To learn these concepts: let’s consider the following files: Dockerfile, train-lda.py, train-nn.py, train-auto-nn.py, requirements.txt, train.csv, test.csv

The train-lda.py and train-nn.py are python scripts that ingest and normalize EEG data, train two models to classify the data, and test the model. The Dockerfile will be used to build our Docker image, requirements.txt (joblib) is for the Python dependencies. train-auto-nn.py is a python script that tweaks the neural network model with different parameters. train.csv are the data used to train our models, and test.csv is a file containing new EEG data that will be used with our inference models.

You can find all files on GitHub: https://github.com/xaviervasques/Jenkins

Jenkins Installation

With Jenkins we will run a container image that has all the necessary requirements installed for training our models. For that, we will create a job chain using build pipeline plugin in Jenkins.

First you need to install Jenkins. Here we install the Long Term Support release.

On Red Hat / CentOS, we can type the following commands:

sudo wget -O /etc/yum.repos.d/jenkins.repo \
https://pkg.jenkins.io/redhat-stable/jenkins.repo
sudo rpm --import https://pkg.jenkins.io/redhat-stable/jenkins.io.key
sudo yum upgrade
sudo yum install jenkins java-1.8.0-openjdk-devel

On Ubuntu, we can type the following commands to install Jenkins:

wget -q -O - https://pkg.jenkins.io/debian-stable/jenkins.io.key | sudo apt-key add -
sudo sh -c 'echo deb https://pkg.jenkins.io/debian-stable binary/ > \
/etc/apt/sources.list.d/jenkins.list'
sudo apt-get update
sudo apt-get install jenkins

Jenkins requires Java. We can install the Open Java Development Kit (OpenJDK). We can see all the needed information to install Jenkins here: https://www.jenkins.io/doc/book/installing/

Jenkins can be installed on many distributions (Linux, macOS, Windows, …) and deployed on private or public cloud such as IBM Cloud or others. You can use different commands to start some Jenkins services such as:

Register the Jenkins service

sudo systemctl daemon-reload

Start the Jenkins service

sudo systemctl start jenkins

Check the status of the Jenkins service

sudo systemctl status jenkins

If everything has been set up correctly, you should see an output like this:

Jenkins use port 8080. We can open it using ufw:

sudo ufw allow 8080

And check the status to confirm the new rules:

sudo ufw status

To launch Jenkins, get the IP address of your server by typing hostame -I and launch your browser by entering your IP and port: 192.168.1.XXX:8080

You should see something like:

In your terminal, use the cat command to display the password:

sudo cat /var/lib/jenkins/secrets/initialAdminPassword

Copy the passport and paste it in Administrator password and click continue.

Then follow some simple steps to configure your environment.

Scenarios Implementation

Let’s say we need to train our model regularly. In that case, it is recommended to wrap the process in Jenkins in order to avoid manual work and make the code much easier to maintain and improve. In this article, we will show two scenarios :

§ Scenario 1: We will clone a GitHub repository automatically when someone update the machine learning code or provide additional data. Jenkins will then automatically start the training of a model and provide the classification accuracy, check if the accuracy is less than 80%.

§ Scenario 2: We will do the same as the scenario 1 and add some additional tasks. We will automatically start the training of a Multi-layer Perceptron Neural Networks (NN) model, provide the classification accuracy score, check if it’s less than 80%, if yes, run train-auto-nn.py that will look for the best hyperparameters of our model and print the new accuracy and the best hyperparameters.

Scenario 1

We will first create a container image using Dockerfile. You can see previous articles to do it: Quick Install and First Use of Docker, Build and Run a Docker Container for your Machine Learning Model and Machine Learning Prediction in Real Time using Docker and Python Rest APIs with Flask. Then, we will use build pipeline in Jenkins to create a Job chain. We will use a simple model, linear discriminant analysis, coded with scikit-learn, that we will train with EEG data (train.csv). In our first scenario, we want to design a Jenkins process where each Job will perform different tasks:

- Job #1: Pull the GitHub repository automatically when we update our code in GitHub

- Job #2: Automatically start the machine learning application, train the model and give the prediction accuracy. Check if the model accuracy is less than 80%.

Jenkins use our linux with user called jenkins. In order for our jenkins user to use the sudo command, we might want to tell the OS not to ask password while executing commands. To do that, you can type

sudo visudo /etc/sudoers

This will open sudoers file in edit mode, and you can add or modify the file as follows:

jenkins ALL=(ALL) NOPASSWD: ALL

An alternative, maybe safer, is to create a file inside of the /etc/sudoers.d directory as all files included in the directory will be automatically processed avoiding the modification of the sudoers file and prevent any conflicts or errors during an upgrade. The only thing you need to do is to include this command at the bottom of the sudoers file:

#includedir /etc/sudoers.d

To create a new file in /etc/sudoers.d with the correct permissions, use the following command :

sudo visudo -f /etc/sudoers.d/filename

Now we just need to include the relevant line in our file:

jenkins ALL=(ALL) NOPASSWD: ALL

Open Jenkins and click on Freestyle project

Job #1: Pull the GitHub repository automatically when we modify our ML code in GitHub

Click on Create a job and name it download or whatever.

In Source Code Management, select Git, insert your repository URL, and your credentials.

Go to Build Triggers and select Poll SCM

You can click on “?” to get some help but just as an example, H/10**** means download the code from GitHub every 10 minutes. This is not really useful for our example, so you can leave the Schedule box empty. If you let it empty, it will only run due to SCM changes if triggered by a post-commit hook.

Then, click on the “Add build step” drop-down and select “Execute shell”. Type the following command to copy the contain of the GitHub repository to a specific path you previously created:

sudo -S cp * /home/xavi/Public/Code/Kubernetes/Jenkins/code

And click save.

We can click on “Build Now” and you should see your code in the created repository. When we modify our files in our GitHub repository (git add, git commit, git push), the files will automatically be updated in our created repository.

Job #2: Automatically start to train our machine learning model and give the prediction accuracy

Let’s start by building our docker image (we could put this step directly into Jenkins):

docker build -t my-docker -f Dockerfile .

Following the same procedure, we will create a new Job. We need to go Build Triggers and click on Build after other projects are built and type the name of the Job #1 (download in our case) :

Click also on Trigger only if build is stable.

Then, in Build open a Execute shell and type the following commands to automatically start the machine learning application, train the model and print the prediction accuracy in a file (result.txt).

Here, we check if my-docker-lda is already built and we then run our container and save the accuracy of our LDA model in a result.txtfile. The next step is to check if the accuracy of the model is less than 80% and provide the output “yes” is it is the case or “no” otherwise. We can for example send an email to provide the information: https://plugins.jenkins.io/email-ext/

To see the outputs of the job, simply go to Dashboard, Last Success column, and select the job, and go to Console Output.

Scenario 2

Let’s keep Job #1 and Job #2 and create a new job.

Job #3: Automatically start the neural network training, give the prediction accuracy, check if accuracy is less than 80% and if yes, run a docker container to perform autoML

You should see in the Console Output the selected parameters and new accuracy.

What’s Next ?

Jenkins is really about how to automate data science code.

The next steps would be to think about using Ansible with Jenkins. Ansible could play an important role in a CI/CD pipeline. Ansible will take care of the deployment of the application and we do not need to worry about how to deploy the application and if the environment is properly setup.

Sources:

https://medium.com/@fmirikar5119/ci-cd-with-jenkins-and-machine-learning-477e927c430d

https://www.jenkins.io

https://cloud.ibm.com/catalog/content/jenkins

https://towardsdatascience.com/automating-data-science-projects-with-jenkins-8e843771aa02


From DevOps to MLOPS: Integrate Machine Learning Models using Jenkins and Docker was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.