Articles Catalogue

Kubeflow = Kubernetes + Machine Learing + Flow

1 Overview

Kubeflow is a tool set for running machine learning tasks on K8S cluster. It provides computing frameworks for Tensorflow, Pytorch and other machine/deep learning tasks. At the same time, it builds the integration of container workflow Argo, called Pipeline. There are many problems with the latest version of local deployment. Most issue s on Github are deployment related, so if they are not deployed on GCP, they may encounter a variety of problems.

The reason for this good support for GCP is that Kubeflow is an open source version of Google's internal machine learning workflow, but few core developers have invested in it, and only a few people are doing version updates and problem fixes.

Before deploying, learn a few concepts about ksonnet.

  1. Regisry: ksonnet's template warehouse, either offline or online, as long as it can be accessed
  2. env: registry registers under Env and switches the warehouse of deployment templates through env
  3. Pkg: The registry contains templates, prototypes and libraries.
  4. prototype: This article is called a component and can configure different param s
  5. library: contains Api information for k8s. Different versions of k8s have different APIs
  6. param: Parameters for Filling Templates
  7. Compoonet: Template filled with parameters. This article is called component.

2 Deploy

Kubeflow's official documentation provides deployment solutions for various platforms.

https://www.kubeflow.org/docs/started/

In terms of deployment, Kubeflow takes advantage of Ksonnet, a tool that facilitates the management of K8S yaml.

https://ksonnet.io/

Because the deployment script provided by Kubeflow only needs to be looked at with ks command when it encounters problems, it is necessary to familiarize yourself with it (I will talk about it with examples later).

2.1 Local Deployment

First of all, we need to find out what components are deployed by one-click script. We need to spend some time to understand each component. Otherwise, when we make mistakes, we can't start at all.

# ks component list
COMPONENT           TYPE    My supplementary information may not be accurate.
=========           ====    ========== ==== ====
ambassador          jsonnet Kubeflow Authentication Unified Gateway and Routing
application         jsonnet There are too many components. This is for integration. CRD
argo                jsonnet Container Task Scheduling
centraldashboard    jsonnet Kubeflow Entrance UI
jupyter             jsonnet jupyter
jupyter-web-app     jsonnet jupyter hub
katib               jsonnet Components for in-depth learning parametric tuning
metacontroller      jsonnet It's also an internal one. CRD
notebook-controller jsonnet hub You can add more than one notebook,It's also one CRD
notebooks           jsonnet jupyter nootbook
openvino            jsonnet 
pipeline            jsonnet pipeline Integrate 
profiles            jsonnet Components for User Rights and Authentication
pytorch-operator    jsonnet A Framework for Deep Learning
spartakus           jsonnet
tensorboard         jsonnet 
tf-job-operator     jsonnet tensorflow Task CRD

The deployment of so many components is very cumbersome. The official gives a script, but the script is not enough. If there is a problem, we must read the script content and simply talk about the structure.

You have to make sure that the version is downloaded correctly, otherwise the problem of debugging is another push.

There are three folders after downloading. Focus on the script folder. The key to deployment is in two scripts. kfctl.sh/util.sh.

Because the script is too long, and all platforms of gcp/aws/minkube are mixed together, the focus is still on the ks part, because the core of the deployment is kssonet.

As for ksonset, the above image is very classic. The template is a cylinder with some missing parts, such as image in yaml. meta.name The second part is the abstraction of these parameters. The final cylinder is filled up by kssonet, and finally combined into a completed yaml file, which can be used by kubectl apply-f xxx.yaml or kssonet command KS apply-c < component >.

Grap key ks-related commands.

# Run this command to find the KS related command cat util.sh| grep "ks"
# All deployments should call this function to create a common ksonnet app.
# Create the ksonnet app
# Initialize the directory of ks project, note that ${KS_INIT_EXTRA_ARGS} will be mentioned later
eval ks init $(basename "${KUBEFLOW_KS_DIR}") --skip-default-registries ${KS_INIT_EXTRA_ARGS}
# It's also very important to delete the default environment, where default proxies the registry of ks.
ks env rm default
# registry, which is described in the following sections with the previous ones
ks registry add kubeflow "${KUBEFLOW_REPO}/kubeflow"
# Starting here, you install various ks component templates, which are not very useful, and you have to generate
ks pkg install kubeflow/argo
ks pkg install kubeflow/pipeline
ks pkg install kubeflow/common
ks pkg install kubeflow/examples
ks pkg install kubeflow/jupyter
ks pkg install kubeflow/katib
ks pkg install kubeflow/mpi-job
ks pkg install kubeflow/pytorch-job
ks pkg install kubeflow/seldon
ks pkg install kubeflow/tf-serving
ks pkg install kubeflow/openvino
ks pkg install kubeflow/tensorboard
ks pkg install kubeflow/tf-training
ks pkg install kubeflow/metacontroller
ks pkg install kubeflow/profiles
ks pkg install kubeflow/application
ks pkg install kubeflow/modeldb
# The generate command is used to fill the parameters into the template to form the completed yaml described above. 
# Note that these components do not correspond to the previous template one by one, because some components contain several template parameters
ks generate pytorch-operator pytorch-operator
ks generate ambassador ambassador
ks generate openvino openvino
ks generate jupyter jupyter
ks generate notebook-controller notebook-controller
ks generate jupyter-web-app jupyter-web-app
ks generate centraldashboard centraldashboard
ks generate tf-job-operator tf-job-operator
ks generate tensorboard tensorboard
ks generate metacontroller metacontroller
ks generate profiles profiles
ks generate notebooks notebooks
ks generate argo argo
ks generate pipeline pipeline
ks generate katib katib
# cd ks_app
# ks component rm spartakus
# The generate command can also be parameterized
ks generate spartakus spartakus --usageId=${usageId} --reportUsage=true
ks generate application application

What really creates K8S resources through yaml is kfctl.sh In the script, the ks-related commands are first found in the same way.

# Run the command cat kfctl.sh| grep "ks"
# Here you specify the components that the application needs to include
# As mentioned above, application CAITON is a crd because kubeflow 
# There are too many components, so there must be a tool for unified management.
ks param set application components '['$KUBEFLOW_COMPONENTS']'
#
#
#
# Here is the last key step in the script, please note!!
#
#
#
# ks show can combine components to generate yaml files
ks show default -c metacontroller -c application > default.yaml
# Then you can see that even such a complex Kubeflow is still built through kubectl application
# So if you need to, be sure to look at the default.yaml file
# The default file has a lot of content, different versions, and should be between 5000 and 9000 lines.
kubectl apply --validate=false -f default.yaml

The P.S. ks command is not entirely listed. If debug is needed, you need to look at the script carefully.

2.2 init process

Through the script init.

./kfctl.sh init myapp

After init, check the version.

2.2 Genee process

# Note the catalogue
cd myapp
../kfctl.sh generate all

After generation, the same Check version information is available.

2.3 apply process

# Note the catalogue
../kfctl.sh apply all

2.4 Successful deployment

Check the pod situation.

View the svc situation.

Access UI.

kubectl port-forward svc/ambassador 8003:80 -n kubeflow

Check Pipeline.

Run a Pipeline DAG.

Check tf-job-dashboard.

Submit a tf-job. There are several examples from the official component example. These components can be installed in the following way.

# Note that you need the ks_app directory generated earlier
ks generate tf-job-simple-v1beta2 tf-job-simple-v1beta2
ks apply default -c tf-job-simple-v1beta2

In this way, several tasks are submitted, essentially yaml is generated through ks, and then KS application is equivalent to kubectl application.

2.5 Delete

# Note the catalogue
../kfctl.sh delete all

3. Issues to be noted

  1. Be sure to confirm that the Kubeflow version is a problem in downloading / installing Kubeflow, because there is a big difference between the versions before and after!
  2. When generating templates, you need to pay attention to the version of K8S! You can specify it in the script, see Appendix.

If you don't plan to deploy the entire Kubeflow, you can deploy only Jupyter, tf-operator, and so on.

4 Reasons for deployment failure

  1. If full deployment is required, multiple K8S resources need to be created and more resources are needed. Local deployment is not necessarily possible. GCP recommends 16 cores.
  2. Version issues, including K8S version, ksonnet version, mirror version, etc.
  3. Offline problem, in principle, as long as the K8S script is deployed, the image is locally available, and the deployment script has been acquired, there is no need for networked deployment.

Common problems include Github's inaccessibility, the need to download K8S's swagger.json file, and so on.

The cost of fully deploying a set of Kubeflow is too high. First, the logic of official document collation is not clear enough and the update is not timely. Second, it contains too many components. If you are not familiar with some components, it is very difficult to find out the problems. If deployed, it's best to deploy through cloud vendors. Relatively speaking, Kubeflow's deployment scripts for vendors are more active than local users. Of course, in GCP, experience should be the best.

appendix

# ks needs to read the. kube/config file
# init needs to identify ks registry, and offline install swagger.json that requires k8s
eval ks init $(basename "${KUBEFLOW_KS_DIR}") --skip-default-registries --api-spec=file:/tmp/swagger.json
# 
# You can specify server to determine the version of k8s
ks env add default --server=https://shmix1.k8s.so.db --api-spec=file:/tmp/swagger.json
#
# Note the information for each script run
++ ks env describe default
+ O='name: default
kubernetesversion: v1.14.3
path: default
destination:
  server: https://kubernetes.docker.internal:6443
  namespace: kubeflow
targets: []
libraries: {}'
#
#
# Full deployment script
#
#
export KUBEFLOW_VERSION=v0.5.0
curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_VERSION}/scripts/download.sh | sh
cd scripts
./kfctl.sh init myapp
cd myapp
../kfctl.sh generate all
../kfctl.sh apply all