Kubernetes Monitoring and Logging — An Apache Spark

I have moved almost all my big data and machine learning projects to Kubernetes and Pure Storage. I am very happy with this move so far. With this platform, my life as a data engineer / data scientist becomes easier — much easier to deploy, scale and manage my Spark jobs, Presto queries, TensorFlow trainings and so on. It is also easier to share the infrastructure resource among these projects.

However, when I set up the environment at the beginning, I found it is a little difficult to understand what is actually going there. I realised I need an end-to-end monitoring setup for my Kubernetes cluster and its applications. In this blog, I will use Apache Spark on Kubernetes as an example to share what I use as my monitoring and logging stack. I want to give an overview here, I will have another blog to explain the how-to in details.

Kubernetes Cluster Monitoring and Alerting

The first thing I needed is a monitoring and alerting solution for my Kubernetes cluster. Prometheus is the most popular solution for this. I use the Prometheus Operator, which provides Kubernetes native deployment and management of Prometheus and related monitoring components. It is straightforward to deploy the operator using the kube-prometheus repository, which is handy as it is pre-configured to collect metrics from all Kubernetes components. It also includes a default set of dashboards and alerting rules.

Kubernetes API Server monitoring dashboard

On top of that, some customisations I made includes:

  • Use Persistence Volume for Prometheus database and Grafana storage.
  • Application specific custom monitoring and alerting rules.
  • Use Slack Receiver for alerting.

Here is an example alert received in Slack:

Slack notification received from Prometheus AlertManager

Prometheus data is stored in a time series database (TSDB). I want low latency access for this TSDB, therefore its PV is backed by FlashArray (provisioned through Pure Storage PSO) in my setup.

Besides the Prometheus stack, I also use Node Problem Detector to detect node problems such as Linux kernel and system issues. This is important because these issues may impact the Kubernetes cluster.

Node conditions added by Node Problem Detector

Application Monitoring for Apache Spark

The Prometheus stack provides my Kubernetes environment a powerful monitoring solution and built-in Kubernetes metrics/dashboards. In order to use it to monitor application, they need to expose application-specific metrics to Prometheus. Not every application supports this today, but luckily the Spark Operator does support exposing metrics to Prometheus.

Monitoring Spark metrics in Prometheus

It is great to have Spark metrics in Prometheus, but I also want the Spark and History Server UI that I have been using for a long time on non-Kubernetes environments. While the job is running, Spark UI is available at driver-pod:4040, so it is easy to use Kubernetes port forwarding to access the UI. For Spark History Server, I configure Spark to send its event logs to a FlashBlade S3 bucket. I then deploy a Spark History Server pointing to that bucket. With this, Spark UI is available even after the job is completed.

Spark History Server on Kubernetes with event logs in FlashBlade S3

Kubernetes Logging

Metrics is important to understand the state of the system, logs tells us what happened actually. It’s critical to have all logs in one place where I can visualise and analyse them as they help me to monitor and troubleshoot my Kubernetes environment easily.

I use the popular Fluentd, ElasticSearch and Kibana logging stack to collect, index and visualise Kubernetes logs, including system and application logs.

Kubernetes log search in Kibana

This logging stack is also deployed on the Kubernetes cluster. Because high throughput is required at scale, I use FlashBlade NFS (again provisioned through PSO) to back ElasticSearch data persistent volumes. I may switch to FlashBlade S3 in the future once ElasticSearch supports directly searchable snapshots in S3.

Making Kubernetes logs searchable and visualised has helped me identified and solved several issues I didn’t notice before. The below is an example log dashboard.

Example Kubernetes log dashboard

Summary and Future Works

By now, I have built a basic monitoring and logging setup for my Kubernetes cluster and applications running on it. This setup collects Kubernetes cluster-wide and application-specific metrics, Kubernetes events and logs, presents nice dashboards and clear overview of my system health.

FlashArray, FlashBlade and PSO help simplify my setup as persistent storage provider. In case this is not available (such as in the cloud), another attractive option is to use Portworx, which abstracts Kubernetes nodes’ local storage and presents them as a persistent volume class (block storage) to the cluster with enterprise storage feature.

This monitoring and logging setup is easy to deploy, delivers most of the feature I need, but it is not completed. Other future I would like to have includes:

  • Collect Kubernetes audit logs which have detailed descriptions of each call made to the kube-apiserver.
  • Unified dashboards with cluster, application and storage monitoring together using Pure Storage Prometheus exporter.
  • Co-relate metrics and logs, context-aware logs.
  • Distributed tracing for Kubernetes micro-services.

Having an end-to-end monitoring and logging strategy is crucial to understand and troubleshoot my Kubernetes environment. I consider it to be a core requirement for any production Kubernetes cluster.

Kubernetes Monitoring and Logging — An Apache Spark Example was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.