Building AIOps Around Kubernetes


The theory of AIOps is that the IT infrastructure environment is too complicated for humans to handle – you need the assistance of advanced automation to identify anomalies and even perform basic maintenance routines. If you’re already using Kubernetes, then you have an advanced tool that can perform complex automated routines – including automated self-healing. How can you transform your existing K8s implementation into part of an AIOps workflow?

First, What Does the AIOps Workflow Look Like?


In order to get your AIOps workflow to function, it needs four components:

1. Collectors

Acquire, format, and store data from your applications. These sensors will routinely ping applications in order to acquire metrics such as latency, throughput, CPU load, etc. They then store the result in a data warehouse where the information is added to a time series.

2. Analytics Platforms.

Use ETL or ELT pipelines to ingest time series data from your data warehouse. Simpler forms of analytics involve setting thresholds, and then alerting when a metric exceeds that threshold. More complicated forms of analytics involve machine learning – they can understand the normal behavior of time series data and alert on anomalies without human intervention.

3. Rule engines

Proceed down an automated decision tree based on the results of your analytics. A severe anomaly in a mission-critical application will probably be alerted directly to an engineer or an administrator. There are several options for handling a lower-tier problem, however.

4. Configuration Management

Can automatically reset harmful configurations, reboot applications and subsystems, and implement other automated error-handling procedures.

Collectively, these tools give administrators the power to monitor applications at scale, detect errors before they become major issues, and handle lower-tier issues automatically. This gives engineers more time to innovate, and more time to work on major issues once they occur.
Kubernetes is often mentioned in the same breath as AIOps, so let’s talk about how it fits into this context.

AIOps Evolved to Deal with Kubernetes

As it turns out, K8s is basically the raison d’etre for AIOps. As K8s manages a large number of containers, it also generates a large amount of data. The larger it scales, the more data it generates – more than any team of human analysts could possibly attempt to see and interpret with their own eyeballs. When problems begin to occur, you won’t know whether you’re experiencing a costly application outage or a minor flutter – because with the addition of K8s, 40 percent of IT organizations are now experiencing over a million alerts per day.

AIOps is the only way to deal with K8s at a certain level of complexity. Fortunately, there are analytics and automation tools that are either built into Kubernetes or specifically designed for the platform. These can help users set up AIOps without much additional investment.

  • When running in Webhook mode, Kubernetes acts as its own collector, delivering information about the applications running on its platform. Prometheus is an add-on for K8s that collects data from the platform in terms of a time-series. This data can either be visualized using a separate dashboard or fed into a separate analytics program.
  • In terms of analytics, a built-in K8s tool like Kubernetes metrics-server can collect resource utilization metrics from individual pods and then alert on abnormally high or low utilization. You can also use an add-on like Jaeger to trace issues in K8s to their source – it can also integrate with Prometheus to create an end-to-end monitoring system.
  • In terms of configuration management, K8s has a huge amount of built-in power when it comes to recovering from harmful scenarios. It can automatically kill and restore unhealthy containers, for example. Recovering from harder failures can be more difficult. You can also enhance this functionality with configuration management tools such as Helm. Essentially, your options for automated error handling in K8s are vast enough that you could write a separate article about it (stay tuned).

You Can’t Automate What You Can’t See

Implementing AIOps will make K8s that much easier to deal with, but there’s a catch. Essentially, it’s hard to build an AIOps platform if you can’t map the terrain of your K8s implementation. Rapid changes to data formats and applications may mean that the monitoring tool you set up yesterday breaks tomorrow – and you might now have a way of knowing what broke or why.

Enter Device42. Using simple APIs and integrations with your existing tools, we can create a single source of truth about your K8s implementation. With a precise list of applications, software versions, container clusters, and more, you’ll have the perfect roadmap when it comes to implementing an AIOps pipeline for easier Kubernetes management. Try a free demo today!