Leveraging Artificial Intelligence (AI) For Cloud Management
Businesses and business units of all sizes can benefit from cloud computing, but many don’t want the cost, performance, and governance concerns of public cloud nor the complexity and operational overhead of building their own private clouds. Today, some cloud vendors are using artificial intelligence (AI) to simplify private cloud deployment and management, making it possible for clouds to be self-driving (e.g., self-installing, self-healing, and self-managing). In this article, we’ll look at the requirements for a self-driving cloud.
Self-Driving Cloud Requirements
Just like any other technology in this space, one needs several systems to work well together, do self-monitoring, healing, learning and to create models for self-optimization. Here is a list of technologies that need to be present for self-driving clouds:
- ‘No Day 0’ — Automatic install and configuration: The first step is an install process that does not require much human intervention. The building blocks for a cloud are servers, storage, and networking. With hyper-converged systems, servers and storage are combined and one needs software-defined networking to minimize the reliance on physical network changes. So, the first requirement is a server + storage building block with all the software pre-installed and baked into the operating system image. You just need to image a few servers and power them on. Once that is done, the cloud should come up automatically without admins knowing anything about various services and their persistent stores. The image software should pool together servers, storage, and networking resources to create a highly resilient cloud.
- Integration with other clouds and internal systems: A cloud is not supposed to work in isolation, so one should be able to quickly connect it with existing virtualized infrastructure and other public clouds. Even better would be to add your existing storage systems and make them part of this cloud through open (e.g., representational state transfer or RESTful) APIs. This is an optional step, but it’s critical if you want to leverage your existing investments in storage and servers. Similarly, most customers want to integrate with AD/LDAP as well to have a single source of users and authentication.
- Deploy applications in a self-service manner: The goal for any cloud is to provide you with an IaaS and PaaS platform that can be consumed by various teams in a self-service manner. For example, developers can use it for application development, continuous integration / continuous development (CI/CD); support teams can use it to bring up replicas of customer environments to troubleshoot any support issues; sales can bring up quick PoCs for trial and finally IT can bring up staging or production deployment of various applications. These steps need to be fully automated, so that one can repeat them without spending too much time. Any cloud solution should provide a self-service interface with pre-built application templates for quick deployment.
- Real-time monitoring for events, stats, logging, and auditing: Since cloud is a shared environment, one needs to be able to monitor various events, stats, and dashboards in real-time. This is required to know the state of applications and what actions other users have performed. IT should be able to get logs and audit the action of all users. For example, if a service was down since 10 p.m. last night, it is good to know if a user or script mistakenly shut down a VM providing that service.
- Self-monitoring and self-healing: Any system as complex as a cloud needs to monitor all the critical services and help monitor the workloads. If any hardware component or software service fails, the system should detect and fix the situation. Then, it can alert the admin as to which component had failed. If this was a hardware component like a server, hard disk, SSD, or NIC, the admin could take corrective action to restore the capacity of the system. This is an absolute minimum requirement for a self-driving cloud.
- Machine learning for long-term decision making: Since the self-healing layer takes care of short-term decisions, we need another layer of automation that can observe the cloud and applications over a longer period to help optimize the cloud, improve efficiency, and plan for future. A self-driven cloud platform collects telemetry or operational data and leverages machine learning to guide data scientists how to develop algorithms that now model this behavior. The algorithms help customers make decisions.
This layer should observe the usage to do prediction capacity modeling and order new servers. It should also determine what sort of servers to add in terms of their CPU, memory and IO ratio. For instance, if the applications are more CPU-intensive, one should order servers with more cores and less storage. Another area is to help optimize the size of VMs based on utilization. Customers pay for peak capacity on public clouds, but the average utilization is less than 15% in most cases. At that point, you are paying five times the cost that you would pay in a private environment if you consolidated the workloads. All these savings can be passed to you instead of cloud vendors keeping them. A learning system can also help you detect any anomalies in your environment.
For example, you might notice that suddenly your VM was sending a lot of data to other public IPs. This was a result of the machine getting hacked by a bot, and any such security risk can be detected using a smart anomaly detection system. The list of learning-based algorithms can get long, but the key is to have a platform where these can be easily added over time.
‘Hands Free’ Upgrades!
Upgrading a cloud is like changing the tires on a running car. Admins spend a lot of time dreading it and finally doing it. With a live cloud running a variety of workloads, it is critical that the upgrade process be completely handled by an intelligent software layer, and not by humans who are reading release notes from vendors to figure out the right path to upgrade for their environment.
Powered by WPeMatico