Technical Tips | How to build an efficient operation and maintenance management platform under the microservice architecture?

Technical Tips | How to build an efficient operation and maintenance management platform under the microservice architecture?

This article is a sharing by Li Ming, CTO of U-Wise Technology, at the "Cloud Operation and R&D Best Practice" event. This article combines the characteristics of microservice architecture to explain how to build an efficient operation and maintenance management platform.

Liming led the team to independently develop a full-stack DevOps operation and maintenance management platform - EasyOps, which is currently the industry's leading intelligent operation and maintenance management platform. As the former head of Tencent's operation and maintenance R&D, Liming led the development of multiple operation and maintenance systems, including public opinion monitoring, big data monitoring platform, CMDB, real-time log analysis platform, Zhiyun, and client experience monitoring.

This article has three main points:

1. The characteristics of microservice architecture and its differences from traditional monolithic architecture, as well as the challenges faced by traditional operation and maintenance tools;

2. Microservice-oriented operation and maintenance platform architecture;

3. Evolution of microservices of operation and maintenance platform.

1. Differences between Microservice Architecture and Monolith Architecture

"Microservices" and "monopoly architecture" are not opposites, but solutions for different scenarios.

The monolithic architecture refers to the concentration of all "brains" together, represented by the CS architecture, which puts all logic in a single application, and then adds front-end UI components, services, MVC architecture, databases, etc. Its technical architecture is not complicated, and it is easy to debug, deploy, and manage. It is a solution suitable for most systems.

However, in the application scenarios where the Internet requires "more, faster, better, and cheaper", the "monolith architecture" faces many challenges.

Many: There are a huge number of Internet users, reaching *** online users;

Fast: The service request response speed must be within one second or even faster;

Good: The service quality should be stable;

Savings: The increase in hardware costs must be lower than the rate of increase in the number of users.

How to solve these four problems - enhance the flexibility of the entire platform.

Platform expansion capabilities

1. Parallel expansion: General stateless servers can achieve parallel expansion through server expansion;

2. Partitioning: For stateful services, platform flexibility can be enhanced through partitioning, such as users in the north and south belonging to different clusters A and B.

The "monolith architecture" can adapt to platform expansion, but it is more difficult to adapt to functional expansion.

Functional expansion capability

In terms of functional dimension, how to make the system more harmonious?

1. Flexible cost control: local adjustments, changes to modules and logic, rather than modifying the entire system.

All modules of the monolithic architecture are bundled together. When expanding, due to the huge size of each module, the only way to expand the capacity is to do so in parallel at a high cost.

The server distribution of modular products under the microservice architecture is very flexible and the expansion cost is low. Now people will choose to divide the server modules and carry out microservice transformation to enhance the platform support capabilities.

2. How to build an operation and maintenance management platform under the microservice architecture

The above article describes the differences between microservice architecture and monolithic architecture. Next, let’s learn how to build an operation and maintenance management platform.

The most important thing in operation and maintenance platform management is the application. For application operation and maintenance, the official website connected to the front end of the system, the logical service in the middle, and the storage and cache in the back end belong to different operations and maintenance.

The operation and maintenance platform is divided into three specific components corresponding to the work.

What are the internal applications and internal dependencies of the operation and maintenance platform? —Programs, configuration files, and computing resources

What supports the operation and maintenance platform as an Internet application? ——Memory, CPU

What resources does the operation and maintenance platform rely on? ——System image

This is what the CMDB IT resource management system is responsible for. When it comes to automated capacity expansion and environment deployment, only by understanding this data can the upper-level system know how to build this application. Many operation and maintenance teams only achieve "tooling" but fail to link it with "resource management configuration."

After effective resource management, the next step is to manage actions such as R&D and operation and maintenance, such as version updates, migration services, and building test environments.

After having resources and actions, and achieving a closed loop of automated operation and maintenance, the operation and maintenance personnel only need to maintain accurate resource configuration data (CMDB) in advance, and the remaining actions will be completed automatically by the system. If resources and actions are mixed, each use will require resources to customize dedicated release scripts and build scripts.

In addition to resource and action management, there is also state (monitoring) management. Every company has a "monitoring" system. What needs to be emphasized here is the issue of awareness, because the "automatic disaster recovery switching" capability is considered in the entire upper layer and application layer monitoring design, so we do not need to pay attention to the underlying monitoring. As long as there is no alarm at the application layer, it does not matter whether the underlying server and computer room are down.

When I first started working, the system often gave out alarms, and I had to get up in the middle of the night to restart the machine and delete files. Now, the operation and maintenance staff will only receive notifications that the server has crashed and confirm it, without having to deal with it in real time. Based on this logic, if there are no alarms from the business, our system is normal.

A complete operation and maintenance management platform can reasonably coordinate and manage resources, actions, and status.

This picture expands and subdivides the simple picture above.

The top layer is for operations and maintenance, and includes a service catalog for operations and maintenance and developers, as well as a unified operations and maintenance portal for daily task centers and status centers.

Below is the scheduling and orchestration system. The product extension makes different orchestration requirements based on different industries and their business characteristics, and solidifies these different demand options.

In the middle is the core of the operation and maintenance platform, the execution layer system. Ignoring the gray traditional API module, what we use in daily operation and maintenance is this three-dimensional monitoring system including the continuous delivery platform, unified monitoring platform and ITOA operation analysis platform, through which we realize action and status management. It provides interfaces with different accuracy and priority for the needs of infrastructure, platform system, application level, service level and even higher levels.

The bottom layer is CMDB resource management. Traditional CMDB management objects belong to hardware assets. With the development of cloud technology, they will become increasingly weaker. Application operation and maintenance does not need to pay too much attention. Here, CMDB includes basic resources such as business information management, application packages, configuration, scheduled scheduling tasks, processes, tools, permissions, and system configuration.

3. Microservice Evolution of Operation and Maintenance Platform

As the company's business develops, how can we optimize or plan the architecture of the systems currently in use?

1. Technology selection

First of all, the difference between microservices and infrastructure is that the components of microservices are transmitted over the network after being split. Therefore, the communication standard must be reasonably selected.

The architecture of microservices is usually heterogeneous. For example, our platform uses languages ​​such as Python, JAVA, and PHP, so we must choose protocols that are compatible with multiple languages ​​at the same time. Just like when we chose protobuf before, we found that the Python library was not mature enough to be compatible with the Linux system. In different scenarios, the technology selection of microservices needs to have strong compatibility.

The second is the choice of language. Microservices emphasize the stability of interfaces. While ensuring the stability of services, you can freely choose a familiar language.

2. Microservice Planning

Single Responsibility Principle: Each service should be responsible for a single part of the functionality.

Clearly publish interfaces: Each service publishes a well-defined interface that remains unchanged. Consumers only care about the interface and have no runtime dependencies on the consumed service.

Independent deployment, upgrade, expansion, and replacement: Each service can be deployed and redeployed independently without affecting the entire system, which makes it easy to upgrade and expand the service.

3. Platform construction

The platform architecture is explained through the following two modules.

1) How can the CMDB system be simply split to make it easier to maintain?

CMDB is a database management system that can be queried and modified and has a large number of configuration systems. It includes model management, configuration management, and automatic discovery.

A) Model Management

In CMDB, we will manage a large number of resources and different actions that change dynamically as the product technology evolves, so we need to separate the model management module to ensure that CMDB is dynamically adjustable.

B) Configuration Management

Due to the high sensitivity of CMDB information, many companies require that sensitive business information, especially information related to applications and IP, be stored in it.

C) Automatic discovery

If the CMDB does not have a perfect automatic discovery mechanism, the probability of failure will be very high. Just like the traditional CMDB has a configuration change process under the operation of a strict approval mechanism. However, even if the configuration is consistent with the existing network, it is still necessary to conduct asset consolidation every six months to correct the information. For a system with massive business, a CMDB without "automatic discovery" capability is unqualified.

Through "automatic discovery", information such as server bandwidth, network card speed, memory, disk space, and processes are automatically collected and managed by CMDB. Module management is relatively traditional, and "automatic discovery" is the core of CMDB. When managing hundreds of thousands of servers at the same time, automatic maintenance can only be performed through "automatic discovery" detection.

2) Continuous deployment system

The continuous deployment system is responsible for automated releases. The figure above divides the platform construction of the continuous deployment system into multiple sub-modules.

A) Build Management

A build is a deployment object that mainly consists of static images, business programs, configuration files, etc. According to the principles of DevOps, everything needs to be versioned. Therefore, a build library is needed to manage all resources released to the production environment.

Through a unified build library, all data released to the online network is managed in a standardized manner, so that the original system can be quickly rebuilt in other computer rooms. At the same time, it also has the function of information sharing. In the past, it was difficult to track after the operation and maintenance was outsourced. Now, R&D personnel only need to input version information into the build library, and the operation and maintenance can export it from the build library.

B) Task Management

The task library is responsible for storing daily release tasks to meet the needs of automated release. In the past, many R&D personnel chose to change the system directly on the existing network for convenience, and the disordered changes in recorded information were not conducive to the daily release of task management.

It is often wrong, so we do not use the design of "automatically updating the system settings after the task is issued". In the case that the upper management system cannot be trusted, the existing network information and data must be scanned and reported in real time.

In order to ensure the successful release of information, the information reported by the Agent must be used as the basis. Because there are a large number of change entries for configuration information, the system cannot be designed with assumptions when the only entry cannot be guaranteed.

The command channel and data channel are the basic components of the upper-level system in addition to the build library, task library, and instance library. First of all, the command channel and data channel need to be managed separately. Tencent once needed to send a 1G file to 2,000 servers once a week, and it kept retrying and failing. Later, the command and data were separated, and only a few dozen KB of command scripts were transmitted each time, and the server was no longer blocked.

Some problems of open source solutions still cannot be solved, such as the current heterogeneous network. In the hybrid cloud scenario, network interoperability must be guaranteed to achieve direct connection. You can choose to write an agent by yourself to practice and connect to the central management server through a reverse channel to solve this problem.

The underlying basic services of the platform architecture under the microservice architecture

1. Name Service

Name service refers to the service that searches for IP ports by matching names in the configuration file. You can choose a suitable open source solution. If you develop it yourself, you can flexibly partition the service. For example, if server A in Shenzhen accesses server B deployed in both Shenzhen and Shanghai, we only need to connect with CMDB in the name service and use the server in Shenzhen to access the IP in Shenzhen to achieve the effect of same-city access. This operation cannot be fully implemented in open source solutions.

2. Status Monitoring

It is required to be able to reach the interface, that is, to call application layer monitoring for data collection.

Through the three core indicators of access volume, success rate, and average latency, most of the needs can be grasped at a low cost. Taking access volume as an example, when the access failure rate increases and an alarm is issued, the name service linkage is directly triggered to automatically remove the faulty node.

3. Load Balancing

When the system scale expands and the number of nodes increases dramatically, adding intermediate agents will increase the internal pressure of the system.

If it is implemented on the Agent, the IP list is queried through the name service, the status information is merged, and the node requests are balanced to achieve better load balancing.

The extreme of load balancing is disaster recovery. Under normal circumstances, each node can handle an appropriate amount of requests based on performance conditions.

These three points are the core capabilities of the operation and maintenance platform or business production system. The operation and maintenance platforms, including Tencent, are based on these three service closed loops. Only by achieving these three points can we solve system anomalies and maintain the normal operation of the system.

Iteration focus of microservice operation and maintenance platform

In fact, when we build the platform, in the process of the entire platform evolution, we actually have to have priorities and make trade-offs. In general, we should first solve our bottleneck problems. Then we should consider the ability to expand in parallel, the ability to consider service reuse, and even the use of some open source solutions. But I never think that open source means that everyone can use a bunch of open source tools together to form a good operation and maintenance platform.

We should have our own control over these open source capabilities and microservices. For example, monitoring. Many open source systems are more focused on execution layer tools, but we still need to build the core CMDB and core process control.

This article is reproduced from Leiphone.com. If you need to reprint it, please go to Leiphone.com official website to apply for authorization.

<<:  LSTM, GRU and Neural Turing Machine: Detailed explanation of the most popular recurrent neural network in deep learning

>>:  Using Domain Events in Microservices

Recommend

How much does it cost to join a meat and poultry mini program in Hohhot?

What is the price for joining the Hohhot Meat and...

Microsoft's mobile version of Office is free: just to show its presence

Recently, Microsoft announced that its Office sui...

8 Tips to Improve App Store Rankings with ASO

Exposure is a major shortcoming for all cash-stra...

Want to stay healthy by drinking soup? Does this method work?

《Cotton Swab Medical Science Popularization》 Beij...

5 ways to attract traffic to Tik Tok

As we all know, Tik Tok has become the largest tr...

YaYaYa! The 280th Danxia flyer turned out to be this one!

Danxia bird season is here, and the Guangdong Dan...

How to increase and retain APP users?

How can APP do a good job in user operation? Can ...