The data in the field of operation and maintenance is huge and complex, and how to open up a new path is something that many operation and maintenance personnel are exploring and thinking about. With the implementation and practice of AI technology in various application fields, IT operation and maintenance will usher in a new era of intelligent operation and maintenance. The efficiency of the algorithm improves the value of AIOps. Through continuous learning, intelligent operation and maintenance will liberate operation and maintenance personnel from complex alarms and noise. So, what is the difference between algorithm-based IT operation and maintenance and automated operation and maintenance? At this stage, which pain points in operation and maintenance are suitable for the introduction of artificial intelligence technology? How to accelerate implementation? On the afternoon of August 26, 51CTO held the 14th "Tech Neo" themed technical salon in Beijing to further broaden the operation and maintenance ideas of operation and maintenance/developers and stimulate their innovation capabilities. In this salon, 51CTO invited Professor Pei Dan, an associate professor from the Department of Computer Science at Tsinghua University and an expert in intelligent operation and maintenance algorithms, Mr. Huang Xin, head of Sogou SRE, and Mr. Shen Jianlin, a senior architect of JD Finance, to discuss the new AIOps implementation method with operation and maintenance/developers through algorithm-based IT operation and maintenance practices and explorations, opening a new era of intelligent operation and maintenance.
From Alerts to Early Warnings: How to Effectively Improve SLOs
At the beginning of the activity, the first speaker was Huang Xin, the head of Sogou SRE. He asked at the beginning how to establish SLO and make the operation and maintenance work evaluable? During the whole sharing process, Huang Xin divided the whole process into five parts: the first is to gain the trust of the business line, the second is to clarify the stability requirements by understanding the business needs, the third is to avoid force majeure, the fourth is to choose the monitoring system according to the needs, and the fifth is data first, don't care about the gains and losses of a city or a pool. Huang Xin shared the following five methods for implementing the early warning system:
Early warning system framework ***, Mr. Huang Xin also discussed with the operation and maintenance developers present about the entry threshold of operation and maintenance, automatic fault recovery, and future prospects. How to implement intelligent operation and maintenance
Next, Professor Pei Dan, associate professor of computer science at Tsinghua University and expert in intelligent operation and maintenance algorithms, shared with everyone how to implement intelligent operation and maintenance. At the beginning of the speech, Professor Pei Dan introduced the operation and maintenance background and the universalization of key technologies of intelligent operation and maintenance, aiming to enable all companies to use the latest intelligent operation and maintenance technologies. Professor Pei Dan believes that the solution to the universalization of intelligent operation and maintenance lies in data, algorithms, computing power, and talents. The second part is to decompose and define the key technologies in intelligent operation and maintenance, and define scientific research problems by decomposing key technologies. The scientific research problems pointed out by Professor Pei Dan are: ***: Clear input, data available; Second: Clear output, and feasible output goals; Third: Have a high-level technology roadmap; Fourth: there are references; Fifth: The academic community in the field of non-intelligent operation and maintenance can understand and solve the problem. ***, Professor Pei Dan also pointed out that the description of the problems regarding intelligent operation and maintenance in the Gartner report is too broad. How to do intelligent operation and maintenance well? Professor Pei Dan believes that machine learning itself has many mature algorithms and systems, as well as a large number of excellent open source tools. If machine learning is successfully applied to operation and maintenance, support from three aspects is required: data, labeled data, and application. Data: Internet applications have a huge amount of logs. We need to optimize the storage. If the data is not enough, we need to generate it ourselves. Labeled data: Daily operation and maintenance work will generate labeled data. For example, after an incident occurs, the operation and maintenance engineer will record the process, which will be fed back into the system, which in turn improves the operation and maintenance level. Application: Operation and maintenance engineers are users of intelligent operation and maintenance systems. Problems discovered by users during use can provide positive feedback for the optimization of intelligent systems. ***Professor Pei Dan shared three cases of intelligent operation and maintenance based on the cooperation with Baidu's operation and maintenance and search departments. ***The first case is KPI automated anomaly detection based on machine learning. The above figure shows that the operation and maintenance personnel judge the anomaly of the KPI curve and mark it out. The system learns the marked feature data. (Typical supervised learning). Here, efficient annotation tools are needed to save the time of operation and maintenance personnel: such as dragging, zooming, etc. ***, Professor Pei Dan shared relevant practices and challenges and other solutions by building a KPI anomaly detection system. Advanced human operation and maintenance ***A speaker from JD Finance, Mr. Shen Jianlin, shared advanced content on human operation and maintenance. Mr. Shen Jianlin started by talking about his own views on operation and maintenance work through the ideal and realization of operation and maintenance, and then cut into the theme of this sharing through the mission of service monitoring. In the service monitoring design principles, Mr. Shen Jianlin divided them into six parts, namely micro-kernel, optimistic strategy, zero intrusion, agreement over configuration, dynamic routing, centralized control and other principles. In the sharing of technical implementation content in the third part, Mr. Shen Jianlin solved the technical implementation methods from manual operation and maintenance to advanced ones through technical means such as log collection solution comparison, challenges of distributed service tracking, SGM overall technical architecture, SGM Agent static architecture, SGM Agent dynamic architecture, SGM Agent collection content, SGM expansion methods, etc. After the sharing, the participating operations/developers communicated and learned with the sharing guests about the new concepts, frameworks, and ideas of current operations technology, some problems encountered in their current work, and raised their own doubts and ideas about the content shared by the guests, and received guidance and suggestions from the guests.
51CTO Tech Neo Technology Salon is an offline communication activity for IT technicians that 51CTO has been organizing regularly since 2016. It is currently limited to the Beijing area and is held once a month. Each issue focuses on a topic covering multiple technology fields such as big data, cloud computing, machine learning, and the Internet of Things. |
>>: IOS Team Programming Standards
"I want to transfer jobs." "Which ...
Not long ago, the famous mathematician Richard Ev...
The account structure of search promotion consist...
Cao Maogui's stock market secrets: intraday T...
Dapeng Education - Studio Design Industry Practic...
Welcome to the 39th issue of the Nature Trumpet c...
If you want to know what phone Apple CEO Tim Cook...
Just as people need to see the road ahead when wa...
Preface I will judge the performance issues invol...
Across industries, the discussion around artifici...
Thanks to the unprecedented demand for artificial...
Sun Yat-sen University recently announced that Zh...
On May 20 this year, China Telecom released its op...
When we talk about Toxoplasma, we tend to think o...
Editor's note: Some of the content in this ar...