How to achieve O&M automation and fault self-healing in the gaming industry

How to achieve O&M automation and fault self-healing in the gaming industry

This article is the on-site dry goods of WOT2016 Internet Operation and Developer Conference. The new session with the theme of WOT2016 Enterprise Security Technology Summit will be held at JW Marriott Hotel in Beijing Pearl River Delta from June 24th to 25th, 2016!

Regarding this technical sharing, it is mainly an introduction to the operation and maintenance system of the game industry, focusing on the construction of the operation and maintenance system of the game industry.

Part 1: 37 Games’ Operation and Maintenance System

Ma Chenlong concluded that a company will basically go through four stages from its start-up to its subsequent listing.

The first stage is standardization. Standardization means unifying the host name, intranet, and configuration files. If they are not unified, the following things cannot proceed. Without a standardized environment, the script cannot be written.

The second stage is automation. The small and medium-sized enterprise stage is a transition from automation to platformization. Platformization means packaging the automated things, integrating the functions, aggregating the data, and then visualizing them on the platform.

The third stage is platformization. The future trend is that scripts and functions must be externalized so that a new person can take over. There is no need to run scripts on the server and tell the next person where to install them.

The last stage is service-oriented. Service-oriented refers to what the cloud platform currently carries. For example, when setting up a redis cluster, users do not need to know how many servers there are, because once the NOSQL service provided is turned on, users can use it directly.

37 Games' automation tool. All of 37 Games' key data is in the CMDB. When the company was first established, publishing was a headache. Although SVN was used for submission, there was no automated publishing process. The key task of operation and maintenance is monitoring. If monitoring is done well, operation and maintenance will be much freer, because most of the time is spent receiving emergency text messages. Many companies only know that operation and maintenance are for development, but they don't know that with operation and maintenance data, they can promote business development. Security management . If you are in a large company, there is a professional team to do security management, which is separate from operation and maintenance. If you are in a medium-sized enterprise, security management is in the same department as operation and maintenance, and the two are inseparable. The last one is DB management. DB is a relational database related to 37 Games, NOSQL and cluster management.

Client classification. The server operation class uses SaltStack as Agent, and then uses logstash to collect logs. The data required by the API can be obtained by monitoring with zabbix.

Application launch. Application launch, grayscale release, application operation, and application offline are four points that run through the operation and maintenance platform. For example, to configure a wab service, the wab service is first installed by the application, pulls data from the CMDB, then grayscale releases, and then goes online, is used until it is offline, and CMDB automatically deletes the configuration in the subsystem involved.

CMDB is the most troublesome thing in operation and maintenance. In fact, the most important thing is to sort out the relationship and then associate the information. Here are a few simple categories, such as domain name library, software library, resource library, IP library, and configuration library. If these are established, CMDB must be maintainable. When establishing these models, you must think about its maintainability. Without maintainability, the subsequent data will be very messy and automation is out of the question. DB management, security management, and monitoring management. DB management is divided into DB deployment and DB monitoring, which are about data operations and some division of permissions. Security management is security rule management.

The second part is monitoring and intelligent analysis of log data.

Monitoring data , because there are many platforms, how to unify this data? 37 Games has management data of various cloud resource servers such as Zabbix, Nagios, and Cacti, and obtains this data through API. Zabbix should support all deletion, addition, growth, modification, and query. Nagios itself does not support API, but data can be captured from its accessory files. Cacti can be captured from the page, and there are some practices for these. As long as the cloud service provider is a relatively reliable cloud, it is possible to provide an API interface.

How to improve our monitoring business ? We analyze our monitoring data from various dimensions and define alarms, which are multi-layer thresholds.

Another is to reduce the false alarm rate . Operations and maintenance personnel may receive a hundred text messages every day, and about ninety of them are false alarms.

Ma Chenlong gave an example: How to use our monitoring data to promote business? The work of operation and maintenance is not just to deal with alarms, but also to promote the progress of business. The environment of the game industry is quite complex. 37 Games has its own IDC computer room to store games. Everything is obtained through the interface, and then the monitoring data is summarized into historical monitoring data. A report is generated based on days, weeks, months, and other dimensions such as system applications and networks. It is worth noting that the data obtained is not real-time, but historical monitoring data, because if real-time analysis is used, it can only be an alarm. The reason for judging that there is a problem with the server is definitely based on the average periodic alarm rate. The average usage rate is lower than a certain threshold before it is judged to have a problem. The threshold needs to be defined with the operation.

Application of Web logs. The scenario is that security personnel need to regularly push some abnormal logs for operation and maintenance to perform XSS injection analysis. The most common problem encountered in the gaming industry is database collisions, where accounts and passwords are constantly refreshed every day. This problem is not only encountered in the gaming industry, but also in other industries.

37 Games has set a goal that the operation and maintenance department is responsible for the unified summary of relevant business logs, and a standard API query interface is required. Now all cooperation with cross-departmental departments or with other people is through API. Security personnel can define Web filtering rules, write regular expressions, analyze abnormal logs, and then determine the type of attack and take corresponding measures. Each platform has a real-time summary of logs. If it is a cluster, the server is sharded, but each server is used to count protection failures, or PV, UV, etc. It is difficult to make a global judgment. Because the classic architecture is the previous ALVS, it is dispersed to the following servers. The data needs to be merged and aggregated first, which will definitely take a long time.

The data can be used to optimize the server configuration and user experience based on the user's region. The game is highly regional and can be accurately located based on the IP address.

37 Games uses the classic Internet architecture ELK. Logstash collects logs, uses redis for queues, puts them in redis, then logstash retrieves them, puts them in ElasticSearch, and finally stores them. In the early stage of establishment, you can directly use graphics and use Kibana for graphical view analysis. The redis in the middle only acts as a queue, but the architecture remains unchanged, so it is very easy to get started. If there is development and operation, you can directly analyze the logs from Lsearch. The web end uses its own filtering rules. If there is a problem, throw it into the blacklist, and the business performs logical control on the front end.

The third part is fault self-healing.

Most medium-sized enterprises now have their own APIs, monitoring data aggregation, alarm data convergence, and many of them are automated. These are not difficult to do. The key is to mine fault information and formulate your own fault self-healing rules in the future.

The problems it faces are some anomalies caused by the logic of the system network and business layer, monitoring false alarms and some reliability issues of monitoring itself. There are also non-self-healing businesses, because fault self-healing is not omnipotent, just like the current artificial intelligence, there are some things that machines cannot replace. For example, in complex business scenarios, machines may not be able to judge, and some things may require human inspection to know how to deal with them.

All the alarm information of 37 games are sent to the alarm information processing center, such as SMS alarms, which are all pushed uniformly.

Here Ma Chenlong shared an example that can be applied in any scenario. Use your own Zabbix monitoring or somkeping monitoring or even third-party monitoring to obtain monitoring information. Then push all monitoring information to the callback queue, and then analyze the alarm information.

What can fault self-healing bring? You can handle your own personal affairs during non-working hours. The first requirement of operation and maintenance is to be on call 24 hours a day. Reduce direct operations on the line. For example, if a fault occurs, it is very likely that a secondary fault will occur if you directly operate the line. Therefore, it is necessary to analyze the cause of the fault and train the enthusiasm of operation and maintenance personnel for work. It is not something that is clamored every day. In the long run, it will significantly improve the players, the interests of the company, and their own value.

Finally, Ma Chenlong concluded: You must have unique solutions in your own field.

"There are many open source solutions that cannot be used directly. They must be used in your own production environment and have your own solutions. The design of the operation and maintenance tools must be simple, because you have to consider how high the maintenance cost will be when the next person takes over your things. Anyone who has written code knows that rewriting is much better than maintaining code. Everything must be centered on the business. Once it is separated from the business, the data you create is actually useless. Finally, I want to say that a good architecture is evolved, not designed." Ma Chenlong said.

This article is compiled from the wonderful speech by 37 Game Operation and Maintenance Architect Ma Chenlong on the theme of "Operation and Maintenance Automation and Fault Self-Healing in the Game Industry" at the WOT2016 Internet Operation and Maintenance and Developer Conference hosted by 51CTO Media.
Lecture video: http://edu..com/lesson/id-100749.html
Lecturer Profile:

[[166505]]
Ma Chenlong is responsible for the development of the 37 game operation and maintenance platform. He currently focuses on the operation and maintenance automation of the game industry and the self-healing of monitoring system failures. He is good at Perl development, regular expressions, and precise log matching.

<<:  The most popular functional responsive programming library on GitHub: ReactiveCocoa and RxSwift framework comparison

>>:  Creating Animated Google Maps Markers with CSS and JavaScript

Recommend

Comparison of Flutter and React Native for mobile development

【51CTO.com Quick Translation】Just a few years ago...

Do you like changing your phone number?

"If a person's mobile phone number remai...

Why are we attracting new users but no highly active users are left?

Some operators may feel puzzled. Why do they spen...

Android MVP framework learning and practice

Preface I thought that Internet companies basical...

Is the $8 billion acquisition of Harman Samsung worth it?

Recently, Samsung has become the focus of attenti...

There is no wife in wife cake, but there is a real turtle in turtle jelly!

There is no lion in the lion's head, there is...

The most powerful Zhihu promotion and drainage method in 2019

A few days ago, my famous Zhihu account in the sk...