This article is the on-site dry goods of WOT2016 Internet Operation and Developer Conference. The new session with the theme of WOT2016 Enterprise Security Technology Summit will be held at JW Marriott Hotel in Beijing Pearl River Delta from June 24th to 25th, 2016! Shanda Games is a leading online game developer, operator and publisher in China, with tens of thousands of servers. Faced with such a large number of servers, how does Shanda achieve automatic management of remote servers? Servers are generally maintained at the application layer. How to manage servers at the hardware layer? Most companies use agents in the operating system to supervise and monitor. However, some projects do not want to install agents, such as Shanda Games, because it is unknown whether and what kind of impact the installation of agents will have on the application. However, the project team may install some agents in the project for supervision. There is also a certain demand for automated operation and maintenance of server hardware. The health of the hardware, the status of the server, whether there are any problems with the memory, whether there are any problems with the power supply, these are difficult to control in the system. How to deal with server problems? Usually, you call the operator or ask the on-site technicians to reinstall the system and restart the machine. However, this operation method requires someone to help, which is time-sensitive and cannot quickly operate the server. Remote KVM. When the system is down, the connection is lost, and the server status cannot be accurately determined, the remote KVM function will be used. Or the system is automatically deployed, which is a key part of IDC full-automatic management. The fully automatic installation system needs to be started through PXE, and then the system is automatically installed through IDC matching. Based on these problems, Shanda adopted the out-of-band management method . Out-of-band management is an IPMI intelligent management platform, based on IPMI server management and monitoring. IPMI was originally a cross-platform hardware-based management specification jointly launched by Intel, HP, Dell, and NEC. From the earliest version 1.5 in 1998 to 2001, to the version 2.0 in 2004, updated versions have been launched until now. The last version was revised in April 2015. Shanda has a very close cooperation with Intel and has been considering how to make the functions of IPMI more complete. The functions of IPMI currently provided are basically power control, which can turn the machine on and off, and hardware monitoring, such as monitoring the status of the motherboard and memory. Then there is the alarm, which can set thresholds and provide some alarm functions. It can also generate logs of various changes at the hardware level. And the SOL serial port re-setting function, the serial port re-setting function does not support the expansion functions of the current new servers in the early servers, so the information we see on the server screen is completely based on SOL. However, IPMI still has many problems, such as not supporting the mounting of virtual media, etc. The basic architecture of IPMI will be helpful for IPMI function development, because IPMI provides a standard white paper. Although some tools are proposed, these tools may not fully meet the requirements of developers in some cases. At this time, developers need to do some hardware-based or interface-based development. Hardware architecture diagram In fact, out-of-band management is a single-chip microcomputer, which can be understood as a single-chip microcomputer separated from the server hardware, that is, a small single-chip microcomputer system is integrated on the server motherboard. This small single-chip microcomputer also has its own operating system. We call this small single-chip microcomputer BMC, which is the baseboard controller. It is the main core chip of out-of-band management. Out-of-band BMC is completely separated from the operating system level, server motherboard and server hardware level. The entire hardware board is only connected to the server at the power level, so it can work normally even if it is not turned on. The series of information generated by out-of-band can be accessed through the external interface and internal interface above, which is the System Interface. In other words, at the system level, out-of-band is an extended hardware board. After installing the corresponding driver on the operating system, all functions provided by the entire out-of-band system can be accessed through the system. There is a non-actual memory on the BMC, which stores information such as IDL, SED, and FIU. IDL is the warehouse record of the sensor. The signals generated by the sensor are placed in IDL. You can use commands to obtain IDL information and check the condition and status of the sensor. SEL is an event log. Events at the hardware level will be recorded in SEL. By obtaining SEL event records, you can find out what problems occurred on the machine and when. The next is FIU, a replaceable management unit that records the information of each board hardware of the entire server hardware system. By obtaining FIU information, you can identify the board, server manufacturer, SN number, asset number, etc. The communication between BMC and the outside world is to interact with external data through the IPMB data bus. Servers of service providers such as HP, Dell, Inspur, and Lenovo all have an expander. The extended external card has an MB chip, and a board is added to the chip. This board serves as an extension of the BMC. BMC is command-line and has no external interface. It can only complete some operations and obtain information through commands. If you want to view information more conveniently through external means, you need an expansion card, which is hung on the IPMB bus. The communication between servers is obtained through the ICMB bus. BMC can also be accessed through external interfaces, such as Lan interface, network interface, serial port, and SL interface. SL is actually the old telephone dial-up or PPP protocol. The most widely used one is Lan interface. The upper part of the figure is called out of Band, and the lower part is called in Band. The upper layer is outside the server, and the hardware information of the server is obtained through the external interface. The lower layer obtains the device information through the internal operating system. Software Station Software station, the upper layer is a series of access functions provided in the band, such as obtaining out-of-band information in the system through IPC and SNMP protocols. It is also possible to obtain its information from the band through BMI, CIM, and WMI acquisition interfaces. In Windows, a lot of information can be obtained through WMI. The network layer encapsulation uses the ICMP protocol, and the 2.0 version uses the ICMP+ protocol. This protocol is encapsulated on the outermost layer, the physical network card. The UDP623 port is used for encapsulation. The inner layer is the information transmission process of IPM, including its NetFN, LUN, seq#, CMD, command line, and data encapsulation. Regarding the format of information transmission, it is divided into two parts: request and response. For data request, IS and ADD are the addresses for request response, and have a length of one byte. The low bit represents the address code, and the highest bit 0 represents an address code, and when it is 1, it is a software ID. The high bit represents the specific address code and software ID. CMD is the command code, which can be found in the white paper. The completion code, that is, after CMD is executed, a code will be returned, and the length is one byte. Date is the specific requested data, or the response data. LUN is its logical unit code. NetFn is the function category, and the function category will be followed by LUN. The length of NetFn is one byte, even numbers are message requests, and odd numbers are message responses. LUN is the high 6 bits, the sequence number of the request. There is a request to generate a sequence number, and the low two bits are the address code for responding and receiving messages. The highest is the detection box, and the detection box has an algorithm. Initially, the detection box is 1, and all the bytes to be checked are added to the checksum and modulo 256. If it is 0, it means that the instruction is correct. These are some important points in the IPMI architecture and are widely used in actual development. This article is compiled from the wonderful speech entitled "Server Out-of-Band Management and Its Applications" by Yan Qiang, Deputy Manager of Shanda Games' Technical Service Department and Senior Researcher, at the WOT2016 Internet Operation and Developer Conference hosted by 51CTO Media.
Yan Qiang, a senior researcher at Shanda Games, is currently responsible for the company's data center planning, construction and resource management, IDC operation and maintenance, etc. With ten years of experience in Shanda, from IT to IDC, he has extensive experience in hardware, systems, operation and maintenance. |
<<: OWASP Chen Liang: Privacy protection cannot rely on a single set of passwords
>>: Overview of three excellent open source Python GUI frameworks
How to become a Zhihu celebrity with millions of ...
2014 was a boom year for mobile healthcare. Accor...
With the continuous upgrading of AI big models, d...
Wheat is the most widely planted grain crop in th...
At the beginning of 2023, the Chronic Disease Man...
Regarding the recommendation on the App Store hom...
Chen Hui's 7-Day Yoga Plan Level 1-2 Resource...
The Spring Festival is coming soon Have you alrea...
According to China National Radio, a fire broke o...
The US$1 billion "meeting gift" that Xi...
How to check personal credit information on mobil...
□ Gao Hongkai May 25th is World Thyroid Day. The ...
Whether designing a web page or a poster, backgro...
As SAIC and Audi are interested in joining hands,...