Source: Translated and compiled by Semiconductor Industry Observer from thenextplatform, thank you. At the end of the keynote speech at Google I/O 2016, Google CEO Pichai mentioned an achievement they have made in AI and machine learning in recent times, a processor called Tensor Processing Unit (TPU for short). This month, the first generation of TPU processors has become outdated. At the Google I/O 2017 conference held early this morning, Google not only promoted Android 8.0, but more importantly, focused on artificial intelligence, so the second-generation TPU came into being. TPU is a high-performance processor developed by Google for AI computing services. Its first-generation product has been used in artificial intelligence such as AlphaGo, focusing on computing performance. Compared with the first generation, the second-generation TPU mainly deepens the learning and reasoning capabilities of artificial intelligence. As for performance, the new Google TPU can achieve a floating-point performance of 180TFLOPs, which is 15 times higher than traditional GPUs and 30 times the floating-point performance of CPUs. In addition, Google has also launched a computing array called TPU pod, which can contain up to 64 second-generation TPUs, which means that the floating-point performance can reach an astonishing 11.5PFLOPS. From the name, we can see that TPU is inspired by Google's open source deep learning framework TensorFlow, so currently TPU is a chip that is only used within Google. The birth of TPUIn 2011, Google realized they had a problem. They started to seriously consider using deep learning networks, which were computationally demanding and straining their computing resources. Google calculated that if each user used their voice search service based on deep learning speech recognition models for 3 minutes a day, they would have to double the size of their existing data centers. They needed more powerful and efficient processing chips. What kind of chips do they need? Central processing units (CPUs) are very efficient at handling a wide range of computing tasks. But CPUs are limited in that they can only handle a relatively small number of tasks at a time. Graphics processing units (GPUs), on the other hand, are less efficient at performing a single task and can handle a smaller range of tasks. However, the power of GPUs lies in their ability to perform many tasks at the same time. For example, if you need to multiply 3 floating point numbers, a CPU will be better than a GPU; but if you need to do 1 million multiplications of 3 floating point numbers, the GPU will crush the CPU. GPUs are ideal chips for deep learning, because complex deep learning networks require millions of calculations to be performed simultaneously. Google uses Nvidia GPUs, but that's not enough, they want more speed. They need more efficient chips. A single GPU doesn't consume much energy, but if Google's millions of servers are running around the clock, energy consumption becomes a serious problem. Google decided to build its own more efficient chips. In May 2016, Google announced TPU (Tensor Processing Unit) for the first time at the I/O conference, and said that this chip had been used in Google data centers for a year. When Lee Sedol played against AlphaGo, TPU was also used, and Google called TPU the "secret weapon" that helped AlphaGo defeat Lee Sedol. Internal architecture of the first generation TPU This diagram shows the internal structure on the TPU, except for the external DDR3 memory, and the host interface on the left. Instructions are sent from the host into queues (no looping). These activate control logic to run the same instruction multiple times based on the instruction. The TPU is not a complex piece of hardware. It looks like a signal processing engine for radar applications, not a standard X86-derived architecture. Jouppi said that although it has many matrix multiplication units, it is more specialized in co-processing with floating-point units. In addition, it should be noted that the TPU does not have any stored programs, and it can send instructions directly from the host. The DRAM on the TPU operates in parallel as a unit as more weights need to be fetched to feed the matrix multiplication unit (64,000 throughput in total). Jouppi didn’t mention how they systolic the data flow, but he said that using a host software accelerator would be a bottleneck. 256×256 array scaling data flow engine, achieving nonlinear output after matrix multiplication accumulation As you can see from the second picture, the TPU has two memory units, as well as an external DDR3 DRAM for the parameters in the model. As the parameters come in, they can be loaded into the matrix multiplication unit from the top. At the same time, activations (or outputs from "neurons") can be loaded from the left. Those go into the matrix unit in a contracted manner to produce a matrix multiplication, which can do 64,000 accumulations per cycle. Undoubtedly, Google may have used some new tricks and technologies to speed up the performance and efficiency of TPU. For example, using high bandwidth memory or hybrid 3D memory. However, Google's problem is to keep the distributed hardware consistent. Second-generation TPU capable of data reasoningThe first generation of TPU can only be used for the first stage of deep learning, while the new version allows neural networks to make inferences about data. Jeff Dean, director of the Google Brain research team, said: "I expect we will use more of these TPUs for artificial intelligence training, making our experimental cycle faster." "When designing the first-generation TPU products, we had established a relatively complete and outstanding R&D team to carry out chip design and development. These R&D personnel are basically involved in the R&D project of the second-generation TPU. From the perspective of R&D, the second-generation TPU, compared with the first-generation, mainly improves the performance of a single chip from the perspective of the overall system, which is much simpler than designing the first-generation TPU chip from scratch. Therefore, we can have more energy to think about how to improve the performance of the chip, how to better integrate the chip into the system, and make the chip play a greater role." Dean said in his speech. In the future, we will continue to follow Google's progress to further understand this network architecture. But before that, we should understand the architecture, performance and working method of the new generation of TPU, and understand how TPU performs ultra-high performance computing. At this conference, Google did not show the chip samples of the new generation of TPU or more detailed technical specifications, but we can still make some speculations about the new generation of TPU from the information we know so far. Judging from the TPU pictures released this time, the second-generation TPU looks a bit like a Cray XT or XC development board. From the pictures, it is not difficult to find that several interconnected chips are soldered to the development board, while maintaining the connection function between the chips and the external chip. There are four TPU chips on the entire board. As we said before, each individual chip can achieve a floating-point performance of 180TFLOPs. There are four external interfaces on the left and right sides of the development board, but two additional interfaces are added on the left side of the board, which makes the whole board look a little abrupt. . If each TPU chip can be directly connected to the memory in the future, just like AMD's upcoming "Vega" processor can be directly connected to the GPU, this layout will be very interesting. The two additional interfaces on the left side will allow the TPU chip to connect directly to the memory in the future, or directly connect to the upstream high-speed network for more complex calculations. These are all our guesses based on the pictures, unless Google can reveal more chip information. Each TPU chip has two interfaces that can be connected to external devices. There are two additional interfaces on the left for external development, which allow developers to design more functions and add more extensions on this basis. Whether it is connecting to local storage devices or connecting to the network, these functions are theoretically feasible. (To achieve these functions, Google only needs to establish a relatively loose and feasible memory sharing protocol between these interfaces.) The figure below shows a possible connection form of multiple TPU boards. Google said that this model can achieve up to 11.5 petaflops of machine learning computing power. How did we get this result? The above connection method looks very much like an open computer architecture, or something else. Vertically, 8 TPU boards are stacked, and horizontally, 4 TPU boards are parallel. At present, we cannot determine whether each development board is a complete TPU board or a half development board. We can only see that there are 6 interfaces on one side of the board and 2 interfaces on the other side. It is worth noting that there are 4 interfaces in the middle of the board, and 2 interfaces on the left and right sides, and there is no shell similar to the TPU development board on the left and right sides. A more reasonable explanation is that the left and right sides are connected to the local memory interface, not the TPU chip interface. Even so, we can still see at least 32 TPU second-generation motherboards running, which means that there are 128 TPU chips running at the same time. After rough calculation, the computing power of the entire system is about 11.5 quadrillion times. For example, if this computing power can be applied to the commercial field in the future, the 32 most advanced GPUs currently used in Google's large-scale translation work can be reduced to 4 TPU boards in the future, which can greatly reduce the time required for translation. It is worth noting that the TPU chip mentioned above is not only suitable for floating-point operations, but also for high-performance computing. Training and learning with TPUCompared with the first generation TPU, the second generation TPU has not only improved computing power, but also added data reasoning capability. However, this reasoning model must be trained on the GPU first. This training mode requires developers such as Google to slow down the experiment and reshape the training model, which will take longer to enable the machine to acquire a certain data reasoning capability. Because of this, it is necessary to first train on a relatively simple and single device, and then bring the results to a more complex environment to obtain a higher level of data reasoning capabilities. In the future, Intel's GPU for artificial intelligence will also adopt this iterative model. The same is true for Nvidia's Volta GPU. NVIDIA's Volta GPU, which has a "tensor core", has super-fast machine learning and training capabilities, and may reach 120 trillion computing times per device in the future. This computing power is about 40% higher than the Pascal GPU launched last year. However, even though it is difficult for us to experience the impact of super-fast computing power such as Google's TPU in our daily lives, the faster and faster computing power of GPUs is still impressive and closer to us. Dean said that the architecture used by Nvidia Volta is very interesting, which makes it possible to accelerate applications through core matrices. To a certain extent, Google's first-generation TPU also adopted similar ideas. In fact, these technologies are still being used in the process of machine learning. "It is always very useful to be able to speed up linear computing capabilities." Dean emphasized. Regardless of the impact of hardware, there are still many places that can attract users. Unlike those projects that have always been kept confidential, in the future, Google will apply TPU technology to Google Cloud Platform. Jeff Dean, a senior researcher at Google, said that they do not want to limit competition through various means, and hope to provide more possibilities and space for TPU, so that it can compete with Volta GPU and Skylake Xeons in the future. Dean believes that the platform should also provide developers with more opportunities to build and execute their own unique models, rather than restricting developers' thinking. In the future, Google will provide more than 1,000 TPUs on the cloud platform for research teams that are interested in open scientific research projects and continue to advance machine learning. Dean said that now within Google, when conducting machine training and learning, GPUs and CPUs are also used simultaneously, even on the same device, to better ensure balance. However, for the new generation of TPU chips, the power consumption during training and learning cannot be accurately estimated at present, but it is worth affirming that the function is definitely the Volta GPU. Since the system is functionally capable of meeting high-performance computing and 64-bit high-performance computing, the calculation of workloads is extremely complex. Nvidia's GPUs will also encounter similar problems during use. In the future, if we want to better solve this problem, we need to continue to work hard with engineers. Dean also admitted at this point: "Unlike the integer calculation method of the first-generation TPU chip, the second-generation chip can perform floating-point operations. Therefore, during the learning and training process of the chip, only a fixed model needs to be used, and there is no need to change the algorithm. Engineers can use the same floating-point calculation method, which greatly reduces the workload." In addition to Nvidia and Intel, Google's introduction of its customized hardware products to the market is a good thing for enterprises, because TPU is still a relatively marginal technology in the market. When the second-generation TPU product is applied to the Google Cloud Platform, Google will push training to a large number of users, which will better promote the development of this technology. For those who are wondering why Google does not commercialize chips, the above content can probably give an answer. With the continuous development of artificial intelligence and neural learning technology, TPU will be able to show its strength on Google Cloud and become a major force in promoting technological progress. What does TPU mean to Google?Google has developed a software engine specifically for deep neural networks. Google said that according to the growth rate of Moore's Law, the current TPU's computing power is equivalent to the computing level that can be achieved in the next seven years. It can provide higher-level instructions per watt for machine learning, which means that it can use fewer crystals for each operation, that is, more operations in one second. And Google has deeply bound it with the Deep Learning system platform TensorFlow, which can obtain better support and build a stronger ecosystem, including more than 100 projects that require the use of machine learning technology, such as search, driverless cars, and intelligent voice. Are TPUs the future of deep learning?Chip deployment in deep learning computations is not a zero-sum game. Real-world deep learning networks require a system’s GPU to communicate with other GPUs or ASICs such as the Google TPU. GPUs are an ideal workspace, with the flexibility deep learning requires. However, ASICs are ideal when fully dedicated to a software library or platform. Google's TPU obviously meets this requirement. The excellent performance of TPU makes it very likely that TensorFlow and TPU will be upgraded together. Although Google officials have repeatedly stated that they will not sell TPU to the outside world, third parties that use Google Cloud Services for machine learning solutions can benefit from the advantages of TPU's excellent performance. The smart chip market has been changing again and again. The emergence of Google TPU has made the trend of chips for accelerating specific areas of neural networks/deep learning more obvious. High-end AI applications require powerful chips to support them. Without any piece of software or hardware, China's smart ecosystem cannot develop. China's processor academic and engineering are constantly improving. We look forward to Chinese chips appearing on the world stage and competing with international peers as soon as possible. Original English text: https://www.nextplatform.com/2017/0 5/17/first-depth-look-googles-new-second-generation-tpu/ |
<<: Difficulty of deep learning: The deeper the neural network, the harder the optimization problem
1. Product Introduction Toutiao is a recommendati...
SEO promotion means to build a website according ...
At present, my country's forest area is 231 m...
August 24th marks the fifth anniversary of the da...
There are three major driving forces for high-qua...
Recently, on some online trading platforms, the p...
August 22 news, according to foreign media report...
A very useful website emulation tool. It can emul...
In order to better penetrate into various industr...
1. Analysis of main functions Function and analys...
Produced by: Science Popularization China Author:...
Q: Will BT and other P2P download methods reduce t...
As one of the most important components of mobile...
The romantic 520 is a holiday that every girl lik...
When it comes to fried chicken fast food brands, ...