Computing power on data platforms: Which GPUs are better suited for deep learning and databases?

Computing power on data platforms: Which GPUs are better suited for deep learning and databases?

[[190035]]
Image from Visual China

Recently, Google disclosed the details of TPU (Tensor Processing Unit) at ISCA2017. Following the Haswell CPU and Tesla K80 GPU , it added a high-performance weapon specifically for machine learning and neural networks.

Data Analysis and GPUs

GPU can not only realize many functions of database, but also realize real-time analysis with its powerful computing power. MapD and Kinetica are two well-known companies in this field. MapD uses NVIDIA Tesla K40/K80 to realize a database based on SQL and column storage, which does not require indexes and is good at any combination of conditional queries (Where), aggregation (Groupby), etc., realizing the BI function of traditional relational databases and facilitating users to freely perform multi-conditional queries. The performance advantage is also obvious (especially the response time).

For example, MapD expanded the data of all flights arriving and departing from the United States from 1987 to 2008 by 10 times and executed a report task that scanned the entire table, such as "SELECT...GROUP BY..." A server with 8 Tesla K40 graphics cards (8 cores/384G RAM/SSD) is 50-100 times faster than an in-memory database cluster consisting of 3 servers (32 cores/244G RAM/SSD).

Another major feature of the GPU database is visual rendering and drawing. The buffer of OpenGL and other programs is directly mapped to the video memory space in the GPU CUDA, and rendered on the spot without copying the results from the memory to the GPU, which can achieve high frame rate animation. It can also be drawn on the spot into PNG or video stream and then sent to the client, greatly reducing the amount of data transmitted over the network. These advantages have attracted many developers.

A well-known company in real-time analysis is Kinetica. They started out analyzing 250 data streams in real time for US intelligence agencies. Now they can use 10 nodes and 200,000 sensors to provide 15,000 parallel real-time analysis, logistics routing calculations, and scheduling for the US Postal Service (USPS).

There are more and more users in my country who use GPUs for analysis and mining, and there are also many friends who want to learn more. The fastest way to get started is to repeat the experiments of predecessors. The development environment and experiments in Accelerating SQL Database Operations on a GPU with CUDA at the University of Virginia are worth learning from. They used a NVIDIA Tesla C1060 with 4G video memory on a low-end server (Xeon X5550 (2.66GHz/4 cores), 5G RAM), and used a table of 5 million rows for query and aggregation, with a response time of 30-60 milliseconds.

The best configuration we have tested is NVidia GTX 780, which is over 1,000 yuan and is suitable for trying queries and aggregations. First, use SQLite to parse the SQL into multiple OpCode steps, and then implement a virtual machine on CUDA to implement each step one by one until the entire table is traversed row by row. Some of these steps can be parallelized, so CUDA can be used to start tens of thousands of threads, each thread processing a row.

Deep Learning and GPUs

Depth requires high computing power, so the choice of GPU will greatly affect the user experience. Before the emergence of GPUs, an experiment might take several months to complete, or it might take a day to find out that the parameters of an experiment are not good. A good GPU can quickly iterate on a deep learning network, running several months of experiments in a few days, or replacing days with hours with hours, or minutes with hours.

Fast GPUs can help beginners quickly accumulate practical experience and use deep learning to solve real-world problems. If you can't get results quickly and learn from your mistakes quickly, it will be frustrating to learn. Tim Dettmers used GPUs to apply deep learning in a series of Kaggle competitions and won the runner-up in the Partly Sunny with a chance of Hashtags competition. He used two large two-layer deep neural networks, adopted the ReLU activation function, and used Dropout for regularization. This network can barely be loaded into the 6GB GPU video memory.

Are multiple GPUs necessary?

Tim had built a small GPU cluster using 40Gbit/s InfiniBand, but he found it difficult to implement parallel neural networks on multiple GPUs, and the speed improvement was not obvious on dense neural networks. Small networks can be more effectively parallelized through data parallelism, but for the large network used in the competition, there was almost no speed improvement.

Later, an 8-bit compression method was developed, which should be able to process dense or fully connected network layers in parallel more effectively than 32-bit. However, the results were not ideal. Even if the parallel algorithm is optimized and the code is written specifically to execute on multiple GPUs in parallel, the effect is still not worth the effort. You need to understand the specific interaction between deep learning algorithms and hardware to determine whether you can really benefit from parallelism.

Parallel support for GPUs is becoming more common, but it is still far from universal and the results are not necessarily very good. Only CNTK, a deep learning library, can efficiently execute algorithms on multiple GPUs and multiple computers through Microsoft's special 1-bit quantized parallel algorithm (higher efficiency) and block momentum algorithm (very efficient).

Using CNTK on a cluster of 96 GPUs, you can get a 90-95x speedup. The next library that can efficiently parallelize on multiple machines may be Pytorch, but it is not yet fully developed. If you want to parallelize on a single machine, you can use CNTK, Torch or Pytorch. The speedup can be 3.6-3.8 times. These libraries contain some algorithms that can be executed in parallel on a single machine with 4 GPUs. Other libraries that support parallelism are either slow or have both problems.

Multiple GPUs, non-parallel

Another benefit of using multiple GPUs is that you can run multiple algorithms or experiments on each GPU separately, even if you don’t execute the algorithms in parallel. Although it doesn’t speed up, you can understand the performance of multiple algorithms or parameters at once. This is very useful when researchers need to accumulate deep learning experience as quickly as possible and try different versions of an algorithm.

This is also good for the deep learning process. The faster the task is performed, the faster the feedback can be obtained, and the brain can draw complete conclusions from these memory fragments. Training two convolutional networks with small data sets on different GPUs can help you find out how to perform better more quickly. You can also find the patterns of cross validation errors more smoothly and interpret them correctly. You can also find some patterns to find parameters or layers that need to be added, removed or adjusted.

In general, a single GPU is sufficient for almost all tasks, but using multiple GPUs to accelerate deep learning models is becoming increasingly important. Multiple cheap GPUs can also be used to learn deep learning faster. Therefore, it is recommended to use multiple small GPUs instead of one large one.

Which one to choose?

NVIDIA GPU, AMD GPU or Intel Xeon Phi

It is easy to build CUDA's deep learning library using NVIDIA's standard library, while AMD's OpenCL standard library is not as powerful. In addition, the CUDA GPU computing or general GPU community is large, while the OpenCL community is smaller. It is more convenient to find good open source methods and reliable programming suggestions from the CUDA community.

Moreover, NVIDIA has invested in deep learning since its inception, and the returns have been great. Although other companies are also investing money and energy in deep learning, they started late and are lagging behind. If other software and hardware other than NVIDIA-CUDA are used in deep learning, it will be a detour.

It is said that Intel's Xeon Phi supports standard C code, and it is easy to modify these codes to accelerate them on Xeon Phi. This feature sounds interesting. But in fact, it only supports a small part of C code and is not practical. Even if it is supported, it is very slow to execute. Tim has used a cluster of 500 Xeon Phi and encountered one pit after another, such as the incompatibility between Xeon Phi MKL and Python Numpy, so unit testing is impossible. Because the Intel Xeon Phi compiler cannot correctly streamline the template code, such as switch statements, a large part of the code needs to be refactored. Because the Xeon Phi compiler does not support some C++11 functions, the C interface of the program must be modified. It is troublesome, time-consuming, and maddening.

Execution is also slow. When the tensor size changes continuously, I don't know if it is a bug or thread scheduling that affects performance. For example, if the size of the fully connected layer (FC) or the dropout layer (Dropout) is different, Xeon Phi is slower than CPU.

The fastest GPU on a budget

What does the speed of GPUs used for deep learning depend on? Is it the CUDA cores? The clock speed? Or the RAM size? None of these. The most important factor affecting deep learning performance is the memory bandwidth.

GPUs are optimized for memory bandwidth at the expense of access time (latency). CPUs are just the opposite, with smaller memory usage, faster calculations, such as multiplication of several numbers (3*6*9); larger memory usage, slower calculations, such as matrix multiplication (A*B*C). GPUs are good at solving problems that require large memory due to their memory bandwidth. Of course, there are more complex differences between GPUs and CPUs, see Tim's answer on Quora.

So, when buying a fast GPU, look at the bandwidth first.

Comparing CPU and GPU bandwidth development

When the chip architecture is the same, the bandwidth can be directly compared. For example, the performance comparison of Pascal graphics cards GTX 1080 and 1070 only needs to look at the memory bandwidth. GTX 1080 (320GB/s) is 25% faster than GTX 1070 (256GB/s). However, if the chip architecture is different, it cannot be directly compared. For example, Pascal and Maxwell (GTX 1080 and Titan X), different production processes use the same bandwidth differently. However, the bandwidth can still roughly reflect how fast the GPU is. In addition, it is necessary to see whether its architecture is compatible with cnDNN. Most deep learning libraries use cuDNN for convolution, so Kepler or better GPUs are required, that is, GTX 600 series or above. Generally speaking, Kepler is slower, so from a performance perspective, the 900 or 1000 series should be considered. In order to compare the performance of different graphics cards on deep learning tasks, Tim made a graph. For example, a GTX 980 is as fast as 0.35 Titan X Pascal, or Titan X Pascal is almost 3 times faster than a GTX.

These results are not from deep learning benchmarks for each card, but from graphics card parameters and computational benchmarks (some cryptocurrency mining tasks are similar to deep learning in terms of computation). So this is just a rough estimate. The real numbers will be slightly different, but the difference is not big, and the graphics card ranking should be correct. At the same time, using small networks that do not fully utilize the GPU will make the big GPU look not good enough. For example, LSTM with 128 hidden units (batch size>64) will run not much faster on GTX 1080 Ti than GTX 1070. To see the performance difference, you need to use LSTM with 1024 hidden units (batch size>64).

Performance comparison of GPU running large deep learning networks

Generally speaking, Tim recommends a GTX 1080 Ti or GTX 1070. Both are good. If you have the budget, you can go with a GTX 1080 Ti. The GTX 1070 is a bit cheaper and faster than the regular GTX Titan X (Maxwell). Both are better than the GTX 980 Ti because they have more video memory - 11GB and 8GB instead of 6GB.

8GB is a bit small, but it is enough for many tasks, such as natural language understanding (NLP) tasks on most image datasets in Kaggle competitions.

When you first start learning deep learning, the GTX 1060 is the best choice, and you can occasionally use it for Kaggle competitions. 3GB is too little, and 6GB is sometimes not enough, but it can handle many applications. The GTX 1060 is slower than the regular Titan X, but its performance and second-hand price are similar to the GTX 980.

From the perspective of price/performance, the 10 series is well designed. GTX 1060, 1070 and 1080 Ti are better. GTX 1060 is suitable for beginners, GTX 1070 is versatile and suitable for startups and certain scientific research and industrial applications, and GTX 1080 Ti is an all-round high-end product.

Tim doesn't recommend the NVIDIA Titan X (Pascal) because it's not very cost-effective. It's more suitable for large datasets in computer vision, or scientific research on video data. The size of the video memory has a huge impact on these fields, and the Titan X is 1GB larger than the GTX 1080 Ti, so it's more suitable. However, it's more cost-effective to buy a GTX Titan X (Maxwell) from eBay - it's a little slower, but the 12GB video memory is large enough.

GTX 1080Ti is enough for most researchers. An extra 1GB of video memory is not very useful for many scientific research and applications.

In scientific research, Tim personally chooses multiple GTX 1070s. He would rather do more experiments a little slower than run one experiment a little faster. NLP does not require as much video memory as computer vision, so a GTX 1070 is enough. The tasks and methods he is currently dealing with determine the most suitable choice - GTX 1070.

A similar approach can be used when choosing a GPU. Think clearly about the tasks and experimental methods to be performed, then find a GPU that meets the requirements. GPU instances on AWS are currently expensive and slow. GTX 970 is slow, second-hand ones are expensive, and the graphics card has memory issues when booting. You can spend more money to buy a GTX 1060, which is faster, has more video memory, and has no video memory issues. If the GTX 1060 is too expensive, you can use the GTX 1050 Ti with 4G video memory. 4GB is a bit small, but it is enough to get started with deep learning. If you make adjustments on some models, you can get better performance. The GTX 1050 Ti is suitable for most Kaggle competitions, but it may not play to the advantages of players in some competitions.

<<:  How to effectively collect user feedback in mobile apps

>>:  Android takes you to analyze ScrollView-imitation QQ space title bar gradient

Recommend

Duang! A complete collection of self-study Android materials

[[128183]] Text/Tikitoo I have learned Android fo...

The first choice of Feng Shui spiritual objects for entrance examinations

June is approaching and the feeling of summer hea...

Android phone market share in the U.S. rose to 64.9% in June

[[138630]] In the early morning of July 2nd, Beij...

User growth: How to retain new users?

Retention of new users is a very important part o...

Analysis of Pinduoduo’s marketing activities!

I think everyone is familiar with the product Pin...