Someone talked about his understanding of CPU on Twitter: The CPU model I remembered was still stuck in the 1980s: a box that could do arithmetic, logic, shifts and bit operations, and could load and store information in memory. I was vaguely aware of various new developments, such as vector instructions (SIMD), and new CPUs had virtualization support (although I had no idea what that meant in practical use). What cool developments have I missed? What can CPUs do today that weren't possible last year? What about CPUs from two, five, or ten years ago? I'm mostly interested in features that programmers need to roll their own wheels to take advantage of (or have to redesign their programming environment). I think that doesn't include hyperthreading/SMT, but I'm not sure. I'm also interested in things that CPUs can't do today but may be able to do in the future. Unless otherwise specified, the content of this article refers to the x86 and Linux environment. History always repeats itself, and many new things on x86 are already old news for supercomputers, mainframes and workstations. status quo Miscellaneous Notes Modern CPUs have wider registers and can address more memory. You may have used an 8-bit CPU in the 1980s, but you are definitely using a 64-bit CPU now. In addition to providing more address space, 64-bit mode (which avoids pseudo-random 80-bit precision for both 32-bit and 64-bit operations via x867 floating point) provides more registers and more consistent floating point results. Other features that have been introduced to x86 since the early 1980s and are very likely to be used include: paging/virtual memory, pipelining, and floating point operations. This article will avoid discussing unusual low-level features that are only used when writing drivers, BIOS code, or doing security audits, such as APIC/x2APIC, SMM, or the NX bit. Memory / Caches Of all the topics, the one that is most likely to really impact your day-to-day programming work is memory access. My first computer was a 286, and on that machine, a memory access might take just a few clock cycles. A few years ago, I used a Pentium 4, and memory accesses took over 400 clock cycles. Processors have evolved much faster than memory, and the solution to the problem of slow memory is to add caches, which make frequently used data faster to access if the access pattern is predictable, and prefetching - preloading data into the cache. A few cycles compared to 400+ sounds bad - 100x slower. But a loop that reads and operates on blocks of 64-bit (8-byte) values, and the CPU is smart enough to pre-fetch the right data before I need it, at about 22GB/s on a 3Ghz processor, we've only lost 8% performance instead of 100x. By using predictable memory access patterns and data block operations that are smaller than the CPU cache, you can take advantage of modern CPU cache architectures. If you want to be as efficient as possible, this document is a good starting point. After digesting this 100-page PDF document, you will next want to familiarize yourself with the system's microarchitecture and memory subsystem, as well as learn to use tools like likwid to analyze and test your application. TLBs There are also small caches in the chip for various things, and unless you need to go all out for micro-optimizations, you don't really need to know about the decoded instruction cache and other fun little caches. The biggest exception is the TLB - the cache for virtual memory lookups (done via a 4-level page table structure on x86). The page tables are in the L1 data cache, and each lookup takes 4, or 16 cycles to do a full virtual address lookup. For all operations that need to be accessed by user-mode memory, this is unacceptable, so there are small and fast caches for virtual address lookups. Because the first-level TLB cache must be fast, it is severely limited in size. If you use 4K pages, you determine how much memory can be found without a TLB miss. x86 also supports 2MB and 1GB pages; some applications benefit greatly from using larger pages. If you have a long-running application that uses a lot of memory, it's worth studying the details of this technology. Out of Order Execution / Serialization For the last two decades, x86 chips have been able to think about the order of execution (to avoid being blocked on a stalled resource). This can sometimes lead to very strange behavior. x86 has very strict requirements on the single CPU, or externally visible state like registers and memory, if everything is to be executed in order it must be updated in real time. These restrictions make things look like they're executed in order, and in most cases you can ignore the existence of OoO (out-of-order) execution unless you're trying to improve performance. The main exception is when you want to make sure things not only look like they're executed in order externally, but actually are executed in order internally. An example where you might be concerned is if you try to time the execution of a sequence of instructions with rdtsc; rdtsc will read out a hidden internal counter and place the result in the externally visible registers edx and eax. Suppose we do this:
Among them, foo, bar, and baz do not touch eax, edx, or [%ebx]. The mov following rdtsc will write the value of eax to a location in memory. Because eax is visible externally, the CPU will ensure that the mov is executed after rdtsc is executed, making everything appear to happen in order. However, since there is no explicit dependency between rdtsc, foo, or bar, rdtsc could be before foo, between foo and bar, or after bar. It is even possible for baz to execute before rdtsc, as long as baz does not affect the mov in any way. This is fine in some cases, but it is not good if rdtsc is used to measure the execution time of foo. In order to precisely arrange the order of rdtsc and other instructions, we need to serialize all executions. How to do it exactly? Please refer to this document from Intel. Memory / Concurrency In addition to the ordering restrictions mentioned above, which mean that loads and stores to the same location cannot be reordered with each other, x86 loads and stores have some other restrictions. In particular, for a single CPU, stores are not recorded with a previous load, regardless of whether they are to the same location or not. However, a load can be recorded together with an earlier store. For example:
Executing it is like:
But the reverse is not true - if you write the latter, it can never be executed as if you wrote the former. You could probably force the previous instance to execute as written by inserting serializing instructions. But this requires the CPU to serialize all instructions which can be very slow because it forces the CPU to wait until all instructions have been serialized before doing anything. If you only care about load/store ordering, there is also an mfence instruction that serializes only loads and stores. This article is not going to discuss memory fences, lfence, and sfence, but you can read more about them here. Single-core loads and stores are mostly ordered, and for multi-cores the same restrictions apply; if core0 is observing core1, it can see that all single-core rules apply to core1's loads and stores. However, if core0 and core1 interact, there is no guarantee that their interaction will also be ordered. For example, core0 and core1 start with eax and edx set to 0, and core0 executes:
And core1 executes
For both cores, eax must be 1, because the first and second instructions depend on each other. However, it is possible for eax to be 0 in both cores, because the third line of core0 can be executed when core1 has not seen anything, and vice versa. Memory barriers serialize memory accesses within a core. Linus has this to say about using memory barriers instead of locking:
And it turns out that on modern x86 processors, using locking to implement concurrency is often cheaper than using memory barriers, so let's look at locks. If you set _foo to 0, and have two threads execute incl(_foo) 10,000 times - a single instruction increments the same location 20,000 times, but the result could theoretically be 2. It's a good exercise to figure this out. We can test this with a simple code:
Compiling with clang++ -std=c++11 –pthread on my two machines gives the following distribution: Not only do the results we get vary at runtime, but the distribution of the results also varies between machines. We never reach the theoretical minimum of 2, or for that matter, any result below 10,000, but it is possible to get a final result between 10,000 and 20,000. Even though incl is a single instruction, it is not guaranteed to be atomic. Internally, incl is a load followed by an add followed by a store. An add in cpu0 could potentially sneak in between the load and store in cpu1, and vice versa. Intel's solution to this is that a small number of instructions can be prefixed with lock to ensure their atomicity. If we change the incl in the above code to lock incl, the output is always 20000. To make the sequence atomic, we can use xchg or cmpxchg, which are always locked to the compare and exchange primitives. This article will not describe in detail how it works, but if you are curious you can read this article by David Dalrymple. To make memory exchanges atomic, locks are globally ordered with respect to each other, and loads and stores are not reordered with respect to locks. For a model with strict memory ordering, see the x86 TSO documentation. In C or C++:
The compiler doesn't know that local_cpu_lock = 0 can't be put in the middle of a critical section. Compiler barriers are different from CPU memory barriers. Since the x86 memory model is stricter, some compiler barriers are opt-outs at the hardware level and tell the compiler not to reorder. If you're using a higher level of abstraction than microcode, assembly, C, or C++, the compiler most likely doesn't have any type of annotation. #p# Memory / Porting If you're porting your code to another architecture, be aware that x86 has probably the worst memory model of any architecture you'll encounter today. If you don't think carefully about porting it to an architecture with weaker guarantees (PPC, ARM, or Alpha), you'll almost certainly get errors. Consider Linus's comments on this example:
mb is a memory barrier. I won't go into detail here, but if you want to know why anyone would create a spec that allows this kind of crazy behavior, think about how before rising production costs killed DEC, their chips were so fast that they could run faster than x86 in emulation on the same benchmark. See the paper on the motivations behind the Alpha architecture for why most RISC-Y architectures made the decisions they did. By the way, this is the main reason why I am skeptical about the Mill architecture. Regardless of whether it can achieve the performance they claim, being technically excellent is not a reasonable business model. Memory / Non-Temporal Stores / Write-Combine Memory The restrictions described in the previous section apply to cacheable (i.e. "write-back" or WB) memory. Before this, there was only uncacheable (UC) memory. One interesting thing about UC memory is that all loads and stores are designed to expect to be loaded or stored on the bus. For processors with no or little onboard cache, this makes perfect sense. Memory/NUMA Non-Uniform Memory Access (NUMA), where memory access latency and bandwidth vary from processor to processor, is so common that it is adopted by default. The requirement here is that the threads sharing memory should be on the same socket, and the memory mapped I/O heavy thread should make sure it talks to the socket of the closest I/O device. Once upon a time, there was only RAM. Then CPUs got so fast relative to RAM that people wanted to add a cache. A cache that is inconsistent with the backing store (RAM) is bad news, so the cache must keep information about what it is holding on to, so it knows if and when it needs to write something to the backing store. That's not too bad, but once you get two cores with their own caches, things get complicated. To keep the same programming model as the uncached case, the caches have to be coherent with each other and with the backing store. Since the existing load/store instructions have nothing in their API that allows them to say "sorry! This load failed because another CPU was using the address you wanted to use", the simplest way is to have each CPU send a message to the bus every time it wants to load or store something. We already have this memory bus that both CPUs can connect to, so just ask the other CPU to reply when its data cache is modified (and lose the corresponding cache line). In most cases, each CPU is only concerned with data that the other CPUs don't care about, so there is some wasted bus traffic. But it's not bad, because once a CPU has put out a message saying "hello! I want to own this address and modify the data", it can assume that it completely owns the address until another CPU asks for it, although that doesn't always happen. For a 4-core CPU, it will still work, although the bytes are a bit more wasted, but each of those CPUs will fail to respond to every other CPU at a much higher rate than all 4 CPUs combined, both because the bus will be saturated, and because the cache will be saturated (the physical size/cost of a cache is O(n^2) in terms of the number of simultaneous reads and writes, and speed is inversely proportional to size). The "easy" solution to this problem is to have a single centralized directory keeping track of all the information, rather than doing N-way peer-to-peer broadcasts. Since we're packing 2-16 cores on a chip these days anyway, it's natural for each chip (socket) to have a single directory keeping track of the cache state of each core. Not only did we solve the problem of each chip, but we also needed some way for the chips to talk to each other. Unfortunately, as we scaled these systems the bus speeds got so fast that even for small systems it was really hard to drive a signal far enough to connect a bunch of chips and memory all on one bus. The simplest solution was to have each socket own a region of memory, so each socket didn't need to be connected to every part of memory. This also avoided the complexity of needing a higher level directory because it was clear which directory owned a specific section of memory. The downside to this is that if you own a socket and want some memory owned by another socket, there is a significant performance hit. Most "small" (<128 cores) systems use a ring bus for simplicity, so the performance hit is not just the direct latency/bandwidth penalty of going through a bunch of jumps to get to the memory, it's also using up a limited resource (the ring bus) and slowing down access to other sockets. In theory, the OS will handle this transparently, but it is often inefficient. Context Switches/Syscalls Here, syscall refers to the Linux system call, not the x86 SYSCALL or SYSENTER instruction. A side effect of all modern processors is that context switches are expensive, which makes system calls expensive. This is discussed in detail in the paper by Livio Soares and Michael Stumm. I will use some of their data below. Here is how many instructions per clock (IPC) a Core i7 on Xalan can do: After 14,000 cycles of system calls, the code is still not running at full speed. Below is a table of the footprints of several different system calls, both in terms of direct cost (instructions and cycles), and indirect cost (number of cache and TLB evictions). Some system calls caused more than 40 TLB evictions! For a chip with a 64-entry D-TLB, that almost wiped out the TLB. Cache eviction is not free. The high cost of system calls is the reason people turn to scripted system calls (e.g. epoll, or recvmmsg) for high performance code, and people who need high performance I/O often use the user space I/O stack. The cost of context switches is why high performance code is often one thread per core (or even a single thread on a fixed thread), rather than one thread per logical task. This high cost is also driven by VDSO, which puts some simple system calls that do not require any elevated privileges into simple user space library calls. SIMD Basically all modern x86 CPUs support SSE, 128-bit wide vector registers and instructions. Because it's common to do the same operation multiple times, Intel added instructions that let you operate on 128-bit chunks of data as if they were 2 64-bit chunks, or 4 32-bit chunks, 8 16-bit chunks, etc. ARM supports the same thing under a different name (NEON), and the supported instructions are very similar. It's common to get a 2x, 4x speedup by using SIMD instructions, and if you've got a compute-heavy workload this is definitely worth looking at.
However, if you don't write assembly by hand, compilers will often generate non-optimized code, especially for SIMD code, so if you care about getting the best performance possible, you should look at the disassembly and check your compiler for optimization errors. Power Management Modern CPUs have a lot of fancy power management features designed to optimize power usage in different scenarios. The result of these is "running to idle", because getting the work done as quickly as possible and then putting the CPU back to sleep is the most energy efficient way to go. While there are many practices that have been shown to benefit power consumption by making specific micro-optimizations, applying these micro-optimizations to real workloads often yields smaller than expected gains. GPU/GPGPU I'm not really qualified to talk about this compared to the rest. Luckily, Cliff Burdick volunteered to write this section:
Compared to processors, GPU hardware has several main differences, which are summarized below: processor At the top level, a GPU processor consists of one or more streaming multiprocessors (SMs). Each streaming multiprocessor on a modern GPU typically contains over 100 floating point units, or cores as they are commonly called in the GPU world. Each core is typically clocked at around 800MHz, although like CPUs, processors with higher clock speeds but fewer cores also exist. GPU processors lack many of the features of their CPU counterparts, including larger caches and branch prediction. Communication between the different layers of cores, SMs, and the processor as a whole becomes increasingly slow. For this reason, problems that perform well on GPUs are typically highly parallel, but with some data that can be shared between a small number of threads. We'll explain why in the memory section below. Memory Modern GPU memory is divided into 3 categories: global memory, shared memory, and registers. Global memory is GDDR which is usually advertised on the GPU box as being around 2-12GB in size and has a throughput of 300-400GB/sec. Global memory is accessible to all threads in all SMs on the processor and is also the slowest type of memory on the card. Shared memory, as the name implies, is memory that is shared between all threads in the same SM. It is usually at least twice as fast as global memory, but access is not allowed between threads on different SMs. Registers are much like registers on the CPU, they are the fastest way to access data on the GPU, but they are local to each thread and the data is not visible to other threads running on different SMs. Both shared memory and global memory have very strict rules about how they can be accessed, with severe performance penalties for not following these rules. In order to achieve the throughput mentioned above, memory accesses must be completely coalesced between threads in the same thread group. Similar to a CPU reading into a single cache line, the GPU can have a cache line that can serve all threads in a group for a single access if it is aligned properly. However, the worst case is when all threads in a group access different cache lines, and each thread requires a separate memory read. This usually means that the data in the cache line is not used by the thread, and the available throughput of the memory is reduced. Similar rules apply to shared memory, with some exceptions that we will not cover here. Threading Model GPU threads run in a Single Instruction Multiple Thread (SIMT) fashion, and each thread runs in groups of predefined size (usually 32) in the hardware. The first part has a lot of implications; each thread in the group must work on the same instruction at the same time. If any thread in a group needs to get a code divergence path from the others (such as an if statement), all threads not participating in that branch will be stopped at the end of that branch. As a simple example:
In the code above, this branch will cause 27 of our 32 threads to pause until the branch ends. As you can imagine, if multiple groups of threads run this code, overall performance will be greatly affected because most of the cores are idle. Only when the entire group of threads is locked will the hardware allow another group of cores to run. Interfaces Modern GPUs have to send and receive data to and from a CPU to copy data between the CPU and GPU memory, to boot the GPU and encode. At maximum throughput, a PCIe 3.0 bus with 16 lanes can reach speeds of about 13-14GB/s. This may sound high, but relative to the speed of the memory that exists on the GPU itself, they are an order of magnitude slower. In fact, graphics processors are getting more powerful that the PCIe bus is increasingly becoming a bottleneck. In order to see any performance advantage of the GPU over the CPU, the GPU must be loaded with so much work that the time the GPU needs to run the work is much longer than the time it takes to send and receive data. Newer GPUs have some features that can dynamically allocate work in GPU code without going back to the CPU, but its application is currently quite limited. GPU Conclusion Due to the major architectural differences between CPUs and GPUs, it is difficult to imagine one completely replacing the other. In fact, GPUs complement the parallel work of CPUs very well, allowing the CPU to independently complete other tasks while the GPU is running. AMD is trying to merge the two technologies through their "heterogeneous phase architecture" (HSA), but taking existing CPU code and deciding how to separate the CPU and GPU parts of the processor will be a big challenge, not only for the processor, but also for the compiler. Virtualization Unless you are writing very low-level code that deals directly with virtualization, the virtualization instructions built into Intel are usually not something you need to think about. Dealing with that stuff is pretty messy, as you can see from the code here. Even for the very simple example shown there, setting up Intel's VT instructions to boot a virtualized guest takes about 1000 lines of low-level code. Virtual Memory If you look at the VT code for Vish, you'll see that there's a nice chunk of code devoted to page tables/virtual memory. This is another "new" feature that you don't have to worry about unless you're writing an operating system or other low-level systems code. Using virtual memory is much simpler than using segmented memory, but we'll leave that for now. SMT/Hyper-threading Hyperthreading is mostly transparent to the programmer. A typical speedup with SMT enabled on a single core is around 25%. That's good for overall throughput, but it means each thread might only get 60% of its original performance. For applications where you care a lot about single-threaded performance, you might want to disable SMT. This depends a lot on the workload though, and as with any other change, you should run some benchmarks on your specific workload to see what the effect is. One side effect of all this complexity added to chips (and software) is that performance is a lot less predictable than once expected; the importance of specific hardware benchmarks has correspondingly risen. People often use the "computer language benchmark game" as evidence that one language is faster than another. I've tried to reproduce the results myself, and on my mobile Haswell (vs. the server Kentsfield used in the results), I've gotten results that differ by as much as 2x (relative speed). Even running the same benchmark on the same machine, Nanthan Kurz recently pointed me to an example where gcc -O3 was 25% slower than gcc -O2, and that changing the link order for a C++ program can result in a 15% performance change. The choice of benchmark is a difficult one. Branches Conventional wisdom holds that branches are expensive, and should be avoided at all (most) possible costs. On Haswell, a branch misprediction costs 14 clock cycles. The branch misprediction rate depends on the workload. Using perf stat on a few different things (bzip2, top, mysqld, regenerating my blog), I've gotten branch misprediction rates between 0.5% and 4%. If we assume that a correctly predicted branch costs 1 cycle, the average cost is between .995 * 1 + .005 * 14 = 1.065 cycles to .96 * 1 + .04 * 14 = 1.52 cycles. That's not bad. This actually overstates the cost since about 1995, when Intel added the conditional move instruction, which lets you conditionally move data without a branch. This instruction was memorably criticized by Linus, which gave it a bad reputation, but it's quite common to get significant speedups with CMOS compared to branches. A real-world example of the extra branch cost is using integer overflow checks. When using bzip2 to compress a particular file, that increases the number of instructions by about 30% (all the increase comes from the extra branch instructions), which results in a 1% performance loss. Unpredictable branches are bad, but most branches are predictable. Ignoring the cost of branches until your profiler tells you there's a hotspot is perfectly reasonable these days. CPUs have gotten a lot better at executing poorly optimized code in the last decade, and compilers have gotten much better at optimizing code, making optimized branches a bad use of the time unless you're trying to squeeze the absolute best performance out of some code. If it turns out that's what you need to do, you'd better use Profile Guided Optimization instead of trying to do it manually. If you really must do it by hand, there are compiler directives you can use to indicate whether a particular branch is likely to be taken or not. Modern CPUs ignore branch hint instructions, but they can help the compiler lay out the code better. Alignment A rule of thumb is to make your structs longer and make sure your data is aligned. But on Haswell chips, the misalignment is zero for almost anything you can think of that doesn't span pages and is single-threaded. There are cases where it's useful, but in general, it's another insignificant optimization because CPUs have gotten much better at executing bad code. It also hurts a bit by increasing the memory footprint for no good reason. Also, don't page-align or otherwise arrange things to large boundaries, or you'll kill cache performance. Self-modifying code This is another optimization that doesn't make a lot of sense these days. Using self-modifying code to reduce code size or increase performance used to make sense, but since modern caches tend to split their L1 instruction and data caches, modifying running code between the L1 caches of a chip requires expensive communication. future Here are some possible changes, from the most conservative guess to the most aggressive. Transactional Memory and Hardware Lock Elision IBM already had these features in their own POWER chips. Intel tried to add these things to Haswell, but it was disabled due to a bug. Transactional memory support is exactly what it sounds like: hardware support for transactions, via three new instructions: xbegin, xend, and xabort. xbegin starts a new transaction. A conflict (or xabort) rolls back the architectural state of the processor (including memory) to the state before xbegin. If you are using transactional memory via library or language support, this should be transparent to you. If you are implementing library support, you will have to figure out how to translate hardware support with limited hardware buffer sizes into abstract transactions. This article is going to discuss the Elision hardware lock, which is essentially an implementation of a mechanism very similar to that used to implement transactional memory, and is designed to speed up lock-based code. If you want to take advantage of HLE, check out this document. Fast I/O For storage and networking, I/O bandwidth is going up and I/O latency is going down. The problem is that I/O is usually done through system calls. As we have seen, the relative premium of system calls is going up. For storage and networking, the answer is to move to the user-mode I/O stack. Dark Silicon/System-on-Chip An interesting side effect of transistor scaling is that we can pack a lot of transistors onto a chip, but they generate so much heat that ordinary transistors can't be turned on or off most of the time without melting the chip. The result of this is that it makes more sense to include specialized hardware that isn't used a lot of the time. On the one hand, this means we get all sorts of specialized instructions like PCMP and ADX. But it also means that we're integrating entire devices with the chip that were once off-chip. This includes things like the GPU and (for mobile devices) the radio. Combined with the trend toward hardware acceleration, this also means that it makes more sense for companies to design their own chips, or at least parts of their own chips. Apple has gone a long way with its acquisition of PA Semi, first by adding a handful of custom accelerators to the stagnant standard ARM architecture, and then by adding custom accelerators to their own custom architecture. Thanks to a combination of the right custom hardware and thoughtful benchmark and system design, the iPhone 4 is slightly faster than my flagship Android phone, which is many years newer and has a faster processor and more memory. Amazon picked up parts of the original Calxeda team and hired a sufficiently sized hardware design team. Facebook has picked up ARM SoC experts and is working with Qualcomm on something. Linus is on record saying, "We're going to see more specialized hardware in all areas," etc. in conclusion x86 chips have had a lot of new capabilities and very useful little features. In most cases, you don't need to know what they are to take advantage of them. The real low-level stuff is usually hidden away by libraries or drivers, and the compiler will try to take care of the rest. The exceptions are if you really want to write low-level code, in which case the world is a lot messier, or if you want to get the absolute best performance out of your code, and it gets a lot weirder. Some things seem certain to happen in the future. But past experience tells us that most predictions are wrong, so who knows? |
<<: Behind the success of pair programming
>>: Essential for APP to go global: How to win the European and American markets
Pinduoduo's super hot-selling operator's ...
The Mid-Autumn Festival and National Day holidays...
Spring is here, A heart that loves beauty is eage...
Google is about to take back the leading role in ...
Today's article shares with you a summary of ...
In the energy storage market, although the 280Ah ...
There is a long-standing topic on the Internet: s...
At a time when fruit prices are getting higher an...
Once upon a time, the distinctive characteristics...
As the most important marketing node in the year,...
What I want to share with you today is the Androi...
There are usually two reasons why menstruation ap...
Recently, Tesla has begun to push the 8.0 version...
The microwave oven, a convenient tool in the fami...
Recently, the Ministry of Agriculture and Rural A...