In the process of developing mobile client, especially Android application, we often come across the term "hardware acceleration". Since the operating system has perfect encapsulation of the underlying hardware and software, upper-level software developers often know little about the underlying principles of hardware acceleration, and do not know the significance of the underlying principles. Therefore, there are often some misunderstandings, such as whether hardware acceleration achieves page rendering acceleration through special algorithms, or whether it achieves rendering acceleration by increasing the CPU/GPU computing speed through hardware. This article attempts to briefly introduce hardware acceleration technology from the underlying hardware principles to the upper-level code implementation, where the upper-level implementation is based on Android 6.0. Understand the significance of hardware acceleration for app development For App developers, a simple understanding of the hardware acceleration principle and upper-level API implementation will allow them to fully utilize hardware acceleration to improve page performance during development. Taking Android as an example, there are usually two solutions to implement a rounded rectangular button: using a PNG image; using code (XML/Java). A simple comparison of the two solutions is as follows. Background knowledge of page rendering - When the page is rendered, the drawn elements must eventually be converted into matrix pixels (i.e., a multi-dimensional array, similar to Bitmap in Android) before they can be displayed on the monitor.
- The page is composed of various basic elements, such as circles, rounded rectangles, line segments, text, vector graphics (usually composed of Bezier curves), Bitmap, etc.
- When drawing elements, especially during animation drawing, it often involves operations such as interpolation, scaling, rotation, transparency changes, animation transitions, frosted glass blur, and even 3D transformations, physical motion (such as parabolic motion commonly seen in games), and multimedia file decoding (mainly used in desktop computers, mobile devices generally do not use GPU for decoding).
- The drawing process often requires floating-point operations with simple logic but huge amounts of data.
CPU and GPU structure comparison CPU (Central Processing Unit) is the core component of computer equipment, used to execute program code, and software developers are very familiar with it; GPU (Graphics Processing Unit) is mainly used to process graphics calculations, and the core component of the so-called "graphics card" is GPU. The following is a comparison of the CPU and GPU structures. - The yellow Control is the controller, which is used to coordinate and control the operation of the entire CPU, including fetching instructions and controlling the operation of other modules;
- The green ALU (Arithmetic Logic Unit) is the arithmetic logic unit, which is used to perform mathematical and logical operations;
- The orange Cache and DRAM are cache and RAM, respectively, which are used to store information.
- As can be seen from the structure diagram, the CPU controller is more complex, but the number of ALUs is relatively small. Therefore, the CPU is good at various complex logical operations, but not good at mathematics, especially floating-point operations.
- Taking 8086 as an example, most of the more than 100 assembly instructions are logic instructions, and the main mathematical operations are 16-bit addition, subtraction, multiplication, division and shift operations. An integer and logic operation generally takes 1 to 3 machine cycles, and floating-point operations need to be converted into integer operations, which may consume hundreds of machine cycles.
- Simpler CPUs may even only have addition instructions, with subtraction implemented using two's complement addition, multiplication implemented using accumulation, and division implemented using a subtraction loop.
- Modern CPUs generally have a hardware floating-point unit (FPU), but it is mainly suitable for situations where the amount of data is not large.
- The CPU is a serial structure. For example, if we calculate 100 numbers, one CPU core can only calculate the sum of two numbers at a time, and the results are gradually accumulated.
- Unlike CPU, GPU is designed to perform a large number of mathematical operations. As can be seen from the structure diagram, the GPU controller is relatively simple, but contains a large number of ALUs. The ALU in the GPU uses a parallel design and has a large number of floating-point units.
- The main principle of hardware acceleration is to convert graphics calculations that the CPU is not good at into GPU-specific instructions through the underlying software code, which are then completed by the GPU.
Extension: The GPU in many computers has its own independent video memory; if there is no independent video memory, a shared memory is used to divide an area from the memory as video memory. Video memory can store information such as GPU instructions. Parallel Structure Example: Cascaded Adders To make it easier to understand, let's take an example from the perspective of the underlying circuit structure. The following figure shows an adder, which corresponds to the actual digital circuit structure. - A and B are inputs, C is output, and A, B, and C are all buses. Taking a 32-bit CPU as an example, each bus is actually composed of 32 wires, and each wire uses a different voltage to represent a binary 0 or 1.
- Clock is the clock signal line. A specific voltage signal can be input into it in each fixed clock cycle. Whenever a clock signal arrives, the sum of A and B will be output to C.
Now we want to calculate the sum of 8 integers. For a serial structure like the CPU, the code is very simple to write. Just use a for loop to add all the numbers one by one. The serial structure has only one adder, which requires 7 sum operations. After each partial sum is calculated, it must be transferred to the input of the adder for the next calculation. The whole process consumes at least a dozen machine cycles. For parallel structures, a common design is a cascade adder, as shown in the figure below, where all clocks are connected together. When the 8 data to be added are ready at the input terminals A1~B4, the summation operation is completed after three clock cycles. If the amount of data is larger and the cascade level is larger, the advantages of the parallel structure are more obvious. Due to circuit limitations, it is not easy to increase the operation speed by increasing the clock frequency and reducing the clock cycle. The parallel structure achieves faster operation by increasing the circuit scale and parallel processing. However, it is not easy to implement complex logic in the parallel structure because it is complicated to consider the output results of multiple branches at the same time and coordinate the synchronous processing (a bit like multi-threaded programming). GPU parallel computing example Suppose we have the following image processing task, adding 1 to each pixel value. The GPU parallel computing method is simple and crude. If resources allow, we can open a GPU thread for each pixel to perform the addition operation. The larger the amount of mathematical operations, the more obvious the performance advantage of this parallel method. Hardware Acceleration in Android In Android, most application interfaces are built using regular Views (except for games, videos, images, and other applications that may use OpenGL ES directly). The following is an analysis and comparison of software and hardware accelerated rendering of Views based on the Java layer code of the Android 6.0 native system. DisplayList DisplayList is a basic drawing element that contains the original properties of the element (position, size, angle, transparency, etc.), corresponding to the drawXxx() method of Canvas (as shown below). Information transmission process: Canvas (Java API) —> OpenGL (C/C++ Lib) —> driver —> GPU. In Android 4.1 and above, DisplayList supports attributes. If some attributes of View change (such as Scale, Alpha, Translate), you only need to update the attributes to the GPU without generating a new DisplayList. RenderNode A RenderNode contains several DisplayLists. Usually, a RenderNode corresponds to a View, including all DisplayLists of the View itself and its sub-Views. Android drawing process (Android 6.0) The following is a complete drawing flow chart of Android View, which is mainly obtained by reading the source code and debugging. The dotted arrows represent recursive calls. - From ViewRootImpl.performTraversals to PhoneWindow.DecroView.drawChild is a fixed process for each traversal of the View tree. First, it is determined whether the layout needs to be re-laid out based on the flag bit and the layout is executed; then the Canvas is created and other operations are performed to start drawing.
- If hardware acceleration is not supported or is turned off, software drawing is used and the generated Canvas is the object of Canvas.class;
- If hardware acceleration is supported, an object of DisplayListCanvas.class is generated;
- The isHardwareAccelerated() method returns false and true respectively, and View determines whether to use hardware acceleration based on this value.
- The recursive path of draw(canvas,parent,drawingTime) - draw(canvas) - onDraw - dispachDraw - drawChild in View (hereinafter referred to as Draw path) calls the Canvas.drawXxx() method, which is used for actual drawing in software rendering and for building DisplayList in hardware acceleration.
- The recursive path of updateDisplayListIfDirty - dispatchGetDisplayList - recreateChildDisplayList in View (hereinafter referred to as DisplayList path) is only passed when hardware acceleration is used to update DisplayList properties while traversing the View tree and quickly skip the View that does not need to rebuild the DisplayList.
In Android 6.0, the APIs related to DisplayList are still marked as "@hide" and are inaccessible, indicating that they are not mature yet and may be opened in subsequent versions. - In the case of hardware acceleration, the DisplayList is constructed after the draw process is completed, and then the GPU is used to draw the DisplayList to the screen through ThreadedRenderer.nSyncAndDrawFrame().
Pure software rendering VS hardware acceleration (Android 6.0) The following is a detailed analysis of the process and acceleration effect before and after hardware acceleration based on several specific scenarios. - In scenario 1, regardless of whether acceleration is used or not, the Draw path will be used when traversing the View tree. After hardware acceleration, the Draw path does not perform actual drawing work, but only constructs the DisplayList. The complex drawing calculation tasks are shared by the GPU, which has a significant acceleration effect.
- In scenario 2, the front and back sizes of the TextView remain unchanged and will not trigger a layout re-run.
- When the software is drawing, the area where the TextView is located is the dirty area. Since the TextView has a transparent area, when traversing the View tree, most Views that overlap with the dirty area need to be redrawn, including the overlapping sibling nodes and their parent nodes (see the following introduction for details). Views that do not need to be drawn are judged and returned directly in the draw(canvas,parent,drawingTime) method.
- After hardware acceleration, the View tree still needs to be traversed, but only the TextView and its parent nodes at each level need to rebuild the DisplayList, taking the Draw path, while other Views directly take the DisplayList path, and the rest of the work is handled by the GPU. The more complex the page, the more obvious the performance gap between the two.
- In scenario 3, the software has to do a lot of drawing work to draw each frame, which can easily cause animation lag. After hardware acceleration, the animation process directly updates the properties of DisplayList through the DisplayList path, which greatly improves the animation smoothness.
- In scenario 4, the performance gap between the two is more obvious. To simply modify the transparency, software rendering still needs to do a lot of work; after hardware acceleration, the properties of RenderNode are generally updated directly, without triggering invalidate, and the View tree will not be traversed (except for a few Views that may have to respond to Alpha in a special way and return true in onSetAlpha(), as shown in the following code).
- public class View { // ...
- public void setAlpha(@FloatRange( from =0.0, to =1.0) float alpha) {
- ensureTransformationInfo(); if (mTransformationInfo.mAlpha != alpha) {
- mTransformationInfo.mAlpha = alpha; if (onSetAlpha(( int ) (alpha * 255))) { // ...
- invalidate( true );
- } else { // ...
- mRenderNode.setAlpha(getFinalAlpha()); // ...
- }
- }
- } protected boolean onSetAlpha( int alpha) { return false ;
- } // ...}
Introduction to software drawing refresh logic By actually reading the source code and experimenting, we can get the software drawing refresh logic under normal circumstances: - By default, the clipChildren property of View is true, that is, each View's drawing area cannot exceed the range of its parent View. If the clipChildren property of a page root layout is set to false, the child View can exceed the parent View's drawing area.
- When a View triggers invalidate, and no animation is played and no layout is triggered:
- When clipChildren is true, the dirty area will be converted into a Rect in ViewRoot, and when refreshing, it will be judged layer by layer, and redrawn when the View overlaps with the dirty area. If a View exceeds the scope of the parent View and overlaps with the dirty area, but its parent View does not overlap with the dirty area, the child View will not be redrawn.
- When clipChildren is false, ViewGroup.invalidateChildInParent() will expand the dirty area to its entire area, so all Views overlapping this area will be redrawn.
- For a fully opaque View, its own flag will be set to PFLAG_DIRTY, and its parent View will set the flag to PFLAG_DIRTY_OPAQUE. In the draw(canvas) method, only the View itself is redrawn.
- For Views that may have transparent areas, both the View itself and the parent View will have the flag PFLAG_DIRTY set.
Summarize So far, the content related to hardware acceleration has been introduced. Here is a brief summary: - The CPU is better at complex logic control, while the GPU is better at mathematical operations thanks to its large number of ALUs and parallel structure design.
- The page is composed of various basic elements (DisplayList), and a large number of floating-point operations are required when rendering.
- Under hardware acceleration conditions, the CPU is used to control complex drawing logic and build or update DisplayList; the GPU is used to complete graphics calculations and render DisplayList.
- Under hardware acceleration conditions, when refreshing the interface, especially when playing animations, the CPU only rebuilds or updates the necessary DisplayList, further improving rendering efficiency.
- To achieve the same effect, you should try to use a simpler DisplayList to achieve better performance (Shape instead of Bitmap, etc.).
References and further reading - GPU - Parallel Computing Tool
- An introduction to the working principle of GPU, the "heart" of the graphics card
- GPU acceleration of Matlab
- Processor architecture: Understand the basic operation of the CPU
- Internal architecture and working principle of CPU
- What is a heterogeneous multiprocessing system and why is it needed?
- Analysis of the Display List Construction Process for Android Application UI Hardware Accelerated Rendering
- Analysis of Display List Rendering Process of Android Application UI Hardware Accelerated Rendering
- Android Choreographer source code analysis
- Android Project Butter Analysis
|