Code address: https://github.com/apatrascu/hunting-python-performance Table of contents 1. Environment Settings 2. Memory Analysis 3. CPU Analysis - Python Script 4. CPU Analysis - Python Interpreter This article is the first part of the tutorial, which mainly discusses the path of Python code optimization from two aspects: environment setting and memory analysis.
1. Environment Settingsset upBefore we dive into benchmarking and performance analysis, first we need a suitable environment. This means we need to configure our machine and operating system for the task. The specs of my machine are as follows:
Our goal is to get reproducible results, so we want to make sure our data is not affected by other background processes, operating system configuration, or any other hardware performance enhancement techniques. Let's first start by configuring the machine for performance analysis. Hardware Features First, disable all hardware performance features, which means disabling Intel Turbo Boost and Hyper Threading from BIOS/UEFI. As its official webpage states, Turbo Boost is "a technology that runs on processor cores and allows them to run at speeds higher than the rated frequency while falling below the power, current, and temperature specification limits." In addition, Hyper Threading is "a technology that can more efficiently utilize processor resources, enabling each core to run multiple threads." These are all good things worth spending money on. So why disable them in performance analysis/benchmarking? Because using these techniques will not allow us to get reliable and reproducible results. It will make the run process vary. Let's take a small example primes.py, the code is deliberately badly written. import time This code is available on GitHub: https://github.com/apatrascu/hunting-python-performance/blob/master/01.primes.py. You need to install a dependency package by running the following command: pip install statistics Let's run this on a system with Turbo Boost and Hyper Threading enabled: python primes.py Now disable Turbo Boost and Hyper Threading on this system and run this code again: python primes.py Look at the standard deviation of 15% in the first case. That's a lot! If our optimization only gave us a 6% speedup, how could we separate the run to run variation from the differences in your implementation? In contrast, in the second example, the standard deviation is reduced to about 0.6%, and the effect of our new optimization is clearly visible. CPU power savingDisable all CPU power saving settings and use a fixed CPU frequency. This can be done by changing intel_pstate to acpi_cpufreq in the Linux power governor. The intel_pstate driver implements a scaling driver using the internal governor of Intel Core (Sandy Bridge and newer) processors. acpi_cpufreq uses the ACPI Processor Performance States. Let's check it out first: $ cpupower frequency-info You can see that the governor used here is set to power saving mode, and the CPU frequency range is between 1.20 GHz and 3.60 GHz. This setting is fine for everyday use, but it will affect the results of the benchmark. So what values should we set for the regulator? If we look through the documentation, we can see that we can use the following settings:
We need to use the performance governor and set the frequency to the maximum frequency supported by the CPU. This is shown below: $ cpupower frequency-info You have now set the frequency to a fixed 2.3 GHz using the Performance Governor. This is the maximum configurable value, without Turbo Boost, that can be used on a Xeon E5-2699 v3. To complete the setup, run the following command with administrator privileges: cpupower frequency-set -g performance If you don't have cpupower, you can install it using the following command: sudo apt-get install linux-tools-common linux-header-`uname -r` -y The power regulator has a big impact on how the CPU works. The default setting for this regulator is to automatically adjust the frequency to reduce power consumption. We don't want that, so disable it from GRUB. Just edit /boot/grub/grub.cfg (but if you're careful with kernel upgrades, this will disappear) or create a new kernel entry in /etc/grub.d/40_custom. Our boot line must include the flag: intel_pstate=disable, like this: linux /boot/vmlinuz-4.4.0-78-generic.efi.signed root=UUID=86097ec1-3fa4-4d00-97c7-3bf91787be83 ro intel_pstate=disable quiet splash $vt_handoff ASLR (Address Space Configuration Randomizer)This setting is controversial, see Victor Stinner's blog: https://haypo.github.io/journey-to-stable-benchmark-average.html When I first suggested disabling ASLR when benchmarking, it was to further improve support for Profile Guided Optimizations that existed in CPython at the time. Why do I mention this? Because on the specific hardware given above, disabling ASLR reduces the standard deviation between runs to 0.4%. On the other hand, based on tests on my personal computer (Intel Core i7 4710MQ), disabling ASLR leads to the same issues mentioned by Victor. Testing on smaller CPUs (such as Intel Atom) leads to even larger standard deviation between runs. Since this doesn't seem to be a universal truth and depends a lot on the hardware/software configuration, for this setup I measured once with it enabled and once with it disabled and compared them afterwards. On my machine I disabled ASLR by putting the following command in /etc/sysctl.conf. Apply using sudo sysctl -p . kernel.randomize_va_space = 0 If you want to disable it at runtime: sudo bash -c 'echo 0 >| /proc/sys/kernel/randomize_va_space' If you want to re-enable: sudo bash -c 'echo 2 >| /proc/sys/kernel/randomize_va_space' 2. Memory AnalysisIn this section, I'll introduce some tools that can help us solve the memory consumption problem in Python (especially when using PyPy). Why should we care about this? Why don't we just care about performance? The answers to these questions are quite complicated, but I'll summarize them. PyPy is an alternative Python interpreter that has some huge advantages over CPython: speed (through its Just in Time compiler), compatibility (almost a drop-in replacement for CPython), and concurrency (using stackless and greenlets). One downside of PyPy is that it will generally use more memory than CPython due to its JIT and garbage collection implementation. However, in some cases, its memory usage can be less than CPython. Let's look at how you can measure how much memory your application is using. Diagnosing memory usagememory_profiler is a library that can be used to measure the memory usage of the interpreter while running a workload. You can install it via pip: pip install memory_profiler Also install the psutil dependency package: pip install psutil The advantage of this tool is that it shows the memory consumption in a Python script line by line. This allows us to find places in the script that we can rewrite. But there is a disadvantage to this analysis. Your code will run 10 to 20 times slower than a normal script. How to use it? You just need to add @profile() directly to the function you want to measure. Let's see how it works in practice! We will use the material script we used before as a model, but make a slight modification to remove the statistics part. The code can also be viewed on GitHub: https://github.com/apatrascu/hunting-python-performance/blob/master/02.primes-v1.py from memory_profiler import profile To start measuring, use the following PyPy command: pypy -m memory_profiler 02.primes-v3.py Or import memory_profiler directly in your script: pypy -m memory_profiler 02.primes-v3.py After executing this line of code, we can see that PyPy gets the following result: Line # Mem usage Increment Line Contents We can see that this script uses 24.371094 MiB of RAM. Let's analyze it briefly. We see that most of it is used in the construction of the array of values. It excludes even values and keeps all other values. We can improve it a bit by calling the range function, which takes an increment parameter. In this case, the script would look like this: from memory_profiler import profile If we measure again, we can get the following results: Line # Mem usage Increment Line Contents Great, now our memory usage is down to 22.75 MiB. We can reduce it even further using list comprehension. from memory_profiler import profile Measure again: Line # Mem usage Increment Line Contents Our latest script consumes only 22.421875 MiB. This is almost a 10% decrease compared to the first version. Original address:
|
<<: Understand the meaning of matrix rank and determinant in one article
>>: Is the mathematical foundation of deep neural networks too difficult for you?
Young looks at the stars, chasing the stars and p...
Audit expert: Chen Mingxin National Level 2 Psych...
When doing various marketing fission activities, ...
Author: Li Haijie, School of Public Health, North...
AMD Ryzen processor will be available at the end ...
Nowadays, smart TVs have become widely popular, a...
Sometimes, food is also medicine, and more and mo...
A distance as short as one millimeter, In fact, i...
Exploring the Secrets of the Twelve Constellation...
From setting the world's fastest IPO record t...
Today, Xiaohongshu has become a "gemstone&qu...
[[158875]] Technology has made huge leaps forward...
Weibo's market value today has reached 10 bil...
As an SEO optimizer, we are really worried when o...
The Mid-Autumn Festival is approaching. A few day...