From environment setup to memory analysis: A guide to Python code optimization

From environment setup to memory analysis: A guide to Python code optimization

Code address: https://github.com/apatrascu/hunting-python-performance

Table of contents

1. Environment Settings

2. Memory Analysis

3. CPU Analysis - Python Script

4. CPU Analysis - Python Interpreter

This article is the first part of the tutorial, which mainly discusses the path of Python code optimization from two aspects: environment setting and memory analysis.

[[197156]]

1. Environment Settings

set up

Before we dive into benchmarking and performance analysis, first we need a suitable environment. This means we need to configure our machine and operating system for the task.

The specs of my machine are as follows:

  • Processor: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
  • Memory: 32GB
  • Operating system: Ubuntu 16.04 LTS
  • Kernel: 4.4.0-75-generic

Our goal is to get reproducible results, so we want to make sure our data is not affected by other background processes, operating system configuration, or any other hardware performance enhancement techniques.

Let's first start by configuring the machine for performance analysis.

Hardware Features

First, disable all hardware performance features, which means disabling Intel Turbo Boost and Hyper Threading from BIOS/UEFI.

As its official webpage states, Turbo Boost is "a technology that runs on processor cores and allows them to run at speeds higher than the rated frequency while falling below the power, current, and temperature specification limits." In addition, Hyper Threading is "a technology that can more efficiently utilize processor resources, enabling each core to run multiple threads."

These are all good things worth spending money on. So why disable them in performance analysis/benchmarking? Because using these techniques will not allow us to get reliable and reproducible results. It will make the run process vary. Let's take a small example primes.py, the code is deliberately badly written.

import time
import statistics

def primes(n):
if n==2:
return [2]
elif n<2:
return []
s=range(3,n+1,2)
mroot = n**0.5
half = (n + 1) / 2 - 1
i=0
m=3
while m <= mroot:
if s[i]:
j=(m*m-3)/2
s[j]=0
while j<half:
s[j]=0
j+=m
i=i+1
m=2*i+3
return [2]+[x for x in s if x]

def benchmark():
results = []
gstart = time.time()
for _ in xrange(5):
start = time.time()
count = len(primes(1000000))
end = time.time()
results.append(end-start)
gend = time.time()
mean = statistics.mean(results)
stdev = statistics.stdev(results)
perc = (stdev * 100)/mean
print "Benchmark duration: %r seconds" % (gend-gstart)
print "Mean duration: %r seconds" % mean
print "Standard deviation: %r (%r %%)" % (stdev, perc)

benchmark()

This code is available on GitHub: https://github.com/apatrascu/hunting-python-performance/blob/master/01.primes.py. You need to install a dependency package by running the following command:

pip install statistics

Let's run this on a system with Turbo Boost and Hyper Threading enabled:

python primes.py
Benchmark duration: 1.0644240379333496 seconds
Mean duration: 0.2128755569458008 seconds
Standard deviation: 0.032928838418120374 (15.468585914964498 %)

Now disable Turbo Boost and Hyper Threading on this system and run this code again:

python primes.py
Benchmark duration: 1.2374498844146729 seconds
Mean duration: 0.12374367713928222 seconds
Standard deviation: 0.000684464852339824 (0.553131172568 %)

Look at the standard deviation of 15% in the first case. That's a lot! If our optimization only gave us a 6% speedup, how could we separate the run to run variation from the differences in your implementation? In contrast, in the second example, the standard deviation is reduced to about 0.6%, and the effect of our new optimization is clearly visible.

CPU power saving

Disable all CPU power saving settings and use a fixed CPU frequency. This can be done by changing intel_pstate to acpi_cpufreq in the Linux power governor.

The intel_pstate driver implements a scaling driver using the internal governor of Intel Core (Sandy Bridge and newer) processors. acpi_cpufreq uses the ACPI Processor Performance States.

Let's check it out first:

$ cpupower frequency-info
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 0.97 ms.
hardware limits: 1.20 GHz - 3.60 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 1.20 GHz and 3.60 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
boost state support:
Supported: yes
Active: yes

You can see that the governor used here is set to power saving mode, and the CPU frequency range is between 1.20 GHz and 3.60 GHz. This setting is fine for everyday use, but it will affect the results of the benchmark.

So what values ​​should we set for the regulator? If we look through the documentation, we can see that we can use the following settings:

  • High performance: Run the CPU at maximum frequency
  • Powersave: Run the CPU at minimum frequency
  • Custom (userspace): Run the CPU at a user-specified frequency
  • On demand: Dynamically adjust the frequency based on the current load. It may jump to the highest frequency and then decrease when idle.
  • Conservative: Dynamically adjust the frequency based on the current load. Compared with the on-demand mode, its frequency adjustment is more gradual.

We need to use the performance governor and set the frequency to the maximum frequency supported by the CPU. This is shown below:

$ cpupower frequency-info
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 10.0 us.
hardware limits: 1.20 GHz - 2.30 GHz
available frequency steps: 2.30 GHz, 2.20 GHz, 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz
available cpufreq governors: conservative, ondemand, userspace, powersave, performance
current policy: frequency should be within 2.30 GHz and 2.30 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 2.30 GHz.
cpufreq stats: 2.30 GHz:100.00%, 2.20 GHz:0.00%, 2.10 GHz:0.00%, 2.00 GHz:0.00%, 1.90 GHz:0.00%, 1.80 GHz:0.00%, 1.70 GHz:0.00%, 1.60 GHz:0.00%, 1.50 GHz:0.00%, 1.40 GHz:0.00%, 1.30 GHz:0.00%, 1.20 GHz:0.00% (174)
boost state support:
Supported: no
Active: no

You have now set the frequency to a fixed 2.3 GHz using the Performance Governor. This is the maximum configurable value, without Turbo Boost, that can be used on a Xeon E5-2699 v3.

To complete the setup, run the following command with administrator privileges:

cpupower frequency-set -g performance
cpupower frequency-set --min 2300000 --max 2300000

If you don't have cpupower, you can install it using the following command:

sudo apt-get install linux-tools-common linux-header-`uname -r` -y

The power regulator has a big impact on how the CPU works. The default setting for this regulator is to automatically adjust the frequency to reduce power consumption. We don't want that, so disable it from GRUB. Just edit /boot/grub/grub.cfg (but if you're careful with kernel upgrades, this will disappear) or create a new kernel entry in /etc/grub.d/40_custom. Our boot line must include the flag: intel_pstate=disable, like this:

linux /boot/vmlinuz-4.4.0-78-generic.efi.signed root=UUID=86097ec1-3fa4-4d00-97c7-3bf91787be83 ro intel_pstate=disable quiet splash $vt_handoff

ASLR (Address Space Configuration Randomizer)

This setting is controversial, see Victor Stinner's blog: https://haypo.github.io/journey-to-stable-benchmark-average.html When I first suggested disabling ASLR when benchmarking, it was to further improve support for Profile Guided Optimizations that existed in CPython at the time.

Why do I mention this? Because on the specific hardware given above, disabling ASLR reduces the standard deviation between runs to 0.4%. On the other hand, based on tests on my personal computer (Intel Core i7 4710MQ), disabling ASLR leads to the same issues mentioned by Victor. Testing on smaller CPUs (such as Intel Atom) leads to even larger standard deviation between runs.

Since this doesn't seem to be a universal truth and depends a lot on the hardware/software configuration, for this setup I measured once with it enabled and once with it disabled and compared them afterwards. On my machine I disabled ASLR by putting the following command in /etc/sysctl.conf. Apply using sudo sysctl -p .

kernel.randomize_va_space = 0

If you want to disable it at runtime:

sudo bash -c 'echo 0 >| /proc/sys/kernel/randomize_va_space'

If you want to re-enable:

sudo bash -c 'echo 2 >| /proc/sys/kernel/randomize_va_space'

2. Memory Analysis

In this section, I'll introduce some tools that can help us solve the memory consumption problem in Python (especially when using PyPy). Why should we care about this? Why don't we just care about performance? The answers to these questions are quite complicated, but I'll summarize them.

PyPy is an alternative Python interpreter that has some huge advantages over CPython: speed (through its Just in Time compiler), compatibility (almost a drop-in replacement for CPython), and concurrency (using stackless and greenlets).

One downside of PyPy is that it will generally use more memory than CPython due to its JIT and garbage collection implementation. However, in some cases, its memory usage can be less than CPython. Let's look at how you can measure how much memory your application is using.

Diagnosing memory usage

memory_profiler is a library that can be used to measure the memory usage of the interpreter while running a workload. You can install it via pip:

pip install memory_profiler

Also install the psutil dependency package:

pip install psutil

The advantage of this tool is that it shows the memory consumption in a Python script line by line. This allows us to find places in the script that we can rewrite. But there is a disadvantage to this analysis. Your code will run 10 to 20 times slower than a normal script.

How to use it? You just need to add @profile() directly to the function you want to measure. Let's see how it works in practice! We will use the material script we used before as a model, but make a slight modification to remove the statistics part. The code can also be viewed on GitHub: https://github.com/apatrascu/hunting-python-performance/blob/master/02.primes-v1.py

from memory_profiler import profile


@profile(precision=6)
def primes(n):
if n == 2:
return [2]
elif n < 2:
return []
s = range(3, n + 1, 2)
mroot = n**0.5
half = (n + 1) / 2 - 1
i = 0
m = 3
while m <= mroot:
if s[i]:
j = (m * m - 3) / 2
s[j] = 0
while j < half:
s[j] = 0
j += m
i = i + 1
m = 2 * i + 3
return [2] + [x for x in s if x]


len(primes(100000))

To start measuring, use the following PyPy command:

pypy -m memory_profiler 02.primes-v3.py

Or import memory_profiler directly in your script:

pypy -m memory_profiler 02.primes-v3.py

After executing this line of code, we can see that PyPy gets the following result:

Line # Mem usage Increment Line Contents
================================================
54 35.312500 MiB 0.000000 MiB @profile(precision=6)
55 def primes(n):
56 35.351562 MiB 0.039062 MiB if n == 2:
57 return [2]
58 35.355469 MiB 0.003906 MiB elif n < 2:
59 return []
60 35.355469 MiB 0.000000 MiB s = []
61 59.515625 MiB 24.160156 MiB for i in range(3, n+1):
62 59.515625 MiB 0.000000 MiB if i % 2 != 0:
63 59.515625 MiB 0.000000 MiB s.append(i)
64 59.546875 MiB 0.031250 MiB mroot = n ** 0.5
65 59.550781 MiB 0.003906 MiB half = (n + 1) / 2 - 1
66 59.550781 MiB 0.000000 MiB i = 0
67 59.550781 MiB 0.000000 MiB m = 3
68 59.554688 MiB 0.003906 MiB while m <= mroot:
69 59.554688 MiB 0.000000 MiB if s[i]:
70 59.554688 MiB 0.000000 MiB j = (m * m - 3) / 2
71 59.554688 MiB 0.000000 MiB s[j] = 0
72 59.554688 MiB 0.000000 MiB while j < half:
73 59.554688 MiB 0.000000 MiB s[j] = 0
74 59.554688 MiB 0.000000 MiB j += m
75 59.554688 MiB 0.000000 MiB i = i + 1
76 59.554688 MiB 0.000000 MiB m = 2 * i + 3
77 59.554688 MiB 0.000000 MiB l = [2]
78 59.679688 MiB 0.125000 MiB for x in s:
79 59.679688 MiB 0.000000 MiB if x:
80 59.679688 MiB 0.000000 MiB l.append(x)
81 59.683594 MiB 0.003906 MiB return l

We can see that this script uses 24.371094 MiB of RAM. Let's analyze it briefly. We see that most of it is used in the construction of the array of values. It excludes even values ​​and keeps all other values.

We can improve it a bit by calling the range function, which takes an increment parameter. In this case, the script would look like this:

from memory_profiler import profile


@profile(precision=6)
def primes(n):
if n == 2:
return [2]
elif n < 2:
return []
s = range(3, n + 1, 2)
mroot = n**0.5
half = (n + 1) / 2 - 1
i = 0
m = 3
while m <= mroot:
if s[i]:
j = (m * m - 3) / 2
s[j] = 0
while j < half:
s[j] = 0
j += m
i = i + 1
m = 2 * i + 3
l = [2]
for x in s:
if x:
l.append(x)
return l


len(primes(100000))

If we measure again, we can get the following results:

Line # Mem usage Increment Line Contents
================================================
27 35.343750 MiB 0.000000 MiB @profile(precision=6)
28 def primes(n):
29 35.382812 MiB 0.039062 MiB if n == 2:
30 return [2]
31 35.382812 MiB 0.000000 MiB elif n < 2:
32 return []
33 35.386719 MiB 0.003906 MiB s = range(3, n + 1, 2)
34 35.417969 MiB 0.031250 MiB mroot = n ** 0.5
35 35.417969 MiB 0.000000 MiB half = (n + 1) / 2 - 1
36 35.417969 MiB 0.000000 MiB i = 0
37 35.421875 MiB 0.003906 MiB m = 3
38 58.019531 MiB 22.597656 MiB while m <= mroot:
39 58.019531 MiB 0.000000 MiB if s[i]:
40 58.019531 MiB 0.000000 MiB j = (m * m - 3) / 2
41 58.019531 MiB 0.000000 MiB s[j] = 0
42 58.019531 MiB 0.000000 MiB while j < half:
43 58.019531 MiB 0.000000 MiB s[j] = 0
44 58.019531 MiB 0.000000 MiB j += m
45 58.019531 MiB 0.000000 MiB i = i + 1
46 58.019531 MiB 0.000000 MiB m = 2 * i + 3
47 58.019531 MiB 0.000000 MiB l = [2]
48 58.089844 MiB 0.070312 MiB for x in s:
49 58.089844 MiB 0.000000 MiB if x:
50 58.089844 MiB 0.000000 MiB l.append(x)
51 58.093750 MiB 0.003906 MiB return l

Great, now our memory usage is down to 22.75 MiB. We can reduce it even further using list comprehension.

from memory_profiler import profile


@profile(precision=6)
def primes(n):
if n == 2:
return [2]
elif n < 2:
return []
s = range(3, n + 1, 2)
mroot = n**0.5
half = (n + 1) / 2 - 1
i = 0
m = 3
while m <= mroot:
if s[i]:
j = (m * m - 3) / 2
s[j] = 0
while j < half:
s[j] = 0
j += m
i = i + 1
m = 2 * i + 3
return [2] + [x for x in s if x]


len(primes(100000))

Measure again:

Line # Mem usage Increment Line Contents
================================================
4 35.425781 MiB 0.000000 MiB @profile(precision=6)
5 def primes(n):
6 35.464844 MiB 0.039062 MiB if n == 2:
7 return [2]
8 35.464844 MiB 0.000000 MiB elif n < 2:
9 return []
10 35.464844 MiB 0.000000 MiB s = range(3, n + 1, 2)
11 35.500000 MiB 0.035156 MiB mroot = n ** 0.5
12 35.500000 MiB 0.000000 MiB half = (n + 1) / 2 - 1
13 35.500000 MiB 0.000000 MiB i = 0
14 35.500000 MiB 0.000000 MiB m = 3
15 57.683594 MiB 22.183594 MiB while m <= mroot:
16 57.683594 MiB 0.000000 MiB if s[i]:
17 57.683594 MiB 0.000000 MiB j = (m * m - 3) / 2
18 57.683594 MiB 0.000000 MiB s[j] = 0
19 57.683594 MiB 0.000000 MiB while j < half:
20 57.683594 MiB 0.000000 MiB s[j] = 0
21 57.683594 MiB 0.000000 MiB j += m
22 57.683594 MiB 0.000000 MiB i = i + 1
23 57.683594 MiB 0.000000 MiB m = 2 * i + 3
24 57.847656 MiB 0.164062 MiB return [2] + [x for x in s if x]

Our latest script consumes only 22.421875 MiB. This is almost a 10% decrease compared to the first version.

Original address:

  • https://pythonfiles.wordpress.com/2017/05/15/hunting-python-performance-setup/
  • https://pythonfiles.wordpress.com/2017/05/18/hunting-python-performance-part-2/
Disclaimer: This article is reprinted from Machine Heart, the original text comes from Pythonfiles, and the author is PythonRinf.

<<:  Understand the meaning of matrix rank and determinant in one article

>>:  Is the mathematical foundation of deep neural networks too difficult for you?

Recommend

Is it possible to know everyone in the world? Is this theory really so amazing?

Audit expert: Chen Mingxin National Level 2 Psych...

Intel evaluates AMD Ryzen: Kaby Lake is enough to suppress

AMD Ryzen processor will be available at the end ...

A little change on the table may enhance your immunity

Sometimes, food is also medicine, and more and mo...

Wu Dayong丨From mask making to cancer screening, nanofibers are so powerful

A distance as short as one millimeter, In fact, i...

Luckin Coffee’s private domain operation strategy!

From setting the world's fastest IPO record t...

Xiaohongshu blogger’s money-making rules!

Today, Xiaohongshu has become a "gemstone&qu...

From Weibo to Toutiao to Zhihu, where are the three content giants going?

Weibo's market value today has reached 10 bil...