[PHP Kernel Exploration] Hash Table in PHP

[PHP Kernel Exploration] Hash Table in PHP

In the PHP kernel, one of the most important data structures is HashTable. The arrays we often use are implemented in the kernel using HashTable. So, how is PHP's HashTable implemented? I was recently reading about the data structure of HashTable, but there is no specific implementation algorithm in the algorithm books. I happened to be reading the PHP source code recently, so I referred to the implementation of PHP's HashTable and implemented a simple version of HashTable. I summarized some of my experiences and would like to share them with you below.

The author has a simplified version of HashTable implementation on github: HashTable implementation

In addition, I have more detailed annotations on the PHP source code on GitHub. If you are interested, you can take a look and give a star. PHP5.4 source code annotations. You can view the added annotations through the commit record.

Introduction to HashTable

A hash table is an efficient data structure for implementing dictionary operations.

definition

Simply put, a HashTable is a data structure of key-value pairs. It supports operations such as insert, search, and delete. Under some reasonable assumptions, the time complexity of all operations in a hash table is O(1) (those interested in the relevant proof can check it out by themselves).

The key to implementing a hash table

In a hash table, instead of using keywords as subscripts, a hash function is used to calculate the hash value of the key as the subscript, and then the hash value of the key is calculated again when searching/deleting, so as to quickly locate the location where the element is stored.

In a hash table, different keywords may calculate the same hash value, which is called a "hash collision", which is to deal with the situation where two or more keys have the same hash value. There are many ways to solve hash collisions, such as open addressing, zipper method, etc.

Therefore, the key to implementing a good hash table is a good hash function and a method to handle hash conflicts.

Hash Functions

There are four definitions for judging the quality of a hash algorithm:

  • Consistency, equivalent keys must produce equal hash values;
  • High efficiency and easy calculation;
  • Uniformity, hash all keys evenly.

The hash function establishes a corresponding relationship between the key value and the hash value, that is, h = hash_func(key). The corresponding relationship is shown in the figure below:

Let experts design a perfect hash function. We can just use the existing mature hash functions. The hash function used by the PHP kernel is the time33 function, also known as DJBX33A, which is implemented as follows:

static inline ulong zend_inline_hash_func(const char *arKey, uint nKeyLength)
{
         register ulong hash = 5381;

        /* variant with the hash unrolled eight times */
        for (; nKeyLength >= 8; nKeyLength -= 8) {
            hash = ((hash << 5) + hash) + *arKey++;
            hash = ((hash << 5) + hash) + *arKey++;
            hash = ((hash << 5) + hash) + *arKey++;
            hash = ((hash << 5) + hash) + *arKey++;
            hash = ((hash << 5) + hash) + *arKey++;
            hash = ((hash << 5) + hash) + *arKey++;
            hash = ((hash << 5) + hash) + *arKey++;
            hash = ((hash << 5) + hash) + *arKey++;
    }

    switch (nKeyLength) {
        case 7: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
        case 6: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
        case 5: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
        case 4: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
        case 3: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
        case 2: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
        case 1: hash = ((hash << 5) + hash) + *arKey++; break;
        case 0: break;
        EMPTY_SWITCH_DEFAULT_CASE()
    }
    return hash;
}

Note: The function uses an 8-time loop + switch to optimize the for loop, reducing the number of loop runs, and then executing the remaining elements that have not been traversed in the switch.

Zipper method

The method of storing all elements with the same hash value in a linked list is called the zipper method. When searching, first calculate the hash value corresponding to the key, then find the corresponding linked list according to the hash value, and finally search for the corresponding value along the linked list sequence.
The specific structure diagram after saving is as follows:

PHP HashTable Structure

After briefly introducing the data structure of the hash table, let's look at how the hash table is implemented in PHP.

The definition of PHP kernel hashtable:

typedef struct _hashtable {
          uint nTableSize;
          uint nTableMask;
          uint nNumOfElements;
          ulong nNextFreeElement;
          Bucket *pInternalPointer;
          Bucket *pListHead;
          Bucket *pListTail; 
          Bucket **arBuckets;
          dtor_func_t pDestructor;
          zend_bool persistent;
          unsigned char nApplyCount;
          zend_bool bApplyProtection;
          #if ZEND_DEBUG
               int inconsistent;
          #endif
} HashTable;
  • nTableSize, the size of the HashTable, grows in multiples of 2
  • nTableMask, used to perform AND operation with the hash value to obtain the index value of the hash value. After arBuckets is initialized, it is always nTableSize-1
  • nNumOfElements, the number of elements currently owned by HashTable, the count function directly returns this value
  • nNextFreeElement, which indicates the position of the next numeric index in the numeric key value array
  • pInternalPointer, internal pointer, pointing to the current member, used to traverse elements
  • pListHead, points to the first element of the HashTable, which is also the first element of the array
  • pListTail points to the last element of the HashTable, which is also the last element of the array. Combined with the above pointer, it is very convenient when traversing the array, such as reset and endAPI
  • arBuckets, an array of doubly linked lists consisting of buckets, the index is generated by the AND operation of the key hash value and nTableMask
  • pDestructor, the destructor used to delete elements in the hash table
  • persistent, identifies the memory allocation function. If it is TRUE, the memory allocation function of the operating system itself is used, otherwise the memory allocation function of PHP is used
  • nApplyCount, saves the number of times the current bucket is recursively accessed to prevent multiple recursions
  • bApplyProtection, identifies whether the hash table should use recursive protection, the default is 1, to use

Here is an example of combining hash and mask:

For example, the real hash value of "foo" (using DJBX33A hash function) is 193491849. If we now have a hash table with a capacity of 64, we obviously cannot use it as an array index. Instead, we apply the mask of the hash table and then only take the low bits of the hash table.

hash | 193491849 | 0b1011100010000111001110001001
& mask | & 63 | & 0b0000000000000000000000111111
-------------------------------------------------------------------------------
= index | = 9 | = 0b00000000000000000000000001001

Therefore, in the hash table, foo is stored in the bucket vector with index 9 in arBuckets.

Definition of bucket structure

typedef struct bucket {
     ulong h;
     uint nKeyLength;
     void *pData;
     void *pDataPtr;
     struct bucket *pListNext;
     struct bucket *pListLast;
     struct bucket *pNext;
     struct bucket *pLast;
     const char *arKey;
} Bucket;
  • h, hash value (or numeric key value key
  • nKeyLength, the length of the key
  • pData, pointer to data
  • pDataPtr, pointer data
  • pListNext, points to the next element in the arBuckets linked list in the HashTable
  • pListLast, points to the previous element in the arBuckets linked list in the HashTable
  • pNext, points to the next element in the bucket list with the same hash value
  • pLast, points to the previous element in the bucket list with the same hash value
  • arKey, the name of the key

The HashTable in PHP is implemented by vector plus bidirectional linked list. The vector is stored in the arBuckets variable. The vector contains pointers to multiple buckets. Each pointer points to a bidirectional linked list composed of multiple buckets. New elements are added using the forward insertion method, that is, the new element is always at the first position of the bucket. As can be seen from the above, the hash table implementation of PHP is quite complicated. This is the price it has to pay for using the ultra-flexible array type.

An example diagram of a HashTable in PHP is shown below:

HashTable related API

  • zend_hash_init
  • zend_hash_add_or_update
  • zend_hash_find
  • zend_hash_del_key_or_index

zend_hash_init

Function execution steps

  • Set the hash table size
  • Set the initial values ​​of other member variables of the structure (including the destructor pDescructor used to release memory)

For detailed code annotations, click: zend_hash_init source code

Note:

1. pHashFunction is not used here. PHP's hash function uses the internal zend_inline_hash_func

2. After zend_hash_init is executed, it does not actually allocate memory for arBuckets and calculate the size of nTableMask. The actual allocation of memory and calculation of nTableMask are performed during the CHECK_INIT initialization when inserting elements.

zend_hash_add_or_update

Function execution steps

  • Check the length of the key
  • Check initialization
  • Calculate hash value and subscript
  • Traverse the bucket where the hash value is located. If the same key is found and the value needs to be updated, update the data. Otherwise, continue to point to the next element of the bucket until it points to the last position of the bucket.
  • Assign a bucket to the newly added element, set the attribute value of the new bucket, and then add it to the hash table
  • If the hash table is full, resize the hash table.

Function execution flow chart

CONNECT_TO_BUCKET_DLLIST adds new elements to the bucket list with the same hash value.

CONNECT_TO_GLOBAL_DLLIST is a doubly linked list that adds new elements to the HashTable.

For detailed code and annotations, please click: zend_hash_add_or_update code annotations.

zend_hash_find

Function execution steps

  • Calculate hash value and subscript
  • Traverse the bucket where the hash value is located. If the bucket where the key is located is found, the value is returned. Otherwise, point to the next bucket until it points to the last position in the bucket list.

For detailed code and annotations, please click: zend_hash_find code annotations.

zend_hash_del_key_or_index

Function execution steps

  • Calculate the hash value and subscript of the key
  • Traverse the bucket where the hash value is located. If the bucket where the key is located is found, proceed to step 3. Otherwise, point to the next bucket until it points to the last position in the bucket list.
  • If the first element is to be deleted, directly point arBucket[nIndex] to the second element; the rest of the operation is to execute the current next by the last next of the current pointer.
  • Adjust related pointers
  • Release data memory and bucket structure memory

For detailed code and annotations, please click: zend_hash_del_key_or_index code annotation.

Performance Analysis

Advantages of PHP's hash table: PHP's HashTable provides great convenience for array operations. Whether it is array creation, adding elements, or deleting elements, the hash table provides good performance. However, its shortcomings are more obvious when the amount of data is large. Let's look at its shortcomings from the perspective of time complexity and space complexity.

The shortcomings are as follows:

  • The zval structure that stores data needs to allocate memory separately, and this extra memory needs to be managed. Each zval occupies 16 bytes of memory;
  • When adding a new bucket, the bucket is also allocated additionally and also requires 16 bytes of memory;
  • In order to perform sequential traversal, a bidirectional linked list is used to connect the entire HashTable, which adds a lot of pointers, and each pointer also requires 16 bytes of memory;
  • When traversing, if the element is at the end of the bucket list, you also need to traverse the entire bucket list to find the value you are looking for.

The main disadvantage of PHP's HashTable is that its double-linked list has extra pointers and zval and bucket, which require additional memory allocation, resulting in a lot of memory space being occupied and a lot of extra time consumed during search.

Follow-up

The above-mentioned deficiencies have been well solved in PHP7. PHP7 has made a major transformation to the data structure in the kernel, making PHP much more efficient. Therefore, it is recommended that PHP developers develop and deploy version updates. Take a look at the following PHP code:

<?php
$size = pow(2, 16); 
 
$startTime = microtime(true);
$array = array();
for ($key = 0, $maxKey = ($size - 1) * $size; $key <= $maxKey; $key += $size) {
    $array[$key] = 0;
}
$endTime = microtime(true);
echo 'Insert', $size, 'Malicious elements required', $endTime - $startTime, 'seconds', "\n";
 
$startTime = microtime(true);
$array = array();
for ($key = 0, $maxKey = $size - 1; $key <= $maxKey; ++$key) {
    $array[$key] = 0;
}
$endTime = microtime(true);
echo 'Insert', $size, 'Normal elements required', $endTime - $startTime, 'Seconds', "\n";

The demo above compares the time consumption when there are multiple hash conflicts and when there are no conflicts. I ran this code under PHP5.4 and the results are as follows

Inserting 65536 malicious elements takes 43.72204709053 seconds

Inserting 65536 normal elements takes 0.009843111038208 seconds

And the result of running on PHP7:

Inserting 65536 malicious elements takes 4.4028408527374 seconds

Inserting 65536 normal elements takes 0.0018510818481445 seconds

It can be seen that PHP7's performance has improved a lot in both conflicting and non-conflicting array operations. Of course, the performance improvement in conflicting arrays is more obvious. As for why PHP7's performance has improved so much, it is worth further investigation.

Finally, I would like to share with you that there is a simplified HashTable implementation on my github: HashTable Implementation

In addition, I have more detailed annotations on the PHP source code on GitHub. If you are interested, you can take a look and give a star. PHP5.4 source code annotations. You can view the added annotations through the commit records.

This is an original article with limited writing skills and limited knowledge. If there is anything inaccurate in the article, please let me know.

If this article is helpful to you, please click to recommend it, thank you ^_^

Reference articles:

PHP array Hash collision example

Understanding PHP's internal array implementation (PHP's Source Code for PHP Developers - Part 4)

PHP's new hashtable implementation

<<:  How to Evolve Neural Networks with Automatic Machine Learning

>>:  Personal understanding of the stack in function calls

Recommend

6 basic elements of complete event promotion planning

Before sharing your content, please consider the ...

6 elements, how to trick consumers into placing orders step by step?

Our consumption behavior is 100% controlled. Why ...

What efforts has mankind made to approach the low temperature limit?

Every summer, people living in hot environments a...

What are the reasons for website promotion failure? What is website promotion?

For those who do website optimization, there are ...

10 Marketing Rules of Queen Estee Lauder

What I want to talk about today is not Estee Laud...

Why does Samsung make its phones curved?

Samsung's new Galaxy Note Edge has attracted ...

Why is kiwi fruit called kiwi? The truth is that it is a bird!

Do you know which fruit is named after both a mon...

Small Animals, Big Contributions: Scientific Understanding of Laboratory Mice

When people are testing drugs on themselves, they...

How to optimize the Meizu App Store? Meizu App Store optimization tips!

I think everyone is familiar with the Meizu brand...