In the PHP kernel, one of the most important data structures is HashTable. The arrays we often use are implemented in the kernel using HashTable. So, how is PHP's HashTable implemented? I was recently reading about the data structure of HashTable, but there is no specific implementation algorithm in the algorithm books. I happened to be reading the PHP source code recently, so I referred to the implementation of PHP's HashTable and implemented a simple version of HashTable. I summarized some of my experiences and would like to share them with you below. The author has a simplified version of HashTable implementation on github: HashTable implementation In addition, I have more detailed annotations on the PHP source code on GitHub. If you are interested, you can take a look and give a star. PHP5.4 source code annotations. You can view the added annotations through the commit record. Introduction to HashTableA hash table is an efficient data structure for implementing dictionary operations. definitionSimply put, a HashTable is a data structure of key-value pairs. It supports operations such as insert, search, and delete. Under some reasonable assumptions, the time complexity of all operations in a hash table is O(1) (those interested in the relevant proof can check it out by themselves). The key to implementing a hash tableIn a hash table, instead of using keywords as subscripts, a hash function is used to calculate the hash value of the key as the subscript, and then the hash value of the key is calculated again when searching/deleting, so as to quickly locate the location where the element is stored. In a hash table, different keywords may calculate the same hash value, which is called a "hash collision", which is to deal with the situation where two or more keys have the same hash value. There are many ways to solve hash collisions, such as open addressing, zipper method, etc. Therefore, the key to implementing a good hash table is a good hash function and a method to handle hash conflicts. Hash FunctionsThere are four definitions for judging the quality of a hash algorithm:
The hash function establishes a corresponding relationship between the key value and the hash value, that is, h = hash_func(key). The corresponding relationship is shown in the figure below: Let experts design a perfect hash function. We can just use the existing mature hash functions. The hash function used by the PHP kernel is the time33 function, also known as DJBX33A, which is implemented as follows: static inline ulong zend_inline_hash_func(const char *arKey, uint nKeyLength) { register ulong hash = 5381; /* variant with the hash unrolled eight times */ for (; nKeyLength >= 8; nKeyLength -= 8) { hash = ((hash << 5) + hash) + *arKey++; hash = ((hash << 5) + hash) + *arKey++; hash = ((hash << 5) + hash) + *arKey++; hash = ((hash << 5) + hash) + *arKey++; hash = ((hash << 5) + hash) + *arKey++; hash = ((hash << 5) + hash) + *arKey++; hash = ((hash << 5) + hash) + *arKey++; hash = ((hash << 5) + hash) + *arKey++; } switch (nKeyLength) { case 7: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ case 6: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ case 5: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ case 4: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ case 3: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ case 2: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */ case 1: hash = ((hash << 5) + hash) + *arKey++; break; case 0: break; EMPTY_SWITCH_DEFAULT_CASE() } return hash; }
Zipper method The method of storing all elements with the same hash value in a linked list is called the zipper method. When searching, first calculate the hash value corresponding to the key, then find the corresponding linked list according to the hash value, and finally search for the corresponding value along the linked list sequence. PHP HashTable StructureAfter briefly introducing the data structure of the hash table, let's look at how the hash table is implemented in PHP. The definition of PHP kernel hashtable:typedef struct _hashtable { uint nTableSize; uint nTableMask; uint nNumOfElements; ulong nNextFreeElement; Bucket *pInternalPointer; Bucket *pListHead; Bucket *pListTail; Bucket **arBuckets; dtor_func_t pDestructor; zend_bool persistent; unsigned char nApplyCount; zend_bool bApplyProtection; #if ZEND_DEBUG int inconsistent; #endif } HashTable;
Here is an example of combining hash and mask: For example, the real hash value of "foo" (using DJBX33A hash function) is 193491849. If we now have a hash table with a capacity of 64, we obviously cannot use it as an array index. Instead, we apply the mask of the hash table and then only take the low bits of the hash table. hash | 193491849 | 0b1011100010000111001110001001 & mask | & 63 | & 0b0000000000000000000000111111 ------------------------------------------------------------------------------- = index | = 9 | = 0b00000000000000000000000001001 Therefore, in the hash table, foo is stored in the bucket vector with index 9 in arBuckets. Definition of bucket structuretypedef struct bucket { ulong h; uint nKeyLength; void *pData; void *pDataPtr; struct bucket *pListNext; struct bucket *pListLast; struct bucket *pNext; struct bucket *pLast; const char *arKey; } Bucket;
The HashTable in PHP is implemented by vector plus bidirectional linked list. The vector is stored in the arBuckets variable. The vector contains pointers to multiple buckets. Each pointer points to a bidirectional linked list composed of multiple buckets. New elements are added using the forward insertion method, that is, the new element is always at the first position of the bucket. As can be seen from the above, the hash table implementation of PHP is quite complicated. This is the price it has to pay for using the ultra-flexible array type. An example diagram of a HashTable in PHP is shown below: HashTable related API
zend_hash_initFunction execution steps
For detailed code annotations, click: zend_hash_init source code
zend_hash_add_or_updateFunction execution steps
Function execution flow chartCONNECT_TO_BUCKET_DLLIST adds new elements to the bucket list with the same hash value. CONNECT_TO_GLOBAL_DLLIST is a doubly linked list that adds new elements to the HashTable. For detailed code and annotations, please click: zend_hash_add_or_update code annotations. zend_hash_findFunction execution steps
For detailed code and annotations, please click: zend_hash_find code annotations. zend_hash_del_key_or_indexFunction execution steps
For detailed code and annotations, please click: zend_hash_del_key_or_index code annotation. Performance AnalysisAdvantages of PHP's hash table: PHP's HashTable provides great convenience for array operations. Whether it is array creation, adding elements, or deleting elements, the hash table provides good performance. However, its shortcomings are more obvious when the amount of data is large. Let's look at its shortcomings from the perspective of time complexity and space complexity. The shortcomings are as follows:
The main disadvantage of PHP's HashTable is that its double-linked list has extra pointers and zval and bucket, which require additional memory allocation, resulting in a lot of memory space being occupied and a lot of extra time consumed during search. Follow-upThe above-mentioned deficiencies have been well solved in PHP7. PHP7 has made a major transformation to the data structure in the kernel, making PHP much more efficient. Therefore, it is recommended that PHP developers develop and deploy version updates. Take a look at the following PHP code: <?php $size = pow(2, 16); $startTime = microtime(true); $array = array(); for ($key = 0, $maxKey = ($size - 1) * $size; $key <= $maxKey; $key += $size) { $array[$key] = 0; } $endTime = microtime(true); echo 'Insert', $size, 'Malicious elements required', $endTime - $startTime, 'seconds', "\n"; $startTime = microtime(true); $array = array(); for ($key = 0, $maxKey = $size - 1; $key <= $maxKey; ++$key) { $array[$key] = 0; } $endTime = microtime(true); echo 'Insert', $size, 'Normal elements required', $endTime - $startTime, 'Seconds', "\n"; The demo above compares the time consumption when there are multiple hash conflicts and when there are no conflicts. I ran this code under PHP5.4 and the results are as follows
And the result of running on PHP7:
It can be seen that PHP7's performance has improved a lot in both conflicting and non-conflicting array operations. Of course, the performance improvement in conflicting arrays is more obvious. As for why PHP7's performance has improved so much, it is worth further investigation. Finally, I would like to share with you that there is a simplified HashTable implementation on my github: HashTable Implementation In addition, I have more detailed annotations on the PHP source code on GitHub. If you are interested, you can take a look and give a star. PHP5.4 source code annotations. You can view the added annotations through the commit records. This is an original article with limited writing skills and limited knowledge. If there is anything inaccurate in the article, please let me know. If this article is helpful to you, please click to recommend it, thank you ^_^ Reference articles: PHP array Hash collision example Understanding PHP's internal array implementation (PHP's Source Code for PHP Developers - Part 4) PHP's new hashtable implementation |
<<: How to Evolve Neural Networks with Automatic Machine Learning
>>: Personal understanding of the stack in function calls
Before sharing your content, please consider the ...
Our consumption behavior is 100% controlled. Why ...
Tmall’s transaction volume on Double Eleven reach...
If you were asked to give a simple description of...
Audit expert: Qu Bo Chief Physician of General Su...
Recently, a Mr. Xia from Huangshi, Hubei, posted ...
On September 9, the WeChat special event "We...
Every summer, people living in hot environments a...
For those who do website optimization, there are ...
What I want to talk about today is not Estee Laud...
Samsung's new Galaxy Note Edge has attracted ...
Do you know which fruit is named after both a mon...
Apps "steal" and waste traffic, so be c...
When people are testing drugs on themselves, they...
I think everyone is familiar with the Meizu brand...