Ruan Yifeng: Github's object counting algorithm

Ruan Yifeng: Github's object counting algorithm

When using Github, have you ever seen the following prompt?

  1. $ git clone https://github.com/torvalds/linux
  2. Cloning into 'linux'...
  3. remote: Counting objects: 4350078, done.
  4. remote: Compressing objects: 100% (4677/4677), done.
  5. Receiving objects: 4% (191786/4350078), 78.19 MiB | 8.70 MiB/s

This prompt says that there are a total of 4350078 objects in the remote code repository that need to be cloned.

This is called "counting objects". Github needs to calculate the total number of objects that need to be cloned in real time.

This process is very slow. According to Github, it takes 8 minutes to count a huge library like the Linux kernel! That is to say, after issuing the git clone command, you will wait for eight minutes before the actual data transfer begins. This is of course unbearable. The Github team has been trying to solve this problem.

Later, they finally discovered a new algorithm, and now it only takes 3 milliseconds to count once!

In order to understand this algorithm, you must first know what Git objects are. Simply put, objects are files, and there are three most important objects.

  • Snapshot Object (Commit)
  • Directory object
  • File Object

Every time you submit code, a commit object is generated, which contains the name of the corresponding current "directory object". The "directory object" stores the subdirectories and file information contained in the code root directory. Each subdirectory is another "directory object", and each file is a "file object" with specific file content.

Therefore, "counting objects" means counting various commits, directories, files, etc. Both git clone and git fetch operations require counting objects because you need to know which object files to download.

The original algorithm for counting objects is as follows.

  1. List all local branches***a commit
  2. List all remote branches***a commit
  3. Compare the two, and if there is a difference, it means that the branch has changed.
  4. For each commit that changes, check the subdirectories and files that have changed.
  5. Trace back to the parent node of the current commit and repeat step 4 until the local and remote histories are consistent.
  6. Add up all objects that need to be changed

The above process shows that "counting objects" is a file traversal algorithm. Changed objects will be counted one by one, which means a large number of file read operations. For large code bases, this process is very slow.

The new algorithm that the Github team came up with is to create a Bitmap index, that is, to generate a binary value for each commit.

Open the .git/objects/pack/ directory of your local Github repository, and you will see an index file and a data file, which are Bitmaps. Simply put, these two files index all objects in the current code base, and then use a binary value to represent these objects. The binary value has as many bits as there are objects. Its nth bit represents the nth object in the data file.

Each commit will have a corresponding binary value, which represents all objects contained in the current snapshot. The binary bits corresponding to these objects are all 1, and the other binary bits are all 0.

The advantage of this is that you don't need to read the commit object, just read the binary value to know which nodes are included in the current commit. Even better, you only need to do an XOR operation on the two binary values ​​to know which bits (that is, which objects) have changed. Moreover, because new objects are always added after the existing binary bits, you only need to read the extra bits to know which objects are added to the current commit compared to the previous commit.

In this way, "counting objects" becomes a binary value comparison operation, so the speed is extremely fast. For further introduction, please refer to the official documents "Bitmap Explanation" and "Bitmap Format".

Currently, Github's production environment has deployed this algorithm, and users no longer have to wait for object counting. In addition, the Github team has merged it into Git, which means that all Git implementations can use the Bitmap function from now on, so there will definitely be more interesting uses in the future.

(over)

<<:  Game Framework Construction - Setting the Vision

>>:  7 Top HTML5 Canvas Animations

Recommend

Unexpectedly! This place on earth is the best place to see the stars

"Sitting on the ground, I can travel eighty ...

It’s 2020 now. Are foldable phones still “expensive vases”?

At the Samsung Unpacked conference that just ende...

Murphy's Law: 20 ways to avoid bad luck in life

Through the exaggerated expression of Murphy'...

How much does it cost to make an animation mini program in Dingxi?

The main factors affecting the price of mini prog...

How much does it cost to rent a server with 8 cores and 16G per year?

Server rental is divided into two categories: ind...

In Poyang Lake, rotifers create 133 trillion plastic particles every day

The total number of nanoplastic particles produce...

It's raining "microplastics" from the sky...|Environmental Trumpet

Hello everyone, this is the 2nd issue of the Envi...

Exynos 7420/A8/Snapdragon 810 actual test: Qualcomm can't raise its head!

Snapdragon 810 is Qualcomm's pain point. Alth...