Ruan Yifeng: Github's object counting algorithm

When using Github, have you ever seen the following prompt?

 $ git clone https://github.com/torvalds/linux
 Cloning into 'linux'...
 remote: Counting objects: 4350078, done.
 remote: Compressing objects: 100% (4677/4677), done.
 Receiving objects: 4% (191786/4350078), 78.19 MiB | 8.70 MiB/s

This prompt says that there are a total of 4350078 objects in the remote code repository that need to be cloned.

This is called "counting objects". Github needs to calculate the total number of objects that need to be cloned in real time.

This process is very slow. According to Github, it takes 8 minutes to count a huge library like the Linux kernel! That is to say, after issuing the git clone command, you will wait for eight minutes before the actual data transfer begins. This is of course unbearable. The Github team has been trying to solve this problem.

Later, they finally discovered a new algorithm, and now it only takes 3 milliseconds to count once!

In order to understand this algorithm, you must first know what Git objects are. Simply put, objects are files, and there are three most important objects.

Snapshot Object (Commit)
Directory object
File Object

Every time you submit code, a commit object is generated, which contains the name of the corresponding current "directory object". The "directory object" stores the subdirectories and file information contained in the code root directory. Each subdirectory is another "directory object", and each file is a "file object" with specific file content.

Therefore, "counting objects" means counting various commits, directories, files, etc. Both git clone and git fetch operations require counting objects because you need to know which object files to download.

The original algorithm for counting objects is as follows.

List all local branches***a commit
List all remote branches***a commit
Compare the two, and if there is a difference, it means that the branch has changed.
For each commit that changes, check the subdirectories and files that have changed.
Trace back to the parent node of the current commit and repeat step 4 until the local and remote histories are consistent.
Add up all objects that need to be changed

The above process shows that "counting objects" is a file traversal algorithm. Changed objects will be counted one by one, which means a large number of file read operations. For large code bases, this process is very slow.

The new algorithm that the Github team came up with is to create a Bitmap index, that is, to generate a binary value for each commit.

Open the .git/objects/pack/ directory of your local Github repository, and you will see an index file and a data file, which are Bitmaps. Simply put, these two files index all objects in the current code base, and then use a binary value to represent these objects. The binary value has as many bits as there are objects. Its nth bit represents the nth object in the data file.

Each commit will have a corresponding binary value, which represents all objects contained in the current snapshot. The binary bits corresponding to these objects are all 1, and the other binary bits are all 0.

The advantage of this is that you don't need to read the commit object, just read the binary value to know which nodes are included in the current commit. Even better, you only need to do an XOR operation on the two binary values to know which bits (that is, which objects) have changed. Moreover, because new objects are always added after the existing binary bits, you only need to read the extra bits to know which objects are added to the current commit compared to the previous commit.

In this way, "counting objects" becomes a binary value comparison operation, so the speed is extremely fast. For further introduction, please refer to the official documents "Bitmap Explanation" and "Bitmap Format".

Currently, Github's production environment has deployed this algorithm, and users no longer have to wait for object counting. In addition, the Github team has merged it into Git, which means that all Git implementations can use the Bitmap function from now on, so there will definitely be more interesting uses in the future.

(over)

<<: Game Framework Construction - Setting the Vision

>>: 7 Top HTML5 Canvas Animations

Strong winds break records, hail covers the ground! This year's most intense severe convective weather is coming, how to prevent it

Blog

"Ask the Heaven", embark on the road of asking the Heaven!

Ruan Yifeng: Github's object counting algorithm

Strong winds break records, hail covers the ground! This year's most intense severe convective weather is coming, how to prevent it

"Ask the Heaven", embark on the road of asking the Heaven!

Mouse: In order to make a living, what kind of environment have I not lived in?

What does a data relay satellite do? How did it develop?

Why can the Yingxian Wooden Pagoda in Shanxi, which looks like a building block, stand for a thousand years?

It understands everything you say, AliOS upgrades scene intelligent semantic understanding

Waymo partners with car rental giant Avis to provide self-driving cars as a service

How to become a great operator? I'll give you 2 methods

The app you exited always restarts suddenly? Beware of privacy leaks

Why can’t Baidu’s “Kuai Search” detect gutter oil?

Recommend

Python skills courses that professionals must learn

What are the pros and cons of wind power construction? Why is China stepping up its construction efforts?

Mars Era 2021 Film and Television Editing Course [HD Quality]

Mengniu’s marketing leveraging the college entrance examination is super hardcore!

Uncovering the secrets of children’s personalities: strong-willed people vs. weak-willed people, which one is your baby?

Jia Zhen's "Xiaohongshu Merchant Camp" Issue 6 Merchant Edition, let merchants lose paid traffic

Is it difficult to make a small program yourself? Is it difficult to make your own mini-program?

4000-word KOC expert operation strategy

What changes will happen to a girl's body if her menstruation stops?

[Smart Farmers] Subvert the impression! This kind of "working" bees can also reproduce offspring

If you don’t clean these 4 places on the rice cooker, be careful not to eat “cockroach steamed rice”!

These 7 points will help your landing page conversion rate soar!

The Taklimakan Desert changes into a sea of snow with one click! | Environmental Trumpet

iOS 18 hidden feature, supports T9 dialing!

Review of marketing cases in Q1 2022!