How does Google never go down?

According to Google's official data, in 2015, the company's Google App suite was available 99.97% of the time.

Perhaps we take this for granted, but it is a remarkable fact, and the billions of Google users around the world never seem to stop to think about how Google can handle something so exciting with such ease.

Replacing human labor with software

Google used these three words to explain this problem: Site Reliability Engineering (SRE). Maybe these three words don't sound particularly sexy, but they are indeed the core concepts that Google has been adhering to for 10 years (the names sound even less sexy).

This concept is difficult to explain in one or two sentences, but it can be summed up in one central idea: let coders instead of IT professionals who specialize in network services operate network services. If this idea is implemented, coders will develop a tool that does not require human intervention to help complete the operation work (the operation here mainly refers to maintaining the stability and performance of the service).

"This is how we build a team where people are tired of doing things themselves and instead write software to do things that were previously done manually," wrote a Google employee named Ben Treynor Sloss in a post.

To many in Silicon Valley, this seems like common sense; from Amazon to Box.com, this approach has been adopted across the tech world. People call it the DevOps (Development plus Operations) model, which refers to the effort to connect software developers with system administrators. But the DevOps model has changed a lot since it was gradually derived from Google's SRE, as represented by Chef and Puppet. Google has been silent on SRE for the past decade, but it did so in the past when it came to large-scale, efficient network operations.

However, Google has entered a new phase, and is more willing to discuss SRE issues. This is mainly because Google wants to promote its cloud services so that outside companies can use its software services. In addition, Google has also written a book to discuss SRE issues.

Well, the book is called Site Reliability Engineering, and it just came out from O'Reilly, and the paper from Sloss is the first chapter of the book.

If you are interested in DevOps, this book is a must-read; even if you are not interested, the beginning of the book - the preface, introduction and the first chapter - is enough for us to understand the driving force of Google, the world's largest network empire.

For many tech companies—and indeed, for many people outside of the tech world—systems administration (or operations, whatever you want to call it) is the finishing touch and one of the most annoying aspects of computer science.

But Sloss, known internally as Google's vice president of "always on operations," turned the issue around, arguing that site reliability "is the most fundamental function of any product." After all, "if a system doesn't work, it's not very useful."

Hegel's Theory of the Unity of Opposites

Sloss is the origin of SRE. He founded the project when Google recruited him to lead the company's operations project in the early years. "SRE is created when you ask a software engineer to design an operations team," he said. "I design and manage this team; this team operates just like I am an SRE myself."

Todd Underwood, currently a SRE director at Google, believes that it is natural for Google to hire coders like Sloss. "When Google was still in its early stages of development, there were already software engineers who were very aware of where problems could go wrong and how to solve them, but none of them were willing to deal with them themselves."

This is actually a troublesome thing. But Chef's CTO (Chief Technology Officer) Adam Jacob also believes that in order to grow into a large-scale company, this transformation is necessary. "It is a very natural thing to connect software development and actual operations. You can't naturally separate the two; especially when you look at this issue historically, you may be more aware of this."

This shift is interesting considering that development and operations have traditionally been two separate disciplines. Developers are focused on writing new software, modifying it, and getting it out to the public as quickly as possible, while operations are focused on making sure it works, preferably with minimal changes. "These are unrelated goals," Underwood said, "but the joke is that when you connect development and operations, you start to eliminate their competing goals."

Underwood called it "Hegel's theory of the unity of opposites"; but when he said that, no one bought it. "People don't read Hegel anymore," he said, jokingly. But the description was spot-on. Once the preparations were in place, Google accelerated the process of putting all its good ideas into this model.

The balance between development and operations

There is an important idea: in order to reduce the conflict between development and operations, Google does not require 100% uptime. As Sloss wrote in the book, it is not actually necessary to ensure that network services are available 100% of the time. Users cannot really tell the difference between 100% and 99.999% (in fact, their laptops, WiFi, and batteries are offline for much more than 0.001%). If you set a reasonable online time percentage below 100% - the error budget - then you will have enough time to make changes and debug them.

“Error budgets remove the friction between development and SRE,” Sloss says. “An outage is no longer a bad thing. It’s part of the expected innovation process, and both development and SRE can address it without fear.”

At the same time, Google has also introduced some corresponding regulations to ensure that SRE does not evolve into old-fashioned system management. In principle, SRE is not allowed to spend more than 50% of its time on traditional operations (which conflict with programming). If operations have a higher priority than development in an SRE team, Google will transfer some operations personnel to ordinary software development work.

“A conscious balance between development and operations ensures that SREs have enough space to invest in creative, automated engineering,” Sloss said. “Of course, they also have to listen to operations.”

Jacob from Chef thinks the 50% ratio mentioned here is not that important, but he likes this attitude. He said, "It's a business, and someone has to handle operations; and operations work is almost endless, so it's understandable that you insist on putting a hat on them."

Google even has strict standards for hiring SREs. 50% to 60% of the recruits will pass the same rigorous assessment as all other Google engineers, and the rest will need to have 85% to 99% of the skills of Google engineers, plus some skills that are specifically applicable to SREs but that most software engineers do not have - such as a thorough understanding of the UNIX operating system and hardware network protocols.

These are all to ensure a proper balance between development and operations.

SRE’s ambition

In many ways, this is a new philosophy. But in their book, when they try to describe it, the Google team uses an older example. The spiritual forerunner of Google SRE was a programmer from MIT named Margaret Hamilton, who wrote the moon landing program for the Apollo spacecraft in the 1960s. As Hamilton herself said, part of the culture that emerged from the Apollo program was to learn from everyone and everything, including those who seemed to have nothing to learn.

Although Hamilton was a coder, she had a major role in operations. To illustrate this, the book tells a story about how she often brought her daughter Lauren into the computer lab, and one day Lauren hit a button that inserted the Apollo pre-launch program into a computer that was running a "post-launch scenario" program.

This caused the entire system to freeze; Hamilton tried to add an error detection code to the system to prevent this error during a real flight. Her boss rejected the whole idea, arguing that astronauts would never make such a mistake; but during Apollo 8, the astronauts did make such a mistake. Fortunately, Hamilton included a workaround in the system documentation. In subsequent work, she still added the error detection code.

If you come to me and say, "It's going to crash," that's no good; but if you say, "It's going to crash, let me tell you how to fix it," that's great, Underwood said. "Here, we have people who know that something is going to go wrong, know where it's going to crash, and can figure out how to prevent it from happening."

That's DevOps, or, as Google calls it, SRE. Those three words don't sound like much, but they're a very powerful idea. Google was born out of it, but some SREs, like Underwood, who are more philosophical, have bigger ambitions. In their vision, operations itself is moving faster than development.

"We hope that in the long run, no one will be operating it anymore," Underwood said.

As a winner of Toutiao's Qingyun Plan and Baijiahao's Bai+ Plan, the 2019 Baidu Digital Author of the Year, the Baijiahao's Most Popular Author in the Technology Field, the 2019 Sogou Technology and Culture Author, and the 2021 Baijiahao Quarterly Influential Creator, he has won many awards, including the 2013 Sohu Best Industry Media Person, the 2015 China New Media Entrepreneurship Competition Beijing Third Place, the 2015 Guangmang Experience Award, the 2015 China New Media Entrepreneurship Competition Finals Third Place, and the 2018 Baidu Dynamic Annual Powerful Celebrity.

<<: The low-key Linux system quietly changed the way of life in the human world

>>: How can 5G mobile networks unlock advanced digital transformation?

Zhongfuxinrong·2021 Credit Repair and Credit Card Limit Increase (Full Technical Course)

Brother Qihang: Douyin brings goods: One person makes 300 videos a day and easily earns more than 100,000 yuan a month with advanced violent gameplay

So if you want to succeed, you must keep learning...

How does Google never go down?

Zhongfuxinrong·2021 Credit Repair and Credit Card Limit Increase (Full Technical Course)

Is there hope? Rumor has it that AirPods wireless headphones will go on sale next week

How to design a high conversion landing page? Focus on these three aspects

Common iOS debugging methods: breakpoint debugging

Alipay APP changes its icon: 2021 "Collect Five Blessings" is here, with more ways to play this year

Will tap water be contaminated after heavy rain and floods?

How long is left in the life cycle of a mobile phone?

CG Zhong Fenghua game scene class video tutorial

Subconscious Guidance Technique: Adjust a man to the way you like him

Three Steps to Learning English for Programmers

Recommend

Why do plants shed their leaves? It's not that simple to calculate

What is carrageenan and what role does it play in ice cream?

After reading "Offline", review "1024"

On 520 Valentine’s Day, shall we say sweet words in the copywriting?

How painful is it to be overly sensitive in this part of the body?

67 wild red pandas, each costing RMB 350,000. Who is paying for the poaching?

Where is the home of 11 million Miao people?

How did the Earth survive? Scientists have found new clues

Brother Qihang: Douyin brings goods: One person makes 300 videos a day and easily earns more than 100,000 yuan a month with advanced violent gameplay

What to do if an accident occurs during outdoor activities? Safety first, mission second

Protecting your liver is protecting your life! If you have a bad liver, you should avoid these 9 common foods

Product Case: How to use social apps for couples?

vivo official website APP full model UI adaptation solution

Analysis of Dianping's Membership Operation System

Huanxin Sui Yunyi: A complete analysis of Huanxin ONE SDK architecture