Unraveling the Google DevOps puzzle: What qualities make for a world-class, reliable system?

Unraveling the Google DevOps puzzle: What qualities make for a world-class, reliable system?

Randy Shoup, who helped lead engineering teams at eBay and Google, is one of the few people I've seen who can clearly describe the leadership traits needed to build productive DevOps and highly reliable systems. I like two of his talks (the 2013 Flowcon talk and his amazing work transforming eBay's architecture in the early 2000s).

[[150861]]

Interview with Randy Shoup: Unraveling the principles of improving Google DevOps

This article was compiled by Gene Kim based on an interview with Randy Shoup. It discusses and explains in depth how to improve Google's DevOps. Let's take a deeper look.

Dr. Spear's model has the following four capabilities:

Ability 1: Identify problems immediately when they occur;

Ability 2: Once a problem is discovered, solve it immediately in a swarming manner (Swarming) and record it as new knowledge;

Competency 3: Disseminate new knowledge throughout the company;

Capability 4: Development-oriented.

This was also the basis for the interview with Randy Shoup, which also revealed some practices at Google and eBay that are not widely discussed.

(I have learned so much from Randy Shoup that it is hard to put it into words. If you want to learn more and apply it to your company's practice, you may contact him through Randy's LinkedIn profile; he is currently engaged in consulting work).
Ability 1: Identify problems as soon as they occur

Dr. Spear wrote:

High-velocity companies develop detailed rules and designs to capture issues against existing knowledge bases and use built-in tests to discover problems.

Whether working individually or in teams, with or without equipment, high-velocity companies are unwilling to accept ambiguity. They specify the following in advance:
(a) The expected output; (b) Who is responsible for what work and in what order; (c) How products, services, and information are delivered from the person in charge of the previous step to the person in charge of the next step; and (d) The method for completing each part of the work.

GK (author): In the field of DevOps, Google should be one of the role models, especially in the field of automated testing.

Eran Messeri from the Google SCM team spoke at the 2013 GOTOcon Aarhus conference in a session titled "What goes wrong when thousands of engineers share the same continuous build?" (notes from the talk can be found here)

He lists some noteworthy statistics (as of 2013) and describes how they created the fastest, most timely, and most cost-effective programmer feedback mechanism within their capabilities:

15,000 programmers (including development and operation and maintenance)

4000 concurrent projects

Check all source code into the same repository (billions of files!)

5,500 code commits via 15,000 programmers

Automated tests run 75 million times a day

0.5% of engineers work on development tools

Here is a QConSF PPT made by Ashish Kumar in 2010, where you can see more amazing numbers achieved by Google development teams.

Q: Google is probably the poster child for automated testing, and everyone wants to know more about your experience working there.

A: Indeed, Google does a lot of automated testing - more than any other model I've worked at. Everything needs to be tested - not just getter/setter functionality, but anything that could go wrong.

Designing tests is often the biggest challenge for everyone. No one spends time writing tests to check what they think will work, but instead writes tests that are likely to fail and are difficult to write.

In practice, this means that the team needs to conduct reliability testing. Usually, they want to test a component in isolation and use mock components for other parts, so as to test their components in a semi-real world, but more importantly, they can inject failures into the mock test.

This way, by constantly testing, you can find out where components don't work. In actual daily testing, there may be a one-in-a-million or one-in-ten-million chance that these components don't work (for example, two replicas of the server are down; something went wrong between the prepare and commit phases; or the entire server went down in the middle of the night).

All of this makes it necessary to build recovery tests into your daily work and run them all the time, which is a huge amount of work.

Q: Where did Google’s current automated testing rules come from?

A: I don't know how the rules at Google evolved, but they were already there when I got there. It's really amazing that all the components in this large-scale distributed system are constantly tested using these complex methods.

As a newbie, I didn't want to write something crappy that wasn't adequately tested, and as a leader, I especially didn't want to set a bad example for the team.

Here's a concrete example that shows some of the advantages of such teams. As you may have read in famous papers (Google File System, BigTable, Megastore, etc.), common Google infrastructure services are run independently by teams - usually surprisingly small teams.

Not only do they write code, they also have to operate it. When these components mature, they not only have to provide corresponding services to users, but also provide them with client libraries to make the services more convenient. Using the existing client libraries, they can simulate backend services for client testing and inject various failure scenarios. For example: you can use the BigTable production library, plus a simulator, and it will perform the same as the actual production platform. You want to inject failures in the write and ack phases? Just do it!

I suspect these principles and practices were honed the hard way, from those emergency situations where you keep asking “how do I avoid downtime?”

Over time, the rules were eventually refined and a solid structure was achieved.
Ability 2: Discover problems, solve them in a cluster, and record them to become new knowledge.

Dr. Spear wrote:

“High-velocity companies are good at finding problems in their systems as soon as possible and finding them. They are also good at: (1) finding problems before they spread; and (2) finding and fixing the causes of problems so that they don’t reoccur. In doing so, they build a deeper level of knowledge about how to manage systems that solve real-world problems, turning the inevitable early omissions into knowledge.”

GK: The two most surprising examples of cluster problem solving in my research are:

A: Toyota's Andon pull rope is responsible for stopping the work when it deviates from the known pattern. It is recorded that a typical Toyota factory needs to pull the Andon pull rope 3,500 times a day on average.

Alcoa's CEO, the respected Paul O'Neill, has established a policy to reduce workplace accidents: any workplace accident must be reported to him within 24 hours. Who needs to report? The general manager of the business unit.

Q: Is Google's remote culture similar to cultures that support swarming behaviors, such as Toyota's andon pull cord and Alcoa's CEO's requirement to notify him of workplace incidents?

A: Absolutely. Both of them resonate with me. Both Ebay and Google have a culture of blame-free postmortems. (GK: John Allspaw also calls it blameless post-mortem.)

Postmortems are a very important rule, and we hold a postmortem whenever a customer is affected by an outage. As John Allspaw and others have described extensively, the goal of a postmortem is not to assign blame, but to create learning opportunities and broad communication across the company.

I've found that a culture where post-mortems don't have consequences can create an amazing dynamic where engineers compete with each other to see who made the biggest mistake. For example, "Hey, we found a backup recovery procedure that we never tested," or "And then we realized we didn't have active replication." This scene may be familiar to many engineers: "I wish we hadn't had any downtime, but now we finally have a chance to fix that broken system we've been complaining about for months!"

This will generate massive corporate learning and, as Dr. Steven Spear describes: This approach allows us to continually find and solve problems before catastrophic consequences occur.

I think it works because we are all engineers at heart and love building and improving systems, and an environment that brings problems to light makes for an exciting and satisfying work environment.

Q: What results from a postmortem? It can’t just be written up and thrown in the trash, right?

Q: You may find it hard to believe, but I believe the most important part is organizing the postmortem meeting itself. As we all know, the most important part of DevOps is culture, and being able to organize meetings, even if there is no output, can improve the system.

A: It becomes a kata — part of our daily ritual that demonstrates our value and how we prioritize our work.

Of course, the postmortem almost always results in a long list of things that worked and what didn’t work, and then you have a to-do list that needs to be inserted into your work queue (e.g. backlog - list of desired features; enhancements - improved documentation, etc.)

When you discover that you need to make new improvements, you end up having to make changes somewhere. Sometimes it's documentation, processes, code, environment, or something else.

But even without all that, simply documenting the post-mortem has tremendous value. Just imagine that everything can be found on Google, and every Googler can see every post-mortem.

When similar incidents occur in the future, the postmortem document will always be the first thing to be read.

Interestingly, postmortems serve another purpose. Google has a long tradition of requiring developers to manage all new services for at least six months. When service teams ask to "graduate" (that is, they need a dedicated SRE team or operations engineers to maintain them), they basically negotiate with SRE. They ask SRE to take over the responsibilities related to application submission.

(Gene: Check out the video of what Tom Limoncelli calls the “switching to pre-launch check state” process, where SREs review documentation, deployment mechanisms, monitoring profiles, and more. Great video!)
SREs often start by reviewing the postmortem document, which plays a large role in determining whether they can graduate an application.

Q: Have you seen similar requirements at Google as Paul O'Neill and his team at Alcoa? Are there examples of how notification/escalation thresholds are being reduced over time?

A: GK: Dr. Spear describes how Paul O'Neill led a team at Alcoa to reduce injuries in the aluminum plant (it was amazing, there were high heat, pressure and corrosive chemicals), reducing the accident rate from 2% per year to 0.07%, making the company the safest in the industry. Amazingly, when the accident rate in the plant was reduced to a certain value, O'Neill asked employees to notify him when something might go wrong.

Yes, they do. Of course, an incident on the job site is the equivalent of an outage that affects our users. Trust me, when there is a major outage that affects our customers, they are notified. When an incident occurs, two things happen:

You need to mobilize all the people needed to restore service, and they need to continue working to resolve the issue (this is standard procedure, of course).

We also have weekly incident meetings with management (on my App Engine team, that includes the head of engineering—I’m one of two people; our boss; our team lead; and the support team and product manager). We review what we learned from the postmortem, review next steps, and confirm that we resolved the issue appropriately. If necessary, decide if we need to do a customer-facing postmortem or publish a blog post.

Sometimes we don’t have much to say. But once the situation is under control, the team always hopes to have fewer questions during peer review and be more motivated to improve.
For example, if a problem does not affect customers, we will call it a problem that affects the team.

Most of us have experienced “near misses” where we put six layers of protection in place, all designed to protect users from failure, and then all but one of them fails.

On my team (Google App Engine), we probably have one publicly known user-impacting outage per year, but of course for every one of those there are several “near misses.”

This is why we have Disaster Recovery, which Kripa Krishnan discussed here.

While Google did a good job and we learned a lot (which is why we have three production replicas), Amazon did a much better job here and they were five years ahead of everyone else. (Jason McHugh, the architect of Amazon S3 who is now at Facebook, gave this talk at QCon 2009 on failure management at Amazon.)

Q: At Alcoa, workplace incidents need to be reported to the CEO within 24 hours. Is there a similar timeline for escalation to leadership at Google?

A: At Google App Engine, we have a very small team (100 engineers worldwide) with only two levels: engineers who do things and managers. We used to wake everyone up in the middle of the night when there were incidents that affected customers. For every one of these incidents, one in ten would be escalated to the company leadership.

Q: How would you describe how Swarming occurs?

A: Like Toyota factories, not everyone can be there to solve all problems when they occur. But culturally, we do prioritize the reliability and quality of those priority 0 problems.

This happens in many ways, some of which are less obvious and more subtle than downtime.

When you review the code that breaks the test, there will be no more work to do before you fix it, nor will you find that there are problems that break more tests waiting to be solved.

Similarly, if someone has a similar problem and needs help, you'll be expected to drop everything and help. Why? It's how we prioritize - like the Golden Rule. We want to help everyone move forward, which helps everyone.

Of course, they will do the same to you when you need help.

From a systems perspective, I think of it like a ratchet or the middle gear on a roller coaster - it keeps us from sliding off the track.

This is not a formal rule in the process, but everyone knows that if there is obvious abnormal operation that affects users, we will issue an alert, send some emails, etc.

The message is usually "Hi, everyone, I need your help," and then we go help.

I think the reason it always works is that even without formally stated or listed rules, everyone knows that our job is not just to "write code" but to "run a service."

Even global dependencies (such as load balancers, global infrastructure configuration errors) can be fixed in seconds, and incidents are resolved in 5-10 minutes.
Competency 3: Disseminate new knowledge throughout the company.

Dr. Spear wrote:

High-velocity companies increase the rate of new knowledge acquisition by spreading knowledge throughout the company (not just the discoverer). They share not only the conclusions of the problem, but also the process of discovering the problem - what was learned and how it was learned. While their competitors in larger systems leave the problem where it was discovered, along with the solution, in high-velocity companies, the person in charge spreads the problem and the discovery throughout the company. This means that when people start working, they absorb the experience of others in the company. We will see several examples of the multiplier effect.

Q: When problems arise, how does knowledge spread? How do local discoveries translate into global advances?

A: Some of it, though not the most important part, is from the documents from the postmortems. There are indications that Google has incidents as often as any other company. When there is a high-profile outage at Google, you can bet that almost everyone in the company reads the postmortem.

Perhaps the biggest mechanism for preventing future failures is a single codebase owned by all of Google. But more importantly, since the entire codebase can be searched, it is easy to leverage other people's experience. No matter how formal and consistent the documentation is, it is better to see what people do in practice - "go look at the code."

However, there is a negative side to this. Generally, the first person using the service might use a random configuration, and then it spreads wildly throughout the company. Suddenly, for no reason, random settings like "37" are everywhere.

As long as you make knowledge easily disseminated and easily accessible, it will spread and, most likely, some terrorist organizations will emerge.

Q: Besides a single source code repository and blameless postmortems, are there other mechanisms to transform local learning into global improvements? What are other ways to spread knowledge?

A: One of the best things about the Google source code repository is that you can find everything. The best answer to any question is "look at the code."

Second, there's excellent documentation that's just a search away. And there's a great internal group. Just like any external service, writing "foo" puts up a mailing list called "foo-user". You ask the people on the list a question. It's great to get in touch with the developers, but most of the time it's the users who answer. Just like a lot of successful open source projects in this industry, by the way.
Capability 4: Development-oriented.

Dr. Spear wrote:

Managers in high-velocity companies recognize that part of their regular work is to release products and services and to continually improve the process of releasing products and services. They teach people how to continually improve their part of the job and provide them with ample time and resources. This allows the company to improve itself in terms of reliability and adaptability. This is the fundamental difference between them and their failed competitors. Managers in high-velocity companies are not responsible for commanding, controlling, scolding, threatening, or evaluating others through a series of indicators. Instead, they ensure that the company improves in the following aspects: self-diagnosis and self-improvement, problem-finding skills, problem-solving, and increasing efficiency by spreading solutions throughout the company.

GK: I also love this quote from David Marquet (author of Turn This Ship Around): “The mark of a true leader is the number of operatives under him or her.” This former submarine commander has led more operatives than any submarine captain in history.

The gist of his work is this: Some hackers fix problems, but once they leave, the problems reappear because they fail to keep the system functioning without them.

Q: How did Google’s leadership develop?

A: Google has implemented almost everything you would find in a healthy company. We have two career paths: engineering and management. Anyone with the word "manager" in their job title is primarily responsible for "making things possible" and encouraging others to lead.

I see my role as creating small teams where everyone matters. Each team is an orchestra, the opposite of a factory - everyone can play solo, but more importantly, they can all work together. We've all had the bad experience of a team where everyone is yelling at each other or nobody is listening to each other.

At Google, I think the biggest influence was our cultural vision for building important engineering projects. One of the big cultural norms was, "Everyone writes great tests; we don't want to be the team that writes crappy tests." Likewise, having a culture like, "We only hire participants" - that was also emotionally important to me.

At Google, some of this is codified in the evaluation and improvement process - which sounds bad, because it means we only do the work needed to get promoted. But on the other hand, the evaluation process is highly praised and almost universally recognized as excellent - people get promoted because they make relevant contributions and are good at what they do. I have never heard of anyone getting promoted because they "hook up to the right thighs and kiss up to the right people."

The main criterion for manager and director positions is leadership - that is, whether the person has made a significant impact, and whether his or her impact goes far beyond the team you work for and beyond someone "just doing their own thing."

Google App Engine was founded seven years ago with an amazing group of engineers in the cluster management group who thought, “Hey, we have these technologies for creating scalable systems. Can we write one so that other people can use it?”

The title of "App Engine Founder" is given to people who are well respected by internal employees, such as the founder of Facebook.

Q: How do new managers do things? If the company must train other leaders, how do new managers or front-line managers understand the risks of job mitigation?

A: At Google, you only get the job that you already do, which is different from most other companies, where you get the job that they hope you can do.

That is, if you want to be a top engineer, then do the work of a top engineer. At Google, like many large companies, there are a lot of training resources.

But in most cases, the cultural norms of how work is done are so influential that it may be a primary trend to ensure that cultural norms continue. It is like a self-selection process, reinforcing cultural norms and technical practices.

Of course, this is also related to the style of the company's top management. Google was founded by two eccentric engineers, and under the influence of the style of the top management, this culture is also constantly strengthening.

If you are in a command-and-control organization where the leader hates people, that message will spread and reinforce itself in the organization.

<<:  Exploring the front-end and back-end separation architecture

>>:  Android 6.0 will be pushed next week, with a lot of new features

Recommend

New media operation topic selection method!

When it comes to new media, some people say that ...

Will humans be replaced by artificial intelligence? Machines cannot do this

Copyrighted images from the gallery, unauthorized...

Product News | What are the recent important updates of Baidu Promotion?

Commercial products are updated quickly...How can...