"Everything will run in the cloud." Now more and more businesses are shifting from maintaining their own infrastructure to public (or private) clouds. The benefits brought by this are self-explanatory, greatly reducing the operation and maintenance costs of the IaaS layer. For the database layer, in the past, you needed a strong DBA background to handle advanced operations such as elastic expansion and high availability. Now most cloud services basically provide similar services to a greater or lesser extent.
Today’s sharing will focus on the architecture behind the cloud database solutions of the more popular cloud service providers, as well as some recent industrial-related technological advances that I have observed that are meaningful to cloud databases. Amazon RDS In fact, when it comes to cloud databases on public clouds, the earliest one should be Amazon's RDS, which was first released in 2009. The architecture of Amazon RDS is similar to building a middle layer on the underlying database (from an architectural point of view, Alibaba Cloud RDS, UCloud RDS and other cloud RDS services are basically the same, competing in terms of functional diversity and implementation details). This middle layer is responsible for routing the client's SQL requests to the actual database storage nodes. Because the business-side requests are proxied through the middle layer, many operation and maintenance tasks can be performed on the underlying database instances, such as backup, migration to physical machines with larger disks or more idle IO, etc. Because these tasks are hidden behind the middle layer, the business layer can basically not perceive them. In addition, this middle routing layer basically just forwards requests, so the bottom layer can connect to various types of databases. So generally speaking, RDS basically supports popular databases such as MySQL/SQLServer/MariaDB/PostgreSQL, with basically no loss in compatibility, and when the proxy layer is well designed, the performance loss is relatively small. In addition, there is an intermediate layer that isolates the underlying resource pool, which can do a lot for resource utilization and scheduling. For example, some less active RDS instances can be scheduled together to share a physical machine. For example, if online capacity expansion is required, you only need to create a replica on a machine with a larger disk and redirect the request at the proxy layer. For example, regular data backups can be placed on S3. All of this can be transparent to users. However, the disadvantages of this architecture are also obvious: it is essentially a single-machine master-slave architecture, which is helpless for scenarios that exceed the capacity, CPU load, and IO of the maximally configured physical machine. With the growth of data volume and concurrency of many businesses, especially the development of mobile Internet, maximal scalability has become a very important requirement. Of course, for most libraries with less data volume requirements and no high concurrent access to a single instance, RDS is still very suitable. Amazon DynamoDB As for the horizontal expansion problem just mentioned, some users are so distressed that they can even accept giving up relational models and SQL. For example, some Internet applications have relatively simple business models, but huge concurrency and data volumes. To deal with this situation, Amazon developed DynamoDB and released DynamoDB cloud services in early 2012. In fact, the Dynamo paper was published in SOSP as early as 2007. This historic paper directly triggered the NoSQL movement and made everyone think that databases can be used in this way. I have mentioned the model and some technical details of DynamoDB in my other article "The Current State of Open Source Databases", so I will not go into details here. Dynamo's main features are horizontal scalability and high availability through multiple copies (3 copies). In addition, the API design supports eventual consistency reads and strong consistency reads. Eventual consistency reads can improve read throughput. However, please note that although DynamoDB has strong consistency reads, the strong consistency here is not the traditional ACID C that we talk about in databases. Moreover, since there is no concept of time series (only vector clock), the handling of conflicts can only be left to the client, and Dynamo does not support transactions. However, for some specific business scenarios, scalability and availability are the most important, not only capacity, but also cluster throughput. Alibaba Cloud DRDS However, the amount of data from RDS users is also growing continuously. Cloud service providers cannot watch these RDS users leave when the data volume grows large or maintain the database cluster themselves, because not everyone can completely refactor the code to NoSQL, and sharding is actually a very painful thing for business developers, but business opportunities are often hidden in the pain. For example, for the expansion solutions of RDS, I will introduce two typical ones. The first one is Alibaba Cloud's DRDS (but it seems to have been removed from Alibaba Cloud's product list?). The idea of DRDS is actually very simple. It is just one step further than RDS. It adds user-configured routing policies to the middle layer of RDS mentioned earlier. For example, users can specify certain columns of a table as sharding keys and route to specific instances according to certain rules. They can also configure vertical database partitioning strategies. In fact, the predecessor of DRDS is Taobao's TDDL, but the original TDDL was done on the JDBC layer, and now TDDL is done in the Proxy layer (a bit like stuffing TDDL into Cobar). The advantage of this is that the work of sub-library and sub-table in the application layer is encapsulated, but it is still essentially a middleware solution, although it can achieve a certain degree of SQL compatibility for simple businesses. Support for some complex queries, multi-dimensional queries, and cross-Shard transactions is limited. After all, the intermediate routing layer has a limited understanding of SQL. Replacing Sharding keys, DDL, and backups are also very troublesome. Judging from the implementation and complexity of YouTube's open source middleware Vitess, it is not even simpler than implementing a database, but its compatibility is not as good as rewriting a database. Amazon Aurora Later, in 2015, Amazon took another path. In 2015, Amazon Aurora was released. There is not much information about Aurora on the public Internet. Aurora provides 5x the read throughput of a single MySQL 5.6, but it can only be expanded to 15 replicas at most. The more replicas there are, the greater the impact on write throughput, because only one Primary Instance can provide write services. A single replica supports a maximum capacity of 64T and supports high availability and elastic expansion. It is worth mentioning Aurora's compatibility. In fact, everyone who works with databases knows that compatibility is a very difficult problem to solve. A small difference in implementation may make the user's migration cost very high. This is why the middleware and sharding solutions are so anti-human. Most of us are pursuing a smooth migration experience for users. Aurora takes a different approach. Since there is not much public information, I guess Aurora implements a distributed shared storage layer based on InnoDB under the MySQL front end (https://www.percona.com/blog/2015/11/16/amazon-aurora-looking-deeper/), which is very good for horizontal scalability for read instances, so that the workload is evenly distributed on the various MySQL instances on the front end, which is somewhat similar to the Share Everything architecture of Oracle RAC. The advantage of this architecture is obvious compared to the middleware solution. It is more compatible because it reuses MySQL's SQL parser and optimizer. Even if the business layer has complex queries, it doesn't matter because it is connected to MySQL. However, it is also for this reason that when there are more nodes and larger amounts of data, queries cannot utilize the computing power of the cluster (for many complex queries, the bottleneck appears on the CPU), and MySQL's SQL optimizer has always been MySQL's weak point. In addition, the design of the SQL engine for large-scale queries is completely different from that of a single machine. For a simple example, the design of a distributed query engine such as SparkSQL/Presto/Impala is definitely completely different from that of a single-machine SQL optimizer, and is more like a distributed computing framework. Therefore, I think Aurora is a solution that optimizes the read performance of simple queries when the data volume is not too large (there is a capacity limit), and its compatibility is much better than that of middleware solutions. However, its disadvantage is that it is still relatively weak in supporting large data volumes and complex queries. In addition, Aurora does not do much optimization for write performance (single-point write). If there is a bottleneck in writing, it still needs to be split horizontally or vertically at the business layer. Google Cloud BigTable Google As the ancestor of big data, it missed many opportunities in the cloud. It missed one opportunity in virtualization, which was taken by VMWare and Docker (Google started working on container solutions ten years ago. You should know that the cgroups patch on which containers depended was submitted by Google). It missed one opportunity in cloud services, which was taken by Amazon (what a pity for Google App Engine). It missed one opportunity in big data storage, which was taken by open source Hadoop, which became the de facto standard. I think the decision to make Google Cloud BigTable service compatible with Hadoop HBase API must have been a heart-breaking decision for the engineers who implemented these Hadoop APIs for BigTable. :) However, after being inspired by Amazon/Docker/Hadoop, Google finally realized the power of community and cloud computing, and began to export various powerful infrastructures within Google to Google Cloud. In 2015, it was finally officially launched on Google Cloud Platform. I believe that most distributed storage system engineers are familiar with the architecture of BigTable. After all, the BigTable paper is also a must-read classic like Amazon Dynamo, so I will not go into details. The API of BigTable cloud service is compatible with HBase, so it is also {Key: two-dimensional table structure}. Since it is still a master-slave structure at the Tablet Server level, the reading and writing of a Tablet can only be done through the Tablet Master by default. This makes BigTable a strongly consistent system. The strong consistency here refers to the writing of a single Key. If the server returns a success, the subsequent reads will all be the latest value. Since BigTable still does not support ACID transactions, the strong consistency here only applies to single-key operations. As for horizontal scalability, BigTable actually has no limitations. The document boasts of Incredible scalability, but BigTable does not provide cross-data center (Zone) high availability and cross-Zone access capabilities. In other words, a BigTable cluster can only be deployed within a data center. This actually shows that BigTable is positioned within Google as a high-performance, low-latency distributed storage service. If cross-Zone high availability is required, the business layer needs to replicate itself and synchronize between two Zones to build a mirrored BigTable cluster. In fact, many Google businesses were run this way before MegaStore and Spanner came out. For BigTable, if you need to achieve high availability, strong consistency, and low latency across data centers, it is not possible, and it does not meet the positioning of BigTable. Another thing worth complaining about is that the BigTable team posted a blog (https://cloudplatform.googleblog.com/2015/05/introducing-Google-Cloud-Bigtable.html) The latency of HBase is criticized very badly, with a .99 response latency of 6 ms and HBase 280ms. In fact, the difference in average response latency is not that big.... Since BigTable is written in C++, its advantage is that the latency is quite stable. But as far as I know, the HBase community is also doing a lot of work to minimize the impact of GC. For example, after the off-heap optimization is completed, the latency performance of HBase will be better. Google Cloud Datastore In 2011, Google published the Megastore paper, which described for the first time a distributed storage system that supports high availability across data centers, horizontal scalability, and ACID transaction semantics. Google Megastore is built on BigTable, and different data centers are synchronized through Paxos. Data is sharded according to Entity Group, and Entity Group itself uses Paxos replication across data centers. ACID transactions across Entity Groups require two-phase commit, implementing Timestamp-based MVCC. However, because the allocation of Timstamps needs to go through Paxos, and the 2PC communication between different Entity Groups needs to be asynchronously communicated through a queue, the actual 2PC delay of Megastore is relatively large. The paper also mentioned that the average response delay of most write requests is about 100~400ms. According to friends inside Google, Megastore is quite slow to use, and delays of seconds are common... As the first distributed database in Google that supports ACID transactions and SQL, there are still a lot of applications running on Megastore, mainly because it is much easier to write programs using SQL and transactions. Why do we talk so much about Megastore? Because the backend of Google Cloud Datastore is Megastore... In fact, Cloud Datastore was launched in Google App Engine in 2011, which was the High Replication Datastore of Data Engine at that time. Now it has been renamed Cloud Datastore. I didn’t know that it was actually the famous Megastore. Although the functions seem to be very powerful, it supports high availability, ACID, and SQL (just Google’s simplified version of GQL), but from the principle of Megastore, the latency is very large. In addition, the interface provided by Cloud Datastore is a set of ORM-like SDKs, which is still somewhat invasive to the business. Although Google Spanner is slower than Megastore, it is very useful. The Spanner paper mentioned that in 2012, there were probably more than 300 businesses running on Megastore. As more and more businesses were reinventing the wheel of ACID Transaction implementation on BigTable, Google couldn't stand it anymore and started to build a big wheel Spanner. The project is ambitious. Like Megastore, it has ACID transactions + horizontal expansion + SQL support. However, unlike Megastore, Spanner did not choose to build a transaction layer on top of BigTable, but directly started to build Paxos-replicated tablets on top of Google's second-generation distributed file system Colossus. In addition, unlike Megastore, which implements transactions through Paxos by various coordinators to determine the timestamp of transactions, hardware is introduced, that is, the TrueTime API composed of GPS clocks and atomic clocks to implement transactions. In this way, transactions initiated by different data centers do not need to coordinate timestamps across data centers, but are directly allocated through the TrueTime API of the local data center, which greatly reduces the latency. Spanner is a nearly universal distributed storage and is complementary to BigTable within Google. If you want high availability, strong consistency, and transactions across data centers, use Spanner. The cost may be a little latency, but not as much as Megastore. If you want high performance (low latency), use BigTable. Google Spanner Currently, it is not available in Google Cloud Platform, but it is a certainty, at least as the next generation of Cloud Datastore. On the other hand, Google still cannot open source Spanner. The reason is the same as BigTable. The underlying layer depends on Colossus and a bunch of Google internal components. In addition, TrueTime is more difficult than BigTable because it is a hardware... So after the Spanner paper was released at the end of 2012, the community also had open source implementations, such as the more mature TiDB and CockroachDB, which will be introduced later when the community cloud database implementation is mentioned. Spanner's interface is slightly richer than BigTable, supporting what it calls a semi-relational table structure, and can perform DDL like a relational database. Although the primary key of each row still needs to be specified, it is much better than a simple kv. Google F1 At the same time as the Spanner project started, Google launched another distributed SQL engine project F1 to be used in conjunction with Spanner. With a highly consistent and high-performance Spanner at the bottom layer, Google can try to connect OLTP and part of OLAP at the upper layer. In fact, F1 is a database as the title of the paper says, but it does not store data. All data is on Spanner. It is just a distributed query engine that relies on the transaction interface provided by Spanner at the bottom layer to translate users' SQL requests into distributed execution plans. Google F1 provides a possibility that has never been realized in other databases, the possibility of integrating OLTP and OLAP, because Google F1 is designed for use in Google's advertising system. Advertising systems have very high consistency requirements and high pressure, which is a typical OLTP scenario; secondly, there may be many complex queries for evaluating the effectiveness of advertising, and the more real-time such queries are, the better, which is a bit like real-time OLAP. The traditional approach is for the OLTP database to synchronize a copy of the data to the data warehouse at regular intervals, and perform offline calculations in the data warehouse. A slightly better approach is to use some streaming computing frameworks for real-time computing. The first solution using a data warehouse has poor real-time performance, and shuffling data is very troublesome. As for the solution using a streaming computing framework, first, it is not flexible, and many query logics need to be written in advance, and many ad-hoc tasks cannot be done. In addition, because the two sides have heterogeneous storage, ETL is also a very troublesome task. In fact, F1 relies on Spanner's ACID transactions and MVCC features to achieve 100% OLTP, and as a distributed SQL engine, it can use the cluster's computing resources to implement distributed OLAP queries. The benefit is that there is no need to set up an additional data warehouse for data analysis, but real-time analysis can be performed directly in the same database. In addition, due to the lock-free snapshot read features brought by Spanner's MVCC and multiple replicas, this type of OLAP query will not affect normal OLTP operations. For OLTP, the bottleneck often occurs in IO, while for OLAP, the bottleneck often occurs in CPU or computing. In fact, it seems that they can be integrated to improve the resource utilization of the entire cluster. This is why I am optimistic about the combination of Google F1 + Spanner. Future databases may integrate data warehouses to provide a more complete and real-time experience. (Actually, the GFS below is not quite accurate, it should be Colossus now) Open source cloud-native database In 2016, a new term suddenly became popular in Silicon Valley: GIFEE, Google Infrastructure For Everyone Else. Everyone realized that with the prosperous development of a new generation of open source basic software, Google's internal infrastructure already had many high-quality open source implementations. For example, there is Docker for containers, the second-generation Kubernetes of Borg, which Google actively open-sourced for schedulers, the traditional BigTable and GFS communities, and Hadoop, which is shitty but still usable. In addition, many large companies that think Hadoop is shitty have basically reinvented the wheel themselves... Not to mention Google's recent addiction to open source. Not to mention Kubernetes, from the popular Tensorflow to the relatively unpopular but personally-in-my-view-significant Apache Beam (the basis of Google Cloud Dataflow), basically anything that can be independently open sourced is actively embracing the community. This has resulted in the gap between the community and Google narrowing, but for now, everything else is easy to say, except that Spanner and F1 are not so easy to build. Even if we leave aside the TrueTime hardware, implementing a stable Multi-Paxos is not an easy task. In addition, things like distributed SQL optimizers also have high technical barriers. Even if they are built, the complexity of testing is no less than the complexity of implementation (you can refer to PingCAP's several sharings on distributed testing philosophy). Currently, from a global perspective, I think there are only two teams in the open source world: PingCAP's TiDB and CockroachLabs' CockroachDB that have sufficient technical capabilities and vision to create an open source implementation of Spanner. TiDB is currently in RC1 and has many users using it in production environments. It is slightly more mature than CockroachDB, and its architecture is closer to the orthodox F1 above Spanner architecture. CockroachDB is slightly behind in maturity, and its protocol chooses PostgreSQL, while TiDB chooses MySQL protocol compatibility. Moreover, from TiDB's sub-project TiKV, we have seen the prototype of a new generation of distributed KV. RocksDB + Multi-Raft does not rely on a third-party distributed file system (DFS) to provide horizontal scalability, and is becoming a new generation of distributed KV storage standard architecture. In addition, I am also very happy to see that such an open source project of this level is initiated and maintained by a domestic team. Even if it is placed in Silicon Valley, it is a top-notch design and implementation. Judging from the activity of Github, the tools used, and the process of operating the community, it is difficult to tell that it is a domestic team. Kubernetes + Operator I just mentioned the word Cloud-Native. Actually, there is no precise definition for this word, but my understanding is that application developers are isolated from physical facilities, that is, the business layer does not need to worry about storage capacity and performance, etc. Everything can be transparently expanded horizontally, and the cluster is highly automated and even supports self-repair. For a large-scale distributed storage system, it is difficult for humans to intervene. For example, a distributed system with thousands of nodes may have various node failures almost every day, instantaneous network jitters, or even the entire data center may crash. It is almost impossible to manually migrate and recover data. Many people are very optimistic about Docker, believing that it has changed the way operations and software are deployed, but I think Kubernetes is more meaningful. The scheduler is the core of the Cloud-native architecture, and the container is just a carrier and is not important. Kubernetes is equivalent to a distributed operating system, and the physical layer is the entire data center, which is DCOS. This is why we are betting heavily on Kubernetes. I believe that large-scale distributed databases will not be able to be separated from DCOS in the future. However, Kubernetes is a headache for stateful service orchestration. The characteristics of a general distributed system are that not only does each node have stored data, but it also needs to expand and shrink capacity according to user needs. When the program is updated, it must be able to perform rolling upgrades without stopping services. When the data load is unbalanced, the system must be rebalanced. At the same time, in order to ensure high availability, each node's data will have multiple copies. When a single node encounters a failure, the total number of copies must be automatically restored. All of these are very challenging for orchestrating a distributed system on Kubernetes. Kubernetes launched Petset in version 1.3, which has now been renamed StatefulSet. The core idea is to give Pod an identity and establish and maintain the connection between Pod and storage. When the Pod may be scheduled, the corresponding Persistent Volume can be bound to it. However, it does not completely solve our problem. PS still needs to rely on Persistent Volume. Currently, Kubernetes' Persistent Volume only provides implementations based on shared storage, distributed file systems or NFS, and does not yet provide support for Local Storage. In addition, Petset itself is still in the Alpha version stage, so we are still waiting and watching. However, in addition to the official Kubernetes community, there are still other people trying. We are delighted to see that not long ago, CoreOS proposed a new method and idea for extending Kubernetes. CoreOS added a new member to Kubernetes, called Operator. Operator is actually an extension of Controller. Due to space constraints, I will not elaborate on the specific implementation. In short, it is a solution for Kubernetes to schedule stateful storage services. CoreOS officially provides an operator implementation for Etcd-cluster backup and rolling upgrade. We are also developing TiDB's operator. If you are interested, you can follow our Github and WeChat official account to learn about the latest progress. About the Author Huang Dongxu, co-founder/CTO of PingCAP, is a senior infrastructure engineer who excels in the design and implementation of distributed storage systems. He is an open source fanatic and the author of the famous open source distributed cache service Codis. He has a unique understanding of open source culture and technology community building. |
<<: 360 and Beijing Internet Security officially launched China's first Internet security cinema
>>: Using Flink as an example to dispel six common myths about stream processing
Geng Shuang made his first appearance at the Unit...
As a planner and operator, online activities are ...
Taizhou photography applet customized price 1. Di...
Goldgenie, a London-based custom studio, today la...
The data system is not only the basis of your sci...
Regarding the operation and promotion of APP, the...
Last Friday, I collected some of the biggest prob...
You may remember two things: Because Fan Xiaoqin ...
Siri has been released since 2011 and is now six ...
What exactly is advertising, trying to convince y...
Browser display problem: The search results and a...
Universal data analysis rules, master the essence ...
Some people say that App promotion is hard work. ...
The power of short videos is 9 times that of long...
[Must-have course for moms] 20 lessons on childre...