We spend a lot of time thinking about stream processing problems, and what’s even cooler is that we also spend a lot of time helping other people understand stream processing and how to apply it to solve data problems in their organizations.
The first thing we need to do is correct people’s misconceptions about stream processing (as a rapidly changing field, there are many misconceptions here that we should think about). In this article, we have selected six of them as examples. Since we are familiar with Apache Flink, we will explain these examples based on Flink. Myth 1: There is no such thing as streaming without batch processing (Lambda architecture) Myth 2: Latency and throughput: choose one Myth 3: Micro-batching means better throughput Myth 4: Exactly once? It’s impossible Myth 5: Streaming can only be used in “real-time” scenarios Myth 6: Streams are still complicated no matter what Myth 1: There is no such thing as streaming without batch processing (Lambda architecture) The "Lambda Architecture" was a useful design pattern in the early days of Apache Storm and other stream processing projects. This architecture consists of a "fast stream layer" and a "batch layer". The reason for using two separate layers is that stream processing in the Lambda architecture can only calculate approximate results (that is, if errors occur in the middle, the calculation results are unreliable) and can only process a relatively small number of events. Although Storm's early versions had this problem, many open source stream processing frameworks today are fault-tolerant, can generate accurate results even when failures occur, and have high throughput. So there is no need to maintain a multi-layer architecture to get both "fast" and "accurate" results. Today's stream processors (such as Flink) can help you get both results at the same time. Fortunately, people no longer discuss Lambda architecture, which shows that stream processing is maturing. Myth 2: Latency and throughput: choose one Early open source stream processing frameworks were either "high throughput" or "low latency", but "massive and fast" has never become synonymous with open source stream processing frameworks. However, Flink (and possibly other frameworks) offer both high throughput and low latency. Here is an example of a benchmark result. Let’s dissect this example from the bottom up, specifically the hardware level, and combine it with a stream processing pipeline that has a network bottleneck (many pipelines using Flink have this bottleneck). There should be no trade-offs to make at the hardware level, so the network is the main factor affecting throughput and latency. A well-designed software system should make full use of the upper limit of the network without introducing bottleneck problems. However, for Flink, there is always room for optimization, which can make it closer to the performance provided by the hardware. Using a cluster of 10 nodes, Flink can now process *** events per second. If it is expanded to 1,000 nodes, its latency can be reduced to tens of milliseconds. In our opinion, this level is much higher than many existing solutions. Myth 3: Micro-batching means better throughput We can discuss performance from another perspective, but first let’s clarify two confusing concepts: Micro-batching Micro-batching is an execution or programming model for processing data that builds on traditional batches. “Using this technique, a process or task can treat a stream as a series of small batches or chunks of data.” buffer Buffering technology is used to optimize access to networks, disks, and caches. Wikipedia succinctly defines it as "an area in physical memory used to temporarily store mobile data." The third myth is that data processing frameworks that use micro-batches can achieve higher throughput than frameworks that process events one at a time because micro-batches are transmitted more efficiently over the network. This myth ignores the fact that streaming frameworks do not rely on batching at any programming model level, they only use buffering at the physical level. Flink does buffer data, which means it sends a group of processed records over the network instead of sending them one at a time. Not buffering data is undesirable from a performance perspective, because sending records one at a time over the network does not bring any performance benefits. So we have to admit that there is no such thing as one record at a time at the physical level. However, buffering can only be used as a performance optimization, so buffering:
So for Flink users, the programs they develop can process each record individually, because Flink hides the details of using buffers to improve performance. In fact, using micro-batches incurs additional overhead in task scheduling, and if the goal is to reduce latency, this overhead will only increase! Stream processors know how to take advantage of buffering without incurring the overhead of task scheduling. Myth 4: Exactly once? It’s impossible This myth includes several aspects:
Let's take a step back and say that we don't mind the existence of the idea of "Exactly once". "Exactly once" originally meant "one-time delivery", but now this term is used casually in stream processing, making the term confusing and losing its original meaning. However, the relevant concepts are still very important, and we don't intend to skip over them. To be as accurate as possible, we consider "exactly once state" and "exactly once delivery" as two different concepts. This is because people have used these two terms in the past, which has led to confusion between them. Apache Storm uses "at least once" to describe delivery (Storm does not support state), while Apache Samza uses "at least once" to describe application state. Exactly once means that the application behaves as if the failure never occurred. For example, suppose we are maintaining a counter application, and after a failure, it can neither count more nor less. The term "Exactly once" is used here because the application state considers that each message is processed only once. Exactly once delivery means that the receiving end (the system outside the application) receives the processed events after the failure occurs, as if the failure had never occurred. Stream processing boxes do not guarantee exactly-once delivery in any case, but they can achieve exactly-once state. Flink can achieve exactly-once state without significantly affecting performance. Flink can also achieve exactly-once delivery on data slots associated with Flink checkpoints. Flink checkpoints are snapshots of the application state. Flink will periodically and asynchronously generate snapshots for the application. This is why Flink can still guarantee one-time state in the event of a failure: Flink regularly records (snapshots) the reading position of the input stream and the relevant state of each operand. If a failure occurs, Flink will roll back to the previous state and restart the calculation. Therefore, although the record is reprocessed, from the result, it seems that the record has been processed only once. What about end-to-end exactly-once processing? It is possible to make checkpoints a transaction coordination mechanism in an appropriate way, in other words, to make the source and target operations participate in the checkpoint. Internally, the results are exactly-once, and from an end-to-end perspective, they are also exactly-once, or "close to exactly-once". For example, when using Flink and Kafka as data sources and data sink (HDFS) rolling occurs, it is an end-to-end exactly-once transaction from Kafka to HDFS. Similarly, when using Kafka as a source for Flink and Cassandra as a sink for Flink, if the updates to Cassandra are idempotent, then end-to-end exactly-once processing can be achieved. It is worth mentioning that with Flink's savepoints, checkpoints can also serve as a state version mechanism. With savepoints, you can "move over time" while maintaining state consistency. This makes code updates, maintenance, migration, debugging, and various simulation tests easier. Myth 5: Streaming can only be used in “real-time” scenarios This fallacy includes several elements:
Now it’s time to think about the relationship between the type of dataset and the processing model. First, there are two kinds of datasets:
Many real-world data sets are unbounded, whether they are stored in files, directories on HDFS, or systems like Kafka. Here are some examples:
In fact, it is difficult to find a bounded data set in the real world, but the location information of all the buildings of a company does have boundaries (although it will also change as the company's business grows). Second, there are two processing models:
Let's go a little deeper and distinguish between two types of unbounded data sets: continuous streams and intermittent streams. It is possible, though not the best approach, to use either model for either dataset. For example, batch processing models have long been used for unbounded datasets, especially intermittent unbounded datasets. The reality is that most "batch" jobs are executed by scheduling, processing only a small portion of an unbounded dataset at a time. This means that the unbounded nature of streams can cause problems for some people (those working on inbound pipelines). Batching is stateless, the output depends only on the input. In reality, batch tasks keep state internally (for example, reducers often keep state), but these states are confined to the boundaries of the batch, and they do not flow between batches. The idea of “state within batch boundaries” becomes useful when one is trying to implement something like time windows with “event timestamps”, which is a common approach when dealing with unbounded datasets. Batch processors that process unbounded datasets will inevitably encounter late events (due to upstream delays), and the data in the batch may become incomplete. Note that this assumes that we are moving the time window based on event timestamps, because event timestamps are the most accurate model in reality. When performing batch processing, late data will become a problem, and even simple time window fixes (such as rolling or sliding time windows) cannot solve this problem, especially if session time windows are used, which is even more difficult to handle. Because the data required to complete a computation will not all be in a single batch, it is difficult to guarantee correct results when using batches to process unbounded data sets. At the very least, it requires additional overhead to handle late data and maintain state between batches (waiting until all data arrives before starting processing, or reprocessing the batch). Flink has a built-in mechanism for handling late data. Late data is considered a normal phenomenon of unbounded data in the real world, so Flink has designed a stream processor specifically for handling late data. Stateful stream processors are better suited for processing unbounded data sets, whether they are continuously or intermittently generated. Using a stream processor is just the icing on the cake. Myth 6: Flow is still complicated no matter what This is a myth. You might be thinking: "It's good in theory, but I still won't use streaming because...":
We never intend to encourage you to use streams, although we think streams are a cool thing. We believe that whether to use streams depends entirely on the characteristics of your data and code. Before making a decision, ask yourself: "What type of dataset am I working with?"
Then ask another question: "Which part changes most frequently?"
For situations where data changes more frequently than code, such as performing a relatively fixed query operation on a frequently changing data set, streaming problems may arise. So, before you think streaming is “complex”, you may have already been solving streaming problems without even realizing it! You may have used hourly batch scheduling, and other people on your team could create and manage these batches (in this case, the results you get may be inaccurate without you realizing that such results are caused by the batch time issues and the state issues mentioned earlier). The Flink community has worked for a long time to provide a set of APIs that encapsulate these time and state complexities. In Flink, it is very easy to handle event timestamps. Just define a time window and a function that can extract timestamps and watermarks (only called once on each stream). Handling state is also very simple, similar to defining Java variables and then registering these variables to Flink. Using Flink's StreamSQL, you can run SQL queries on a continuous stream. Last point: What about the case where the code changes more frequently than the data? For this case, we think you have an exploratory problem. Iterating using a notebook or other similar tool may be appropriate for solving exploratory problems. After the code is stable, you will still encounter streaming issues. We recommend using long-term solutions to streaming issues from the beginning. The future of stream processing As stream processing matures and these myths fade, we see streams moving into areas beyond analytical applications. As we have discussed, the real world is generating data continuously. Traditional approaches interrupt this continuous data flow because the data must be aggregated into a centralized location or broken into batches for application consumption. Stream processing patterns such as CQRS are becoming increasingly popular. Applications can be developed directly based on continuous data streams, which can retain states locally, better isolate applications and teams, and better handle time-based data. As Flink continues to evolve and improve and is adopted by more and more companies, we believe that it can not only be used to simplify analytical pipelines, but also bring us more powerful computing models. This article was written by Kostas Tzoumas. Original article: Stream Processing Myths Debunked. |
<<: Understanding the current status and cutting-edge technologies of cloud databases in one article
>>: Android bottom navigation bar implementation (I) BottomNavigationBar
Bing (bing) In June 2009, Microsoft launched a ne...
On August 1, Apple released the second public bet...
As the first batch of user reviews of iOS14.7 cam...
Foreign media VentureBeat published an article sa...
Introduction to the practical course resources fo...
Many short video creators have been complaining t...
Nowadays, the hotel industry is no longer limited...
Many people have this question: why do the Tik To...
We are no longer unfamiliar with websites, which ...
With the upgrading of consumption, Chinese people...
Whether you are doing user operations, new media ...
Professor Zhang Jingming's "Chinese Medi...
On April 14, Douyin's e-commerce assistant is...
14 lessons on how to monetize your listening manu...
Source code introduction: MeiTuanLocateCity imita...