Exclusive interview with Yan Zhitao, Vice President of R&D at TalkingData: Uncovering the secrets of big data

Exclusive interview with Yan Zhitao, Vice President of R&D at TalkingData: Uncovering the secrets of big data

[[123691]]

In the era of big data, the value of data is self-evident. However, what is truly valuable is not the data itself but the analysis of the data after analysis and mining. For the Internet companies that have sprung up today, when the data is large enough and comprehensive enough, they can even draw a data portrait for the user. Now almost all industries are talking about big data. However, as the traffic on mobile terminals exceeds that on PC terminals, the data of mobile Internet has become an important part of big data that cannot be ignored.

On the eve of the 2014 Spark Asia Pacific Summit to be held on December 6, 2014, TalkingData R&D Vice President Yan Zhitao explained to everyone the secrets of Internet data.

In the relatively fragmented mobile Internet data, Yan Zhitao believes that "the data is mainly divided into four categories: device information, application behavior information, location information, and sensor information."

The application behavior information can reflect the user's habits to a certain extent. The location information can more accurately obtain the user's location, which is of great significance to the O2O model. With the explosion and popularization of smart hardware, the data information of sensors is more important. When I interviewed a CEO of smart hardware before, he also said that sensors are like the brain of smart hardware. Therefore, the data collection, analysis and mining of smart hardware are the key to whether smart hardware can be truly intelligent.

Yan Zhitao said: "Compared with software, the information of smart hardware is more trivial. Compared with the way mobile applications depend on mobile phones, smart hardware is not easy to popularize. In other words, each smart hardware can only cover a small number of user groups, but the real meaning lies in how to collect and integrate the data of each small group, so as to play the greatest value."

However, in my opinion, the cost of smart hardware is still high due to the price of chips, and there are no killer applications. It will take some time to replace traditional hardware devices. However, with the popularity of smart phones and the increase in the configuration of mobile phone hardware, killer mobile applications and popular mobile games are appearing frequently.

"From the data point of view, e-commerce applications such as Taobao, JD.com, and Vipshop have a large user base, and tool applications such as 360 Mobile Assistant and Wifi Key also have a certain user base because of their own value. From the current point of view, some social and casual mobile games have a relatively high user base." Yan Zhitao told reporters

TalkingData Analytics was launched in 2012. In just two years, Vipshop, Didi Dache, Jumei and Qunar all became its users, and its coverage on mobile terminals has reached over 800 million.

So how do the frameworks they choose to handle such massive amounts of data when analyzing and mining such large amounts of data?

Yan Zhitao told the reporter: Now we have to process several terabytes of data every day, which is divided into two lines: offline and real-time. In terms of offline, we initially chose the typical Hadoop ecosystem, which ensures the final data consistency through tasks of one hour or several hours. However, in terms of real-time, due to the special needs of users, we use Redis to implement our real-time statistics. With the development of the business, we have built a TD2.0 platform, which is better than the offline one. It completes quasi-real-time data presentation through small batch calculations. The offline system has gradually switched to a data processing platform based on Spark.

In fact, when Spark first appeared in 2012, it attracted attention because of its convenient support for iterative operations and its friendliness to machine learning. Yan Zhitao also mentioned: "It was TalkingData's algorithm engineers who first used Spark for iterative operations, and then migrated the platform business to it. Compared with Hadoop, Spark can perform iterative operations better and timely request delay calculations. The most important thing is that its ecosystem is more suitable for the current needs of big data analysis than Hadoop."

However, in terms of fault tolerance and efficiency of computing, Yan Zhitao said: "From my personal experience, Spark is better than Hadoop in some aspects. Because Hadoop has a high dependence on IO, everything must be shuffled on IO, put on disk and then read in, which results in poor utilization of the computing power of the machine. Spark's RDD model can effectively reduce the dependence on IO and make full use of memory, thereby improving performance."

But domestic JAVA programmers need to learn Scala to use Spark. Although Spark has some problems, Internet companies are born to solve problems.

Both Hadoop and Spark are open source technologies and neither is superior or inferior. Enterprises or developers need to choose their strengths and use them. In fact, some communities and forums are calling for the integration of Hadoop and Spark.

Yan Zhitao also believes that this is currently a state of integration. Some of TalkingData's real-time computing needs and requests are now being met with Spark, and some Hadoop-based ecosystems are also migrating to Spark.

Spark has not been around as long as Hadoop in China, and some companies are paying more and more attention to the development of Spark. Yan Zhitao also said: "There is a community called Spark Meetup in China. We participate in every session, and more and more people are participating. Now giants like Baidu, JD.com, and Tencent are all working on Spark and paying more and more attention to Spark. After all, Hadoop is still a little bit older than Spark and is not suitable for some scenarios. It can be said that Spark is becoming more and more popular in China, and its development will get better and better."

However, as an emerging technology, there are bound to be some shortcomings. Putting aside the technology itself, because China is a Chinese environment after all, although there are some enthusiastic people writing blogs and doing translations, the Chinese materials are still in short supply. Therefore, more development needs to be invested in the construction of Spark.

Everyone is talking about getting rid of IOE. Many core members of TalkingData's technical team come from IBM and Oracle, but IBM and Oracle have different attitudes on the issue of open source. Yan Zhitao said that although most of our members come from traditional software companies such as IBM and Oracle, they are now in Internet companies and use Internet development methods. In fact, IBM and Oracle have different attitudes towards open source. IBM is much more open to open source than Oracle. We are mainly open source. Although we use Apache as the database, we are not forced to go back. When we think we have done well enough, we will go back. I will also ask our engineers to put the code into the open source community to improve the code quality. Next year, more people in our team will be active in the open source community.

No matter which open source technology it is, it is the crystallization of the wisdom of millions of people. Spark is no exception, but the current situation of open source in China is not optimistic and even a half-dead state. It is also criticized by foreigners for only importing but not exporting.

Yan Zhitao told the reporter that it is true that we did not do well in open source in the past, but now, for example, Taobao and Tencent have open sourced some of their technologies. I believe that more domestic companies will gradually return to open source. As far as I know about the Spark community, there are many very active contributors from China. I believe there will be more and more contributions in the future.

He also hopes that we will release it when the team's product is a little better. Because this will make it more valuable. If the product value is not that great, it will become half-dead or even lose its value. So at this stage we are working hard to make the product better, and we expect to turn it into an open source project in 2015.

When smart hardware appeared with the halo of changing life, big data became the guarantee to prevent it from falling from the altar. Both Spark and Hadoop need to adapt to the current requirements. Take advantage of their strengths and make up for their weaknesses, and choose the best one to use.

<<:  After the bubble, the iceberg and fire of smart hardware

>>:  What else can you offer to attract developers besides high salaries and beautiful girls?

Recommend

Inventory | Characteristics of major information flow channels in 2018

Since the beginning of this year, super apps such...

The Chinese people’s “cosmic-level” romance must be seen one by one!

The sky is wider than the earth Wider than the sk...

5 tips for running promotional events!

Usually when it comes to operational activities, ...

Are there any restrictions on URLs for Baidu Frame users?

I believe everyone knows that as long as you have...

Trump's unexpected election as US president reveals his shady career history

The US presidential election finally came to an e...

Live broadcast room promotion strategy!

On the first day of our involvement, the effect w...