Information Center

2016 Big Data Map

  

Hong Kong server On February 15, it was reported that the "big data" with a history of three years seems to have passed away in the technology start-up industry, which likes new and dislikes old. Although Hadoop came out in 2006, the concept of "big data" was really popular from 2011 to 2014. That is, during this period, at least in the eyes of the media or experts, "big data" has become the new "gold" or "oil". However, at least in my conversations with people in the industry, people increasingly feel that this technology has stalled to some extent. 2015 may be the year when those cool kids in the data field shifted their interest and began to indulge in AI, machine intelligence, deep learning and many other related concepts.

Regardless of the inevitable hype cycle curve, our "big data map" has entered its fourth year. It is meaningful to take a step back to reflect on what happened last year and what the future of this industry will be like.

So is 2016 big data still a "thing"? Let's discuss it.

Enterprise technology=hard work

The interesting thing about big data is that it is no longer likely to become a subject of hype as it once was.

After the hype cycle, products and services that can still attract widespread interest are often those that can be contacted, perceived, or associated with the public: such as mobile applications, social networks, wearables, virtual reality, etc.

But big data is basically a kind of pipeline facilities. Of course, big data provides power for many consumer or commercial user experiences, but its core is enterprise technology: database, analysis, etc. These things are running on the back end, and few people can see them. As anyone who works in that world knows, it is impossible to adapt to the new technology on the enterprise side in one night.

In the early days, the big data phenomenon was mainly driven by the symbiotic relationship with a group of backbone Internet companies (especially Google, Facebook, Twitter, etc.), which are both heavy users of core big data technologies and creators of these technologies. When these companies suddenly face huge data with unprecedented scale, they have no traditional (expensive) infrastructure and no way to recruit some of the best engineers, so they have to develop the required technology by themselves. Later, with the rapid development of the open source movement, a large number of such new technologies began to be shared to a wider range. Then, some engineers of major Internet companies left their jobs to start their own big data start-ups. Other "digital natives" companies, including emerging unicorns, are also beginning to face similar demands from big Internet companies. Since they do not have traditional infrastructure, they naturally become early adopters of big data technologies. The early success led to more entrepreneurial activities and more VC funding, which led to the rise of big data.

After several years of rapid development, we are now faced with a broader but also more difficult opportunity: let a larger number of enterprises from medium-sized to multinational companies adopt big data technology. Unlike "digital natives", these companies do not have the advantage of starting from scratch. And they will lose more: most of the existing technology infrastructure of these companies is successful. Of course, those infrastructures may not be fully functional. Many people in the organization also realize that it is better to modernize their legacy infrastructure earlier than later, but they will not replace their key businesses overnight. Any revolution requires process, budget, project management, pilot, local deployment and complete security audit. Large enterprises are understandably cautious about young start-ups dealing with key parts of their infrastructure. What's more, to the despair of entrepreneurs, many (or most?) enterprises still stubbornly refuse to migrate data to the cloud (at least to the public cloud).

Another key to understand is that the success of big data does not lie in the realization of one aspect of technology (like Hadoop), but in the combination of a series of technologies, people and processes. You have to capture data, store data, clean data, query data, analyze data and visualize data. Some of these tasks can be completed by products, while others need to be done by people. Everything needs to be seamlessly integrated. Finally, to make all these things work, the whole company needs to establish a data-driven culture from top to bottom, so that big data is not only a "thing", but also that (key) "thing".

In other words: there is a lot of hard work to do.

Deployment phase

Therefore, this is the reason why we began to enter the big data deployment period and early maturity period after several years of eye-catching start-ups such as springing up, VC investment frequency and other headlines.

The next wave of large companies (called "early majority of users" of the traditional technology adoption cycle) mostly hold a wait-and-see attitude towards big data technology, and they are still confused about the whole big data. Until recently, they still expected a large supplier (such as IBM) to provide a one-stop solution, but now it seems that this situation will not occur in the near future. Their attitude towards this big data landscape is fear, wondering whether they really need to cooperate with this bunch of start-ups that don't seem different, and then patch up various solutions.

The ecosystem is maturing

With the continuous entrepreneurial activities in this field and the continuous inflow of funds, together with a moderate amount of exit, and more and more active technology giants (especially Amazon, Google, IBM), the number of companies in this field is increasing, and finally the 2016 version of the big data map is assembled.

Obviously, this picture is crowded, and there are many that cannot be listed (see the notes for our methodology)

In terms of the basic trend, the action began to slowly shift from left to right (that is, innovation, new products and new companies), from the infrastructure layer (the world of developers/engineers) to the analysis layer (the world of data scientists and analysts) and even the application layer (the world of business users and consumers), "big data native applications" It is already emerging rapidly - which is more or less in line with some of our original expectations.

Big data infrastructure: there is still a lot of innovation

Google's paper on MapReduce and BigTable (the basis of Hadoop) led by Cutting and Mike Cafarella has been published for 10 years. During this period, the infrastructure layer of big data has gradually matured and some key problems have been solved.

However, innovation in infrastructure is still dynamic, which is largely due to the considerable scale of open source activities.

2015 is undoubtedly the year of Apache Spark. Since we released the last version of the big data map, this open source framework that uses memory processing has started to cause many discussions. Since then, Spark has been supported by all kinds of players from IBM to Cloudera, making it gain considerable trust. The emergence of Spark is very meaningful because it solves some key problems that have slowed down the adoption of Hadoop: Spark is much faster (benchmark tests show that Spark is 10 to 100 times faster than Hadoop's MapReduce), easier to program, and can be well matched with machine learning.

In addition to Spark, there are also some other exciting frameworks, such as Flink, Ignite, Samza, Kudu, and so on. These frameworks also have a good momentum of development. Some thought leaders believe that the emergence of Mesos (data center resource management system, which programs the data center as a large computing resource pool) has also stimulated the demand for Hadoop.

Even in the world of databases, there seem to be more and more new players. As many as the market can't bear, many exciting things have happened here, from the maturity of graphic databases (such as Neo4j), to the launch of specialized databases (such as statistical time series database InfluxDB), and even the emergence of CockroachDB (a new database inspired by Google Spanner that combines the strengths of SQL and NoSQL). Data warehouses are also evolving (such as cloud data warehouse Snowflake).