Information Center

From the bottom to the application, the necessary skills of those data people

  

According to the different stages of data application, I will talk about the necessary skills of those data people from the bottom to the final application.

1. Big data platform

At present, it is very popular to build Hadoop, Hive, Spark, Kylin, Druid and Beam with data sources and various cool new technologies. The premise is that you must understand Java, and many platforms are developed in Java.

At present, many enterprises have collected data. For traditional business data, traditional data is sufficient. But for user behavior and click behavior, or many unstructured data, such as text, image, and text, many companies do not know how to store the data because of the large amount of data.

What we need to solve here is how to build real-time, near real-time and offline big data frameworks, how to couple and decouple data streams, and how to conduct disaster tolerance, platform stability and availability.

My feeling is that in the past two or three years, this talent is still scarce. Because the concept of big data is so hyped, many enterprises have been fooled into saying that let's start to enter the big data industry. One of the prerequisites for entry is to store the data, especially the data on many user behaviors. It is obvious for business improvement. If you can describe users well, it will be helpful for your product design, marketing, and market development. At this stage, many companies need to do the first step: store more data. This is also the reason for the high mobility of this staff, which has been poached by high salaries.

Different from traditional SQL, for unstructured data with a large amount of data, what we want is to store data at the cheapest cost while achieving disaster tolerance, high scalability, high performance, and cross domain. From the current point of view, distribution has been proved to be a good way.

In addition, the cloud will be a good direction. Not every company can afford so many big data platform developers and operation and maintenance personnel OPS. We in this industry should have a good sense of crisis, contribute our value in time, and actively learn new technologies, otherwise we may be eliminated.

In addition, it is a good idea for start-ups or some traditional enterprises to spend some money to host data to cloud service providers, so that you can quickly determine what the value of data is to you, without having to purchase so many servers, hire so many operators and website developers.

Having said these things, I want to give some direction to people who will work in this field or companies that want to store data in the future. I don't do this myself. I don't know much about it. Just look at it.

One of the most talked about points in this work is that Hive is so slow, SQL queries are so slow, why the cluster fails again, and why the data is wrong after the Hadoop version is upgraded.

Therefore, working in this field requires a strong ability to tackle key problems, and the ability to quickly locate and solve bugs, because many tools are open source. Because it is open source, you know that all kinds of pitfalls even fail to be backward compatible, so you need strong Java development capabilities.

If you want to do a good job in this area, you also need to have the design ability of the entire system architecture, relatively strong resistance to pressure, ability to solve problems, and ability to collect resources. You can enter the open source community, so that you can follow the latest trends and technologies at any time.

2. Data Warehouse ETL

It's really hard to work as a warehouse worker, and Oncall alone will be daunting. There are many database engineers who are often woken up by the Oncall call when they sleep at night. Because the data process has problems, they need to find out which data source has problems and solve them immediately, otherwise the entire data process will be affected.

If the data process is affected, you may be called to the office by the chief leader to say, "Why is the data I want not ready, and why is my business report not issued today?".

From the above scenario, we can know that this is a very important position. Because the data process is very important, it determines that the data is disordered from the source. After passing ETL, it becomes neat data. These neat and consistent data can make it easy to calculate the statistical results of various businesses and can unify the caliber. Otherwise, there will be several departments, and there will be several statistical results. At that time, Department A said that the business increased by 5%, and Department B said that the business increased by 10%. OMG, who should trust.

At least in the following points, I think data warehouse personnel should do a good job:

a. For the integrity of the data dictionary, users want to know clearly what the logic of this field is. The fields should be consistent, and the same field should not have different definitions in different tables.

b. The stability of the core process should not make the main table of orders available every day very unstable. Sometimes it is very early, and sometimes it comes out at noon. If it is not stable, people who use data will have no confidence in you.

c. The warehouse version iteration should not be too frequent, and the compatibility between different versions should be maintained. Don't do Warehouse 1.0 well. It will soon be replaced by 2.0. In the data warehouse, continuity needs to be taken into account. The main table should not change too frequently, otherwise the users will be very painful. They are used to the 1.0 table structure with difficulty, and cannot switch so quickly. In short, it should be backward compatible.

d. Keep the unity of business logic. Do not have the same business logic. People in the same group have different results. The reason is that the common logic has not been translated into universal things, so everyone writes differently. This actually needs special attention.

In view of the above, the skill requirements of this position are: Don't be a person who can only write SQL. Now the tools are very developed. If your skills are very simple, then the substitutability index is very high, and you have no sense of achievement. This is not to say that people who can write SQL are very low, but that they should learn more skills, otherwise it will be very dangerous.

Warehouse staff should always think about how to design the architecture is the most reasonable. You should consider whether to need field redundancy, row storage or column storage, how to expand fields most effectively, and how to split hot data and cold data, so you need to have architecture thinking.

In terms of skills, in addition to being proficient in SQL, you also need to know how to write Transform and MapReduce, because many business logic is very complex to implement in SQL, but if you know other script languages, it will provide you with convenience and improve your efficiency a lot. In addition, good warehouse personnel need to write Java or Scala. It is necessary to improve your efficiency by writing UDTF or UDAF.

Data warehouse personnel should also often consider automation and tooling, and need good tools or module abstraction capabilities to implement automated tools to improve the efficiency of the entire organization. For the data skew problem often encountered, it is necessary to quickly locate the problem and optimize it.

After talking about data storage, the following are several key positions of data application. Before that, I want to say that one of the most critical prerequisites of data application is: data quality, data quality, data quality!! Every time you elaborate your opinions, analyze conclusions or use algorithms, you need to check the correctness of the source data first, otherwise any conclusions are false propositions.

3. Data visualization

This is a cool job. It's better to understand the front-end, such as js. Data visualization personnel need to have a good analytical thinking, and cannot ignore the degree of help to the business in order to show off their skills. Because I don't have many guest roles in this position, I don't have a particularly in-depth understanding. However, I think this position needs the ability of analysis to do a good job in visualization.

On the other hand, people who do data applications should know something about data visualization. The sequence of materials for opinion expression is: pictures>tables>words. An opportunity that can be illustrated with pictures should not be described with words, because it is easier for others to understand. You should know that when you explain something to the big leader, you need to think of the big leader as a "data idiot", so that you can say something more vividly.

4. Data analyst

Now, there is a great demand for data analysis, because everyone wants to say: data is available, but what can be done? This requires data analysts to analyze and mine data, and then do data applications.

The most common complaint to data analysts is: what you analyze is not normal business logic, and what else need you to analyze? Or your analysis conclusion is wrong, which is inconsistent with our business logic. In particular, when the ABTest results do not conform to the original expectations, analysts will often be pulled to say, "Analyze why my AB test results are not significant, there must be a reason.".

Many times, the baby's heart is bitter. If you say that the conversion rate has declined, you can see which segment channel has declined from the data. As for why customers do not place orders, we have to go to users. Many times, the data can not show why, but can only tell you what the current situation is.

If you have been writing analysis reports and giving conclusions, and continue to go round and round without directly reflecting results in the business, data analysts should wake up. Do you think this is the position you want?

For the positioning of data analysts: Personally, it is very difficult to become an excellent data analyst, and there are not many excellent analysts on the market now. In addition to data analysis, conclusion refining and insight into the reasons behind the data, the skills of data analysts also need to understand business and algorithms.

Only in this way, when faced with a business problem, data analysts can extract the problem, solve the problem layer by layer, and then respond to the strategy according to the positioning problem, such as whether to test the strategy first or apply the algorithm to optimize, which scenario to use the algorithm in, and whether to use the algorithm to solve the problem.

An excellent data analyst is a versatile data scientist who is proficient in business and algorithms, not an idle person who only listens to business needs to pull data, make reports, and analyze. We all say that analysis should draw conclusions. The conclusions of excellent analysts are a package of strategies and countermeasures that can solve problems. At the same time, many needs are discovered by analysts and mined through data.

From the above description, it can be seen that the requirements for data analysts are: be able to write sql to pull data, be proficient in business, be able to have data insight, be proficient in algorithms, be proactive, and have high requirements.

If you are always busy with daily analysis needs and are keen on writing gorgeous reports, you should remember that you are in danger because there will be a lot of people questioning your value, especially small companies. Because the salary of data personnel is not a small expense.

Most of the non landing analyses are pseudo analyses, and some exploratory feasibility studies may not consider landing, but other analysis of specific business needs need to consider landing, and then reverse your role through practice. Only by doing this repeatedly can you slowly affirm your value and improve your analysis skills, and only in this way can you prove that you are an analyst The value of data landing.

5. Data Mining/Algorithms

After three years of trial and error, I feel a lot about this. The main points of deep experience are as follows:

When a rule is settled, what algorithm should be used.

· Why is your accuracy so low?!

· Can you get 99% accuracy?

· Is your recommendation valuable? If you don't recommend it, the customer will also place an order for that product.

· Help me make a big data forecast. What does he want?

In many cases, different scenarios have different requirements for accuracy, so it is necessary to argue with the business under certain reasonable scenarios. Don't be afraid to let the business complain, and more often manage their expectations.

In some scenarios, the value of recommendation lies in the "long-term repurchase rate", so don't always focus on the conversion rate of ABTest. It is also promising to reduce the cost of customers. A smart product will make customers love it. Although there is no obvious difference in this transformation, the value can only be reflected by observing the long-term repurchase rate. In particular, distinguish between high-frequency and low-frequency products. Low frequency products are particularly difficult to reflect short-term value.

For the skill requirements of this position, you are not required to implement all algorithms from scratch. Now there are many ready-made algorithm packages to call. The basic requirement is that you should know which algorithm will be used in each scenario, such as classification scenarios. Common classification algorithms include LR/RF/Xgboost/ET, etc. In addition, you should also know what the effective optimization parameters of each algorithm are and how to optimize when the model is not good. It also requires the ability to implement algorithms. Scala/python/R/Java can be used for language. We often say that tools are not important. What is important is that you play with tools, not that tools play with you.

In addition, for supervised learning algorithms, algorithm engineers should have good business sense, so that the feature design can be more targeted and the designed feature can have good apriority.

6. Deep learning (NLP, CNN, speech recognition)

I have not used this product commercially, but I have practiced it. Personally, I think commercialization is the key point, especially when everyone is waiting and saying that your chatbot is very useful, but Siri has done it for so long, and the final response is not so good.

Now the customer service robot is very popular, and everyone is complaining that the context is poorly understood, and how the robot's semantic recognition is so poor. Who knows? It is much more difficult to recognize the meaning of Chinese than that of foreign countries, because there are too many variants of a negative expression in Chinese. You don't know which one we will say.

In addition, some people often complain that your CNN is so complicated that we need to meet the requirement of returning within 100ms online. It is so complicated that it must be too late to complete real-time calls. Finally, we can only consider offline prediction. People who often say this will not write the underlying code themselves. Many times I think that it is not that you do not have a solution to the problem, but that you do not think about how to solve the problem. Your mind determines your output.

On the whole, this requires a high level of personal comprehensive quality. If you just want to simply use the ready-made model to extract the features of the middle tier, and then apply other machine learning models to predict, you can also solve some real company applications, such as Yelp's image classification.

However, strictly speaking, this is not a person who does in-depth learning, because people who really play DL need to build models, adjust parameters, and change symbols themselves, so their programming ability is very strong. In this regard, I have always stood up to them. In particular, some start-ups have high requirements for the programming ability of this position. If you don't have the following information after your interview with a startup company, it means that you are excellent, but not necessarily suitable for our company, because we are looking for someone with strong programming skills.

I'm not professional about this, so I'll stop there and not say too much. Personally, it is necessary to have a strong ability of algorithm transformation and optimization, try to improve the speed of algorithm prediction, and constantly improve the extensiveness and accuracy of the algorithm. At present, the whole industry is also developing in a good direction. If many people see the high salaries offered by this industry, they should remember to check with the recruitment requirements to find out which skills they need to add. In this way, you can become the phoenix among people.

For the future, there is a bright future. For the future, there is great expectation. For the future, everything is possible.

To sum up:

So much has been said and talked about. In fact, the core is how to create value with data. If you don't have the ability to create value with data, you can only wait to be overwhelmed by data, beaten to death in the workplace by data, and reach the ceiling of your career early.

On the level of reflecting the value of data, the closer to the data application layer, the higher the requirements for data to generate value. People engaged in this field should often reflect on whether they have a good business sense. After all, in the industry, no one cares whether you have increased a percentage point compared with the traditional baseline. What they care about is what the value of the company is after you have increased a percentage point.

The more you go to the bottom, there is no mandatory requirement to bind performance together, but more agreements are made on the process. The value of this is mainly reflected in technological innovation. If you solve the problem of the existing architecture, you can become a big bull. So learn more about programming, and don't be too restrictive.