Information Center

Big data encounters data purification problems

  

Karim Koshawaj is a doctor and online health consultant in Toronto. He wants to summarize how to better treat patients from the massive data fed back by 500 doctors. But as we all know, doctors' "calligraphy" is like a heavenly script, and it is even more difficult for computers to recognize spelling errors and abbreviations.

For example, Kochavaj pointed out: "Whether the patient smokes is a very important information. If you read the medical record directly, you can immediately understand what the doctor means. But if you want the computer to understand it, you can only wish you good luck. Although you can also set the option of 'Never smoke' or 'Smoking=0' on the computer, how many cigarettes does a patient smoke every day? This is almost impossible for a computer to understand.

Because the propaganda reports boast big data, many people may think that big data is particularly simple to use: just insert the information equivalent to an entire library into the computer, and then sit aside, waiting for the computer to give insights on how to improve the production efficiency of automatic production lines, and how to let online shoppers buy more sneakers online, Or how to treat cancer. But the truth is far more complicated than imagination. Because information will be outdated, inaccurate and missing, data will inevitably be "dirty". How to make data "clean" is an increasingly important but often overlooked task, but it can prevent you from making costly mistakes.

Although technology has been improving, there are not many ways people can think of to purify data. Even when dealing with some relatively "clean" data, it is often time-consuming and laborious to obtain useful results.

Josh Sullivan, vice president of Booz Allen, said, "I told my clients that this is a messy and dirty world, and there is no completely clean data set."

Data analysts generally prefer to search for unusual information first. Because of the huge amount of data, they usually entrust the task of filtering data to the software to find out whether there is something abnormal that needs further inspection. Over time, the accuracy of computer screening data will also improve. By classifying similar cases, they will also better understand the meaning of some words and sentences, and then improve the accuracy of screening.

Sullivan said: "This method is simple and direct, but 'training' your model can take weeks and weeks."

Some companies also provide software and services to purify data, including technology giants like IBM and SAP, as well as specialized agencies engaged in big data and analysis such as Cloudera and Talent Open Studio. A large number of start-ups also want to be gatekeepers of big data, including Trifacta, Tamr and Paxata.

Due to too much "dirty" data, the medical industry is considered one of the most difficult industries to deal with big data technology. With the popularization of electronic medical records, the difficulty of inputting medical information into computers has become lower and lower. However, researchers, pharmaceutical companies and medical industry analysts still have much to improve in data if they want to analyze the data they need.

Ke Xiavajie, the doctor and CEO of InfoClin, a health data consulting company, spent a lot of time hoping to screen useful data from tens of thousands of electronic medical records to improve the diagnosis and treatment of patients. However, they constantly encountered obstacles in the process of screening.

Many doctors do not record the patient's blood pressure in their medical records, which cannot be fixed by any data purification method. It is an extremely difficult task for computers to judge what disease a patient has by relying on the information of existing medical records. When entering the diabetes number, the doctor may forget to clearly mark whether the patient has diabetes or one of his family members has diabetes. Or maybe they just input the word "insulin" without mentioning what disease the patient has, because it is obvious to them.

Doctors use a unique set of shorthand fonts when they diagnose, prescribe drugs and fill in basic patient information. Even if we let humans crack it, it will be a big headache, and it is basically impossible for computers to complete the task. For example, Koshawaj mentioned that a doctor wrote three letters of "gpa" in his medical record, which made him puzzled. Fortunately, he found that the words "gma" were written not far behind him, and he suddenly realized that they were the abbreviations of grandpa and grandma.

Kochavaj said, "It took me a long time to understand what they mean."

Koshawaje believes that one of the ultimate ways to solve the problem of "dirty" data is to develop a set of "data discipline" for medical records. Doctors should be trained to form the habit of correctly entering information, so that they will not be in a mess when purifying data afterwards. Kochavaj said that Google has a very useful tool that can tell users how to spell rare words when they are typing. Such a tool can be added to the electronic medical record tool. Although computers can pick out spelling mistakes, it is a step in the right direction to let doctors abandon bad habits.

Another suggestion of Koshawaj is to set more standardized domains in the electronic medical record. In this way, the computer will know where to find specific information, thus reducing the error rate. Of course, the actual operation is not so simple, because many patients suffer from several diseases at the same time. Therefore, a standard form must be flexible enough to take all these complex situations into account.

But for the needs of diagnosis and treatment, doctors sometimes need to write down some free writing things in the medical records, which are certainly not fit in a small box. For example, the reason why a patient falls is very important if it is not caused by injury. But in the absence of context, software's understanding of free writing can only be described as "hit the big luck". When filtering data, people may do better if they use keywords to search, but it is inevitable that many relevant records will be missed.

Of course, in some cases, some seemingly dirty numbers are not really dirty. For example, Sullivan, vice president of Booz Allen Consulting, said that his team once analyzed the demographic data of customers for a luxury hotel chain, and suddenly found that the data showed that a group of teenagers from a wealthy Middle East country were frequent guests of the hotel.

Sullivan recalled: "There were a large group of 17 year olds staying in this hotel all over the world, and we thought: 'This is definitely not true.'"

But after doing some digging work, they found that this information was actually correct. This hotel has a large number of young customers, even the hotel itself is not aware of it, and the hotel has not done any promotion and publicity for these customers. All customers under the age of 22 are automatically included in the "low-income" group by the company's computer. Hotel executives have never considered how much money these children have.

"I think it would be more difficult to build a model without outliers," Sullivan said

Even if the data is obviously dirty sometimes, it can still be of great use sometimes. For example, Google's spelling correction technology mentioned above. It can automatically recognize misspelled words and then provide alternative spelling. The reason why this tool has such a magical function is that Google has collected hundreds of millions or even billions of misspelled words in the past few years. Therefore, dirty data can also be turned into treasure.

In the end, it is people rather than machines that draw conclusions from big data. Although the computer can organize millions of documents, it cannot really interpret it. Data purification is a process of trial and error to facilitate people to draw conclusions from data. Although big data has been regarded as a magic tool that can improve business profits and benefit all mankind, it is also a headache.

Sullivan pointed out: "The concept of failure is quite another matter in data science. If we do not fail 10 or 12 times a day to try and make mistakes, they will not give correct results."