Collection
zero Useful+1
zero

knowledge discovery

Acquire knowledge from various media presentation information according to different needs
Knowledge discovery is a process of obtaining knowledge from various information according to different needs. The purpose of knowledge discovery is to shield users raw data It can extract effective, novel and potentially useful knowledge from the original data and report directly to users. [1]
Chinese name
knowledge discovery
Foreign name
Knowledge Discovery in Database, KDD
knowledge discovery
data mining A broader term for "
data classification
data mining One of the important branches of research

conceptual analysis

Announce
edit
Related books
Knowledge Discovery in Database (KDD) is the so-called“ data mining In a broader sense, knowledge is obtained from information expressed by various media according to different needs. The purpose of knowledge discovery is to shield users raw data It can extract meaningful and concise knowledge from the original data and report directly to users. Database based knowledge discovery (KDD) and data mining There is also confusion. Usually, these two terms are used interchangeably. KDD means lower level data conversion The whole process of high-level knowledge. KDD can be simply defined as: KDD is a specific process to determine valid, novel, potentially useful, and basically understandable patterns in data. Data mining can be regarded as the extraction of patterns or models from observation data, which is a general interpretation of data mining. Although data mining is the core of the knowledge discovery process, it usually only accounts for a part of KDD (about 15% to 25%). Therefore, data mining is only one step of the whole KDD process. There is no exact definition of how many steps and which steps must be included in the KDD process. However, the generic process should accept the original data input , select important data item , reduce, preprocess and concentrate the data set, convert the data into an appropriate format, find patterns from the data, evaluate and interpret the findings.

basic task

Announce
edit

data classification

Classification is data mining One of the important branches of research is an effective data analysis method. The goal of classification is to construct a classification model (i.e classifier ), this model can translate data record Map to a given category so that it can be used for data prediction immediately.

Data clustering

Use the clustering Function sets a group of individuals according to Similarity It can be classified into several classes, so that classes can be found automatically. Clustering and classification Similarly, data is grouped. But different from classification, groups in clustering are not predefined, but defined according to the characteristics of actual data and the similarity between data.

Decline and forecast

This is a special type of classification, which can be seen as predicting the future data state based on past and current data. Attenuation by pairing Statistical techniques modeling To predict the numerical value of, learn a (linear or Nonlinear )The feature will data item Map to a numeric predictor variable.

Correlation and Relevance

It refers to finding interesting associations or Correlation Association rules It refers to analyzing the data in the database data object To infer the information of another data object and find out the knowledge pattern with high recurrence probability, a parameter with confidence factor is often used to describe this uncertain relationship.

Sequential discovery

Usually refers to determining the sequential pattern in a data group. When a specific type of relationship of data has been found, these patterns are associated with relevance be similar. But for relationships based on time series Sequential discovery and association are different. Summary: Sequential discovery is to map data to a subset of concise descriptions of data groups or to highly generalized data of a specific set of user data in the database.

Description and identification

It refers to finding a set of feature rules, each of which is or Display Data The characteristics of a group or a proposition that distinguishes the concept of an experimental class from a comparative class.

Time series analysis

Its task is to discover Attribute value Development trend, such as from stock market index Financial data, customer data, medical data, etc. It is used to search for similar patterns to discover and predict the risks of specific patterns causal relationship And trends.

Knowledge type

Announce
edit
(1) Generalization.
It is based on the microscopic characteristics of the data that it is characterized by universality Knowledge of the concept, middle view or macro view.
(2) Classification&Clustering.
reflect Similar things Characteristic knowledge of common nature and characteristic knowledge of difference between different things. It is used to reflect the aggregation mode of data or distinguish the categories of objects according to their properties.
(3) Association.
It refers to the knowledge reflecting the dependency or correlation between an event and other events, also known as dependency. This kind of knowledge can be used in database normalization, query optimization, etc.
(4) Predictive knowledge( Prediction )。
adopt time series Type data, which predicts the future situation from historical and current data. It is actually a kind of associated knowledge with time as the key attribute.
(5) Deviant knowledge( Deviation )。
By analyzing the special cases outside the standard category Data clustering Extraneous Outlier , Actual Observations Significant difference between and system predicted value Extreme exception Describe.

Technology application

Announce
edit

content validity

knowledge discovery
Many knowledge discovery technologies have emerged, classification method There are also many kinds of mining objects based on relational database multimedia database According to the mining method, it can be divided into data driven, query driven and interactive; Press Knowledge type Share Association rules , feature mining, classification clustering , summarize knowledge trend analysis Deviation analysis , text mining. Knowledge discovery technology can be divided into two categories: algorithm based methods and visualization based methods. Most algorithmic methods are based on artificial intelligence, information retrieval, database, statistics, fuzzy sets and Rough set theory And other fields.

Typical technology

Typical algorithmic knowledge discovery technologies include Bayesian theory of probability and maximum likelihood estimation, decline analysis, nearest neighbor Decision tree K - Method clustering Association rules Mining, Web, and Search Engines data warehouse and OLAP (On—line Analytical Processing, OLAP ), neural network genetic algorithm fuzzy classification And clustering, rough classification and Rule induction Etc. These technologies are very mature and detailed in relevant books and articles. Here we introduce a method based on visualization.

Innovative technology

be based on Visualization method In graphics Scientific visualization and Information visualization And other fields, including:
geometry Projection technique It refers to the use of basic composition analysis factor analysis , multi-dimensional scaling to find interesting projections of the cube.
② Based on icon technology. It means that each multidimensional data item Maps to graphics, colors, or other icons to improve the representation of data and patterns.
③ Pixel oriented technology. Each attribute is represented by only one colored pixel, or attribute Value range Map to a fixed color image. ④ Hierarchical technology. Refers to subdivision Multidimensional space , and give it in a hierarchical way Subspace
⑤ Based on chart technology. Means by using query language And extraction techniques are effectively presented in the form of graphs data set
⑥ Hybrid technology. It refers to the technology combining the above two or more technologies.

Operation steps

Announce
edit
Multiple descriptions of the knowledge discovery process. They are only used to organize and Expression There is no essential difference in content. The knowledge discovery process includes the following steps:
1. Understanding and definition of problems: data mining People and Domain experts Cooperation. Conduct in-depth analysis of problems to determine possible solutions and evaluation methods for learning results.
2. Related data collection And extraction: collect relevant data according to the definition of the problem. stay Data extraction In the process, you can use the query function of the database to speed up data extraction.
3. Data exploration and cleaning: understand the meaning of the fields in the database and their relationship with other fields. Perform Legitimacy Check and clean up data containing errors.
4. Data engineering: reprocess data, mainly including selecting relevant attribute subsets and eliminating redundant attributes, sampling data according to knowledge discovery tasks to reduce the amount of learning, and converting data representation to fit learning algorithms. This step may be repeated several times in order to achieve the best match between the data and the task.
5. Algorithm selection: select the appropriate algorithm according to the data and the problem to be solved Data mining algorithm And decide how to use the algorithm on these data.
6. Run data mining algorithm: extract patterns from processed data according to the selected data mining algorithm.
7. Evaluation of results: the evaluation of learning results depends on the problems to be solved Novelty and Effectiveness Evaluate. data mining KDD is a basic step in the KDD process. It includes a specific mining algorithm to find patterns from the database. KDD process use Data mining algorithm The process of extracting or identifying knowledge from the database according to specific measurement methods and thresholds includes pretreatment of the database, sample division and Data transformation

Scope of application

Announce
edit
In fact, the potential application of knowledge discovery is very broad. It has gone far beyond the initial " Cargo rack Engineering ". From industry to agriculture, from astronomy to geography, from forecasting to decision support, KDD is playing an increasingly important role. Many computer software developers have launched their data mining Products such as IBM ,Microsoft, SPSS SGI ,SLPInfoware, SAS (ObjectBusiness), etc. As a new technology of information processing, data mining has come to the fore in practical applications.
1. Business. " The "shelf project" is an example of KDD's initial successful application. It is precisely because the successful application of KDD in business continues to stimulate the development of KDD, and then expands to a broader application area Today's business, especially sales and service trade , is still one of the most widely used fields of KDD. Mainly used for Sales Forecast , inventory demand, retail point selection, price analysis and sales pattern analysis For example, by analyzing the deviation pattern of customers with particularly high and low consumption, hotels can find some interesting consumption pattern : AutOmatedWagering uses the ModeIMaX of Advanced Software AppIcations prediction model . Combination geographic information Analyze and develop Lotiery Machine Selection Florida The best place to install the lottery machine.
Example diagram of knowledge discovery
2. Agriculture. Agriculture is a large complex system Chinese Agriculture The department has accumulated a lot of information about soil fertility, meteorology, diseases and pests Market information Data, examples and Experiential knowledge However, it has not been fully utilized. Many valuable and regular knowledge can be found through KDD. For example, through the analysis of the pest database, we can find that influence factor , migration or spread laws, so as to contain the occurrence, expansion or reduction of disaster losses domestic market Information mining to guide agriculture Production planning Etc.
3. Medical biology. medical care Health care industry There is a large amount of data to be processed, but the data of this industry is managed by different information systems, data organization Poor performance and complex types. as medical diagnosis Data, which may include text, numerical values, images, etc., have brought some difficulties to applications. KDD is mainly used in medicine diagnostic analysis , Drug composition - utility analysis, new drug development and drugs Production process Control optimization, etc.
4、 Financial insurance aspect. Financial transactions need to collect and process a large amount of data, analyze these data, and find that Data mode And characteristics. Then you may find a customer Consumer groups Or the organization's financial and business interests, and can be observed financial market Change trend of. KDD is widely used in the financial field, such as finance stock market Analysis and prediction Account classification Bank guarantee and Credit evaluation Etc.
5. Communication and media. as line fault Forecast audience ratings Influential factors and websites of Intrusion detection , Web information discovery, etc.
6. National defense and military affairs. as military intelligence Data analysis Command automation And Auxiliary decision , War Risk prediction Weapon attack effect analysis, geography Data analysis Etc.
7. Other aspects. as industrial production Medium equipment fault diagnosis Production process optimization: data processing and analysis in scientific research, meteorological analysis and forecast, etc.