Acquire knowledge from various media presentation information according to different needs
Knowledge discovery is a process of obtaining knowledge from various information according to different needs.The purpose of knowledge discovery is to shield usersraw dataIt can extract effective, novel and potentially useful knowledge from the original data and report directly to users.[1]
Knowledge Discovery in Database (KDD) is the so-called“data mining In a broader sense, knowledge is obtained from information expressed by various media according to different needs. The purpose of knowledge discovery is to shield usersraw dataIt can extract meaningful and concise knowledge from the original data and report directly to users.Database based knowledge discovery (KDD) anddata mining There is also confusion. Usually, these two terms are used interchangeably.KDD means lower leveldata conversionThe whole process of high-level knowledge.KDD can be simply defined as: KDD is a specific process to determine valid, novel, potentially useful, and basically understandable patterns in data.Data mining can be regarded as the extraction of patterns or models from observation data, which is a general interpretation of data mining.Although data mining is the core of the knowledge discovery process, it usually only accounts for a part of KDD (about 15% to 25%).Therefore, data mining is only one step of the whole KDD process. There is no exact definition of how many steps and which steps must be included in the KDD process.However, the generic process should accept the originaldata input, select importantdata item, reduce, preprocess and concentrate the data set, convert the data into an appropriate format, find patterns from the data, evaluate and interpret the findings.
basic task
Announce
edit
data classification
Classification isdata mining One of the important branches of research is an effective data analysis method.The goal of classification is to construct a classification model (i.eclassifier ), this model can translatedata recordMap to a given category so that it can be used for data prediction immediately.
Data clustering
Use theclusteringFunction sets a group of individuals according toSimilarityIt can be classified into several classes, so that classes can be found automatically.Clustering and classificationSimilarly, data is grouped.But different from classification, groups in clustering are not predefined, but defined according to the characteristics of actual data and the similarity between data.
Decline and forecast
This is a special type of classification, which can be seen as predicting the future data state based on past and current data.Attenuation by pairingStatistical techniquesmodelingTo predict the numerical value of, learn a (linear orNonlinear)The feature willdata itemMap to a numeric predictor variable.
Correlation and Relevance
It refers to finding interesting associations orCorrelation。Association rulesIt refers to analyzing the data in the databasedata objectTo infer the information of another data object and find out the knowledge pattern with high recurrence probability, a parameter with confidence factor is often used to describe this uncertain relationship.
Sequential discovery
Usually refers to determining the sequential pattern in a data group.When a specific type of relationship of data has been found, these patterns are associated withrelevancebe similar.But for relationships based ontime seriesSequential discovery and association are different.Summary: Sequential discovery is to map data to a subset of concise descriptions of data groups or to highly generalized data of a specific set of user data in the database.
Description and identification
It refers to finding a set of feature rules, each of which is orDisplay DataThe characteristics of a group or a proposition that distinguishes the concept of an experimental class from a comparative class.
Time series analysis
Its task is to discoverAttribute valueDevelopment trend, such as fromstock market indexFinancial data, customer data, medical data, etc.It is used to search for similar patterns to discover and predict the risks of specific patternscausal relationshipAnd trends.
Knowledge type
Announce
edit
(1) Generalization.
It is based on the microscopic characteristics of the data that it is characterized byuniversalityKnowledge of the concept, middle view or macro view.
(2) Classification&Clustering.
reflectSimilar thingsCharacteristic knowledge of common nature and characteristic knowledge of difference between different things.It is used to reflect the aggregation mode of data or distinguish the categories of objects according to their properties.
(3) Association.
It refers to the knowledge reflecting the dependency or correlation between an event and other events, also known as dependency.This kind of knowledge can be used in database normalization, query optimization, etc.
adopttime seriesType data, which predicts the future situation from historical and current data.It is actually a kind of associated knowledge with time as the key attribute.
Many knowledge discovery technologies have emerged,classification method There are also many kinds of mining objects based onrelational database、multimedia database ;According to the mining method, it can be divided into data driven, query driven and interactive;PressKnowledge typeShareAssociation rules, feature mining, classificationclustering, summarize knowledgetrend analysis 、Deviation analysis, text mining.Knowledge discovery technology can be divided into two categories: algorithm based methods and visualization based methods.Most algorithmic methods are based on artificial intelligence, information retrieval, database, statistics, fuzzy sets andRough set theoryAnd other fields.
Typical technology
Typical algorithmic knowledge discovery technologies include Bayesian theory of probability and maximum likelihood estimation, decline analysis, nearest neighborDecision treeK - Methodclustering、Association rulesMining, Web, andSearch Engines、data warehouse andOLAP(On—line Analytical Processing,OLAP), neural networkgenetic algorithm、fuzzy classification And clustering, rough classification andRule inductionEtc.These technologies are very mature and detailed in relevant books and articles.Here we introduce a method based on visualization.
② Based on icon technology.It means that each multidimensionaldata itemMaps to graphics, colors, or other icons to improve the representation of data and patterns.
③ Pixel oriented technology.Each attribute is represented by only one colored pixel, or attributeValue rangeMap to a fixed color image. ④Hierarchical technology.Refers to subdivisionMultidimensional space, and give it in a hierarchical waySubspace。
⑤ Based on chart technology.Means by usingquery languageAnd extraction techniques are effectively presented in the form of graphsdata set。
⑥ Hybrid technology.It refers to the technology combining the above two or more technologies.
Operation steps
Announce
edit
Multiple descriptions of the knowledge discovery process. They are only used to organize andExpressionThere is no essential difference in content.The knowledge discovery process includes the following steps:
1. Understanding and definition of problems:data mining People andDomain expertsCooperation. Conduct in-depth analysis of problems to determine possible solutions and evaluation methods for learning results.
2. Relateddata collectionAnd extraction: collect relevant data according to the definition of the problem.stayData extractionIn the process, you can use the query function of the database to speed up data extraction.
3. Data exploration and cleaning: understand the meaning of the fields in the database and their relationship with other fields.PerformLegitimacyCheck and clean up data containing errors.
4. Data engineering: reprocess data, mainly including selecting relevant attribute subsets and eliminating redundant attributes, sampling data according to knowledge discovery tasks to reduce the amount of learning, and converting data representation to fit learning algorithms.This step may be repeated several times in order to achieve the best match between the data and the task.
5. Algorithm selection: select the appropriate algorithm according to the data and the problem to be solvedData mining algorithmAnd decide how to use the algorithm on these data.
6. Run data mining algorithm: extract patterns from processed data according to the selected data mining algorithm.
7. Evaluation of results: the evaluation of learning results depends on the problems to be solvedNoveltyandEffectivenessEvaluate.data mining KDD is a basic step in the KDD process. It includes a specific mining algorithm to find patterns from the database.KDD process useData mining algorithmThe process of extracting or identifying knowledge from the database according to specific measurement methods and thresholds includes pretreatment of the database, sample division andData transformation。
Scope of application
Announce
edit
In fact, the potential application of knowledge discovery is very broad. It has gone far beyond the initial "Cargo rackEngineering ".From industry to agriculture, from astronomy to geography, from forecasting to decision support, KDD is playing an increasingly important role.Many computer software developers have launched theirdata mining Products such asIBM,Microsoft,SPSS,SGI,SLPInfoware,SAS(ObjectBusiness), etc.As a new technology of information processing, data mining has come to the fore in practical applications.
1. Business. "The "shelf project" is an example of KDD's initial successful application.It is precisely because the successful application of KDD in business continues to stimulate the development of KDD, and then expands to a broaderapplication area 。Today's business, especially sales andservice trade, is still one of the most widely used fields of KDD.Mainly used forSales Forecast, inventory demand, retail point selection, price analysis and salespattern analysis。For example, by analyzing the deviation pattern of customers with particularly high and low consumption, hotels can find some interestingconsumption pattern: AutOmatedWagering uses the ModeIMaX of Advanced Software AppIcationsprediction model . Combinationgeographic information Analyze and develop Lotiery Machine SelectionFloridaThe best place to install the lottery machine.
Example diagram of knowledge discovery
2. Agriculture.Agriculture is a largecomplex system Chinese AgricultureThe department has accumulated a lot of information about soil fertility, meteorology, diseases and pestsMarket informationData, examples andExperiential knowledgeHowever, it has not been fully utilized.Many valuable and regular knowledge can be found through KDD.For example, through the analysis of the pest database, we can find thatinfluence factor, migration or spread laws, so as to contain the occurrence, expansion or reduction of disaster lossesdomestic marketInformation mining to guide agricultureProduction planningEtc.
3. Medical biology.medical careHealth care industryThere is a large amount of data to be processed, but the data of this industry is managed by different information systems,data organization Poor performance and complex types.asmedical diagnosis Data, which may include text, numerical values, images, etc., have brought some difficulties to applications.KDD is mainly used in medicinediagnostic analysis , Drug composition - utility analysis, new drug development and drugsProduction processControl optimization, etc.
7. Other aspects.asindustrial production Medium equipmentfault diagnosisProduction process optimization: data processing and analysis in scientific research, meteorological analysis and forecast, etc.