Software quality evaluation system - data mining processing platform

2020/07/14 20:46
Reading number 142

After the software quality evaluation system is established, the data used for evaluation should be determined before evaluation, which requires the data mining platform to play a role. This paper will introduce our evaluation data mining processing platform by taking the production of input method evaluation corpus as an example.




one


Data mining processing principles

overall

Use Scenarios

According to the different typing environments required by different users send Use several applications with higher frequency, As comprehensive as possible It covers the typing scenarios used by users, mainly divided into the following two categories:

  • Chat scene: typing content of user chat, such as QQ, WeChat, nailing, etc

  • Non chat scenarios: non chat environments where users need to type, such as Zhihu, Weibo, Taobao, video applications, etc

User profile

For users with specific needs, we provide choice Term data for specific needs , for example:

  • Some people are teachers and may input many educational terms;

  • Some people are car enthusiasts and may input many automobile related terms;

  • Some people are E-sports enthusiasts, and more will input entries related to the game;

  • wait...

For these needs, we divide the data obtained in the above use scenarios into more than 10 categories, such as cars, sports, education, games, films and television, to cover Specific typing needs of specific users

objectivity

Non tendentiousness

When selecting data, Equal treatment In all scenarios where typing products are used, we should not only use the data that performs well on our own products (for example, we use the thesaurus of the input method to produce evaluation data, while ignoring the popular online hot words, which leads to good evaluation results but poor user experience), without adding human factors, Avoid Matthew effect


Understand the real intention of users

In the process of typing There may be many operations For example, some words may not exist in the thesaurus given in our input method, but they are created by users themselves, which requires us to have a word grouping scenario when evaluating. Therefore, when mining user's typing behavior, we not only meet the common typing needs, but also design a variety of possible user behaviors, including association, word formation, error correction, backspace, etc.

Uniformity

When producing evaluation data, the same data may be applied to many different evaluation needs, and the difference in data format will lead to higher adaptation costs. Therefore, for evaluation data, it is necessary to meet the requirements of uniform format. We have formulated Unified format specification To ensure that it can be effectively used for multiple evaluation requirements.



two


Acquisition of evaluation data

By regularly capturing Evaluation data , for data acquisition, please refer to us Comment on crawler Github open source project

 https: //github.com/sogou-qa/LightCommentCrawler

click end of document Read the original text You can directly visit the project address and welcome everyone to exchange and learn together.

The following is the data acquisition effect:



three




Evaluation data processing and corpus production

Data cleaning

After data acquisition, the articles or comments from which the original data comes are usually saved in json format. These articles or comments may contain many special symbols such as newline characters or special characters that cannot be recognized by the input method, so they need to be Regularization processing , only the Chinese content we need for evaluation is reserved.

Data before cleaning:


Data after cleaning:

Data segmentation

The cleaned data is still stored in the form of large pieces of articles and cannot be used directly. It needs to use specific tools to carry out Word segmentation In this step, we use the widely used Jieba participle Tools enable large paragraphs of articles to be divided into words.

When using jieba word segmentation before, the word segmentation effect obtained by directly using the toolkit is not very consistent close Our daily typing habits, So we According to user input behavior Optimized word segmentation algorithm , the effect has been significantly improved, and the comparison results are as follows:

Corpus production

After the word segmentation operation is completed, the next step is the production of the corpus. First, the results after the word segmentation need to be carried out Phonetic processing , so that each entry has a corresponding pinyin string, and then these entries and their corresponding pinyin strings are produced in a fixed format, and finally saved as a json file. The format of the corpus depends on the evaluation tools, Ensure uniformity The format of our corpus is as follows:

 Example: { Number of entries : four , Entry Content : [ { "pinyin" : "woyou" , "expect_cand" : "I have" }, { "pinyin" : "" , "expect_cand" : Of }, //Represent association { "pinyin" : "*" , "expect_cand" : "" } //Denotes backspace { "pinyin" : "#" , "expect_cand" : "" } //Indicates line break ], Keyboard Type : twenty-six }


four


epilogue

The evaluation data mining is not static, and needs to be continuously updated and improved to adapt to more and more complex evaluation tasks. With the evaluation system and evaluation data, you can then develop evaluation tools and content related to evaluation execution.

Welcome to other articles in the software quality evaluation system series:

Software quality evaluation system - opening

Software quality evaluation system - evaluation system


stamp ⬇️ "Read the original text" to visit our github open source project, welcome to exchange and learn together~

This article is shared from the WeChat official account Sogou QA.
In case of infringement, please contact support@oschina.cn Delete.
Participation in this article“ OSC Source Innovation Plan ”, welcome you to join us and share with us.

Expand to read the full text
Loading
Click to lead the topic 📣 Post and join the discussion 🔥
Reward
zero comment
zero Collection
zero fabulous
 Back to top
Top