Original address: DFA Algorithm for Filtering Sensitive Words in Games
For a game with chat function, we hope our chat system can judge the player's input. If the player's input contains some sensitive words, we forbid the player to send chat, or convert the sensitive words to * to replace them.
Why use DFA algorithm
If we already have a thesaurus of sensitive words (obtained from relevant departments or online), the easiest way to filter sensitive words is:
Go through the whole sensitive word library, get the sensitive word, and then judge whether there is the sensitive word in the string entered by the player. If there is, replace the sensitive word character with*
But in this way, we need to traverse the entire sensitive thesaurus and replace the string entered by the player. The whole sensitive thesaurus usually contains thousands of strings. The character string entered by players for chat is generally 20~30 characters.
Therefore, the efficiency of this method is very low and cannot be applied to real development.
The DFA algorithm can be used to achieve efficient filtering of sensitive words. Using the DFA algorithm, we can replace all the existing sensitive words only by traversing the string entered by the player once.
DFA algorithm principle
DFA algorithm constructs a tree like search structure in advance (actually, it should be said that it is a forest), and then it can perform very efficient search in the tree like structure according to the input.
Suppose we have a sensitive thesaurus, and the words in Ciku are:
I love you!
I love him
I love her
I love you
I love him
I love her
I love her
Then we can construct such a tree structure:
Set the string entered by the player as: Baiju I love you hahaha
We traverse the string str entered by the player, and set the pointer i to point to the root node of the tree structure, that is, the leftmost blank node:
When str [0]='white', the tree [i] does not point to the node whose value is' white ', so the matching condition is not met, and continue to traverse
Str [1]='chrysanthemum', also does not meet the matching condition, continue traversing
Str [2]='I'. At this time, the tree [i] has a path connecting to the 'I' node, which meets the matching conditions. I points to the 'I' node, and then continues traversing
Str [3]='love'. At this time, the tree [i] has a path connecting to the node 'love', which meets the matching conditions, and i points to 'love' to continue traversing
Str [4]='you', there is also a path, i points to 'you', continue traversing
Str [5]='ah', there is also a path, and i points to 'ah'
At this point, our pointer i has pointed to the end of the tree structure, that is, a sensitive word judgment has been completed at this time. We can use variables to record the subscript of the string entered by the player at the beginning of the sensitive word match and the subscript at the end of the match, and then iterate again to replace the character with *.
After a match, we re point the pointer i to the root node of the tree structure.
At this time, the string entered by our player has not yet reached the end, so we continue to traverse:
Str [6]='ha', do not meet the matching conditions, continue traversing
Str [7]='ha'
Str [8]='ha'
It can be seen that we can find the sensitive words in the string entered by the player once.
Under the title of this paragraph, I said that the structure constructed by DFA algorithm at the beginning is actually a forest, because for a more complete sensitive thesaurus, the structure constructed by it is as follows:
If you do not look at the root node of the structure, that is, the blank node, it can be seen as a forest composed of tree structures.
After understanding how the DFA algorithm matches the filter words, we begin to discuss how to construct such a forest structure based on the sensitive thesaurus from the code level.
Construction of forest structure for DFA algorithm
Both trees and forests are composed of nodes, so we will discuss what information should be stored by nodes in this structure.
According to the normal tree structure, the node ends storing its own value and the pointer of its connected child node.
But for the structure of DFA algorithm, the number of sub nodes is uncertain at first. Therefore, we can use a List to store the pointers of all child nodes, but in this case, we need to traverse the entire List to find the path when matching, which is relatively slow.
To achieve O (1) lookup efficiency, we can use a hash table to store pointers to child nodes.
We can also directly use the hash table as the entry node of the forest:
This hash table stores a series of sensitive words with different keys. The starting character Value is the key value pair representing the node of this character
And because the hash table can store objects of different types (as long as it inherits from the object), we can also store a key value pair whose Key is' IsEnd 'and Value is 0. A value of 0 means that the current node is not the end of the structure, and a value of 1 means that the current node is the end of the structure.
Then other nodes in the structure can also be constructed using a hash table. For the characters represented by this node, we have stored them in the key value pair contained in its parent node (because our structure eventually has a blank root node, in which the key value pair stores the beginning character of the sensitive vocabulary, and Value is also a hash table, that is, its child node)
Each node, namely the hash table, also stores a key value pair with Kye of "isEnd" and Value of 0/1. Then it also stores a series of characters represented by Key as its child node, and Value as the key value pair of its child node (hash table).
Let's give another specific example:
The structure is as follows:
The structure starts with its blank root node, namely the hash table. We set it as map
Then, for the sensitive word "I love you", the search process is:
Map 'I' 'You' ['IsEnd']==1
After the above analysis, we can get the general process of code construction of the structure:
1. Create a hash table as the blank root node of the structure
2. Traverse the sensitive word thesaurus to get a sensitive word string
3. Traverse the sensitive word string to get a current traversal character
4. Find whether the current traversal character has been included in the tree structure. If so, go directly to the existing node in the tree structure, and then continue to traverse the character downward.
The search process is:
For the first string of sensitive words:
IndexMap=map//It is equivalent to the pointer to the tree structure node
if(indexMap.ContainsKey(‘c’)) indexMap = indexMap[‘c’]
In this way, our indexMap is equivalent to a pointer to the same node that already exists in the tree structure
The same applies to the following characters:
if(indexMap.ContainsKye(‘c’)) indexMap = indexMap[‘c’]
If the tree structure does not exist, or if the current pointer points to a node and all its child nodes do not represent the characters to be traversed, then we need to create a child node, that is, add a key value pair, whose Key is the character currently traversed, and Value is a new hash table.
5. Determine whether the current traversal character is the last of the current string. If yes, add a key value pair whose Key is "IsEnd" and Value is 1. If not, then the key value pair to be added is "IsEnd" and the value is 0.
This is the end of the discussion on the structure construction of the DFA algorithm. Next, the construction code (implemented in Java) is given.
DFA algorithm structure initialization construction code
/** *Construct sensitive information tree * * @param words */ private static void InitFilter(List<String> words) { map = new HashMap(words.size()); for (int i = 0; i < words.size(); i++) { String word = words.get(i); HashMap indexMap = map; for (int j = 0; j < word.length(); j++) { char c = word.charAt(j); if (indexMap.containsKey(c)) { indexMap = (HashMap) indexMap.get(c); } else { HashMap newMap = new HashMap(); newMap.put("IsEnd", 0); indexMap.put(c, newMap); indexMap = newMap; } if (j == word.length() - 1) { if (indexMap.containsKey("IsEnd")) indexMap.put("IsEnd", 1); else indexMap.put("IsEnd", 1); } } } }
DFA algorithm search process
The principle of the DFA algorithm search process has been discussed above, and examples are also given. In fact, the search process is somewhat similar to the process of initializing the structure. Therefore, the code is given directly without going into details.
Code Implementation of DFA Algorithm Searching Process
/** *Find Procedure * * @param txt * @param beginIndex * @return */ private static int CheckFilterWord(String txt, int beginIndex) { boolean flag = false; int len = 0; HashMap curMap = map; for (int i = beginIndex; i < txt.length(); i++) { char c = txt.charAt(i); HashMap temp = (HashMap) curMap.get(c); if (temp != null) { if ((int) temp.get("IsEnd") == 1) flag = true; else curMap = temp; len++; } else break; } if (!flag) len = 0; return len; } /** *Find Using * * @param txt * @return */ public static String SerachFilterWordAndReplace(String txt) { int i = 0; StringBuilder sb = new StringBuilder(txt); while (i < txt.length()) { int len = CheckFilterWord(txt, i); if (len > 0) { for (int j = 0; j < len; j++) { sb.replace(i + j, i + j + 1, "*"); } i += len; } else ++i; } return sb.toString(); }
]]>