字符串匹配4：AC自动机

多模式匹配

BF、RK、BM和KMP都是单模式匹配算法，也就是一个主串只能匹配一个模式串。在很多时候我们还需要多模式匹配，也就是一个主串匹配多个模式串，比如聊天敏感词。敏感词是一个列表，一个主串可能包含了多个敏感词。

树3：字典树（Trie树）中字典树是一种多模式匹配，命中之后可以提供多条建议，实际上是匹配上了多条建议。

自动机

自动机，通常指的是确定有限状态自动机（DFA）。自动机是一个对信号序列进行判定的数学模型。

自动机核心由状态和转移函数组成。一个 确定有限状态自动机（DFA） 通常包含以下几个部分：

序列字符集C，也是就是可输入的序列字符的集合，不可输入就必然不可接受。
状态集合Q，每个序列字符的输入都会到达一个状态，如果将DFA当作一个有向图，状态集合就是顶点集合。
起始状态s，起始状态是一个状态集合中特殊的状态，序列的起始输入由此开始转移。
接收状态集合A，是状态集合的一个子集合，到达这个状态表示该序列已经被自动机接受了，也就是匹配了。
转移函数f，每个序列字符输入之后都需要经过转移函数将状态由a转移到b。如果把一个 DFA 看成一张有向图，那么 DFA 中的转移函数就相当于顶点间的边，而每条边上都有一个字符。

自动机是一个抽象数学模型，我们在实践中主要是理解和应用其思想，所以不同的实现的时空复杂度都会不一样。

假设我们有这样一个用来判断二进制数是否是偶数的状态机，如下:

    ———1——> 
接受         不接受
    <---0---

自动机接受状态开始从高位到低位接受二进制数的序列，如果是1则将状态转移成不接受，如果是0则将状态转移成接受，如果输入最后一位状态停在了接受，那么就表示该序列被自动机接受了，也就是这是一个偶数。

[[树4：字典树（Trie树）]]字典树也是一个自动机，转移函数就是字典树的边，接受状态是将每个字符串插入到字典树时到达的那个状态，需要注意的是假设我们字典数用来补全单词，里面有how，hi两个单词，当序列输入是h时，我们将可到达的how和hi全部输出，而对于输入序列hello，状态就是不可接受的。

[[字符串匹配3：KMP算法]]KMP算法也是一个自动机，基于字符串的 KMP 自动机接受且仅接受以为后缀的字符串。转移函数：

            i+1  (char == s[i+1])
f(i,char) = 0    (i==0, s[i+1] != char)
            f(next[i],char)

AC自动机，接受且仅接受以指定的字符串集合中的某个元素为后缀的字符串，本质是字典树思想与KMP思想的结合。

AC自动机

AC（Aho–Corasick）自动机是 以 Trie 的结构为基础，结合 KMP 的思想 建立的自动机，用于解决多模式匹配等任务。

AC 自动机本质上是 Trie 上的自动机。

简单来说，建立一个 AC 自动机有两个步骤：

基础的 Trie 结构：将所有的模式串构成一棵 Trie；
KMP 的思想：对 Trie 树上所有的结点构造失配指针。

需要注意的是，要在插入全部元素，构建完成字典树之后再处理失败指针。

构建完成后就可以对字符串进行多模式匹配。匹配规则是：

遍历主串，和KMP思想一样，遍历主串的index是不会倒退的。设置当前节点为root节点
匹配主串当前位置的字符与是否再当前节点的子节点上。
如果不在，则沿着失败指针查找，直到root节点（null）还没匹配到，则设置当前节点为root节点，如果查找到了，则将当前节点设置为失败指针指向的节点的匹配子节点。
如果在，则将当前节点设置为当前节点的匹配子节点。

package cn.lishiyuan.algorithm.strings;  
  
import java.util.*;  
import java.util.function.Function;  
  
/**  
 * AC自动机多模式匹配  
 */  
public class ACMatcher  {  
    // 分词  
    private final Function<String,List<String>> split;  
    // 合并词  
    private final Function<List<String>,String> merge;  
  
    private final Node root;  
  
    private static class Node{  
        String word;  
  
        // 子节点  
        Map<String, Node> children;  
  
        // 是否是最后一个字母或者单词  
        boolean isEndWord = false;  
  
        int length = 0;  
  
        Node fail;  
  
        Node(String word){  
            this.word = word;  
            this.children = new HashMap<>();  
            fail = null;  
        }  
    }  
  
    public ACMatcher(Function<String,List<String>> split, Function<List<String>,String> merge){  
        this.split = split;  
        this.merge = merge;  
        this.root = new Node(null);  
    }  
  
    public void insert(String words){  
        if (words == null || words.trim().isEmpty()){  
            return;  
        }  
  
        // 分词，构建树节点  
        List<String> wordsList = split.apply(words);  
        // 匹配字典数据  
        Node parent = root;  
  
        for (int i = 0; i < wordsList.size(); i++) {  
  
            String word = wordsList.get(i);  
            if (parent.children.containsKey(word)){  
                // 存在相同节点则继续向后查找  
                parent = parent.children.get(word);  
            }else {  
                Node node = new Node(word);  
                node.isEndWord = (i == (wordsList.size() - 1));  
                if(node.isEndWord){  
                    // 分词后的长度  
                    node.length = wordsList.size();  
                }  
                parent.children.put(word,node);  
                parent = node;  
            }  
        }  
  
    }  
  
    // 全部敏感词插入完成后  
    // 手动构建fail指针  
    public void buildFailure() {  
        // 通过广度优先遍历构建fail指针  
        Queue<Node> queue = new LinkedList<>();  
        root.fail = null;  
        for (Node child : root.children.values()) {  
            // root的子节点的fail指针是root  
            child.fail = root;  
            queue.add(child);  
        }  
        while (!queue.isEmpty()) {  
            Node current = queue.poll();  
            for (Map.Entry<String, Node> entry : current.children.entrySet()) {  
                Node child = entry.getValue();  
                // 父节点的失败指针  
                Node fail = current.fail;  
                while (fail != null && !fail.children.containsKey(entry.getKey())) {  
                    // 失败指针的失败指针  
                    fail = fail.fail;  
                }  
                // 如果没找到就指向root  
                child.fail = (fail != null) ? fail.children.get(entry.getKey()) : root;  
                queue.add(child);  
            }  
        }  
    }  
  
    public List<String> find(String text){  
        if (text == null || text.trim().isEmpty()){  
            return Collections.emptyList();  
        }  
  
        // 如果不最后一个词（也就是中途break了），则返回空列表  
        List<String> result = new ArrayList<>();  
  
        // 分词，构建树节点  
        List<String> wordsList = split.apply(text);  
        // 匹配字典数据  
        Node parent = root;  
  
        for (int startIndex = 0; startIndex < wordsList.size(); startIndex++) {  
  
            String key = wordsList.get(startIndex);  
            // index 只有一个方向  
            // 字典如果没有查找到就从下个字重新开始查找  
            if (parent.children.containsKey(key)) {  
                parent = parent.children.get(key);  
                // 查找下一个字符  
                extractWord(parent, wordsList,result, startIndex);  
            }else {  
                Node fail = parent.fail;  
                // 如果不为null且fail子节点不包含下个词，则还需要再从缩小查找范围  
                while (fail!=null && !fail.children.containsKey(key)){  
                    fail = fail.fail;  
                }  
                if (fail == null){  
                    // 不是敏感词  
                    parent = root;  
                }else {  
                    // 查到了，继续查找  
                    parent = fail.children.get(key);  
                    extractWord(parent, wordsList,result, startIndex);  
                }  
  
            }  
  
        }  
  
        return result;  
    }  
  
    private void extractWord(Node parent, List<String> wordsList,List<String> result, int startIndex) {  
        Node current = parent;  
        while (current != root) {  
            if (current.isEndWord) {  
                // 记录单词  
                List<String> list = wordsList.subList(startIndex - current.length + 1, startIndex + 1);  
                String word = merge.apply(list);  
                result.add(word);  
            }  
            current = current.fail;  
        }  
    }  
  
    public boolean match(String test){  
        List<String> words = find(test);  
        return !words.isEmpty();  
    }  
  
}

使用方法是，先调用insert将敏感词插入到自动机的字典树上面，再调用buildFailure广度优先搜索，构建失败指针，最后调用find查找主串中含有的敏感词。

多模式匹配

自动机

AC自动机

评论已关闭

最新文章

分类