今天給大家推薦一款高效能敏感詞檢測開源庫。
01
計畫簡介
這是一款基於.Net開發的、高效能敏感詞工具箱,支持繁簡互換、全形半形互換,拼音模糊搜尋等功能。功能強大、高效能,秒級檢測億級別的文章。
02
技術架構
跨平台:采用.Net Core3.1開發,支持跨平台。可以部署在Docker, Windows, Linux, Mac。
03
計畫結構
04
使用方法
敏感詞檢測
過濾敏感詞,可以設定跳字長度,預設全形轉半形、忽略大小寫、跳詞、重復詞、黑名單。返回結果包含:關鍵字、關鍵字起始位置、結束位置、關鍵字序號等資訊。
string s = "中國|國人|zg人";
string test = "我是中國人";
StringSearch iwords = new StringSearch();
iwords.SetKeywords(s.Split('|'));
var b = iwords.ContainsAny(test);
Assert.AreEqual(true, b);
var f = iwords.FindFirst(test);
Assert.AreEqual("中國", f);
var all = iwords.FindAll(test);
Assert.AreEqual("中國", all[0]);
Assert.AreEqual("國人", all[1]);
Assert.AreEqual(2, all.Count);
var str = iwords.Replace(test, '*');
Assert.AreEqual("我是***", str);
敏感詞通配符檢測
支持正規表式型別:.?[]|,透過正規表式可以進行模糊匹配,提升檢測精準度。
string s = ".[中美]國|國人|zg人";
string test = "我是中國人";
WordsMatch wordsSearch = new WordsMatch();
wordsSearch.SetKeywords(s.Split('|'));
var b = wordsSearch.ContainsAny(test);
Assert.AreEqual(true, b);
var f = wordsSearch.FindFirst(test);
Assert.AreEqual("是中國", f.Keyword);
var alls = wordsSearch.FindAll(test);
Assert.AreEqual("是中國", alls[0].Keyword);
Assert.AreEqual(".[中美]國", alls[0].MatchKeyword);
Assert.AreEqual(1, alls[0].Start);
Assert.AreEqual(3, alls[0].End);
Assert.AreEqual(0, alls[0].Index);//返回索引Index,預設從0開始
Assert.AreEqual("國人", alls[1].Keyword);
Assert.AreEqual(2, alls.Count);
var t = wordsSearch.Replace(test, '*');
Assert.AreEqual("我****", t);
拼音轉換、繁簡轉換、數位轉大小寫操作
此工具箱,整合了繁體簡體互轉、拼音轉換、首字母提取、數位轉大小寫,使用例子如下:
// 轉成簡體
WordsHelper.ToSimplifiedChinese("我愛中國");
WordsHelper.ToSimplifiedChinese("我愛中國",1);// 港澳繁體 轉 簡體
WordsHelper.ToSimplifiedChinese("我愛中國",2);// 台灣正體 轉 簡體
// 轉成繁體
WordsHelper.ToTraditionalChinese("我愛中國");
WordsHelper.ToTraditionalChinese("我愛中國",1);// 簡體 轉 港澳繁體
WordsHelper.ToTraditionalChinese("我愛中國",2);// 簡體 轉 台灣正體
// 轉成全形
WordsHelper.ToSBC("abcABC123");
// 轉成半形
WordsHelper.ToDBC("abcABC123");
// 數位轉成中文大寫
WordsHelper.ToChineseRMB(12345678901.12);
// 中文轉成數位
WordsHelper.ToNumber("壹佰貳拾三億肆仟伍佰陸拾柒萬捌仟玖佰零壹元壹角貳分");
// 獲取全拼
WordsHelper.GetPinyin("我愛中國");//WoAiZhongGuo
WordsHelper.GetPinyin("我愛中國",",");//Wo,Ai,Zhong,Guo
WordsHelper.GetPinyin("我愛中國",true);//WǒÀiZhōngGuó
// 獲取首字母
WordsHelper.GetFirstPinyin("我愛中國");//WAZG
// 獲取全部拼音
WordsHelper.GetAllPinyin('傳');//Chuan,Zhuan
// 獲取姓名
WordsHelper.GetPinyinForName("單一一")//ShanYiYi
WordsHelper.GetPinyinForName("單一一",",")//Shan,Yi,Yi
WordsHelper.GetPinyinForName("單一一",true)//ShànYīYī
效能對比
下面我們使用者1000字字串,進行10萬次效能對比,看看對比結果,測試程式碼如下:
ReadBadWord();
var text = File.ReadAllText("Talk.txt");
Console.Write("-------------------- FindFirst OR ContainsAny 100000次 --------------------
Run("TrieFilter", () => { tf1.HasBadWord(text); });
Run("FastFilter", () => { ff.HasBadWord(text); });
Run("StringSearch(ContainsAny)", () => { stringSearch.ContainsAny(text); });
Run("StringSearchEx(ContainsAny)--- WordsSearchEx(ContainsAny)程式碼相同", () => { stringSearchEx.ContainsAny(text); });
Run("StringSearchEx2(ContainsAny)--- WordsSearchEx2(ContainsAny)程式碼相同", () => { stringSearchEx2.ContainsAny(text); });
Run("StringSearchEx3(ContainsAny)--- WordsSearchEx3(ContainsAny)程式碼相同", () => { stringSearchEx3.ContainsAny(text); });
Run("IllegalWordsSearch(ContainsAny)", () => { illegalWordsSearch.ContainsAny(text); });
Run("StringSearch(FindFirst)", () => { stringSearch.FindFirst(text); });
Run("StringSearchEx(FindFirst)", () => { stringSearchEx.FindFirst(text); });
Run("StringSearchEx2(FindFirst)", () => { stringSearchEx2.FindFirst(text); });
Run("StringSearchEx3(FindFirst)", () => { stringSearchEx3.FindFirst(text); });
Run("WordsSearch(FindFirst)", () => { wordsSearch.FindFirst(text); });
Run("WordsSearchEx(FindFirst)", () => { wordsSearchEx.FindFirst(text); });
Run("WordsSearchEx2(FindFirst)", () => { wordsSearchEx2.FindFirst(text); });
Run("WordsSearchEx3(FindFirst)", () => { wordsSearchEx3.FindFirst(text); });
Run("IllegalWordsSearch(FindFirst)", () => { illegalWordsSearch.FindFirst(text); });
Console.Write("-------------------- Find All 100000次 --------------------
Run("TrieFilter(FindAll)", () => { tf1.FindAll(text); });
Run("FastFilter(FindAll)", () => { ff.FindAll(text); });
Run("StringSearch(FindAll)", () => { stringSearch.FindAll(text); });
Run("StringSearchEx(FindAll)", () => { stringSearchEx.FindAll(text); });
Run("StringSearchEx2(FindAll)", () => { stringSearchEx2.FindAll(text); });
Run("StringSearchEx3(FindAll)", () => { stringSearchEx3.FindAll(text); });
Run("WordsSearch(FindAll)", () => { wordsSearch.FindAll(text); });
Run("WordsSearchEx(FindAll)", () => { wordsSearchEx.FindAll(text); });
Run("WordsSearchEx2(FindAll)", () => { wordsSearchEx2.FindAll(text); });
Run("WordsSearchEx3(FindAll)", () => { wordsSearchEx3.FindAll(text); });
Run("IllegalWordsSearch(FindAll)", () => { illegalWordsSearch.FindAll(text); });
Console.Write("-------------------- Replace 100000次 --------------------
Run("TrieFilter(Replace)", () => { tf1.Replace(text); });
Run("FastFilter(Replace)", () => { ff.Replace(text); });
Run("StringSearch(Replace)", () => { stringSearch.Replace(text); });
Run("WordsSearch(Replace)", () => { wordsSearch.Replace(text); });
Run("StringSearchEx(Replace)--- WordsSearchEx(Replace)程式碼相同", () => { stringSearchEx.Replace(text); });
Run("StringSearchEx2(Replace)--- WordsSearchEx2(Replace)程式碼相同", () => { stringSearchEx2.Replace(text); });
Run("StringSearchEx3(Replace)--- WordsSearchEx3(Replace)程式碼相同", () => { stringSearchEx3.Replace(text); });
Run("IllegalWordsSearch(Replace)", () => { illegalWordsSearch.Replace(text); });
Console.Write("-------------------- Regex 100次 --------------------
Run(100, "Regex.IsMatch", () => { re.IsMatch(text); });
Run(100, "Regex.Match", () => { re.Match(text); });
Run(100, "Regex.Matches", () => { re.Matches(text); });
Console.Write("-------------------- Regex used Trie tree 100次 --------------------
Run(100, "Regex.IsMatch", () => { re2.IsMatch(text); });
Run(100, "Regex.Match", () => { re2.Match(text); });
Run(100, "Regex.Matches", () => { re2.Matches(text); });
執行10萬次效能對比,結果如下:
從測試結果看,此工具比C#內建的正則效率高8.8倍,如果數量量越大效能優勢越明顯。
05
計畫地址
https://github.com/toolgood/ToolGood.Words
最後推薦加入我的 , 帶你從零學習:三層架構與領域驅動設計架構 !
- End -
分享一套.NetCore從入門到精通視訊教程
點選下方公眾號卡片,關註我
回復「 888 」,免費領取
覺得好看 點個在看👇