PHP如何實(shí)現(xiàn)非法詞匯過濾

這篇文章主要介紹“PHP如何實(shí)現(xiàn)非法詞匯過濾”的相關(guān)知識，小編通過實(shí)際案例向大家展示操作過程，操作方法簡單快捷，實(shí)用性強(qiáng)，希望這篇“PHP如何實(shí)現(xiàn)非法詞匯過濾”文章能幫助大家解決問題。

創(chuàng)新互聯(lián)建站為企業(yè)提供：品牌網(wǎng)站建設(shè)、網(wǎng)絡(luò)營銷策劃、小程序定制開發(fā)、營銷型網(wǎng)站建設(shè)和網(wǎng)站運(yùn)營托管，一站式網(wǎng)絡(luò)營銷整體服務(wù)。實(shí)現(xiàn)不斷獲取潛在客戶之核心目標(biāo)，建立了企業(yè)專屬的“營銷型網(wǎng)站建設(shè)”，就用不著再為了獲取潛在客戶而苦惱，相反，客戶會主動找您，生意就找上門來了！

算法簡介

將關(guān)鍵詞構(gòu)造成一顆樹，每個字都是一個節(jié)點(diǎn)。

遍歷需要過濾的語句，將語句的每個字都去樹中查找，看看是否存在。

實(shí)現(xiàn)難點(diǎn)

構(gòu)造一棵樹簡單，關(guān)鍵點(diǎn)是php中遍歷字符串需要自己正確的得到單個字符的長度。
簡單遍歷字符串的方法如下：

$strLen = mb_strlen($str);
for ($i = 0; $i < $strLen; $i++) {
    echo mb_substr($str, $i, 1, "utf8"),PHP_EOL;
}

該方法是利用mb_*系列函數(shù)來正確截取每個字符，處理大量字符串時速度非常慢，我猜測是：mb_substr每截取一個字符，都要計(jì)算該字符串之前，有多少個字符。
正確的遍歷字符串的方式是按utf8的編碼規(guī)律來截取字符串，具體請看下文。

算法實(shí)現(xiàn)

tree = new WordNode();
        $file = fopen($path, "r");
        while (!feof($file)) {
            $words = trim(fgets($file));
            if ($words == '') {
                continue;
            }
            //存在純數(shù)字的非法詞匯
            if (is_numeric($words)) {
                $this->callIsNumeric = false;
            }
            $this->setTree($words);
        }
        fclose($file);
    }

    protected function setTree($words)
    {
        $array = $this->strToArr($words);
        $tree = $this->tree;
        $l = count($array) - 1;
        foreach ($array as $k => $item) {
            $tree = $tree->getChildAlways($item);
            if ($l == $k) {
                $tree->end = true;
            }
        }
    }

    /**
     * 返回包含的非法詞匯
     * @param string $str
     * @return array
     */
    public function check($str)
    {
        //先壓縮字符串
        $str = trim(str_replace([' ', "\n", "\r"], ['', '', ''], $str));
        $ret = [];
        loop:
        $strLen = strlen($str);
        if ($strLen === 0) {
            return array_unique($ret);
        }
        //非法詞匯中沒有純數(shù)字的非法詞匯，待檢測字符串又是純數(shù)字的，則跳過不再檢查
        if ($this->callIsNumeric && is_numeric($str)) {
            return array_unique($ret);
        }
        //挨個字符進(jìn)行判斷
        $tree = $this->tree;
        $words = '';
        for ($i = 0; $i < $strLen; $i++) {
            //unicode范圍 --> ord 范圍
            //一字節(jié) 0-127 --> 0 - 127
            //二字節(jié) 128-2047 --> 194 - 223
            //三字節(jié) 2048-65535 --> 224 - 239
            //四字節(jié) 65536-1114111 --> 240 - 244
            //@see http://shouce.jb51.net/gopl-zh/ch4/ch4-05.html
            $ord = ord($str[$i]);
            if ($ord <= 127) {
                $word = $str[$i];
            } elseif ($ord <= 223) {
                $word = $str[$i] . $str[$i + 1];
                $i += 1;
            } elseif ($ord <= 239) {
                $word = $str[$i] . $str[$i + 1] . $str[$i + 2];
                $i += 2;
            } elseif ($ord <= 244) {
                //四字節(jié)
                $word = $str[$i] . $str[$i + 1] . $str[$i + 2] . $str[$i + 3];
                $i += 3;
            } else {
                //五字節(jié)php都溢出了
                //Parse error: Invalid UTF-8 codepoint escape sequence: Codepoint too large
                continue;
            }
            //判斷當(dāng)前字符
            $tree = $tree->getChild($word);
            if (is_null($tree)) {
                //當(dāng)前字不存在，則截取后再次循環(huán)
                $str = substr($str, $i + 1);
                goto loop;
            } else {
                $words .= $word;
                if ($tree->end) {
                    $ret[] = $words;
                }
            }
        }
        return array_unique($ret);
    }

    protected function strToArr($str)
    {
        $array = [];
        $strLen = mb_strlen($str);
        for ($i = 0; $i < $strLen; $i++) {
            $array[] = mb_substr($str, $i, 1, "utf8");
        }
        return $array;
    }
}
/**
 * 單個字符的節(jié)點(diǎn)
 */
class WordNode
{
    //是否為非法詞匯末級節(jié)點(diǎn)
    public $end = false;
    //子節(jié)點(diǎn)
    protected $child = [];

    /**
     * @param string $word
     * @return WordNode
     */
    public function getChildAlways($word)
    {
        if (!isset($this->child[$word])) {
            $this->child[$word] = new self();
        }
        return $this->child[$word];
    }

    /**
     * @param string $word
     * @return WordNode|null
     */
    public function getChild($word)
    {
        if ($word === '') {
            return null;
        }
        if (isset($this->child[$word])) {
            return $this->child[$word];
        }
        return null;
    }
}

關(guān)于“PHP如何實(shí)現(xiàn)非法詞匯過濾”的內(nèi)容就介紹到這里了，感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識，可以關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道，小編每天都會為大家更新不同的知識點(diǎn)。

網(wǎng)站題目：PHP如何實(shí)現(xiàn)非法詞匯過濾
網(wǎng)頁網(wǎng)址：http://weahome.cn/article/geieps.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

PHP如何實(shí)現(xiàn)非法詞匯過濾

算法簡介

實(shí)現(xiàn)難點(diǎn)

算法實(shí)現(xiàn)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管