編程開發(fā)中如何實現(xiàn)布隆過濾器

這篇文章給大家分享的是有關(guān)編程開發(fā)中如何實現(xiàn)布隆過濾器的內(nèi)容。小編覺得挺實用的，因此分享給大家做個參考，一起跟隨小編過來看看吧。

成都創(chuàng)新互聯(lián)專注為客戶提供全方位的互聯(lián)網(wǎng)綜合服務(wù)，包含不限于成都網(wǎng)站建設(shè)、網(wǎng)站制作、鄠邑網(wǎng)絡(luò)推廣、小程序開發(fā)、鄠邑網(wǎng)絡(luò)營銷、鄠邑企業(yè)策劃、鄠邑品牌公關(guān)、搜索引擎seo、人物專訪、企業(yè)宣傳片、企業(yè)代運營等，從售前售中售后，我們都將竭誠為您服務(wù)，您的肯定，是我們最大的嘉獎；成都創(chuàng)新互聯(lián)為所有大學生創(chuàng)業(yè)者提供鄠邑建站搭建服務(wù)，24小時服務(wù)熱線：13518219792，官方網(wǎng)址：www.cdcxhl.com

給40億個不重復(fù)的無符號整數(shù)，沒排過序。給一個無符號整數(shù)，如何快速判斷一個數(shù)是否在這40億個數(shù)中。

拿到這個題目，我們首先想到的是遍歷這40億的數(shù)字，然后一個一個找。顯然是行不通的。因為這40億個數(shù)放到內(nèi)存中，大約需要16G內(nèi)存。

如果我們把它轉(zhuǎn)換成位圖處理，那么就好處理多了。我們可以把一個×××再細分一下，一個int類型就可以編程32個位，每一位用0，1表示當前這個位置上是否存有值，同樣是利用哈希存儲的方法。只是這樣存儲的話就可以減少很多的空間了，例如上題使用的內(nèi)存就可以從16G降到500M的內(nèi)存?？臻g的使用率減少了不止一點。

大家可以根據(jù)我這個方法實現(xiàn)上面的代碼，今天我主要介紹的是布隆過濾器，因為布隆過濾器也要用到位圖（bitmap），位圖實現(xiàn)思想：

1.把一個int類型變成32個bits。

2.把它們?nèi)砍跏蓟癁?。

3.如果當前位上有值，把0置成1。

下面我給出位圖的實現(xiàn)代碼：

BitMap.h中

#include 

class BitMap
{
public:
	BitMap(size_t size = 0)
		:_size(0)
	{
		//用resize開辟空間，_a中的值會被初始化為0

		//加1為了讓值全部能放到數(shù)組中，假如有36個數(shù)，36/32=1余4，而
		//多開的那個空間就保證了這4個數(shù)能放下

		//_a.resize(size/32+1);和下面的代碼一個性質(zhì)，只不過用移位運算符比除法的效率高
		_a.resize((size >> 5) + 1);
	}
	//插入
	void Set(size_t x)
	{
		size_t index = x >> 5;
		size_t num = x % 32;
		//當前位置如果等于0，沒有值，可以插入
		if (!(_a[index] & (1 << num)))
		{
			_a[index] |= (1 << num);//當前位置置1
			++_size;
		}
	}
	//刪除
	void Reset(size_t x)
	{
		size_t index = x >> 5;
		size_t num = x % 32;
		//當前位置為1，有值，可以刪除
		if (_a[index] & (1 << num))
		{
			_a[index] &= ~(1 << num);//當前位置置0
			--_size;
		}
	}
	//判斷是否有值
	bool BitMapTest(size_t x)
	{
		size_t index = x >> 5;
		size_t num = x % 32;
		if (_a[index] & (1 << num))
		{
			return true;
		}
		return false;
	}
	void Resize(size_t size)
	{
		_a.resize(size);
	}
protected:
	vector _a;
	size_t _size;
};

如果想判斷一個元素是不是在一個集合里，一般想到的是將所有元素保存起來，然后通過比較確定。鏈表，樹等等數(shù)據(jù)結(jié)構(gòu)都是這種思路. 但是隨著集合中元素的增加，我們需要的存儲空間越來越大，檢索速度也越來越慢。不過世界上還有一種叫作散列表（又叫哈希表，Hash table）的數(shù)據(jù)結(jié)構(gòu)。它可以通過一個Hash函數(shù)將一個元素映射成一個位陣列（Bit Array）中的一個點。這樣一來，我們只要看看這個點是不是 1 就知道可以集合中有沒有它了。這就是布隆過濾器的基本思想。

Hash面臨的問題就是沖突。假設(shè) Hash 函數(shù)是良好的，如果我們的位陣列長度為 m 個點，那么如果我們想將沖突率降低到例如 1%, 這個散列表就只能容納 m/100 個元素。顯然這就不叫空間有效了（Space-efficient）。解決方法也簡單，就是使用多個 Hash，如果它們有一個說元素不在集合中，那肯定就不在。如果它們都說在，雖然也有一定可能性它們在說謊，不過直覺上判斷這種事情的概率是比較低的。

優(yōu)點

相比于其它的數(shù)據(jù)結(jié)構(gòu)，布隆過濾器在空間和時間方面都有巨大的優(yōu)勢。布隆過濾器存儲空間和插入/查詢時間都是常數(shù)。另外, Hash 函數(shù)相互之間沒有關(guān)系，方便由硬件并行實現(xiàn)。布隆過濾器不需要存儲元素本身，在某些對保密要求非常嚴格的場合有優(yōu)勢。

布隆過濾器可以表示全集，其它任何數(shù)據(jù)結(jié)構(gòu)都不能；

k 和 m 相同，使用同一組 Hash 函數(shù)的兩個布隆過濾器的交并差運算可以使用位操作進行。

缺點

但是布隆過濾器的缺點和優(yōu)點一樣明顯。誤算率（False Positive）是其中之一。隨著存入的元素數(shù)量增加，誤算率隨之增加。但是如果元素數(shù)量太少，則使用散列表足矣。

另外，一般情況下不能從布隆過濾器中刪除元素. 我們很容易想到把位列陣變成整數(shù)數(shù)組，每插入一個元素相應(yīng)的計數(shù)器加1, 這樣刪除元素時將計數(shù)器減掉就可以了。然而要保證安全的刪除元素并非如此簡單。首先我們必須保證刪除的元素的確在布隆過濾器里面. 這一點單憑這個過濾器是無法保證的。另外計數(shù)器回繞也會造成問題。

下面給出布隆過濾器的實現(xiàn)代碼：

仿函數(shù)實現(xiàn)，我用了5個仿函數(shù)，它們的實現(xiàn)我是看http://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html這里面實現(xiàn)的，他的實現(xiàn)比較好，能更好的避免哈希沖突。

commom.h中

#pragma once

#include 
size_t NewSize(size_t size)
{
	// 使用素數(shù)表對齊做哈希表的容量，降低哈希沖突
	const int _PrimeSize = 28;
	static const unsigned long _PrimeList[_PrimeSize] =
	{
		53ul, 97ul, 193ul, 389ul, 769ul,
		1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
		49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
		1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
		50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,
		1610612741ul, 3221225473ul, 4294967291ul
	};
	for (int i = 0; i < _PrimeSize; i++)
	{
		if (_PrimeList[i]>size)
		{
			return _PrimeList[i];//按照素數(shù)表來設(shè)置容量大小
		}
	}
	//當需要的容量超過素數(shù)表的最大容量，我們就按照最大的來擴容
	return _PrimeList[_PrimeSize - 1];
}
template 
struct __HashFunc1
{
	size_t BKDRHash(const T *str)
	{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)
		{
			hash = hash * 131 + ch;   
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return BKDRHash(key.c_str());
	}
};
template 
struct __HashFunc2
{
	size_t SDBMHash(const T *str)
	{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)
		{
			hash = 65599 * hash + ch;
			//hash = (size_t)ch + (hash << 6) + (hash << 16) - hash;  
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return SDBMHash(key.c_str());
	}
};
template 
struct __HashFunc3
{
	size_t RSHash(const T *str)
	{
		register size_t hash = 0;
		size_t magic = 63689;
		while (size_t ch = (size_t)*str++)
		{
			hash = hash * magic + ch;
			magic *= 378551;
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return RSHash(key.c_str());
	}
};

template 
struct __HashFunc4
{
	size_t APHash(const T *str)
	{
		register size_t hash = 0;
		size_t ch;
		for (long i = 0; ch = (size_t)*str++; i++)
		{
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
				hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
			}
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return APHash(key.c_str());
	}
};
template 
struct __HashFunc5
{
	size_t JSHash(const T *str)
	{
		if (!*str)        // 這是由本人添加，以保證空字符串返回哈希值0  
			return 0;
		register size_t hash = 1315423911;
		while (size_t ch = (size_t)*str++)
		{
			hash ^= ((hash << 5) + ch + (hash >> 2));
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return JSHash(key.c_str());
	}
};

我使用了5個Hash函數(shù)，可以降低哈希沖突。大家視情況而定，自己設(shè)置哈希函數(shù)的個數(shù)。

BoolmFilter.h中

#pragma once

#include
#include "BitMap.h"
#include "common.h"
template,
	class HashFunc2 = __HashFunc2,
	class HashFunc3 = __HashFunc3,
	class HashFunc4 = __HashFunc4,
	class HashFunc5 = __HashFunc5
>
class BoolmFilter
{
public:
	BoolmFilter(size_t capatity = 0)
	{
		_capatity = NewSize(capatity);
		_bm.Resize(_capatity);
	}
	void Set(const T& key)
	{
		size_t index1 = HashFunc1()(key);
		size_t index2 = HashFunc2()(key);
		size_t index3 = HashFunc3()(key);
		size_t index4 = HashFunc4()(key);
		size_t index5 = HashFunc5()(key);

		_bm.Set(index1%_capatity);
		_bm.Set(index2%_capatity);
		_bm.Set(index3%_capatity);
		_bm.Set(index4%_capatity);
		_bm.Set(index5%_capatity);
	}
	bool Test(const T& key)
	{
		size_t index1 = HashFunc1()(key);
		if (!_bm.BitMapTest(index1%_capatity))
		{
			return false;
		}
		size_t index2 = HashFunc2()(key);
		if (!_bm.BitMapTest(index2%_capatity))
		{
			return false;
		}
		size_t index3 = HashFunc3()(key);
		if (!_bm.BitMapTest(index3%_capatity))
		{
			return false;
		}
		size_t index4 = HashFunc4()(key);
		if (!_bm.BitMapTest(index4%_capatity))
		{
			return false;
		}
		size_t index5 = HashFunc5()(key);
		if (!_bm.BitMapTest(index5%_capatity))
		{
			return false;
		}
		return true;
	}
protected:
	BitMap _bm;
	size_t _capatity;
};

test.cpp中

#include 
using namespace std;
#include "BoolmFilter.h"

void BoolTest()
{
	BoolmFilter<>bf(100);
	bf.Set("she is girl");
	bf.Set("我是好人");
	bf.Set("chive/2012/05/31/2528153.html");
	cout << bf.Test("she is girl") << endl;
	cout << bf.Test("我是好人") << endl;
	cout << bf.Test("chive/2012/05/31/2528153.html") << endl;
}
int main()
{
	BoolTest();
	system("pause");
	return 0;
}

感謝各位的閱讀！關(guān)于“編程開發(fā)中如何實現(xiàn)布隆過濾器”這篇文章就分享到這里了，希望以上內(nèi)容可以對大家有一定的幫助，讓大家可以學到更多知識，如果覺得文章不錯，可以把它分享出去讓更多的人看到吧！

文章名稱：編程開發(fā)中如何實現(xiàn)布隆過濾器
瀏覽地址：http://weahome.cn/article/iidjcd.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

編程開發(fā)中如何實現(xiàn)布隆過濾器

優(yōu)點

缺點

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管