From Wikipedia:
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
In other words, given a set of elements, bloom filter can tell you that: A) given element is definitely not in the set, or B) given element is maybe in the set. It can give you a false positive: it can say that an element is in the set when it in fact is not. But it will never give you a false negative: it will never say that an element is not in the set when in fact it is.
In my past life I used bloom filters to check whether or not I should perform an expensive database index search 🙂
In the following example I construct a bloom filter given a set of strings
set1, I then verify that each element of
set1is in the set according to the bloom filter. Finally I try a different set of elements
set2, and test what the bloom filter says about those elements. Given big enough bloom filter I get 100% correct answers (that non of the elements in
set2are present). Here’s the code (bloom.cpp):
#include#include #include "bloom.hpp" using namespace std; int main() { string set1[] = {"Martin", "Vorbrodt", "C++", "Blog"}; string set2[] = {"Not", "In", "The", "Set"}; bloom_filter bloom(128); for(auto& s : set1) bloom.add(s); for(auto& s : set1) cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl; for(auto& s : set2) cout << "Contains \"" << s << "\" : " << bloom.contains(s) << endl; }
Contains "Martin" : 1
Program output.
Contains "Vorbrodt" : 1
Contains "C++" : 1
Contains "Blog" : 1
Contains "Not" : 0
Contains "In" : 0
Contains "The" : 0
Contains "Set" : 0
My implementation of the bloom filter is primitive: it uses only one hashing function which increases false positives, but it is very short and clean and can be built upon; maybe at some point I'll write a full blown implementation; though there's plenty of examples online; I want this post to be an introduction to the topic.
In any case, here's my implementation of the bloom filter (bloom.hpp):
#pragma once #include#include #include template > class bloom_filter { public: bloom_filter(std::size_t size) : m_bits(size) {} void add(const key& data) { m_bits[hash{}(data) % m_bits.size()] = true; } bool contains(const key& data) { return m_bits[hash{}(data) % m_bits.size()]; } private: std::vector m_bits; };