Clever Geek Handbook
📜 ⬆️ ⬇️

Bloom filter

The Bloom filter is a probabilistic data structure invented by Burton Bloom in 1970 [1] , which allows you to check whether an element belongs to a set . At the same time, it is possible to obtain a false positive response (there is no element in the set, but the data structure reports that it is), but not false negative .

The Bloom filter can use any amount of memory previously set by the user, and the larger it is, the less likely it is to false-positive. The operation of adding new elements to a set, but not deleting existing ones (unless modification with counters is used) is supported.

Content

Description of the data structure

 
Bloom filter example withm=18 {\ displaystyle m = 18}   andk=3 {\ displaystyle k = 3}   storing the set { x , y , z }. Colored arrows indicate places in the bitmap corresponding to each element of the set. This Bloom filter can determine that the element w is not included in the set, since one of the corresponding bits is equal to zero.

The Bloom filter is a bit array of m bits . Initially, when the data structure stores an empty set , all m bits are reset to zero. The user must determine k independent hash functions h 1 , ..., h k that map each element to one of the m positions of the bit array in a fairly uniform way.

To add an element e, it is necessary to write units for each of the positions h 1 ( e ), ..., h k ( e ) of the bit array.

To check whether an element e belongs to the set of stored elements, it is necessary to check the state of bits h 1 ( e ), ..., h k ( e ). If at least one of them is equal to zero, the element cannot belong to the set (otherwise, if it were added, all these bits would be set). If they are all equal to one, then the data structure reports that e belongs to the set. In this case, two situations may arise: either the element really belongs to the set, or all these bits turned out to be set by chance when adding other elements, which is the source of false positives in this data structure.

Probability of false positives

 
Estimation of the probability of false positives p as a function of the number of inserted elements n and the size of the bit array m , with the optimal number of hash functionsk=(m/n)ln⁡2 {\ displaystyle k = (m / n) \ ln 2}   .

Suppose that the size of the bitmap is m bits and k hash functions are given. Suppose that a set of hash functions is randomly selected, and for any element x, each hash function h i assigns it one of the places in the bitmap with equal probability

Pr(hi(x)=p)=onem,p=one...m,{\ displaystyle \ Pr (h_ {i} (x) = p) = {\ frac {1} {m}}, \ quad p = 1 \ ldots m,}  

and, in addition, the meaningshi(x) {\ displaystyle h_ {i} (x)}   are collectively independent random variables (to simplify the subsequent analysis).

Then the probability that one will not be written to some pth bit during the insert operation of the next element is

Pr(hone(x)≠p∩...∩hk(x)≠p)=Pr(hi(x)≠p)k=(one-onem)k.{\ displaystyle \ Pr (h_ {1} (x) \ neq p \ cap \ ldots \ cap h_ {k} (x) \ neq p) = \ Pr (h_ {i} (x) \ neq p) ^ { k} = \ left (1 - {\ frac {1} {m}} \ right) ^ {k}.}  

And the probability that the pth bit remains zero after inserting n different elements x 1 , ..., x n into the initially empty Bloom filter is

Pr(∀i∈{one...k},∀j∈{one...n},hi(xj)≠p)=(one-onem)kn≈e-kn/m{\ displaystyle \ Pr \ left (\ forall i \ in \ {1 \ dots k \}, \ \ forall j \ in \ {1 \ ldots n \}, \ h_ {i} (x_ {j}) \ neq p \ right) = \ left (1 - {\ frac {1} {m}} \ right) ^ {kn} \ approx e ^ {- kn / m}}  

for a sufficiently large m in view of the second remarkable limit .

The false positive is that for some element y , not equal to any of the inserted, all k bits with numbers h i ( y ) will be non-zero, and the Bloom filter will incorrectly answer that y is included in the set of inserted elements. The probability of such an event is approximately equal

(one-e-kn/m)k.{\ displaystyle (1-e ^ {- kn / m}) ^ {k}.}  

Obviously, this probability decreases with increasing m (the size of the bitmap) and increases with increasing n (the number of inserted elements). For fixed m and n, the optimal number k (the number of hash functions) minimizing it is

k=mnln⁡2≈0.693onemn.{\ displaystyle \ textstyle k = {\ frac {m} {n}} \ ln 2 \ approx 0 {,} 6931 {\ frac {m} {n}}.}  

In this case, the probability of false positives is equal to

2-k≈0.618fivem/n.{\ displaystyle 2 ^ {- k} \ approx {0 {,} 6185} ^ {m / n}.}  

As a result, we note that in order for the Bloom filter to maintain a given limited probability of false positives, the size of the bitmap must be linearly proportional to the number of elements inserted.

Properties

  • Unlike many other data structures (for example, hash tables ) that also store many elements, the Bloom filter can represent the universal set of all possible elements. In this case, all the bits in its bitmap are equal to one.
  • The union and intersection of two Bloom filters of the same size and with the same set of hash functions can be implemented by bitwise operations OR and AND on their bitmaps.

Application

Compared to hash tables , a Bloom filter can cost several orders of magnitude less memory, sacrificing determinism. It is usually used to reduce the number of queries to non-existent data in a data structure with more expensive access (for example, located on a hard disk or in a network database), that is, to "filter" queries to it (hence the name of the data structure).

Examples of practical applications:

  • Squid Proxy uses Bloom filters for the cache digests option .
  • Google BigTable uses Bloom filters to reduce the number of hard disk accesses when checking for the existence of a given row or column in a database table. [2]
  • Computer programs for checking spelling.

Notes

  1. ↑ Bloom, Burton H. (1970), " Space / time trade-offs in hash coding with allowable errors ", Communications of the ACM T. 13 (7): 422–426 , DOI 10.1145 / 362686.362692  
  2. ↑ Bigtable: A Distributed Storage System for Structured Data
Source - https://ru.wikipedia.org/w/index.php?title= Bloom Filter&oldid = 95651976


More articles:

  • 1850 in literature
  • FIBA European League 1994/1995
  • Njesuti
  • Woodmer
  • Bang on Yehuda Street (1948)
  • Entomological Society of Latvia
  • SS 433
  • Voronkouhi
  • Legion of the Living Dead
  • DIN Rail

All articles

Clever Geek | 2019