package hash
Main package for hash APIs. Import all.
import com.spotify.scio.hash._
- Source
- package.scala
- Alphabetic
- By Inheritance
- hash
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Type Members
- sealed trait ApproxFilter[T] extends Serializable
An approximate filter for instances of
T
, e.g.An approximate filter for instances of
T
, e.g. a Bloom filter. A Bloom filter offers an approximate containment test with one-sided error: if it claims that an element is contained in it, this might be in error, but if it claims that an element is not contained in it, then this is definitely true. - sealed trait ApproxFilterCompanion extends AnyRef
A trait for all ApproxFilter companion objects.
- implicit final class ApproxFilterIterable[T] extends AnyVal
- implicit final class ApproxFilterSCollection[T] extends AnyVal
- class BloomFilter[T] extends ApproxFilter[T]
An ApproxFilter implementation backed by a Guava BloomFilter.
An ApproxFilter implementation backed by a Guava BloomFilter.
Import
magnolify.guava.auto._
to get common instances of Guava Funnel s. - case class MutableScalableBloomFilter[T](fpProb: Double, headCapacity: Long, growthRate: Int, tighteningRatio: Double, headFPProb: Double, headCount: Long, head: Option[google.common.hash.BloomFilter[T]], tail: List[Either[google.common.hash.BloomFilter[T], SerializedBloomFilters]])(implicit funnel: Funnel[T]) extends Serializable with Product
Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.- T
The type of objects inserted into the filter
- fpProb
The initial false positive probability
- headCapacity
The capacity of the filter at the head of
filters
- growthRate
The growth rate of each subsequent filter added to
filters
- tighteningRatio
The tightening ratio applied to
headFPProb
when scaling- headFPProb
The false positive probability of the head of
filters
- headCount
The number of items currently in the filter at the head of
filters
- head
The underlying bloom filter currently being inserted into, or
None
if this scalable filter has just been initialized- tail
The underlying already-saturated bloom filters, lazily deserialized from bytes as necessary.
- funnel
The funnel to turn
T
s into bytes
- case class SerializedBloomFilters(numFilters: Int, filterBytes: Array[Byte]) extends Product with Serializable
Value Members
- object BloomFilter extends ApproxFilterCompanion with Serializable
Companion object for BloomFilter.
- object MutableScalableBloomFilter extends Serializable
A mutable, scalable wrapper around a Guava BloomFilter.
A mutable, scalable wrapper around a Guava BloomFilter. Not thread-safe.
Scalable bloom filters use a series of bloom filters, adding a new one and scaling its size by
growthRate
once the previous filter is saturated. A scalable bloom filtercontains
("might contain") an item if any of its filters contains the item.fpProb
, the initial false positive probability, andtighteningRatio
must be carefully chosen to maintain the desired overall false positive probability. Similarly, theinitialCapacity
determines how often the SBF must scale to support a given capacity and influences the effective false positive probability. See below for more details.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.When a SBF scales a new filter is appended. The false positive probability for a series of
numFilters
appended filters is:1 - Range(0, numFilters).map { i => (1 - fpProb * scala.math.pow(tighteningRatio, i)) }.product
For the defaults, the false positive probability after each append is 3%, 5.6%, 7.9%, 9.9%, 11.7% and so on. It is therefore in the interest of the user to appropriately size the initial filter so that the number of appends is small.
An approximation of the long-term upper bound of the false positive probability is
fpProb / (1 - tighteningRatio)
. For the defaults offpProb = 0.03
andtighteningRatio = 0.9
this gives0.03 / (1 - 0.9)
, or around 30% false positives as an upper bound.Bloom filters inherently trade precision for space. For a single filter, the number of bits is:
-1 * capacity * scala.math.log(fpProb) / scala.math.pow(scala.math.log(2), 2)
For example, for an
initialCapacity
of 1 million and the defaultfpProb
of 3%, a filter will be 912 kilobytes; iffpProb
is instead 0.1%, then the size grows to 1797 kilobytes.When using scalable bloom filters, if possible, always size the first filter to be larger than the known number of items to be inserted rather than choosing a small default and letting the filters scale to fit. For example, if a SBF must contain ~65.5 million items, then with the defaults and an
initialCapacity
of 1000, the SBF will have scaled 16 times, will consume ~84 megabytes and have a false positive probability of ~22%. If, on the other hand, a SBF is constructed with an initial capacity of 65.5 million items, it will not have scaled at all and therefore have its initial false positive probability of 3%, and consume only ~60 megabytes.