Packages

package hash

Main package for hash APIs. Import all.

import com.spotify.scio.hash._
Source
package.scala
Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. hash
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Type Members

  1. sealed trait ApproxFilter[T] extends Serializable

    An approximate filter for instances of T, e.g.

    An approximate filter for instances of T, e.g. a Bloom filter. A Bloom filter offers an approximate containment test with one-sided error: if it claims that an element is contained in it, this might be in error, but if it claims that an element is not contained in it, then this is definitely true.

  2. sealed trait ApproxFilterCompanion extends AnyRef

    A trait for all ApproxFilter companion objects.

  3. implicit final class ApproxFilterIterable[T] extends AnyVal
  4. implicit final class ApproxFilterSCollection[T] extends AnyVal
  5. class BloomFilter[T] extends ApproxFilter[T]

    An ApproxFilter implementation backed by a Guava BloomFilter.

    An ApproxFilter implementation backed by a Guava BloomFilter.

    Import magnolify.guava.auto._ to get common instances of Guava Funnel s.

  6. case class MutableScalableBloomFilter[T](fpProb: Double, headCapacity: Long, growthRate: Int, tighteningRatio: Double, headFPProb: Double, headCount: Long, head: Option[google.common.hash.BloomFilter[T]], tail: List[Either[google.common.hash.BloomFilter[T], SerializedBloomFilters]])(implicit funnel: Funnel[T]) extends Serializable with Product

    Import magnolify.guava.auto._ to get common instances of Guava Funnel s.

    Import magnolify.guava.auto._ to get common instances of Guava Funnel s.

    T

    The type of objects inserted into the filter

    fpProb

    The initial false positive probability

    headCapacity

    The capacity of the filter at the head of filters

    growthRate

    The growth rate of each subsequent filter added to filters

    tighteningRatio

    The tightening ratio applied to headFPProb when scaling

    headFPProb

    The false positive probability of the head of filters

    headCount

    The number of items currently in the filter at the head of filters

    head

    The underlying bloom filter currently being inserted into, or None if this scalable filter has just been initialized

    tail

    The underlying already-saturated bloom filters, lazily deserialized from bytes as necessary.

    funnel

    The funnel to turn Ts into bytes

  7. case class SerializedBloomFilters(numFilters: Int, filterBytes: Array[Byte]) extends Product with Serializable

Value Members

  1. object BloomFilter extends ApproxFilterCompanion with Serializable

    Companion object for BloomFilter.

  2. object MutableScalableBloomFilter extends Serializable

    A mutable, scalable wrapper around a Guava BloomFilter.

    A mutable, scalable wrapper around a Guava BloomFilter. Not thread-safe.

    Scalable bloom filters use a series of bloom filters, adding a new one and scaling its size by growthRate once the previous filter is saturated. A scalable bloom filter contains ("might contain") an item if any of its filters contains the item.

    fpProb, the initial false positive probability, and tighteningRatio must be carefully chosen to maintain the desired overall false positive probability. Similarly, the initialCapacity determines how often the SBF must scale to support a given capacity and influences the effective false positive probability. See below for more details.

    Import magnolify.guava.auto._ to get common instances of Guava Funnel s.

    When a SBF scales a new filter is appended. The false positive probability for a series of numFilters appended filters is:

    1 - Range(0, numFilters).map { i => (1 - fpProb * scala.math.pow(tighteningRatio, i)) }.product

    For the defaults, the false positive probability after each append is 3%, 5.6%, 7.9%, 9.9%, 11.7% and so on. It is therefore in the interest of the user to appropriately size the initial filter so that the number of appends is small.

    An approximation of the long-term upper bound of the false positive probability is fpProb / (1 - tighteningRatio). For the defaults of fpProb = 0.03 and tighteningRatio = 0.9 this gives 0.03 / (1 - 0.9), or around 30% false positives as an upper bound.

    Bloom filters inherently trade precision for space. For a single filter, the number of bits is:

    -1 * capacity * scala.math.log(fpProb) / scala.math.pow(scala.math.log(2), 2)

    For example, for an initialCapacity of 1 million and the default fpProb of 3%, a filter will be 912 kilobytes; if fpProb is instead 0.1%, then the size grows to 1797 kilobytes.

    When using scalable bloom filters, if possible, always size the first filter to be larger than the known number of items to be inserted rather than choosing a small default and letting the filters scale to fit. For example, if a SBF must contain ~65.5 million items, then with the defaults and an initialCapacity of 1000, the SBF will have scaled 16 times, will consume ~84 megabytes and have a false positive probability of ~22%. If, on the other hand, a SBF is constructed with an initial capacity of 65.5 million items, it will not have scaled at all and therefore have its initial false positive probability of 3%, and consume only ~60 megabytes.

Inherited from AnyRef

Inherited from Any

Ungrouped