transformers

package transformers

Ordering

Alphabetic

Visibility

Public
All

Type Members

case class MDLRecord[T](label: T, value: Double) extends Product with Serializable
Labelled feature for MDL.
case class Settings(cls: String, name: String, params: Map[String, String], featureNames: Seq[String], aggregators: Option[String]) extends Product with Serializable
trait SettingsBuilder extends AnyRef
abstract class Transformer[-A, B, C] extends Serializable
Base class for feature transformers.
Base class for feature transformers.
Input values are converted into intermediate type B, aggregated, and converted to summary type C. The summary type C is then used to transform input values into features.
A
input type
B
aggregator intermediate type
C
aggregator summary type
case class WeightedLabel(name: String, value: Double) extends Product with Serializable
Weighted label.
Weighted label. Also can be thought as a weighted value in a named sparse vector.

Value Members

object Binarizer extends SettingsBuilder with Serializable
Transform numerical features to binary features.
Transform numerical features to binary features.
Feature values greater than threshold are binarized to 1.0; values equal to or less than threshold are binarized to 0.0.
Missing values are binarized to 0.0.
object Bucketizer extends SettingsBuilder with Serializable
Transform a column of continuous features to n columns of feature buckets.
Transform a column of continuous features to n columns of feature buckets.
With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all double values; Otherwise, FeatureRejection.OutOfBound rejection will be reported for values outside the splits specified.. Two examples of splits are Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity) and Array(0.0, 1.0, 2.0).
Note that if you have no idea of the upper and lower bounds of the targeted column, you should add Double.NegativeInfinity and Double.PositiveInfinity as the bounds of your splits to prevent a potential FeatureRejection.OutOfBound rejection.
Note also that the splits that you provided have to be in strictly increasing order, i.e. s0 < s1 < s2 < ... < sn.
Missing values are transformed to zero vectors.

object HashNHotEncoder extends SettingsBuilder with Serializable

Transform a collection of categorical features to binary columns, with at most N one-values.

Transform a collection of categorical features to binary columns, with at most N one-values. Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.

Missing values are transformed to zero vectors.

If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.

Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:

sizeScalingFactor     % Collisions
-----------------     ------------
                2     17.9934%
                4     10.5686%
                8      5.7236%
                16      3.0019%
                32      1.5313%
                64      0.7864%
              128      0.3920%
              256      0.1998%
              512      0.0975%
              1024      0.0478%
              2048      0.0236%
              4096      0.0071%

object HashNHotWeightedEncoder extends SettingsBuilder with Serializable
Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Transform a collection of weighted categorical features to columns of weight sums, with at most N values. Similar to NHotWeightedEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal NHotEncoder.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:
```
sizeScalingFactor     % Collisions
-----------------     ------------
                2     17.9934%
                4     10.5686%
                8      5.7236%
                16      3.0019%
                32      1.5313%
                64      0.7864%
              128      0.3920%
              256      0.1998%
              512      0.0975%
              1024      0.0478%
              2048      0.0236%
              4096      0.0071%
```

object HashOneHotEncoder extends SettingsBuilder with Serializable

Transform a collection of categorical features to binary columns, with at most a single one-value.

Transform a collection of categorical features to binary columns, with at most a single one-value. Similar to OneHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.

Missing values are transformed to zero vectors.

If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.

Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:

sizeScalingFactor     % Collisions
-----------------     ------------
                2     17.9934%
                4     10.5686%
                8      5.7236%
                16      3.0019%
                32      1.5313%
                64      0.7864%
              128      0.3920%
              256      0.1998%
              512      0.0975%
              1024      0.0478%
              2048      0.0236%
              4096      0.0071%

object HeavyHitters extends SettingsBuilder with Serializable
Transform a collection of categorical features to 2 columns, one for rank and one for count.
Transform a collection of categorical features to 2 columns, one for rank and one for count. Only the top heavyHittersCount items are tracked, with 1.0 being the most frequent rank, 2.0 the second most, etc. All other items are transformed to [0.0, 0.0].
Ranks and frequencies are estimated with Algebird's SketchMap data structure. With probability at least 1 - delta, this estimate is within eps * N of the true frequency (i.e., true frequency <= estimate <= true frequency + eps * N), where N is the total size of the input collection.
Missing values are transformed to [0.0, 0.0].
object IQROutlierRejector extends SettingsBuilder with Serializable
Reject values if they fall outside of either factor * IQR below the first quartile or factor * IQR above the third quartile.
Reject values if they fall outside of either factor * IQR below the first quartile or factor * IQR above the third quartile.
IQR or inter quartile range is the range between the first and the third quartiles.
The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision of the approximation can be controlled with the k parameter.
All values are transformed to zeros.
Values factor * IQR below the first quartile or factor * IQR above the third quartile are rejected as FeatureRejection.Outlier.
When using aggregated feature summary from a previous session, values outside of previously seen [min, max] will also report FeatureRejection.Outlier as rejection.
object Identity extends SettingsBuilder with Serializable
Transform features by passing them through.
Transform features by passing them through.
Missing values are transformed to 0.0.
object Indicator extends SettingsBuilder with Serializable
Transform an optional 1D feature to an indicator variable indicating presence.
Transform an optional 1D feature to an indicator variable indicating presence.
Missing values are mapped to 0.0. Present values are mapped to 1.0.
object MDL extends SettingsBuilder with Serializable
Transform a column of continuous labelled features to n columns of binned categorical features.
Transform a column of continuous labelled features to n columns of binned categorical features. The optimum number of bins is computed using Minimum Description Length (MDL), which is an entropy measurement between the values and the targets.
The transformer expects an MDLRecord where the first field is a label and the second value is the scalar that will be transformed into buckets.
MDL is an iterative algorithm so all of the data needed to compute the buckets will be pulled into memory. If you run into memory issues the sampleRate parameter should be lowered.
References:
- Fayyad, U., & Irani, K. (1993). "Multi-interval discretization of continuous-valued attributes for classification learning."
- https://github.com/sramirez/spark-MDLP-discretization
object MaxAbsScaler extends SettingsBuilder with Serializable
Transform features by rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
Transform features by rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
Missing values are transformed to 0.0.
When using aggregated feature summary from a previous session, out of bound values are truncated to -1.0 or 1.0 and FeatureRejection.OutOfBound rejections are reported.
object MinMaxScaler extends SettingsBuilder with Serializable
Transform features by rescaling each feature to a specific range [min, max] (default [0, 1]).
Transform features by rescaling each feature to a specific range [min, max] (default [0, 1]).
Missing values are transformed to min.
When using aggregated feature summary from a previous session, out of bound values are truncated to min or max and FeatureRejection.OutOfBound rejections are reported.
object NGrams extends SettingsBuilder with Serializable
Transform a collection of sentences, where each row is a Seq[String] of the words / tokens, into a collection containing all the n-grams that can be constructed from each row.
Transform a collection of sentences, where each row is a Seq[String] of the words / tokens, into a collection containing all the n-grams that can be constructed from each row. The feature representation is an n-hot encoding (see NHotEncoder) constructed from an expanded vocabulary of all of the generated n-grams.
N-grams are generated based on a specified range of low to high (inclusive) and are joined by the given sep (default is " "). For example, with low = 2, high = 3 and sep = "", row ["a", "b", "c", "d", "e"] would produce ["ab", "bc", "cd", "de", "abc", "bcd", "cde"].
As with NHotEncoder, missing values are transformed to [0.0, 0.0, ...].
object NHotEncoder extends SettingsBuilder with Serializable
Transform a collection of categorical features to binary columns, with at most N one-values.
Transform a collection of categorical features to binary columns, with at most N one-values.
Missing values are either transformed to zero vectors or encoded as a missing value.
When using aggregated feature summary from a previous session, unseen labels are either transformed to zero vectors or encoded as unknown (if encodeMissingValue is true) and [FeatureRejection.Unseen]] rejections are reported.
object NHotWeightedEncoder extends SettingsBuilder with Serializable
Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal NHotEncoder.
Missing values are either transformed to zero vectors or encoded as a missing value.
When using aggregated feature summary from a previous session, unseen labels are either transformed to zero vectors or encoded as unknown (if encodeMissingValue is true) and [FeatureRejection.Unseen]] rejections are reported.
object Normalizer extends SettingsBuilder with Serializable
Transform vector features by normalizing each vector to have unit norm.
Transform vector features by normalizing each vector to have unit norm. Parameter p specifies the p-norm used for normalization (default 2).
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
object OneHotEncoder extends SettingsBuilder with Serializable
Transform a collection of categorical features to binary columns, with at most a single one-value.
Transform a collection of categorical features to binary columns, with at most a single one-value.
Missing values are either transformed to zero vectors or encoded as a missing value.
When using aggregated feature summary from a previous session, unseen labels are either transformed to zero vectors or encoded as unknown (if encodeMissingValue is true) and [FeatureRejection.Unseen]] rejections are reported.
object PolynomialExpansion extends SettingsBuilder with Serializable
Transform vector features by expanding them into a polynomial space, which is formulated by an n-degree combination of original dimensions.
Transform vector features by expanding them into a polynomial space, which is formulated by an n-degree combination of original dimensions.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
object PositionEncoder extends SettingsBuilder with Serializable
Transform a collection of categorical features to a single value that is the position of that feature within the complete set of categories.
Transform a collection of categorical features to a single value that is the position of that feature within the complete set of categories.
Missing values are transformed to zeros so may collide with the first position. Rejections can be used to remove this case.
When using aggregated feature summary from a previous session, unseen labels are ignored and FeatureRejection.Unseen rejections are reported.
object QuantileDiscretizer extends SettingsBuilder with Serializable
Transform a column of continuous features to n columns of binned categorical features.
Transform a column of continuous features to n columns of binned categorical features. The number of bins is set by the numBuckets parameter.
The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision of the approximation can be controlled with the k parameter.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, values outside of previously seen [min, max] are binned into the first or last bucket and FeatureRejection.OutOfBound rejections are reported.
object QuantileOutlierRejector extends SettingsBuilder with Serializable
Reject values in the first and/or last quantiles defined by the number of buckets in the numBuckets parameter.
Reject values in the first and/or last quantiles defined by the number of buckets in the numBuckets parameter.
The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision of the approximation can be controlled with the k parameter.
All values are transformed to zeros.
Values in the first and/or last quantiles are rejected as FeatureRejection.Outlier.
When using aggregated feature summary from a previous session, values outside of previously seen [min, max] will also report FeatureRejection.Outlier as rejection.
object StandardScaler extends SettingsBuilder with Serializable
Transform features by normalizing each feature to have unit standard deviation and/or zero mean.
Transform features by normalizing each feature to have unit standard deviation and/or zero mean. When withStd is true, it scales the data to unit standard deviation. When withMean is true, it centers the data with mean before scaling.
Missing values are transformed to 0.0 if withMean is true or population mean otherwise.
object TopNOneHotEncoder extends SettingsBuilder with Serializable
Transform a collection of categorical features to binary columns, with at most a single one-value.
Transform a collection of categorical features to binary columns, with at most a single one-value. Only the top N items are tracked.
The list of top N is estimated with Algebird's SketchMap data structure. With probability at least 1 - delta, this estimate is within eps * N of the true frequency (i.e., true frequency <= estimate <= true frequency + eps * N), where N is the total size of the input collection.
Missing values are either transformed to zero vectors or encoded as unknown.
object VectorIdentity extends SettingsBuilder with Serializable
Takes fixed length vectors by passing them through.
Takes fixed length vectors by passing them through.
Similar to Identity but for a sequence of doubles.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
object VonMisesEvaluator extends SettingsBuilder with Serializable
Transform a column of continuous features that represent the mean of a von Mises distribution to n columns of continuous features.
Transform a column of continuous features that represent the mean of a von Mises distribution to n columns of continuous features. The number n represent the number of points to evaluate the von Mises distribution. The von Mises pdf is given by
f(x | mu, kappa, scale) = exp(kappa * cos(scale*(x-mu)) / (2*pi*Io(kappa))
and is only valid for x, mu in the interval [0, 2*pi/scale].

Packages

transformers

package transformers

Type Members

Value Members

Ungrouped

Packages

transformers 

package transformers

Type Members

Value Members

Ungrouped

transformers