Overview

Heroic is an open-source monitoring system originally built at Spotify to address problems faced with large scale gathering and near real-time analysis of metrics.

Heroic's main features are:

  • An architecture that scales with you, large or small.
  • Indefinite retention, as long as you have the hardware spend.
  • A rich query and filtering language, driven by Elasticsearch.
  • Federation support to connect multiple Heroic clusters into a global interface.

Heroic uses a small set of components which are responsible for very specific things.

Consumers

Consumers are the component responsible for consuming metrics.

We support consuming over Kafka, Pubsub, the collectd protocol, and over HTTP.

Out of these options, Pubsub or Kafka are preferred since they allow for a discovery-free architecture as a pub-sub system. They have proven to be horizontally scalable and resilient towards failures.

The following tutorials are available for configuring consumers:

Metrics

Metric storage can be handled by either Google Cloud Bigtable or Apache Cassandra.

Google Cloud Bigtable is a fully-managed petabyte-scale NoSQL database. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail. Bigtable has inspired the design of many other NoSQL databases, such as Apache HBase and Apache Cassandra.

The technique for storing time series data in Cassandra is described in Advanced Time Series with Cassandra. It was inspired by how it's implemented in KairosDB. Using Cassandra, you can store almost an indefinite amount of data as long as you are willing to spend the hardware on it.

For more information, see the section detailing how to configure metrics backends.

Metadata

We use Elasticsearch to store and make metadata available to a heroic cluster. It is the primary component that drives Heroic's Query Language.

Elasticsearch has been shown to not be reliable in terms of data safety. Because of this, Heroic uses Elasticsearch in a way so that it is not the primary storage and can rapidly be rebuilt.

For more information, see the section detailing how to configure metadata backends.

Suggestions

When building Heroic it was quickly realized that navigating hundreds of millions of time series without context is hard. To address this, a specialized Elasticsearch backend was built to handle suggestions.

Suggestions provide information about what tags are available for a specific context. Assuming you are interested in which what tags exist for a given role, you could do the following:


$ curl -H "Content-Type: application/json" <url>/metadata/suggest-tag -d \
  '{"filter": ["=", "role", "heroic"], "key": "what", "value": "us"}'

{
  "suggestions": [
    {"score": "1.0", "key": "what", "value": "cpu-usage"},
    {"score": "1.0", "key": "what", "value": "disk-used-percentage"}
  ]
}

Take note on the partial matching of us in cpu-usage and disk-used-percentage. This is the point of suggestions, to provide the user with the most relevant matches for a specific input. Especially partial ones. A typical use-case would be to fill the content of drop-down box as the user is typing.

Suggestions are different from Metadata since it uses more intensive indexing techniques to analyze your tags. Having it as a separate module is useful since they are not critical to the operation of Heroic. Or to be more specific, its Query Language.

Clustering and Federation

Main article: Federated Clusters

Heroic has support for federating requests, which allows multiple independent Heroic clusters to serve clients through a single global interface. This can be used to reduce the amount of geographical traffic by allowing one cluster to operate completely isolated within its zone.

A client querying any heroic node in a federation will cause it to fan out to all known shards and merge the result.

Federations tries to be as transparent as possible in the face of problems. Each request that fans out to a shard has the potential to fail, preventing that data to become unavailable.

In the face of errors, successful shards will be returned as normal. The failing shards will be specifically reported as such, and it is left to the client to decide what to do next.