Pubsub Consumers Fault Tree Analysis

The diagram below shows the fault tree analysis for Heroic consumers using PubSub for ingestion, and Bigtable and Elasticsearch for storage. This is a useful tool for reasoning about reliability of the system and determining reasonable SLOs.

Some assumptions are made that negatively affect the accuracy of the model and the calculated probabilities should be seen as being quite pessimistic. The biggest assumptions are:

  • Numbers for Google Cloud services are taken from their SLAs. In reality, the services will typically be much more reliable than the numbers listed since SLAs are contractual agreements and skew conservatively.
  • Elasticsearch shard failure is modeled as any n data nodes failing in the cluster, where n is the replication factor. The actual failure would have to be all replicas failing together, not just any random nodes.

Click and drag the diagram to pan around. Branches can be collapsed/expanded by clicking on a node, and hovering over a node will show a longer description.

Input Data

The source data used to generate the fault tree can be downloaded here: pubsub-consumers.mef

The input is in the Open-PSA Model Exchange Format and was rendered using https://github.com/hexedpackets/fault_tree/