MTTR Calculator for GPU Cloud Reduce Downtime & Increase Availability

Calculating MTTR for SLA Compliance in GPUaaS/NCP Environments

NCP or GPUaaS provider hardware environments can be large. If there is a fault, the NCP needs to know how quickly the fault needs to be corrected in order to meet the availability SLA which may be represented as 5x9s or 6x9s etc. This worksheet calculates the mean-time-to-repair (MTTR) required across various different types of GPUaaS/NCP hardware infrastructure faults based on the input which is the expected availability SLA. The mean-time-between-failure (MTBF) numbers are taken from a study by Facebook which lists the types of faults and the number of faults found of each type during a specific amount GPU-hours.

The below numbers are from the Meta/FB paper
MTBF Calculations
Study Number of Days	54
Study Number of Hours	1,296
Study Number of GPUs	16,384
Study Number of GPU Hours	21,233,664

Formulas
MTBF = Mean Time Between Failures
MTTR = Mean Time to Repair
MTBF = # of operational hours / # of failures
Availability = MTBF / (MTBF + MTTR)

Reduce Downtime & Increase Availability for GPU Clouds

Calculating MTTR for SLA Compliance in GPUaaS/NCP Environments

MTTR Calculator for GPUaaS Providers

Get your MTTR report emailed

Main links

Products

Solutions

Stay up to date on aarna.ml