"
MTTR Calculator

Reduce Downtime & Increase Availability for GPU Clouds

Calculating MTTR for SLA Compliance in GPUaaS/NCP Environments

NCP or GPUaaS provider hardware environments can be large. If there is a fault, the NCP needs to know how quickly the fault needs to be corrected in order to meet the availability SLA which may be represented as 5x9s or 6x9s etc. This worksheet calculates the mean-time-to-repair (MTTR) required across various different types of GPUaaS/NCP hardware infrastructure faults based on the input which is the expected availability SLA. The mean-time-between-failure (MTBF) numbers are taken from a study by Facebook which lists the types of faults and the number of faults found of each type during a specific amount GPU-hours.

The below numbers are from the Meta/FB paper
MTBF Calculations
Study Number of Days 54
Study Number of Hours 1,296
Study Number of GPUs 16,384
Study Number of GPU Hours 21,233,664
Formulas
MTBF = Mean Time Between Failures
MTTR = Mean Time to Repair
MTBF = # of operational hours / # of failures
Availability = MTBF / (MTBF + MTTR)

MTTR Calculator for GPUaaS Providers

%
Hours
Required MTTR: 0 Minutes
Monthly Downtime: 0 Seconds
GPUaaS Failure Analysis & MTBF

Get your MTTR report emailed