MTTR Calculator: Reduce Downtime & Increase Availability for GPU Clouds

Lower your operating cost by registering your free NVIDIA GPU cycles with aggregators. Aarna offers an easy way to register (and unregister) your free GPU resources as spot instances.

TALK TO AN ENGINEER

AMCOP Architecture

NCP or GPUaaS provider hardware environments can be large. If there is a fault, the NCP needs to know how quickly the fault needs to be corrected in order to meet the availability SLA which may be represented as 5x9s or 6x9s etc. This worksheet calculates the mean-time-to-repair (MTTR) required across various different types of GPUaaS/NCP hardware infrastructure faults based on the input which is the expected availability SLA. The mean-time-between-failure (MTBF) numbers are taken from a study by Facebook which lists the types of faults and the number of faults found of each type during a specific amount GPU-hours.

AMCOP Architecture

The below numbers are from the Meta/FB paper
MTBF Calculations
Study Number of Days	54
Study Number of Hours	1,296
Study Number of GPUs	16,384
Study Number of GPU Hours	21,233,664

Formulas
MTBF = Mean Time Between Failures
MTTR = Mean Time to Repair
MTBF = # of operational hours / # of failures
Availability = MTBF / (MTBF + MTTR)

MTTR Calculator for GPUaaS Providers

Input NCP Availability SLA

Avg. Manual Repair Time

Hours

Required MTTR: 0 Minutes

Monthly Downtime: 0 Seconds

GPUaaS Failure Analysis & MTBF

Get your MTTR report emailed

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

MTTR Calculator: Reduce Downtime & Increase Availability for GPU Clouds

AI Cloud: GPU-As-a Service

AMCOP Architecture

AMCOP Architecture

Get your MTTR report emailed

Why Most GPUaaS SLAs Fall Short and How to Fix It