1. Introduction

HPC (High-Performance Computing) and AI / ML (Artificial Intelligence / Machine Learning) workloads are integral to addressing some of the most challenging problems in science, engineering, and business. HPC workloads enable precise simulations of natural and physical phenomena, while AI/ML workloads empower systems to learn, adapt, and make predictions from data.

Efficient execution of these tasks relies on robust job scheduling, optimized resource utilization, and effective coordination between CPUs and GPUs. While HPC workloads benefit from traditional, deterministic scheduling, the diversity of AI/ML workloads calls for more dynamic and workload-specific approaches. This blog explores the intricacies of HPC and AI/ML workloads, the critical role of job scheduling, and a detailed mapping of the best suitable schedulers for specific workload types.

2. Understanding the nature of various HPC and AI/ML Workloads

HPC workloads revolve around solving deterministic problems through simulations or computational models. These tasks often run for extended durations and require vast computational power distributed across CPUs and GPUs. The HPC jobs often need an uninterrupted execution and the use cases do not mandate any real time responses.

AI/ML workloads are fundamentally data-driven and probabilistic, encompassing a diverse set of tasks. These workloads range from training and fine-tuning machine learning models to deploying them for inference and decision-making.

Let’s delve into various categories of AI/ML workloads:

  • Large Language Models (LLMs), such as GPT or BERT, are a foundational AI workload. Training LLMs involves optimizing billions of parameters over massive datasets to enable models to understand and generate human-like responses. The workload requires extensive GPU resources for long term duration spanning several months.
  • Small Language Model (SLM) training focuses on smaller, domain-specific models that are optimized for particular applications. While GPUs dominate the training process, SLM models need GPU infrastructure for lesser duration, which would mean frequent set-up and tear down of infrastructure.
  • Fine-tuning is a process where pre-trained models are adapted to specific datasets or domains. This allows businesses to customize AI capabilities for their unique needs. Fine-tuning relies on GPUs for computationally intensive retraining tasks, but needs lesser GPU cycles as compared to SLMs. For this reason, these jobs might be even more dynamic than SLMs.
  • Batch inference processes large datasets using pre-trained models in a non-real-time manner. Based on the size of data to be processed, the GPU usage time for such workloads could be predicted. These tend to be of even shorter duration than Fine-tuning.
  • Retrieval-Augmented Generation (RAG) is a technique that combines a generative model (e.g., GPT, BERT) with an external knowledge retrieval system to generate responses. RAG is characterized by bursty usage of GPUs mainly for inferencing and minimal training. Unlike the other workloads above, RAG is a transactional workload that is “always on” with dynamic scaling to adapt to the load.
  • Real-time inference is another vital AI/ML workload, used in latency-sensitive applications. This workload is also transactional and relies on GPUs for performing the necessary computations with minimal delay. Of course, not all RAG and real-time inference jobs need GPUs, some are perfectly fine with CPUs.

Among all the above AI/ML workload categories, RAG and real-time inferencing need quick response times and the traffic for these is bursty with peaks and valleys.

To summarize, below table captures comparison between all the AI/ML workloads.

3. Why Is Job Scheduling Critical?

Efficient job scheduling is vital for handling the diverse demands of HPC and AI/ML workloads. It ensures that compute resources, whether CPUs or GPUs, are utilized optimally, preventing bottlenecks and idle time. Scheduling also allows for dynamic adaptation to fluctuating workloads, ensuring that resources are allocated in real-time based on priority and demand.

After having looked at various HPC and AI workloads characteristics, let’s take a look at some of the prominent job scheduling options.

SLURM (Simple Linux Utility for Resource Management) is a robust open-source job scheduler designed for High-Performance Computing (HPC) environments. It efficiently manages and allocates resources, such as CPUs, GPUs, and memory, across large clusters of nodes. SLURM supports features like job prioritization, pre-emption, dependency handling, and partitioning, making it ideal for running computationally intensive, batch-oriented workloads.

Ray is a distributed open-source computing framework for executing parallel and distributed processing of AI/ML workloads. Ray's job scheduling capabilities enable distributed model training, fault tolerance, task pre-emption and scaling, which results in efficient resource utilization. Ray enables developers to harness multi-node GPU processing capability by defining tasks that can be executed in parallel across the distributed GPU clusters spanning on-premises, cloud, or hybrid environments.

Kubernetes (With NVIDIA GPU operator) enables scheduling of GPU-accelerated workloads, managing pod-level resource allocation, and ensuring seamless scaling for AI/ML applications. It supports a wide range of use cases, including containerized training pipelines, inference deployments, and batch processing, particularly in multi-tenant or cloud-native environments. By combining Kubernetes’ powerful orchestration features with NVIDIA's GPU optimizations, this solution simplifies the execution of AI/ML workloads in scalable, containerized environments while providing flexibility for both batch and real-time processing tasks. Kubernetes can also be used for transactional workloads with soft isolation through different namespaces. We call it soft since it is not 100% secure, but good enough for an organization for sharing across departments.

Run.ai is a proprietary Kubernetes-native platform focused on optimizing GPU resource utilization and scheduling for AI/ML workloads by extending the native K8s scheduling capabilities specifically for AI/ML workloads. It enables dynamic GPU allocation, pooling, and partitioning, allowing multiple jobs to share GPUs efficiently. Run:AI’s job scheduling capabilities include pre-emption, priority-based scheduling, and resource quotas, ensuring fair allocation in multi-tenant environments.

The below table gives a brief comparison between features offered by the above stated job schedulers.

4. Finding a best fit job scheduler

Determining the best fit job scheduler depends on the short term and long term nature of AI workloads that are planned to be scheduled on the available GPU infrastructure.

For longer duration LLM training that span months, bare metal nodes serve the purpose without any need for a job scheduler. The only requirement here would be to have a sophisticated infra manager that automates bare metal provisioning such that it could be quickly reprovisioned for any type of new LLM training requirement.

For any other category of the AI/ML workloads it would be prudent to use some job scheduler.

The below table depicts how various job schedulers fare for supporting the different categories of AI/ML workloads.

As evident from the chart above, no single job scheduler perfectly addresses the diverse requirements of all use cases. To bridge this gap, at aarna.ml, we have developed a comprehensive GPU Cloud Management Software (CMS) that enhances existing job, instance, and model scheduling solutions by introducing robust multi-tenancy, hard isolation, and seamless support for a wide range of workloads, including bare metal, VMs, containers, and Kubernetes pods.

5. Conclusion

Efficient job scheduling is essential for maximizing the performance of HPC and AI workloads. By understanding the unique strengths and limitations of these schedulers—and aligning them with specific workload demands—organizations can achieve optimal resource utilization, reduce costs, and unlock the full potential of their computational infrastructure.