As AI workloads become more demanding, Network Cloud Providers (NCPs) and AI Cloud Providers face significant challenges in scalability, resource efficiency, and multi-tenancy management. Traditional static GPU allocations lead to underutilized resources, increased operational overhead, and a lack of flexibility in handling diverse workloads like LLM training, batch inference, and real-time AI inferencing.
To address these challenges, aarna.ml GPU Cloud Management Software (CMS) introduces a comprehensive, automated, and scalable reference architecture (RA) that enables multi-tenant GPUaaS (GPU-as-a-Service) along with additional capabilities such as PaaS, Job Submission and Model Serving. This RA blueprint provides GPU cloud providers with seamless integration with their GPU environment leading to efficient GPU orchestration, workload isolation, and intelligent scheduling. This ensures maximum efficiency, flexibility, and cost optimization in AI infrastructure.
The Key Challenges in Multi-Tenant AI Cloud Infrastructure
Unified Multi-Tenancy Management
Multi-tenancy in AI infrastructure is often fragmented across compute, networking, storage, and PaaS layers, making tenant onboarding complex and inefficient. Without a unified framework, providers struggle with:
- Inconsistent isolation policies between different infrastructure stacks.
- Manual intervention required for onboarding and resource allocation.
- Lack of automation, leading to operational inefficiencies.
aarna.ml GPU CMS provides a cohesive multi-tenancy model that ensures:
- Multi tenancy across compute infrastructure, networking fabric (ethernet and InfiniBand), external storage and PaaS.
- Seamless tenant onboarding with automation.
- Granular workload isolation at compute, storage, and networking layers.
- Role-based access control (RBAC) for security and governance.
- Supporting Diverse AI Workloads on a Single Platform
Modern AI cloud infrastructure needs to support a wide range of workloads, including:
- Static Bare-Metal GPU allocations for high-performance computing.
- Kubernetes-based PaaS solutions for scalable AI workloads.
- Job Submission & Model Serving for AI inference and automation.
The aarna.ml CMS RA enables:
- Flexible GPUaaS offerings, including IaaS and PaaS models.
- Kubernetes-native orchestration to dynamically allocate workloads.
- Integration with AI model deployment frameworks (NVIDIA NIM, Hugging Face, vLLM).
- Maximizing GPU Utilization with Intelligent Orchestration
AI workloads such as Large Language Model (LLM) training, Small Language Model (SLM) fine-tuning, DL training, batch inference, real-time inferencing, and Retrieval-Augmented Generation (RAG) have highly varied GPU utilization patterns. Without dynamic orchestration, GPU resources remain underutilized, leading to higher costs and lower efficiency.
aarna.ml GPU CMS provides a native job scheduling capabilities along with options to integrate with 3rd party job schedulers like Run.ai and SLURM to maximize GPU utilization. Along with supporting various job schedulers, aarna.ml GPU CMS also utilizes MIG (Multi-Instance GPU) to further enhance GPU utilization. NCPs and AI cloud providers can also register their spare GPU capacity with NVIDIA Cloud Functions (NVCF) using aarna.ml GPU CMS to utilize the GPUs in instances such as low traffic on inference workloads.
The aarna.ml CMS RA enables:
- Dynamic scaling of GPU resources based on workload demand.
- Optimizing infrastructure efficiency with intelligent workload scheduling.
- Enabling per-tenant GPU isolation using MIG or full GPU partitioning.
aarna.ml GPU CMS Reference Architecture (RA)
The aarna.ml GPU CMS RA is designed as a modular, scalable, and API-driven framework that enables providers to build and manage a multi-tenant AI cloud platform. In just a few weeks, the aarna.ml GPU CMS can be integrated with the GPUaaS provider’s GPU environment. The RA consists of several core components that work together to ensure high efficiency, seamless orchestration, and security in AI cloud operations. aarna.ml GPU CMS adheres to NVIDIA Reference Architecture for NCP, NCP Telco, Spectrum-X Compute Network Fabric, Quantum-2 InfiniBand Platform, High Performance Storage Design, and Common Networking.
Tenant Management & Isolation
One of the most critical aspects of a multi-tenant AI cloud is ensuring secure tenant isolation while enabling seamless access to shared GPU resources. The aarna.ml GPU CMS RA introduces a hierarchical multi-tenancy model that enables cloud providers to:
- Create isolated virtual environments per tenant using NVIDIA MIG (Multi-Instance GPU) or full GPU allocation, ensuring that different customers or workloads do not interfere with each other.
- Implement automated tenant provisioning with built-in role-based access control (RBAC) and dynamic policy enforcement for secure AI workload execution.
- Enable per-tenant Ethernet and InfiniBand segmentation for ultra-low-latency communication in AI/ML workloads, ensuring high performance while maintaining isolation.
- Create a secure and isolated per tenant storage volumes in external high performance storage arrays along with network isolation.
With these capabilities, AI cloud providers can offer secure, scalable, and fully automated multi-tenant GPU environments that meet enterprise requirements.
GPU Orchestration & Scheduling
To maximize GPU utilization and efficiency, aarna.ml GPU CMS integrates dynamic GPU scheduling and workload orchestration mechanisms. Instead of relying on static GPU allocations, this RA allows for:
- Native job scheduling with Kubernetes support, where workloads are dynamically assigned GPU resources based on real-time demand.
- Seamless integration with SLURM, Run: AI, and NVIDIA NVCF, enabling intelligent scheduling and GPU auto-scaling for training, inference, and fine-tuning workloads.
- Dynamic GPU provisioning, allowing workloads to request GPU resources on demand while ensuring optimal allocation efficiency.
These capabilities eliminate GPU underutilization by ensuring that resources are only allocated when needed and freed when workloads complete, reducing costs and improving efficiency.
Infrastructure as a Service (IaaS) & Platform as a Service (PaaS)
To support a diverse range of AI workloads, the aarna.ml GPU CMS RA provides:
- Bare-metal and VM-based infrastructure for traditional GPU compute workloads.
- Kubernetes-based GPUaaS, allowing enterprises to deploy and scale AI models dynamically.
- Self-service AI job submission, enabling data scientists and AI developers to access GPU resources through APIs and dashboards.
- Seamless AI model deployment and serving, integrating with NVIDIA NIM, Hugging Face, and vLLM for real-time inference workloads.
By offering both IaaS and PaaS models, AI cloud providers can cater to enterprise users, researchers, and AI startups alike.
Monitoring, Security, and Billing
With multi-tenant AI clouds, monitoring, security, and billing become critical aspects of operations. The aarna.ml GPU CMS RA includes:
- Real-time GPU utilization monitoring and analytics, providing insights into workload performance, GPU allocation efficiency, and cost tracking.
- Built-in security policies and RBAC, ensuring that AI workloads remain isolated and protected from unauthorized access.
- Integration with a 3rd party billing product for GPU and token usages.
With these capabilities, AI cloud providers can maintain full control, visibility, fault management, and security over their GPUaaS offerings.
By adopting aarna.ml GPU CMS, AI cloud providers can:
- Maximize infrastructure ROI through dynamic GPU orchestration.
- Scale AI workloads seamlessly across IaaS and PaaS models.
- Enable per-tenant workload isolation for secure multi-tenancy.
- Enhance operational efficiency with automation-driven management.
Download On-Demand Multi-tenancy Reference Architecture (RA) for NCPs and AI Cloud Providers for further details.