Managing network isolation in AI cloud environments is critical for ensuring tenant data security, performance consistency, and compliance. This becomes even more important in high-performance AI clusters that rely on InfiniBand fabric for ultra-low latency communication between GPU nodes.
With aarna.ml GPU Cloud Management Software (GPU CMS), cloud providers can achieve complete InfiniBand network isolation for every tenant—all through an automated, policy-driven process. This ensures each tenant’s data and traffic are fully segregated, with no manual intervention required.
Fully Automated InfiniBand Isolation
The aarna.ml GPU CMS achieves end-to-end isolation on InfiniBand fabrics by integrating seamlessly with NVIDIA UFM (Unified Fabric Manager). This allows for:
- Automated Discovery – The system automatically detects and maps the full InfiniBand topology, including leaf, spine, and core switches, as well as all GPU nodes and their InfiniBand ports (GUIDs).
- Tenant Creation & PKey Assignment – When a new tenant is created, a unique PKey (Partition Key) is automatically provisioned for that tenant, establishing logical separation at the network layer.
- Resource Allocation & GUID Mapping – When GPU compute nodes are allocated to a tenant, aarna.ml GPU CMS automatically maps all InfiniBand GUIDs of the tenant’s servers to the tenant’s PKey—ensuring that all traffic from those servers is restricted to the tenant’s own isolated network partition.
This policy-based automation eliminates manual errors, guarantees secure isolation across the entire InfiniBand fabric, and ensures each tenant receives a fully segregated high-performance network.
Seamless Visibility and Control
All discovery, tenant creation, and isolation enforcement actions are fully visible within the aarna.ml GPU CMS Admin Portal. Both NCP admins (cloud provider admins) and tenant admins can track:
- Topology Discovery Results – Real-time visualization of the InfiniBand fabric, showing all switches, nodes, and links.
- Tenant-Specific Isolation – Full visibility into which PKey is assigned to each tenant.
- Server-Level Validation – Ability to drill down into individual servers and confirm that their InfiniBand ports are correctly assigned to the tenant’s PKey.
This centralized visibility ensures operational transparency and gives cloud providers the tools they need to enforce multi-tenant isolation at scale.
Key Benefits of InfiniBand Integration
- Automated Discovery & Configuration – No manual effort required for topology discovery or PKey creation.
- Guaranteed Tenant Isolation – Each tenant’s traffic is strictly confined to their assigned PKey.
- Centralized Management – All network isolation policies are managed from a single pane of glass within the aarna.ml GPU CMS Admin Portal.
- NVIDIA UFM Integration – Leverages NVIDIA’s industry-standard InfiniBand management platform for seamless compatibility.
- Real-Time Validation – Admins can instantly verify that each tenant’s compute nodes are correctly isolated at the network level.
Complete Network Isolation Across Ethernet & InfiniBand
While this blog focuses on InfiniBand isolation, aarna.ml GPU CMS also supports Ethernet network isolation, including full integration with NVIDIA Spectrum-X switches. Whether using Ethernet, InfiniBand, or a combination of both, aarna.ml GPU CMS ensures complete network separation between tenants—across both the control plane and data plane.
Watch the Full Demo