introduced Gaudi accelerators to Amazon’s EC2 DL1 instances, we faced a challenge that threatened the entire deployment. The performance numbers were not just disappointing; they were disastrous. Models that required training effectively were seeing up to 50% of their performance degradation when scaling across multiple nodes. The problem? A network topology that routed all bytes of data through host memory, causing a bottleneck that undermined everything Gaudi was designed to do.
I led the engineering effort to address this issue, which ultimately resulted in the development of what we now call Peer Direct. It’s a feature that transformed the way Gaudi accelerators communicate in cloud environments, and its history has some useful lessons on distributed AI training at scale.
The Problem with Host NICs
Gaudi was designed with the NIC (Network Interface Card) being embedded directly in the silicon. Each chip has ten network interfaces that can handle 100 Gbps and support RDMA with RoCE v2, allowing devices to access each other’s memory directly without needing the CPU or This architecture is highly efficient for AI training workloads, where collective operations like AllReduce need to accumulate gradients from dozens or hundreds of devices per training iteration.
But cloud deployments are not always compliant with perfect architectures. When Amazon tested Gaudi for DL1 instances, they had to utilise ordinary host NICs rather than Gaudi’s built-in networking. The reasons were pragmatic: cost savings and the logistics of working around existing data centre infrastructure to accommodate a new network topology. From their business perspective, leveraging established network infrastructure made perfect sense.
From the performance point of view, it was a disaster. Instead of peer-to-peer RDMA transfers between Gaudi cards, all communication went the long way around. Data had to be duplicated out of Gaudi’s high-bandwidth memory into host DRAM, processed by the host CPU, sent out the host NIC over TCP/IP, received by the far host, and duplicated back into the far Gaudi’s memory. All the added hops caused latency, stole CPU cycles, and added bandwidth restrictions that completely ruined the scalability of distributed training.
The performance shortfall was so bad that one questioned whether deployment would ever be worth it at all. This wasn’t a matter of some trivial optimisation; it was an existential threat to the entire arrangement with AWS.
Why Performance Matters This Much
It’s worth knowing why a 50% loss of performance is so disastrous in the life of training models, and especially large models such as GPT-5. It now takes weeks or months to train huge language models even on humongous clusters. If you are messing around with models that have billions or trillions of parameters, every percentage point of performance translates directly into time and dollars.
Consider the economics. If it takes 30 days to train a model versus 15, you’re not only waiting longer; you’re paying for double the compute time. At cloud scale, with hundreds or thousands of accelerators in continuous use, this adds up to millions of dollars. Worse, it halves your iteration speed. In an competitive AI world where companies are racing to develop improved models, doubling the number of tests within the same time frame can be the distinction between being in front and being behind.
Environmental cost is also crucial. Large models require a lot of electricity to teach. Better performance means less compute time, which halves energy consumption and carbon emissions. As more pressure is mounted on the AI industry to cut its carbon footprint, gains in efficiency are no longer a luxury but rather a necessity.
The solution we designed, Peer Direct, delivered RDMA-like performance when the physical network layout wasn’t suitable for normal RDMA. We needed direct memory access between Gaudi devices on different systems without traversing host memory, but on host NICs that weren’t designed for this in the first place.
The enabler was AWS Elastic Fabric Adapter, a high-performance network interface for HPC and AI workloads on EC2. EFA provides low-latency OS-bypass communications, typically sub-10 microsecond latency. EFA provides RDMA-like semantics using libfabric, an in-user-space communication library providing a common interface across several networking technologies.
The task was to combine libfabric with Habana’s Collective Communication Library, HCCL, which handles all distributed training workloads. HCCL was built on the assumption of native RDMA using Gaudi’s on-chip NICs. We needed to create a bridge enabling HCCL to leverage libfabric transparently for communications without compromising its performance guarantees and communication semantics.
The solution needed several technical advances. First, we introduced a memory registration system that allowed libfabric to directly access Gaudi’s high-bandwidth memory. We utilised the Linux kernel DMA-BUF framework, which provides a shared mechanism for sharing device driver buffers. When HCCL needs to transfer data, the Gaudi driver provides a DMA-BUF file descriptor for the memory region, which libfabric can utilise to create RDMA transfers directly from device memory.
Second, we included an LRU cache for memory registrations. Memory registration is expensive; it involves kernel calls and setup operations that can cause significant overhead. By caching the mapping of memory addresses to their libfabric handles, we could reuse registrations in hot-access regions, eliminating most registration overhead from actual training.
The result was a communication pipeline that looked something like this: HCCL calls the OFI wrapper, which calls the cached libfabric handle to perform an RDMA transfer straight from source Gaudi memory to destination Gaudi memory, with neither CPU ever being called. The OFI wrapper was introduced to keep the codebase clean and avoid direct header inclusions — it’s a lightweight library that dynamically links to HCCL and enables the use of libfabric without requiring direct integration
After the transfer is complete, libfabric reports through a completion queue, and HCCL continues computation with the recently received data.
The Development Experience
Building Peer Direct involved venturing into new territory on tight schedules. Libfabric wasn’t yet mainstream in the field of AI accelerators yet. There wasn’t a lot of public documentation available, and discussion was meagre. There was more of an emphasis on diving into libfabric source code and reverse-engineering based on experimentation.
The communication with AWS engineers was paramount but time-zone constrained. Working with a team twelve hours ahead meant that debug iterations had 24-hour turnarounds. Every issue needed careful documentation and proper communication, as real-time collaboration was not possible.
The stakes were high since the entire DL1 deployment was riding on this functionality working. Delays would have thwarted a major product launch. Nobody on our team had deep background knowledge of libfabric internals, so we were learning a complex codebase while designing a critical integration simultaneously.
The Results
When we actually deployed Peer Direct, the speed improvements were all the effort was worth. We saw a 1.5 to 2x throughput increase for collective operations on a 32MB message size. On larger messages, the performance was even more astounding, with up to 1.76x better throughput at a 256MB message size. CPU overhead created a bottleneck that completely disappeared.
Most significantly, these microbenchmark improvements directly translated into real model training performance. Training Habana’s DeepSpeed BERT model with 5 billion parameters across 128 Gaudi devices, we saw substantial throughput gain. Models using more aggressive memory optimisation methods, like ZeRO-2, which are more collective operation dependent, benefited disproportionately from Peer Direct.
PeerDirect was one of the main enablers for Gaudi performance on AWS DL1 instances, allowing high-scale distributed training to run effortlessly on the launch day. Beyond this initial impact, the effort set the groundwork for future high-performance communication features and proved that cloud-native AI accelerators could remain competitive despite the constraints of cloud infrastructure.
The experience reminded me of an important lesson in systems engineering: often the most important performance improvements do not result from optimising the fast path, but from sidestepping unjustified detours altogether. During distributed AI training, having data travel straight across accelerators with no unnecessary copies and no CPU intervention is what makes a working system versus one that scales.
Key takeaways? One important “takeaway” from this project is that assumptions about network topology should be tested at the earliest point in the distributed training process. As many of the accelerator stacks were built based on an idealised environment, they do not take into account the additional hops, translation layers, and/or cost-driven factors that exist in the cloud environments. Therefore, before focusing on optimising either model level or kernel level, engineers should perform simple collective microbenchmarking across the desired topology. If scaling efficiency dramatically decreases with increasing node counts or message sizes, the likely reason is the data path, not the kernel. By identifying the host-memory detour early on, engineers can focus their efforts where they will have the greatest impact.
Another important lesson learned was the need to treat both memory registration and data transfer as first-class performance concerns. Memory registration overhead can greatly exceed the time spent communicating if each data transfer requires a new registration. The LRU cache for registered memories was a non-glamorous addition to HCCL; however, it effectively eliminated a systemic source of latency and made the RDMA path viable for real-world workloads. When developing distributed systems, engineers should profile not only the available network bandwidth but also the lifecycle costs associated with allocating buffers, registering them, and tearing down those registrations. Small changes to these control paths can result in large increases in end-to-end throughputs.
Finally, the integration method used in this project provides a pattern for integration. Instead of rewriting HCCL to use libfabric directly, we created a thin abstraction layer that maintained existing semantics while replacing the underlying transport layer. This provided several benefits, including minimising risk, reducing code churn, and allowing incremental testing. Teams facing a similar challenge (i.e., adapting accelerator-native communication libraries to cloud-native fabrics) should attempt to isolate the transport layer, maintain collective semantics, and create small, testable interfaces between the two. This not only allows for faster development but also allows for simpler support of future transport backends.
Disclosure: I work as an AI Runtime Team Manager at Intel. The perspectives shared in this article are my own.