Supercharging IBM’s Cloud-Native AI Supercomputer

It’s been a year of massive strides in AI, with new technologies becoming household names and models with tens of billions of parameters becoming commonplace for real-world use cases. At IBM, we launched watsonx, the data and AI platform for enterprise, to bring these advanced AI capabilities to IBM customers across a wide variety of industries, leveraging many innovations that emerged from our IBM Research community.

There’s a growing need to design systems with the right compute capabilities to efficiently carry out the various stages of the AI lifecycle. This is partly why IBM decided to build Vela, an AI supercomputer in the IBM Cloud, last year. Vela allows us to efficiently deploy our AI workflows — from data pre-processing, model training and tuning, to deployment and even new product incubation — all within the IBM Cloud.

Partner with aster.cloud
for your next big idea.
Let us know here.

From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.

CYBERPOGO.COM :: For the Arts, Sciences, and Technology.

DADAHACKS.COM :: Parenting For The Rest Of Us.

ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.

TAKUMAKU.COM :: For The Hearth And Home.

ASTER.CLOUD :: From The Cloud And Beyond.

LIWAIWAI.COM :: Intelligence, Inside and Outside.

GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.

FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.

ASTERCASTER.COM :: Supra Astra. Beyond The Stars.

BARTDAY.COM :: Prosperity For Everyone.

Vela was designed to be flexible and scalable, capable of training today’s large-scale generative AI models, and adaptable to new needs that may arise in the future. It was also designed such that its infrastructure could be efficiently deployed and managed anywhere in the world. Over the last year, AI practitioners from across IBM have trained and prototyped AI technologies on Vela, including IBM’s next-generation AI studio, watsonx.ai, which became generally available in July. Bringing a platform like watsonx.ai online so quickly around the world would not have been possible without Vela’s cloud-first design.

One year in, IBM is scaling Vela for what’s ahead. Today, we’re sharing several major upgrades we’ve made to Vela over the last year — including nearly doubling the capacity of the system and dramatically improving the speed of Vela’s network. Let’s break down what’s new, and how we made it happen.

Speeding up Vela

This particular wave of AI has unique inter-dependencies with the underlying infrastructure that it takes to train and deploy it. In the push towards bigger models, trained over ever-larger data sets, moving faster means using more GPUs per job. As more GPUs compute in parallel, we need a commensurate increase in network performance to ensure that GPU-to-GPU communication doesn’t become a bottleneck to workload progress. This year, we deployed a major upgrade to the Vela network that allows us to efficiently scale training individual workloads to thousands of GPUs per job. The core enabling technologies that we deployed on Vela were RoCE (RDMA over Converged Ethernet), and GDR (GPU-direct RDMA).

Remote direct memory access (RDMA) allows one processor to access another processor’s memory without having to involve either computer’s operating systems. This leads to much faster communication between the processors by eliminating as many of the intervening processes as possible. GPU-direct RDMA allows GPUs on one system to access the memory of GPUs in another system, using network cards (as shown in the figure below), going over the ethernet network. By enabling GPU-direct RDMA over our ethernet network in Vela, we improved our network throughput by two to four times, reduced our network latency by six to 10 times.

We are also able to scale workloads out nearly linearly to much larger models than previously possible. This includes training the 20 billion parameter Granite model we recently announced, which is a key enabler of our watsonx Code Assistant for Z service. The RoCE and GDR upgrade was several years of research in the making. It required simultaneous changes and enhancements to nearly every part of our cloud stack, from the system firmware to the host operating system, to virtualization, to the network underlay and overlay.

Diagram showing the difference in communication path before and after deployment of RoCE + GDR.

Increasing capacity

While Vela was designed to be expandable, the team wanted to do more than just add more GPUs to Vela; we wanted to do it in a space- and resource-efficient manner. In particular, we looked for a way to double the density of the server racks, which roughly doubled capacity without increasing the space or networking equipment required.

After analyzing AI workload patterns, we determined we could move ahead with our capacity expansion within the power and cooling resources that were already available without impacting workload performance. We then worked with our partners to develop a highly optimized power capping solution. This allowed Vela to essentially “overcommit” the amount of power available to a rack safely. We then developed a testing framework for all pertinent components to ensure everything was working safely after the expansion, without any detrimental impact to the system or the workloads that needed to run efficiently on Vela. As a result, Vela is now comprised of around twice as many GPUs as it had prior to the upgrade.

Architecture of Vela after capacity increase.

Improving operations and diagnostics

The team behind Vela also looked at ways to run the system more efficiently. Because of their complexity, AI servers have a higher failure rate than many traditional cloud systems. Moreover, they fail in unexpected (and sometimes hard to detect) ways. Furthermore, when nodes – or even individual GPUs – fail or degrade it can impact the performance of an entire training job running over hundreds or thousands of them. Automation which detects and finds these kinds of issues and produces alerts as quickly as possible is therefore important to keeping the environment productive.

This year, the IBM teams enhanced the automation in IBM Cloud, cutting the time it takes to find and understand these kinds of hardware failures and degradations on Vela in half. Now, servers can be brought back into the production fleet far faster than before. Lessons learned from managing an environment this complex have been rolled out more broadly to improve operations across the rest of IBM Cloud’s virtual private cloud (VPC) environment.

What’s next

Even before these upgrades, Vela was already a powerful platform that accelerated the launch and deployment of watsonx.ai all over the world, as well as the development of our core underlying platform, OpenShift AI. And with the latest infrastructure advancements in Vela, we’re training increasingly powerful models that will help solve some of the most pressing business problems our customers face.

In much the same way that this is still the start for the AI boom, this is just the beginning for IBM’s AI infrastructure innovation journey. Earlier this year, IBM announced the availability of additional GPU offerings on IBM Cloud, bringing to market innovative GPU infrastructure designed to train, tune and inference foundation models for enterprise workloads. And, with new IBM AI infrastructure technologies maturing, like the IBM AIU chip, so much more is going to be possible in the years ahead.

By: Talia Gershon, Bengi Karacali-Akyamac, Seetharami Seelam, Drew Thorstensen and Rohit Badlaney
Originally published at: IBM Blog

Source: cyberpogo.com

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

Supercharging IBM’s Cloud-Native AI Supercomputer

From our partners:

Speeding up Vela

Increasing capacity

Improving operations and diagnostics

What’s next

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

aster.cloud

Related Topics

IBM Study: One in Four Malicious Breaches are AI-Enabled, Costing Companies $6 Million on Average

Accelerating the frontiers of scientific discovery: Google’s $40M commitment to the Genesis Mission

3 Questions: Neural transparency and the future of AI design

Intel Invests €5 Billion to Expand Manufacturing in Europe

IBM and Red Hat Expand Lightwell with New Offerings to Build the Trust Infrastructure for AI-Era Open Source

When I Was Young

The Fastest AI Fried Chicken In The World

Zed Approves | How to Stay Cool in Extreme Heat

The AI investment surge hasn’t produced the expected results yet. That could change in 2026

Zed Approves | It’s Prime Day 2026! Time to Upgrade Your World Cup Viewing Setup and Beat the Heat

Most Popular

Zed Approves | The Best Prime Day PC Deals: Top Gaming Rigs, Workstations, and Everyday Laptops

Zed Approves: How to Gear Up for GTA 6 This Amazon Prime Day (2026 Quick Guide)

Father’s Day Outdoors – Build Dad the Ultimate Backyard Watch Party

Father’s Day Outdoors, Round Two – Gear for the Action, the Tailgate, and Beating the Heat

The Ultimate Father’s Day Gift Guide – Home Entertainment Upgrades Dad Actually Wants

Supercharging IBM’s Cloud-Native AI Supercomputer

From our partners:

Speeding up Vela

Increasing capacity

Improving operations and diagnostics

What’s next

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Related Topics

You May Also Like