aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Cloud-Native
  • Computing

Supercharging IBM’s Cloud-Native AI Supercomputer

  • aster.cloud
  • December 24, 2023
  • 5 minute read

It’s been a year of massive strides in AI, with new technologies becoming household names and models with tens of billions of parameters becoming commonplace for real-world use cases. At IBM, we launched watsonx, the data and AI platform for enterprise, to bring these advanced AI capabilities to IBM customers across a wide variety of industries, leveraging many innovations that emerged from our IBM Research community.

There’s a growing need to design systems with the right compute capabilities to efficiently carry out the various stages of the AI lifecycle. This is partly why IBM decided to build Vela, an AI supercomputer in the IBM Cloud, last year. Vela allows us to efficiently deploy our AI workflows — from data pre-processing, model training and tuning, to deployment and even new product incubation — all within the IBM Cloud.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Vela was designed to be flexible and scalable, capable of training today’s large-scale generative AI models, and adaptable to new needs that may arise in the future. It was also designed such that its infrastructure could be efficiently deployed and managed anywhere in the world. Over the last year, AI practitioners from across IBM have trained and prototyped AI technologies on Vela, including IBM’s next-generation AI studio, watsonx.ai, which became generally available in July. Bringing a platform like watsonx.ai online so quickly around the world would not have been possible without Vela’s cloud-first design.

One year in, IBM is scaling Vela for what’s ahead. Today, we’re sharing several major upgrades we’ve made to Vela over the last year — including nearly doubling the capacity of the system and dramatically improving the speed of Vela’s network. Let’s break down what’s new, and how we made it happen.

Read More  How To Succeed With Cloud Computing During An Economic Downturn

Speeding up Vela

This particular wave of AI has unique inter-dependencies with the underlying infrastructure that it takes to train and deploy it. In the push towards bigger models, trained over ever-larger data sets, moving faster means using more GPUs per job. As more GPUs compute in parallel, we need a commensurate increase in network performance to ensure that GPU-to-GPU communication doesn’t become a bottleneck to workload progress. This year, we deployed a major upgrade to the Vela network that allows us to efficiently scale training individual workloads to thousands of GPUs per job. The core enabling technologies that we deployed on Vela were RoCE (RDMA over Converged Ethernet), and GDR (GPU-direct RDMA).

Remote direct memory access (RDMA) allows one processor to access another processor’s memory without having to involve either computer’s operating systems. This leads to much faster communication between the processors by eliminating as many of the intervening processes as possible. GPU-direct RDMA allows GPUs on one system to access the memory of GPUs in another system, using network cards (as shown in the figure below), going over the ethernet network. By enabling GPU-direct RDMA over our ethernet network in Vela, we improved our network throughput by two to four times, reduced our network latency by six to 10 times.

We are also able to scale workloads out nearly linearly to much larger models than previously possible. This includes training the 20 billion parameter Granite model we recently announced, which is a key enabler of our watsonx Code Assistant for Z service. The RoCE and GDR upgrade was several years of research in the making. It required simultaneous changes and enhancements to nearly every part of our cloud stack, from the system firmware to the host operating system, to virtualization, to the network underlay and overlay.

Read More  Microsoft Announces New AI Solutions For Microsoft Cloud For Nonprofit: A Game-Changer For Fundraising And Volunteer Engagement
Diagram showing the difference in communication path before and after deployment of RoCE + GDR.

Increasing capacity

While Vela was designed to be expandable, the team wanted to do more than just add more GPUs to Vela; we wanted to do it in a space- and resource-efficient manner. In particular, we looked for a way to double the density of the server racks, which roughly doubled capacity without increasing the space or networking equipment required.

After analyzing AI workload patterns, we determined we could move ahead with our capacity expansion within the power and cooling resources that were already available without impacting workload performance. We then worked with our partners to develop a highly optimized power capping solution. This allowed Vela to essentially “overcommit” the amount of power available to a rack safely. We then developed a testing framework for all pertinent components to ensure everything was working safely after the expansion, without any detrimental impact to the system or the workloads that needed to run efficiently on Vela. As a result, Vela is now comprised of around twice as many GPUs as it had prior to the upgrade.

Architecture of Vela after capacity increase.

Improving operations and diagnostics

The team behind Vela also looked at ways to run the system more efficiently. Because of their complexity, AI servers have a higher failure rate than many traditional cloud systems. Moreover, they fail in unexpected (and sometimes hard to detect) ways. Furthermore, when nodes – or even individual GPUs – fail or degrade it can impact the performance of an entire training job running over hundreds or thousands of them. Automation which detects and finds these kinds of issues and produces alerts as quickly as possible is therefore important to keeping the environment productive.

Read More  Improve observability with AI: 5 real-world success stories

This year, the IBM teams enhanced the automation in IBM Cloud, cutting the time it takes to find and understand these kinds of hardware failures and degradations on Vela in half. Now, servers can be brought back into the production fleet far faster than before. Lessons learned from managing an environment this complex have been rolled out more broadly to improve operations across the rest of IBM Cloud’s virtual private cloud (VPC) environment.

What’s next

Even before these upgrades, Vela was already a powerful platform that accelerated the launch and deployment of watsonx.ai all over the world, as well as the development of our core underlying platform, OpenShift AI. And with the latest infrastructure advancements in Vela, we’re training increasingly powerful models that will help solve some of the most pressing business problems our customers face.

In much the same way that this is still the start for the AI boom, this is just the beginning for IBM’s AI infrastructure innovation journey. Earlier this year, IBM announced the availability of additional GPU offerings on IBM Cloud, bringing to market innovative GPU infrastructure designed to train, tune and inference foundation models for enterprise workloads. And, with new IBM AI infrastructure technologies maturing, like the IBM AIU chip, so much more is going to be possible in the years ahead.

By: Talia Gershon, Bengi Karacali-Akyamac, Seetharami Seelam, Drew Thorstensen and Rohit Badlaney
Originally published at: IBM Blog

Source: cyberpogo.com


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • AI
  • Artificial Intelligence
  • IBM
  • IBM Cloud
  • Supercomputer
  • Vela
You May Also Like
View Post
  • Computing
  • Multi-Cloud
  • Technology

Wiz: 80% of cloud breaches are caused by basic mistakes

  • April 13, 2026
View Post
  • Computing
  • Multi-Cloud
  • Technology

Contact center monitoring best practices for CX leaders

  • April 9, 2026
View Post
  • Computing
  • Multi-Cloud
  • Technology

Cloud vs. local backup: Which is right for your organization?

  • April 9, 2026
View Post
  • Computing
  • Multi-Cloud
  • Technology

Why channel partners must design for tech sovereignty

  • April 7, 2026
View Post
  • Computing
  • Multi-Cloud
  • Technology

“A lot of other cloud vendors have been let off the hook”: Oracle leans hard on one-size-fits-all appeal of OCI for enterprises

  • March 30, 2026
View Post
  • Computing
  • Technology

Google Cloud and NVIDIA expand AI innovation across industries at GTC 2026

  • March 17, 2026
View Post
  • Computing
  • Multi-Cloud
  • Technology

Last year in AWS with Corey Quinn

  • March 9, 2026
View Post
  • Computing
  • Multi-Cloud
  • Technology

A guide to contact center security best practices

  • March 6, 2026

Stay Connected!
LATEST
  • digital-nomad-freelancer-worker-2151205464 1
    One paperwork problem – Get your Digital Nomad Visa employment documents fast from UK, EU or Singapore
    • June 16, 2026
  • 2
    Samsung Art Store Brings Art Basel to Homes Worldwide With New Curated Collection
    • June 15, 2026
  • 3
    You Do Not Need to Invest in the IPO of SpaceX, Anthropic, and OpenAI
    • June 10, 2026
  • 4
    The consequences of relying on AI for accurate news
    • June 10, 2026
  • 5
    Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers
    • June 10, 2026
  • 6
    WWDC26: Apple unveils next generation of Apple Intelligence, Siri AI, powerful parental controls, and an expansive set of software improvements
    • June 8, 2026
  • 7
    IBM and Google Cloud Announce Strategic Partnership to Scale AI with Human Expertise and AI‑Powered Delivery
    • June 4, 2026
  • Data center 8
    Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency
    • June 3, 2026
  • 9
    Ink vs Pixels. What you miss versus what you are actually missing.
    • June 1, 2026
  • 10
    Banks race to patch new cyber vulnerabilities, and other cybersecurity news
    • May 25, 2026
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • pope-leo-xiv-cq5dam-1500.844 1
    Pope Leo XIV to Publish First Encyclical on Artificial Intelligence and Human Dignity on 25 May
    • May 22, 2026
  • 2
    Portfolio to Clients, and is Strengthened by Ongoing Project Glasswing Work
    • May 20, 2026
  • reMarkable Paper Pure 3
    Everything The reMarkable Paper Pure Actually Does
    • May 14, 2026
  • 4
    Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future
    • May 11, 2026
  • Anthropic Institute 5
    Introducing The Anthropic Institute
    • March 11, 2026
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.