iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📖

re:Invent 2025: Shell’s HPC Evolution and Accelerating Seismic Processing with AWS GPU Innovation

に公開

Introduction

By transcribing various overseas lectures into Japanese articles, we aim to make hidden, high-quality information more accessible. The presentation featured in this project, based on that concept, is this one!

For re:Invent 2025 transcription articles, information is summarized in this Spreadsheet. Please check it as well.

📖re:Invent 2025: AWS re:Invent 2025 - Shell's HPC Evolution: Accelerating Seismic Processing with AWS GPU Innovation

In this video, Michael Gual, Shell's Global Head of HPC Engineering and Operations, and Husain Shell, CTO of AWS Energy and Utilities, introduce Shell's cloud HPC migration journey, which began in 2017. For the embarrassingly parallelizable workload of seismic exploration data processing, they initially struggled trying to replicate on-premises environments, but a turning point came with the 10X Challenge in 2021. They migrated Full-Waveform Inversion to GPUs, evolving from P4DE to P5EN and H200, achieving predictable burst capacity through the utilization of EC2 Capacity Blocks. Currently, they use Parallel Computing Service across multiple regions, having achieved a real-time reduction of 2.5 years. By combining on-premises CPU processing with cloud GPU processing in a hybrid architecture, they consistently achieve a 3x to 5x acceleration.

https://www.youtube.com/watch?v=ujO5kQwVHPs
※ Please note that this article was automatically generated while maintaining the content of the original lecture as much as possible. It may contain typos or incorrect information.

Main Content

Thumbnail 0

Collaboration between AWS Energy and Utilities and Shell - Speaker Introduction and Shell's Business Overview

Hello everyone. My name is Husain Shell, and I am the CTO of AWS Energy and Utilities. Thank you all for joining us today, and for being here when you could have been anywhere else in Las Vegas. I'm truly proud to introduce my colleague here, my partner in all the things we've worked on with many customers in HPC and the energy space. Michael Gual, Shell's Global Head of HPC Engineering and Operations, is here to talk about his journey over the last four or five years, not just in high-performance computing, but also in generative AI co-innovation, and our data and analytics endeavors. I'm very excited about what's to come. Fortunately, he'll do most of the talking, but I'm here to answer any questions if needed. So, Michael, please.

Thumbnail 10

Thumbnail 90

Thank you, Husain. It's a pleasure to be here. Thank you all for coming such a long way. I thought it would be wise to walk here, but after the keynote, maybe not so much. But thank you for making it all the way here. I'm really excited to share this journey with you today. It's actually been over five years, and we've had some struggles, which I'll talk about in today's talk. But first and foremost, and most importantly, is the long slide about disclaimers. Yes, you can take a photo, but please don't take it as financial advice. This is Las Vegas, after all.

Thumbnail 100

So what are we going to talk about today? I'll give you a bit of background about Shell, what we do besides what you see at gas stations and on F1 tracks. I'll give you a bit of background on Shell. We'll talk about our HPC, our upstream challenges, and how that translates to HPC, and we'll look back at our journey with AWS, which began in 2017. I'll actually show you what the architecture looks like. We'll talk about what's next, and then hopefully we'll have time for Q&A at the end. I'll defer the difficult questions to Husain.

Thumbnail 140

Thumbnail 180

So, first, let me explain a bit about Shell and what we do beyond gas stations. We have over 93,000 employees across 70 countries, truly involved in all aspects of energy. What I'll be talking about a lot today is our upstream business. Key metrics there include 66 million tons of LNG in 2022, and 22,800 barrels of oil equivalent production per day. Clearly, there are many different areas and interesting figures. And how that looks from a customer sector perspective is that we provide integrated energy solutions across all aspects of energy, including things that use a lot of power recently, like data centers. But again, I hope all of you used Shell products on your trip to Las Vegas.

Thumbnail 200

Technology Transformation in Upstream Business - Strategic Shift to Cloud Integration

So, let's dive deep into upstream and talk about who actually uses our HPC. First, let's talk about what we're doing on the technology side, and what we're trying to change. We are a company that for decades has done a lot of things proprietarily, with a lot of proprietary equipment, and built a lot of things ourselves. And what we're really realizing is that integration is one of the keys to unlocking the future way of working. So, we're working on empowering our core workflows. How do we simplify the workflows? How do we integrate a lot of different siloed applications? How do we embed AI into these deeply technical workflows? And then on the right side, we choose where to invest, where there's still differentiation, but there's also a lot of market standards. So, how do we pick the right target areas to invest in, invest strategically there, and try to simplify the overall user experience?

Thumbnail 270

This leads to a pretty obvious conclusion that running this in the cloud, driving this integration in the cloud, is the natural place. And that's where you see our partnership with AWS, which started in 2017, where Shell started what used to be called SDU. That's become OSDU for those of you who know, and it's now also a managed offering called EDI within AWS. I think we can talk more about that when we talk about the overall stack. But Accenture is a key strategic partner in our hyperscaler strategy. And in HPC, we've been in production since 2022, which is embedded in our energy industry peers, our ProServe team embedded within my HPC engineering team, and of course, our own developers and users.

It's been a really long journey. I'll talk about this in more detail shortly.

Thumbnail 320

Seismic Exploration and the Need for HPC - Business-Critical Challenges and the Importance of Variable Capacity

So, let's talk a little bit about HPC. Who uses HPC? Why do we need HPC? Why do we need more HPC? Our industry is famous for having machines on the top 20 list. We have a lot of compute. And very simply, why do we need a lot of compute? To do a lot of physics. What are we doing? Seismic exploration. For those unfamiliar with seismic exploration, we send sound waves, typically kilometers, miles deep under the seabed. We listen to those waves up to 40 kilometers away. That generates petabytes of data. That data then needs to be processed with a huge amount of compute.

Generally speaking, the algorithms we use are decades old, but we couldn't afford the compute. So as compute becomes more affordable and we get more compute power, we just throw more physics at it. We remove assumptions. We put in more parameters. As we go into more complex areas to find energy, as we want to better image the basins we're already in, we need more geophysics. Scaling these algorithms requires exponentially more compute.

Generally speaking, if you give a geophysicist new hardware, they typically max it out until the next thing comes along. Storage is similar. The type of storage we typically have is only full ones. This is what's driving what we need and why we need so much compute. So let's also put HPC in the business context. All of this imaging work we're doing is ultimately on our company's critical path to drive revenue and to drive those wonderful numbers I showed you at the beginning.

We need to have an image of what the subsurface looks like. So this is on our critical path. The good news is, this is an embarrassingly parallelizable workload. So generally, the more capacity you throw at it, the faster you can process it. There are nuances to scale. There are always nuances and bottlenecks, but generally speaking, this is a very parallelizable workload, and more capacity leads to faster results.

Thumbnail 460

So, what do we want to do, and why do we need variable capacity? We want to be able to make capacity decisions relevant to business decisions. Otherwise, what have we done in the past? Guesswork. Buy a system, hope it's enough, procurement cycles are long, and the size of the investment means you're tied to that decision for years. We wanted to be able to realize the value of information and make capacity decisions much faster based on business decisions.

It's an HPC waiting game. For those of you who have deployed systems, and we still manage systems on-premises, it's becoming increasingly difficult to put equipment in data centers. Whether it's the AI boom, securing enough power, or just general supply chain challenges, this is a huge challenge. And I'm very happy that AWS is doing a wonderful job of that for us, and that you all worry about a lot of the things we historically had to deal with.

More complex geophysics is why we need a lot of HPC. And the reason we need the cloud is really because it gives us that variability, that flexibility, and to really say we can do something when we need it. This year, we took our innovation strategy up to our executive leadership, and we said,

we said. Okay, we want to make this happen, we want to pursue one of those targeted differentiation areas. Within a matter of weeks, that capacity was fully online, and the researchers were off to the races.

This is historically, as an HPC manager, where somebody comes to me and says, I want more, and here's the money, it's like, great, where's the rest of the money, and where's the money for the next three or five years? This is, okay, let's do it. Let's start, let's go. That's where we are today. So let's talk about the challenges, let's talk about how we got there.

Thumbnail 590

Trial and Error from 2017 to 2022 - The Turning Point Brought by the 10X Challenge

It's been a long journey. And the good news is, to be honest with you, it started really slow, but then it got pretty fast. If I use Amazon's analogy of the flywheel, when we started this, it was very much a journey up the stairs.

It was very slow, very difficult up the stairs, sometimes wondering if there was even another step, or if we were just looking at a wall. But we got to the flywheel. So how did it start, and what was 2017 to 2022 like? This is half a technical story, and probably half an emotional story as well.

In 2017, when we started talking about HPC in the cloud, people looked at us weirdly, like what are you doing? Why are you doing that? That won't work. So from 2017 to 2022, we tried a lot of different POCs. Oh, Spot looks interesting, that's affordable, and we tried all sorts of things. Oh, but it can't handle interruption. Okay, let's try something else. Let's try lots of different heterogeneous SKUs. Oh, that didn't work either. And we just went from one thing to the next, and it really wasn't working. We weren't getting to scale, we weren't getting to our goals.

A lot of what we were doing, in hindsight, was just trying to do exactly what we were doing on-premises. Okay, I have 1,000 nodes. So I need 100 nodes in AWS. Oh wait, the nodes aren't the same, now I only have half the power, so I need 200 nodes, and we were adding so much complexity by forcing the same thing.

The biggest turning point for us was the 10X challenge. In 2021, so pre-production, early 2022, somebody who is in the room here today, came up with the 10X challenge. It's if we can achieve a 10X wall clock improvement time, that will change how we think about compute, it'll change how we do business, it'll change everything. Think about your life. If something that takes you 10 hours can be done in one hour, you change how you do it. So when you 10X something, it fundamentally changes how you behave towards it.

So the 10X challenge was, okay, in AWS, let's forget about trying to replicate what we're doing on-premises. It's 10X wall clock time. And the other thing we're going to do is separate the technical from the commercial. Let's prove it technically first. Can we do it? How fast can the car go? And if the car can go really fast, that's great. We'll talk about the commercial aspect. Because if it's really expensive, or too expensive, we might not do it. If it's really expensive, we might only do it once a year. And if it's affordable, we'll do it all the time. And that became the challenge.

As we started looking at that metric, a few other stars aligned. One was that our main algorithm, Full-Waveform Inversion, which is very common in the industry, migrated from CPU to GPU. Nvidia, this was in 2022, the P4DE and the A100 80 gig just came out. And now, thanks to Nvidia's reference architecture, the nodes were very similar to what we were putting on-premises. So we had some homogeneity. We didn't have, this CPU node is different from that CPU node over there. So it started to work.

Another thing was, because of petabytes of data, we always thought we had to move everything. And that data was gravity, it was the anchor that kept us on-premises. What we started to think about was, well, maybe it doesn't have to be. What if compute was the magnet? And what if compute could actually pull the workflows, enabling different ways of doing things? And I'll show you what that looks like on the next slide, but that's one of the things about how do we decompose the workflows to do what we want to do? And again, instead of just replicating what we're doing on-premises, how do we think about the workflow differently, and how do we optimize it?

And another really subtle thing we noticed was that 10 gig Direct Connect became very affordable in the United States. When you look at the scale of the cost of compute versus the cost of a 10 gig or a 100 gig Direct Connect, it's like, oh, that's manageable, that's a very investable item.

And in 2022, we went live in production. And it was great. We went live on December 23rd. I was very happy because AWS almost ruined my Christmas, but they didn't. And the people who filled the queue are sitting in the front row here. They joined on the 23rd. They came online, it was full within minutes, and it stayed full for the entire period we had it. It was truly impressive.

Honestly, it wasn't that great. The challenge we had with P4DE was getting it. We had some base capacity, but it was a rare resource, and we were trying lots of clever ways to get it, find it, or call Hussein and yell at him.

Thumbnail 910

It wasn't great. We definitely weren't reaching the 10X goal. Did it accelerate projects? Sure, but I would say we were accelerating in the 3X range. We were doing some great things, but we knew we could do more.

Migration to P5EN and H200 - Achieving Predictability with EC2 Capacity Blocks

And this year, we migrated to P5EN and Nvidia H200, which was great for our workload, because that HBM really helped our workload. We also started using EC2 Capacity Blocks, which was and still is a game-changer for us in terms of predictability. We now have visibility into what we need a few weeks out, weeks in advance, because we understand our projects. It's not like waking up in a panic and needing compute that day. We have some visibility into when these projects are coming, so we're reserving them now. We have dashboards, we know when the capacity is coming. We also know when it's going away, but that predictability of burst capacity has been a real game-changer for us.

Now, the easy way I explain it is that what an engineer can do with one click, or a few clicks, historically was something that would take six to nine months, considering putting out an RFP, buying machines, getting them up and running, deploying them, and then 12 to 18 months. And then you had to keep them for years. Now, you click a button, and it's there. It's frighteningly as easy as shopping on Amazon, so I have to rein in the engineers, because they've gotten so used to spending very large amounts of money online. But it's truly amazing to see the power of capacity blocks. I'll talk later about what that has done for our engineers and our actual HPC users, but it's been truly fun. This is how we got here, and now we're getting into the flywheel.

From when we got the H200s up and running at the beginning of this year, to what we've done this year, the pace of innovation and optimization is super accelerated. What we're doing now is bringing in many more use cases. The ability to spin up and respond to business needs, R&D use cases, production use cases, AI use cases, GenAI use cases, it's astounding. Are we reaching 10X? No, we're not reaching 10X. I would say we're now consistently in the 3X to 5X range. With P4DE, we struggled to get 1X to 3X. We're definitely in the 3X to 5X range now, but there's still more to come. That's part of the journey, and again, I'm confident we'll get there, and then we'll continue to push even further. So, we'll keep pushing.

Thumbnail 1110

One more thing that's important to me is that the price-performance ratio is still number one. My most important job, which I explain to people, is to provide HPC users with a price-performance cycle. Getting the best performance at the best price and ensuring availability is paramount in what we do. I'll save those statistics for later, but this is what it currently looks like internally. Petabytes of data come in. We ingest it on-premises. The first thing we generally do is depopulate, and we start shrinking the model.

This is where our CPUs still exist. Yes, AWS is looking at this too, but this is still an unsolved problem. But it's valuable, so we run it there. We ingest data and do a lot of CPU-intensive processing on-premises, but eventually for Full-Waveform Inversion, FWI, the input files go from petabyte scale to about 10 terabytes. Those 10 terabytes are sent to AWS via Direct Connect. And now, we use a lot of P5EN or P-series instances to run that. This works very well.

We've also just migrated to PCS, Parallel Computing Service. This is SLURM-based. It looks and feels very similar to us engineers. I'll show you what it looks like from a user's perspective on the next slide. But this really, really works well. And the output is 1 terabyte. Yes, there are egress costs. I always complain to AWS about that, but it's like a ticketing fee. But again, from an overall perspective, it's worth it for the overall outcome. And then it comes back, and if the remaining CPU processing is needed, it's run on-premises.

Data gravity still exists, but what we're actually doing is enabling many more workloads. And finally, I'll show you how this integrates into the overall stack. But this is how it works. From a capacity perspective, we have dedicated nodes, we have capacity blocks, and what we're now starting to explore is what we can do with Spot, since we're seeing a bit more Spot for P-series instances. But this is how we actually make it work, and how a hybrid solution works very well for us.

Thumbnail 1270

This is a Full-Waveform Inversion use case. Other use cases, for example, if you're doing large language model fine-tuning, that's entirely in the cloud. There's no CPU processing on either side, so it can run natively in the cloud. Some of our innovation and R&D workloads run natively in the yellow box in the middle. It's probably too small, but that's okay. I'll explain the important parts.

Thumbnail 1300

Hybrid Architecture Details - Multi-Region Support and PCS Utilization

So the left side is our user environment. This is our virtual workstation. So that's where users log into their virtual machines, Linux or Windows. And what's important here is that this is the same interface they use on-premises, and it's the same interface they use in the cloud. So this is their virtual workstation, currently housed on-premises. And on the AWS side, what you're seeing here is the Parallel Cluster Service, PCS, running. This is AWS's managed SLURM offering. But through this API Gateway, and I'll point this out, you're probably all staring at the diagram, thinking is there a hamster in this diagram? Yes, there is a hamster in this diagram.

Our optimization team has a thing for rodents. The team itself is called Rats. This is Hamster. We also have a tool called Mice. The person who's on support each week is called the Rat in the Hole. Yes, it's a hamster. Yes, it's a tool. This is a really important API to make all of this work. But ultimately, job submissions happen via API calls through Hamster, but ultimately it's submitted to SLURM, just like on-premises.

And of course, this isn't a single pane. Users are aware that they are running in the cloud. And the biggest reason they're aware is that they have to make sure the data is there. So there are jobs that we can orchestrate to move the data there. But there are other details that aren't well represented in this slide, but it's about how we've really been able to leverage the cloud, find capacity, and leverage scale.

This is an architecture that works across multiple regions. So I can tell you as of this morning, we are running in three different regions. We are not yet running cross-region. This is something we are actively working on and discussing with AWS. How can we do that? Currently, we are running in multiple regions, but they all operate independently of each other. We are also running in multiple Availability Zones. The benefit of our workloads not being tightly coupled is that we can run the same workload across multiple Availability Zones within a region. And this is another way for us to access more capacity.

And on the data tier below that, we have FSx. But what we realized was that we had to think differently about FSx as well. Because if you use Lustre in the cloud like we do on-premises, it can become more expensive than the GPUs. And that's the nature of how we use Lustre on-premises, and the lack of tiering like in the cloud. We just have really, really hot storage, and that's Lustre for us. And really, really cold storage, and that's tape. We don't have this entire tiered structure that AWS has.

So, what we're doing on AWS is using FSx backed by S3. The way our jobs work is that we run a lot of computation, get a model, and it's placed on FSx over Lustre. And when you update the model again, that's it. But older versions of the model can be moved to S3. This means that as long as only the latest version is hydrated on FSx, we can keep that FSx to a very small portion of the entire project.

And again, if you remember how I described the workflow, you just delete it at the end. We treat all of this as ephemeral. The input data you upload, we don't take it home. We have a copy of it on-premises. It's just, well, safely and securely deleted. But this architecture has allowed us to be agile in the cloud. And now, with capacity blocks, we know we're going to move to this region on this day. And with PCS, we can spin up a cluster in minutes, well, an hour at most.

And this has allowed us to get on that flywheel, and really get that flywheel spinning. It's wonderful to see. This morning, Matt announced the GB300s. I can ask our data scientists if they want to test the GB300s tomorrow. We can. The pace at which we can do things, trial and benchmark, what works and what doesn't, where we need to go to invest our time with NVIDIA and leverage the latest chipsets, these are all thanks to the pace at which we can do things. Because in the old days of HPC, how would we do it? A new chip comes out. Let's call NVIDIA. Let's get a white box. Does it have the right OS? Is it compatible? How do we connect it to the network? There were all these steps.

Now, you just get a capacity block. You can get it on demand, or you can spin it up on Spot. So, it's the pace of what we can do. Yes, the biggest thing I want to show in this diagram in the future is multi-region. How can we really leverage multi-region, chase capacity, and leverage Spot more than we do now? Let me give another plug for PCS. Some of you may have seen me talk about PCS yesterday, but one of the things we can now do by letting PCS manage our SLURM compared to Parallel Cluster is that we can make seamless updates to the cluster.

With previous Parallel Cluster, if you made a change, you had to stop the cluster. But now, if you want to update, for example, if you want to add an instance to one of these clusters, you can add it, and then the SLURM managed service automatically updates it, making it seamless for the user. Previously, if you wanted to add instances or do other things, you had to redeploy and stop the cluster. It's a really great solution.

Thumbnail 1630

Building the Entire Integrated Stack - The Value of HPC Orchestrator and Co-innovation

Now, I'm very biased about this slide, but what we've been talking about is the bottom, which is obviously the foundation that supports everything, and it's the most important part. But to be honest, the reason we're doing all of this is to go back to the upstream challenges I talked about at the beginning. How do we provide an integrated stack? How do we provide competitive differentiation? How do we provide market standards? So, when you look at what we're providing, I'll say again here, this is not a data gravity story, HPC is an enabler, and now it becomes a conversation about where do we need it, and in which workflows do we need to enable it? And we go up the stack, driving an integrated experience. But Hussein, here, can you elaborate on what we're doing with you, and what you're doing across the entire vertical, to help build this whole ecosystem?

Yes, thank you, Michael. So, I want to bring you back up to 10,000 to 15,000 feet. Why do we need to look at the forest and not just the trees? It's important to make sure that the foundation and infrastructure, and the business value, are driven by different levels of the stack. But don't forget what the rest of the stack looks like, and why it's needed. Whether it's data, end-user experience, innovation, and increasingly, generative AI and AI capabilities embedded within these workflows. So, for us, this represents the whole picture. And we are fortunate that Shell is a great partner and constantly reminds us that the big picture matters, and what else needs to be done. Not just thinking about compute, not just thinking about data, but thinking about everything.

In terms of end-user enablement, there's still a lot to do. We saw in the previous slide how data comes from on-premises to the cloud and back again. The remaining pieces at the top, such as application access, can also reduce a lot of cost and friction when Shell starts to be able to exist in the cloud at the price performance it needs, and on demand. This applies to many other operators in this room, and operators we partner with around the world. For me, this is a multi-year, end-to-end journey to achieve. Because, to be honest, the teams working on some of these are completely disconnected.

As Michael heard, some of the R&D teams are doing a lot of AI, generative AI, models, simulations, and more, using the same AWS infrastructure that Michael can access for HPC. How do we make that experience seamless? There are teams working on data management, ingestion, deployment, and integration with applications. How can these datasets also influence the results coming out of HPC and the results going into R&D?

And one of the critical enablement pieces that people often forget when they just want to do their daily business is the co-innovation part. I think one of the best opportunities I've worked on in my career was the co-innovation we did with Shell, Oxy, and others. There, we created a whole new way of looking at HPC orchestration and workflow enablement in the cloud, without repeating the same mistakes that Michael mentioned had been made on-premises for decades. That's the HPC Orchestrator. This is a low-code, no-code tool that we literally came up with as an idea together, implemented, and built. And now it's being tested by almost every operator I talk to. Because it makes HPC in the cloud much more seamless and removes the complexity and adoption barriers that Michael was talking about. For the first few years, it's simply about spinning it up, getting in, and building templates of what you're trying to build and run your workloads. That could be HPC, it could be AI, it could be all sorts of simulations.

That's one area. We also have some initiatives related to agents in the GenAI space, which I hope we can talk about next time. But for me, this is a bigger journey and a bigger purpose that we're pursuing with Shell, and I'm very excited about it. And this will benefit many of our other operators and customers, and it will also help realize the partner ecosystem that we've built into it. So, I'll hand it back to Michael.

Yes, thank you. What I want to address is the HPC Orchestrator that Ray mentioned. I think this is a great example of how, to leverage the cloud, you need to think a little differently and use the tools that are there, rather than just using the tools you're familiar with. When you look at the templates for HPC Orchestrator, those templates are doing really smart things, like driving asynchronicity or using storage in smarter ways. These are things that help increase speed, price, and throughput. So, instead of redesigning, it's about thinking from a cloud-native perspective. And the fact that there are templates is also important. There are many areas where our industry competes, but there are also many areas where there's a lot of room for collaboration.

Thumbnail 1970

So, templates for how to best use cloud-native technologies are great areas for us to collaborate. And we can all use the same templates, while keeping the code that we consider differentiated to ourselves. So, I think this is a great example of what AWS is doing here, and I thank them for pushing us as an industry to realize this vision. So, let's talk about the fun stuff. So what is it? You have some nodes, you have a lot of computers, that's great. You guys are operating them wonderfully. Matt was very proud of your Nvidia reliability in his keynote this morning.

The Outcome of 2.5 Years of Time Savings - Shift in Mindset and Future Prospects

So, what has that burst yielded? Unfortunately, it hasn't reached 10X, but it's been 3X to 5X. We looked at the numbers and thought, how much real time have we saved on these projects since 2022? 2.5 years. We saved 2.5 years of real time thanks to being able to accelerate with burst capacity. This is important. This flexibility is important. It's important for our business, and it's important for how we think about HPC. This honestly gives us a variability and capability that we've never had before. But for me, looking at it now, I've been involved in some way in this journey since 2017, but looking at what we're doing this year, this is exactly what we wanted to be.

There's still more we want to do. There's still more we need to do. There's always more to do. We can always do better. But we're doing what we dreamed of. We're doing what we wrote in the PRFAQ in 2017. Now we're actually starting some of it. I might have had the date wrong when I first wrote the PRFAQ, but it's happening. And I think this is the beginning of covering a lot of the foundation and groundwork so that we can do more things faster next. And the mindset has changed. I'll touch on that too, but regarding deployment, I now think differently about how we manage internal HPC demand.

Thumbnail 2090

If you look at the current standard lead times, bringing in machines of the scale we use takes a very long time, requires a lot of planning, and a lot of capital. But now, we can do it with a click. This is a very big difference. This has allowed us to focus on different areas of the business. We can respond when needed, and we can really focus on where we as a company can maximize our investment in time, effort, and innovation. This is truly interesting, and it's also leading to a change in mindset. At the beginning of this year, when I talked to the researchers and said, "We're moving your GPUs to the cloud," some of them were not very happy with us. They said, "No, no, no, I like my on-premises nodes." I said, "It's okay, it's okay." And now, we can't take them away from them. Because now they're saying, "Oh, can I use nodes with large memory? More nodes with large memory? Can I reduce the number of cores? Can I increase the number of cores?"

The heterogeneity and innovation we're seeing, can you spin up this agent, can you use this service, this is a really big difference. Previously, historically, it was like, "Do you want that? Yeah, we can have an engineer spend a few days cobbling something together as a stopgap." But this has completely changed the conversation with innovators, and it has changed the mindset. And this is both a blessing and a curse in terms of actually changing the mindset. Because when you have a really big system, the good news is you might not scrutinize whether it needs to run, or if it's the most important thing that needs to run, what the priorities are, simply because you're trying to fill the system.

The challenge now is not whether the system is full, but how big you want the system to be. And a really interesting question I get asked is, "So, how much do you want to spend?" Now we're having a very different conversation about value. And another really interesting conversation we can have now is, "Do you want to go faster? How much faster?" That's where 5X comes in, and hopefully it will be 10X, but where do you want to accelerate? So, as I said, we are delivering on what we expected at the outset, and we have learned lessons.

Thumbnail 2220

It comes back to vision. This was a technical journey, and as I said at the beginning, it was also an emotional journey. And one of the biggest things that unlocked it was finding how to do things differently. So, for me, to all of you out there, what is your 10X challenge? Maybe it's 10X, please embrace that. You've heard AWS say it before, and you're probably tired of hearing it, but it makes a big difference. If you just say, "I have 100 nodes, and I want 100 nodes there," I can guarantee you'll find reasons not to like the cloud. Because when you do something with your 100 nodes, it will feel preferable to their 100 nodes, and it probably won't be a great setup. But what is your 10X?

For us, it's speed. And what I would recommend is to continue to separate the commercial from the technical. AWS loves technical problems, and the teams dig deep. That allows our engineers to work on truly fun and challenging things. Of course, the commercial aspect is always a consideration, but that's a given. But this separation has helped. And then, another thing we learned a lot about is taking risk out of the most important things first, and trying to do it quickly, and actually doing it quickly when everyone is in the same room.

There have been a few times, Amazon has a program called EBA. What does it stand for? Experience Based Acceleration. But basically, one of the things we did was, we said we want to be able to run and operate across multiple regions. We brought a lot of engineers together, and we brought AWS people together. It didn't get operational in five days, but we fundamentally took the risk out of the most important things. And in fact, one of the most important things was getting Hamster to work. You heard about that earlier. But that gave us confidence. It wasn't implemented, but we realized we had taken the risk out of the most important things: how to communicate from on-premises to multiple regions, how to orchestrate all of that. And by the end of the week, we were running dummy jobs.

But that gave us the confidence to say, okay, let's go. But having all the right people in the room, and that's where the partnership with AWS has been truly fantastic for me. If you ask and push, you always get a response. And then another thing, this is my everyday life in HPC, is constant optimization. It's never enough, it always needs to be faster. There's always a bottleneck, it's always too expensive. But it's constant optimization.

Thumbnail 2380

So what's next for us? We've talked about it, workflow integration and new workflows. And again, we're already doing that today. You want to fine-tune an LLM, okay, let's do it. What else can I say? A few weeks ago, we were training that LLM on H200s, and now we're training it on Blackwell. Why? Because the workload drives it. You get the price performance. We're there. That rapid response to needs is very, very real.

Integrated modulation, we talked about the Energy Orchestrator, but for me, it's also beyond that. It's integrating into EDI where our data resides. It's agentification for energy. It's about achieving a whole integrated workflow, both for what we build and for our vendor and partner ecosystem. It's gluing all of that together. And again, it's changing that paradigm of not being able to go to the cloud because the data isn't there. For me, if you provide an integrated environment and there's an attraction, users will come. Because it becomes the obvious place. It's not if, or maybe, but it has to be there.

Thumbnail 2480

Constant optimization, we've talked about this. Again, this is what we do day in and day out. We're always looking at something and thinking, can't we do this better? And today, for our HPC, that's multi-region. That's the next big challenge we want to de-risk. And price-performance. Walking a long way to get here, I bumped into your accelerator lead, and it was like, when do you want to test the B300s? They're out already, when do you want to test them? So looking at that price-performance, looking at what's suitable for our workloads, and how to deliver it in the most meaningful way, that's our next step.

So, as a closing comment, I want to express my sincere gratitude to AWS. It's been a long journey. As I said earlier, sometimes the stairs looked like walls, but I'm truly, truly grateful for this journey. And personally, I want to sincerely thank all the sponsors and engineers on Shell's side. It's a great honor for me to stand here and tell this story, but it's only possible because of the hard work of all the people who made it happen. So thank you all, and thank you AWS. And of course, thank you for having me.


※ This article was automatically generated using Amazon Bedrock, maintaining the information from the original video as much as possible. Please note that there may be typos or incorrect content.

Discussion