iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📖

re:Invent 2025: Data Warehouse Modernization Case Studies via Migration to Amazon Redshift

に公開

Introduction

By transcribing various overseas lectures into Japanese articles, we aim to make hidden valuable information more accessible. The presentation we're covering in this project, proceeding with that concept, is here!

For re:Invent 2025 transcription articles, information is summarized in this Spreadsheet. Please check it as well.

📖 re:Invent 2025: AWS re:Invent 2025 - Modernize your data warehouse by moving to Amazon Redshift (ANT317)

This video explains the modernization of analytics data warehouses using Amazon Redshift. According to a Harvard Business Review study, while 83% of data leaders prioritize generative AI and agentic AI, over 50% perceive their data infrastructure as an obstacle. Redshift, with features like its multi-warehouse lakehouse architecture, Zero-ETL integration, and MCP Server, enables the elimination of data silos and the effective use of AI. Charter Communications achieved a 35% cost reduction and 18% SLA improvement through their migration. In the case of Roche Pharmaceuticals, a team of 150 engineers transitioned from ETL to ELT, integrating 300 data sources and reaching a scale of 3 million queries processed per day. Organizational reforms emphasizing DevOps, Agile, and transparency, combined with the use of technologies like Redshift Spectrum, Lambda UDFs, and Bedrock integration, made development that used to take months now possible in days.

https://www.youtube.com/watch?v=o582f9kMjb0

  • Please note that this article is automatically generated while maintaining the content of the existing presentation as much as possible. There may be typos or incorrect information.*

Main Content

Thumbnail 0

Session Overview and Agenda: Data Warehouse Modernization with Amazon Redshift

Welcome to our session, "Modernize your analytics data warehouse with Amazon Redshift." For about the next hour, I'm going to talk to you about how Amazon Redshift, as a modern data warehouse, helps you break data silos, integrate data, analyze it in various forms, and then deliver transformative business value.

First, a quick introduction to our speakers today. My name is Manan Goel, and I serve as a Principal Product Manager on the Amazon Redshift team. I've been with Redshift for about seven years. We have a few other presenters, so I'm very excited about that. Satesh is joining us as a Principal Solutions Architect for Redshift, and he'll walk us through some of the capabilities of Redshift. And last but not least, Yannick from Roche will be joining us. So, let's get started.

Thumbnail 60

First, let's take a quick look at our agenda for today. First, I'm going to talk about some of the key trends we're hearing when we talk to data and analytics leaders and their data analytics needs. What are we hearing from our customers? Second, I'm going to talk about how Redshift has evolved and how we've been adding new capabilities and features to address some of the trends we're hearing from customers. The third thing is that I'm going to give you an opportunity to see Redshift in action. I know agentic AI is top of mind for all of you, so I'm going to walk you through some use cases, including a live demo, on how you can use Redshift for your agentic AI use cases using things like Redshift's MCP server.

Last but not least, I'm going to hand it over to Yannick to talk about Roche Pharmaceuticals' data warehouse migration and modernization journey. They have done a phenomenal job consolidating multiple data warehouse environments across multiple geographies, so you're going to get an opportunity to hear from him about some of the best practices around migration and modernization. And then last, we're going to wrap it up with next steps and how you can get started on your modernization and migration journey with Redshift.

Thumbnail 140

So let's start with the trends. I want to start first with a recent study that Harvard Business Review conducted with more than 600 data and analytics leaders, just like yourselves. And what this study revealed were two very important trends. First, 83%, so more than 8 out of 10 data and analytics leaders, said that generative AI and agentic AI are a critical strategic initiative for their organizations. I'm sure you're seeing those trends in your organizations as well, but everyone is looking at these technologies as foundational technologies of our time to help us get more insights from our data and deliver truly transformative customer experiences.

But interestingly, in the same study, Harvard Business Review also found that for more than 5 out of 10 customers, the data infrastructure is an obstacle in deriving value from data analytics. So, there is a lot of work to be done to get your data foundation ready to be able to leverage foundational technologies like generative AI and agentic AI.

Thumbnail 230

Now, when Harvard Business Review delved deeper into the key areas these data analytics leaders are focusing on to prepare their data foundations, four important points emerged. They are truly seeking highly curated, high-quality data. This is because agentic AI and generative AI are fundamentally based on the value of data. The better the data you have, the better the insights you get. Therefore, data quality and curation continue to be critical initiatives that these data leaders are working on.

Other areas include eliminating data silos, removing data fragmentation, and consolidating data in one place, whether that means moving data or providing federated access. These are crucial areas that leaders are focusing on. The third point is data security and governance. One of the things we're doing with generative AI and agentic AI is exposing it to far more users, both human and machine, inside and outside the organization. Therefore, in that context, governance and security are truly important.

And finally, also related to agentic AI and generative AI, what we're seeing is a very dramatic increase in the magnitude of queries.

Thumbnail 320

Therefore, being able to deliver these capabilities at scale while remaining cost-effective is also extremely important. And that is exactly where Redshift, as a modern data warehouse, comes into play. Redshift, out of the box, provides a host of capabilities that help you build a very strong data foundation in each of these areas.

Amazon Redshift's Architecture: A Strong Data Foundation with Multi-Warehouse Lakehouse Structure

Looking at Redshift's architecture, starting from the center, I want to talk about a few points regarding Redshift's foundational capabilities. These capabilities make it very easy to build a strong data foundation. First and foremost, Redshift provides the ability to build this multi-warehouse lakehouse architecture. You can move away from a single monolithic data warehouse environment to a more modern, distributed data warehouse architecture.

It offers capabilities like storage and compute separation, allowing you to scale one independently from the other, achieving optimal price-performance depending on your workload. It provides a lakehouse architecture, meaning you can store and process both structured and unstructured data, whether it's in your data warehouse or an open-format data lake using Iceberg, SD tables, or a general-purpose S3 bucket.

Another foundational capability of Redshift is this multi-warehouse architecture. We are seeing many customers migrate from a single monolithic cluster to this multi-cluster architecture, which is more scalable, reliable, and cost-effective. You can use both provisioned and serverless compute depending on your use case. We also provide features like data sharing, where you have only one copy of the data, and multiple computes can access it while maintaining workload isolation, preventing noisy neighbor problems or resource contention, such as a dashboard workload impacting a data science workload.

In addition to that, we also provide the ability to do various types of analysis: SQL analytics, Spark analytics, AI/ML, generative AI analytics, and the ability to integrate data from multiple sources within your organization. Whether they are relational sources or streaming sources, we can handle them using features like Zero-ETL integration and streaming ingestion. Finally, from an infrastructure perspective, we extensively leverage AI and ML capabilities within the data warehouse to automatically scale up and down your data warehouse as required.

Thumbnail 480

So, overall, Redshift provides a strong platform for building modern data warehouse architectures and addressing key needs related to analytics modernization. But of course, being AWS, we are never satisfied with the status quo. We continue to work backward from customer requirements and continuously deliver new capabilities to further improve data warehouses and enhance analytics.

Redshift Continues to Evolve: Over 60 New Features and Enhanced AI/ML Integration

In the last year alone, the team has been aggressively working and has released over 60 different capabilities across these four key areas, which further provide capabilities and make it simpler and easier for you to build modern analytics architectures. I want to highlight a few of these. If you are building a multi-warehouse architecture, we provide performance parity across this distributed architecture. On the serverless side, we have improved serverless performance and expanded the RPU range.

Lakehouse analytics continues to be a big investment area for us, where we are treating Iceberg as a first-class citizen with respect to data warehouses, providing both read and write capabilities on Iceberg. On the near real-time analytics side, we are continuously expanding the database sources we support to allow you to move data in one click. We support on-premise and EC2-based databases like Oracle and SQL Server as sources, and business applications like Salesforce and ServiceNow as sources, seamlessly moving data from them to your analytics environment so you can run analytics quickly.

Finally, on the AI/ML and generative AI side, we are also providing capabilities like MCP Server and integration with Amazon Bedrock, allowing you to run LLMs within your data warehouse and easily build agentic applications. Of course, a picture is worth a thousand words. The product continues to evolve based on your requirements. I also want to share a customer success story, how one customer used Redshift to modernize their analytics infrastructure and what results they are getting in that environment.

Thumbnail 610

Charter Communications Success Story: 35% Cost Reduction and 18% SLA Improvement in Migration Project

Now, the example I'm going to talk about here is from Charter Communications. For those of you who are not familiar with Charter Communications, they are a major telecommunications and mass media company based in the United States. They provide internet, phone, and cable television services to millions of customers across the United States. You can imagine how important scalability, reliability, and the ability to do these things in minutes or seconds are to them. Think about when a new Netflix show comes out and becomes very popular. Everyone wants to watch it, and you want to be able to scale up so that when people are streaming, there are no streaming issues or anything like that.

This customer went through a migration and modernization journey, and the results you're seeing here include a 35% reduction in cost, an 18% improvement in SLA, and they migrated 600 terabytes of data, over 500,000 queries, and 40,000 objects from their on-premise system. What they were able to achieve is truly phenomenal. I want to give you a quick background on what their environment looked like before and after their migration, and what results they were able to achieve.

Thumbnail 690

This is the architecture they had when they were running their on-premise data warehouse in the past. You see a lot of duplication here. There's a lot of inefficiency in terms of multiple pipelines for batch and near real-time processing, and many of the problems you see in a monolithic data warehouse environment. They had shared resources, where multiple workloads were competing for the same resources, resulting in unmet SLAs, lacking the elasticity and scalability the business needed during events like the Super Bowl or a new Netflix show release, and also suffering from poor disaster recovery options and high operational costs.

Thumbnail 740

This is their final architecture after completing their migration to Amazon Redshift. As you can see, it's a much simpler and cleaner architecture. There's no longer duplication across multiple pipelines for batch and real-time. They have a single pipeline that handles both batch and real-time processing. They have this multi-cluster architecture, a hub-and-spoke architecture with a centralized data lake cluster that ingests all data. On the spoke side, they have a serverless data warehouse environment that provides workload isolation, better performance by offering purpose-built environments, and also reduces costs. For example, if you're using a serverless data warehouse for dashboards, if no one is running dashboards on the weekend, the compute goes to sleep, and you don't have to pay anything for that compute. So, that's a very wonderful thing from a business outcome perspective.

Thumbnail 790

These are some of the results they saw: an 18% improvement in SLA. Elasticity and scalability, which used to take days or months in the past, now happen in minutes or seconds. Disaster recovery has significantly improved with dramatic improvements in RTO and RPO. They can bring new products to market faster and deliver new features to customers much more quickly. For example, this week we launched Iceberg materialized views, and they can quickly adopt such features, deploy them in their environment, and allow their customers to benefit from them. Finally, they have reduced their operational costs by over 35%. So, the results this customer achieved by migrating to Amazon Redshift, further modernizing with Redshift, and using it for AI, ML, and generative AI use cases are truly remarkable.

Thumbnail 860

Thumbnail 880

Live Demo: Integrating Multiple Data Sources and Querying with Zero-ETL Integration

So, what I'm going to do next is hand it over to Satesh to walk us through some product demonstrations and show you these capabilities in action. Thank you, Manan. So, let's start with a business use case. Let's assume you own a sporting goods company. You sell tennis rackets, basketballs, cricket bats, everything that everyone can play with. You're doing pretty well. Sales are good. You are a billion-dollar company, but a new CEO has joined, and she wants to grow sales even further. Who doesn't want more money? Everyone wants more money. So she asked your Head of Sales and Head of IT to come up with a plan on how they can improve sales revenue.

Thumbnail 900

They reviewed your business and IT processes and identified two gaps where you can maximize your organization's revenue. One is that campaigns don't reach customers at the right time. For example, if there's a game event happening today, and a personalized ad campaign reaches them at the end of the game or the next day, you're not going to get the maximum revenue, right? So that's the first gap. The second gap is that the marketing team is unable to quickly put together personalized promotions for customers. If you can optimize and solve these two problems, your revenue will grow. So these are the findings from the Head of Sales.

Thumbnail 940

And the CEO asks the Head of IT, why is this happening? How can we solve this problem? Fundamentally, the main reason for slow and ill-timed marketing campaigns lies in the long and complex pipelines that ingest data from various channels within the organization, and this needs to be optimized. The second point is that marketing analysts are heavily dependent on the IT team to run queries, get data insights, and provide data. This is because they don't have the necessary skill set to access all available data sources within the organization. These are the findings.

Thumbnail 980

Thumbnail 1000

Now, let's see how Amazon Redshift can solve these problems. The current ETL pipeline looks like this. Customer data is in a structured data source. Web application and social data are feeding into game analytics, and telemetry is coming from various applications. Diverse sources, very typical, right? And four different ETL technologies are processing this data, with different technologies, different skill sets, data landing along the way, eventually reaching the Redshift data warehouse. Because of such a pipeline, it takes a very long time to make data available to the end-user marketing team.

Thumbnail 1030

What if there was a magic bullet that could solve this problem? That is Zero-ETL integration. This is a fully managed service provided by AWS to replicate data from various sources to Amazon Redshift. As of now, it supports about 23 sources that can replicate data from all these sources to Redshift. These sources include AWS native databases, third-party sources like Salesforce, SAP, ServiceNow, and also on-premise data sources like Oracle and SQL Server. This is new content announced at this re:Invent. So you have plenty of options to simplify your ETL pipelines.

Thumbnail 1070

Thumbnail 1080

Now, let's see how this actually works. What you're looking at is the AWS Glue console, and on the left side, you see Zero-ETL integration. When you click on it, you'll see a screen to create a Zero-ETL integration. Click on create Zero-ETL integration, and we'll select Salesforce as the source. We'll get campaign data from Salesforce. Select the Salesforce connection and an IAM role that has access to the instance.

Thumbnail 1090

Thumbnail 1100

Thumbnail 1110

Now, you don't need to replicate all the data. You only need to replicate data specific to your use case. Here, we are selecting account, contact, opportunity, and campaign, so you can pick and choose the specific objects you need. While you're selecting the objects, you can also get a quick preview of what data you are getting. If you click on preview, you can see directly what you're replicating and do a quick sanity check.

Thumbnail 1120

Thumbnail 1130

Thumbnail 1140

Now we've selected the source. Next, we need to select the target. The target for this use case is Amazon Redshift. You can also select S3 or S3 tables. Here, we are selecting a Redshift data warehouse in this account, and the source is displayed. Next, we click next. You can choose your own encryption key or leave it as default and set the refresh frequency. You can set it as short as 1 second, and then give the integration a name. Here, we'll name it Salesforce integration and click next. Once named, it can be any name you like. That's it. We've almost finished setting up the ETL pipeline to get data from the source and create a Zero-ETL integration.

Thumbnail 1160

Thumbnail 1170

It takes about 10 to 12 minutes initially, but once it's active, it becomes immediately available. All you have to do is create a database, and you can query the data replicated from Salesforce right on this screen. You can give it any name here as well. We'll name it Salesforce DB and create the database.

Thumbnail 1180

Thumbnail 1190

Thumbnail 1200

Now the setup is complete. Next, we'll go to Amazon Redshift and start querying this data. Click on Zero-ETL integration. You'll see the integration we created. It's active, and the database is also active. Click on the integration to query the data. The query editor, Redshift navigator, opens. You'll see the tables replicated from Salesforce. Here, we selected account, campaign, contact, and opportunity. You can view the data simply by right-clicking on one of the sources.

Thumbnail 1210

Thumbnail 1220

Thumbnail 1230

This isn't just about replicating data; you can also achieve observability for this integration. Clicking on Zero-ETL integration shows you the number of replicated tables and the volume of data fetched from Salesforce. You can monitor this from the console itself, or using system tables and CloudWatch. So, observability is also built into this feature.

Thumbnail 1250

Thumbnail 1270

So we've seen how to get data from Salesforce, but what if you have multiple sources? So let me show you how you can bring in additional data. Whether it's from DynamoDB or structured data sources like PostgreSQL. You can get data from all these sources. Let's briefly look at how you can get that data. I've pre-created Zero-ETL integrations in a similar way. As you can see here, channel data is coming from DynamoDB, and structured data is coming from PostgreSQL. You can click on either of them and then click Query Data.

Thumbnail 1280

Thumbnail 1290

Thumbnail 1300

Thumbnail 1310

When you perform Zero-ETL replication from all these sources, they are placed as separate databases within your cluster. Each source appears as its own database, and here you can see the data from DynamoDB, the channel table. Expanding the next one, you'll see the customer database, showing all customers and orders fetched from Aurora PostgreSQL. And then Salesforce. You've seen a detailed demo of how to get the data. The point here is that you can get data from multiple sources and run a single query against all these databases from your Amazon Redshift cluster. You don't have to go to individual sources and do a lot of ETL work around them.

Harnessing Agentic AI with Redshift MCP Server: Natural Language Data Analysis

So, this solves the first problem. Let's go back to our use case. Now you have a simplified ETL process that can get data from different sources and load it into a centralized data warehouse. But the problem isn't completely solved yet. We've only solved the first part. The second part is that we need to enable marketing analysts to operate on this data and generate insights quickly. We can now get data faster, but can they leverage this data? The answer is no.

Thumbnail 1370

The reason is that with the current process they follow, they contact your team and say,

"Hey, I need data for these customers who are participating in this game. I want to create a personalized promotion plan for them. Please give me the data." And then your team runs a series of SQL queries across all these data sources to pull out the data, and then they can work on the promotion plan. Is this real-time? The answer is no. There are still humans in the loop, processing this data and providing all the information to the marketing team.

Thumbnail 1410

Thumbnail 1420

So, how can we solve this problem? What if a marketing analyst could ask a question in natural language, and the data warehouse understood that question, automatically ran queries in the background, and generated the output in natural language? Then they wouldn't have to go through your team or the IT team to run all the queries, and the delays occurring in the promotion process would also be eliminated. Let's see how we can achieve this using Amazon Redshift Serverless. How many of you know what MCP is? Raise your hands. Okay, MCP is a buzzword right now. I'm sure you all know, but simply put, it's a standardized protocol for applications to communicate with LLMs.

Thumbnail 1460

Amazon Redshift launched the Redshift MCP Server in June 2025. And since then, we have seen great adoption. What's happening behind the scenes? For example, if a marketing analyst asks a question in natural language from a frontend tool or some client, such as Amazon Q, Claude, Visual Studio Code, or Cloud Desktop, the natural language prompt is sent to an LLM on Amazon Bedrock. The LLM then says, "Hey, to answer this question, I need X, Y, Z tools, and I need to execute them in a specific order to address this question." The LLM identifies the tools to solve your problem and is the brain that orchestrates the tools.

Thumbnail 1490

Thumbnail 1500

And it responds to the client, saying, "Hey, this is the set of tools I need to use, and this is the order I need to execute them in. And that will help solve the problem." The client then uses the Redshift MCP Server API to make calls to the data warehouse. And the data warehouse executes those tools and calls and returns the data to the end client. And the end client displays it to the marketing analyst. All of these things are happening behind the scenes. As a marketing analyst, you don't have to worry about it. The client, Amazon Bedrock, Redshift, and the Redshift data warehouse all work together to address your natural language prompts. It can be simplified like this. We will do a more detailed demo of how this actually works.

Thumbnail 1540

Thumbnail 1550

Since we are all technologists, we need to understand a bit more than just the final output. So, I am in Amazon Q. You can use any MCP client. So, let's first ask what tools are available. You can ask about tools. And we got the results. I'm going to pause it so you can take a good look. I will pause and play so you can see the information. First, you see the built-in tools, and below that, you see the Redshift MCP-specific tools. There's execute query, list databases, schemas, and tables. Various tools are available. While we're doing this demo, pay attention to which tool is being recommended. That way, you can guess, and then you can see if your guess is right or wrong. It'll be a fun little activity.

Thumbnail 1600

Thumbnail 1610

Next, I want to identify what clusters I have in my account. So I asked, "Please show me all available Redshift clusters." And look, the LLM says that list clusters is the right thing for you and asks the agent to execute list clusters. It ran the list clusters tool and gave me an output that there are two clusters. One is an analytics cluster, and the second is a marketing cluster. And it gave me all the information as a snapshot: that both are serverless, their status, endpoints, whether they are publicly accessible, encryption details, and so on.

Thumbnail 1640

Thumbnail 1650

Thumbnail 1660

Thumbnail 1670

Great, now we know the clusters. Next, we need to identify what databases, tables, and schemas exist. So let's ask that question. Here I ask, "What databases and tables are available in the analytics cluster?" This is my prompt. The LLM understands that the user is asking about databases and calls list databases, then list schemas, and then list tables in that order. It might call multiple times depending on the number of schemas. This is the orchestration happening behind the scenes. And we get a list of all available databases, schemas, and tables in the analytics cluster. Looking at the output, we see there's a dev database, a public schema, and within that schema, there are customer and orders tables.

Thumbnail 1700

Thumbnail 1710

Thumbnail 1720

Okay, great. Next, I want to understand what elements are in the customers and orders tables. Again, you can ask in natural language. Can anyone guess which tool will be called? list columns, right? When I ask, "Show me the structure of the customers and orders tables," it translates to the list columns tool, retrieves all the information behind those tools, and executes twice: once for the customer table and once for the orders table, giving me all the metadata for both tables. Great. So far, we've identified clusters, databases, tables, and columns.

Thumbnail 1750

Thumbnail 1760

Thumbnail 1770

Now, put on your marketing analyst hat. You don't need to know anything about these things. All you care about are your top 10 customers and their purchasing patterns. So, let's ask that question in natural language and see what happens behind the scenes. Again, pay attention to which tools are called and how many times they are called. When I ask, "Analyze customer purchasing patterns and tell me the top 10 customers and their purchasing frequency," it performs an analysis. It calls execute query once, and execute query a second time: once to get the top 10 customers and a second time to get the purchasing patterns. It returns the results and even provides insights in a summary. So, as a user, you don't have to run any kind of query, nor do you have to ask your IT team to do the work. Everything happens in real-time, allowing you to deliver promotional content at the very moment the game is happening.

Now, to summarize everything, the first problem was simplifying the ETL pipeline. We solved that problem by using Zero-ETL to simplify it. And the second problem was enabling the marketing team to be self-sufficient. We achieved this using Redshift MCP. So, the problems we started with have been solved. The CEO is happy, you are happy, and we are happy.

Thumbnail 1810

Roche Pharmaceuticals' Challenge: Modernization through a Three-Dimensional Framework of People, Process, and Technology

So, with that happy thought, I'm going to hand it over to Yannick to talk about how Redshift solved Roche's analytics needs. Thank you very much.

Thumbnail 1820

Hello everyone, good morning. My name is Yannick Misteli. I am the Head of Engineering at Roche Global Pharma Strategy. For those of you who are not familiar with Roche, we are a global leader in the pharmaceutical industry. We operate in over 80 countries, and our headquarters are in Switzerland. I am also from Switzerland, and we have a history of innovation spanning over 130 years.

At Roche, I lead a team of approximately 150 engineers, who are distributed across LATAM, EMEA, and APAC. We support Roche's go-to-market domain. In the pharmaceutical industry, go-to-market is the engine that connects our science and products to the real world. My team works very closely with the sales, marketing, and digital teams, and these teams engage with physicians, hospitals, and the broader ecosystem.

Thumbnail 1880

Now, when we started five years ago, we embarked on modernizing this engine. We had three big ambitions, and of course, some hurdles came with them. The first one was around deep customer insights. We really wanted to understand our customers and have a 360-degree view, but the reality was that all the data was fragmented everywhere, and we couldn't get this view.

The second one was that we really wanted to speed up and accelerate the time to value. But we were constrained by legacy infrastructure, making it very difficult to launch new things and impossible to achieve the speed we desired. Finally, we wanted to achieve a global innovation system that could scale local successes globally, but the technology was so fragmented that scaling was impossible. Also, the mindset wasn't aligned, and we kept reinventing the wheel everywhere.

Thumbnail 1960

To understand why we couldn't scale, this is a good illustration of the technology landscape we faced five years ago. As you can see, we had a giant Oracle cluster in EMEA. We had a large Hadoop cluster in LATAM, and a patchwork of SQL servers and MySQL everywhere. We were trying to glue them together with legacy tools like Informatica, Talend, and Alteryx. But perhaps the biggest challenge was at the center of this, because there were so many disconnected systems, business users resorted to Excel, probably the most popular distributed database in the world.

Thumbnail 2000

So we realized that solving our data legacy problem wouldn't be solved just by technology. So we introduced a framework that looked at three dimensions: people, process, and technology. On the people side, we leaned heavily on Conway's Law. Conway's Law states that technical architecture reflects organizational structure. If teams are siloed, you end up with data silos. On the second aspect, we wanted to foster a more global mindset. I wanted my team to think globally and act locally.

On the process side, we needed to define global standards, but of course, maintain local agility. This is not rigid centralized control. Rather, it is about establishing a common language that enables scale. On the technology side, this is, of course, the enabler. Switching to and consolidating on AWS Cloud dramatically simplified our tech stack. And of course, the more you consolidate, the more scalability you need. And that's precisely where AWS shines.

Thumbnail 2090

Organizational Transformation and the Shift from ETL to ELT: Team Restructuring and DevOps Culture Infusion

Returning to the topic of people, when I think of 150 engineers, our operating model five years ago reminded me very much of children's soccer. That is, everyone chasing the ball. Everyone goes to where the ball is, and the goal is left wide open, right?

We needed a more professional structure with clear positions. That's how we organized ourselves. Data engineers as defenders to build stability, analytics engineers to distribute the ball, and data analysts as strikers to score goals and derive business value.

Talent alone doesn't win games. You need a team. That's why we established cross-functional teams linked to regions. This created a matrix structure, which allows us to keep the teams highly aligned yet loosely coupled. Highly aligned because we ensure strong standards and a common tech stack within our capabilities. But loosely coupled because we give these cross-functional teams the flexibility to decide what to build.

Thumbnail 2170

On the process side, we also needed to provide these teams with a common way of working. This common way of working is built upon three pillars. DevOps is one of the most important, and also a very big mindset shift. When we started, I remember my boss came back and asked me, "Yannick, when are you going to hand over to the operations team?" I said, "There's no handover. We build it, we operate it." That's DevOps: development and operations.

If you think about what the main benefit of DevOps is, it's very simple. If you know you have to operate it yourself, you think twice about how you build it. That's the big benefit of DevOps for me. Of course, we followed a lot of automation. Everything is code, which enabled us to achieve both the speed and quality we needed.

The second pillar was switching to Agile. We unified on two-week sprints. This was also a big mindset shift for the business side. I remember the first sprint, the business side came back to me mid-sprint and said they wanted to change all the sprint goals. I told them, "No, we're not going to change all the sprint goals." They looked at me surprised and said, "But aren't we Agile?" We had to educate them that Agile is not chaos. It's actually about being very disciplined about delivery, and you need to have a good plan.

And the last pillar is transparency, which you need when leading a large organization. We truly unified project management. We consolidated tracking and project management into a single source of truth, because otherwise, you can't run a portfolio of that size. I also like the quote here: If you want to build great products, you need great people. If you want to attract and retain great people, you need great principles.

Thumbnail 2310

From a technology perspective, the biggest shift was really moving from ETL to ELT. In the old legacy on-premise world, storage was very expensive, and compute was scarce. The way we did it was to extract data, transform it, and then load it into the Oracle system. That was the way it was done, but AWS and the cloud completely changed that economy. S3 is cheap, and Redshift has vast compute available. So we flipped the principle from ETL to ELT, staging all data first and then processing it in Redshift.

What are the big benefits? There are two. First and foremost, speed. If a new business request comes in, it's highly likely that this data is already staged in the data lake, so we can immediately start delivering insights and analytics. The second is quality. We can really separate concerns here: EL and T. Data engineers can focus on the extraction and loading part, and analytics engineers can focus on the transformation part.

This again ties back to the people aspect and how you organize your teams. We can parallelize that,

And people focusing on specific areas can become very skilled in the technologies they use and the work they do. This, of course, brings you the agility and quality you desire.

Thumbnail 2420

Roche's Data Platform: Large-Scale Operation Integrating 300 Data Sources and Processing 3 Million Queries Daily

So, this is our architecture. It's a bit of a messy diagram, but this is how we manage over 300 sources that feed into our data platform. It follows a very simple principle: ingest, process, and serve. On the ingestion side, we are big customers of AWS AppFlow. AWS AppFlow is great for SaaS applications because it's an AWS managed service, so you just connect to your SaaS application, and it takes care of everything. Now, for non-SaaS data sources like DataSUS or PubMed, we needed a slightly more customizable approach. For that, we use Lambda and Glue. So, we've built a great framework around these two services to ingest data.

We also use AWS Transfer Family because we receive a lot of CSV files from providers. Now, AWS Transfer Family is the perfect solution there. It connects to S3. A Glue crawler crawls new files in S3 and registers them in the Glue Data Catalog. So, once a file is delivered, we can query it immediately. Of course, the best data onboarding is the one you don't have to do. So, for that, we also use Redshift data sharing with some partners. And we are switching to Zero-ETL wherever possible. We have already switched our internal Aurora databases to Zero-ETL.

Regarding the process part, as I mentioned, everything is done in Redshift. We use DBT to orchestrate all these workloads, and DBT helps us version control all our business transformations. So, there's a lot of transparency about what's going on. On the consumption side, we try to pick the right tools for the right use cases. Tableau has many curated dashboards, and it's really good for that. But we also use ThoughtSpot for providing self-service BI. And this natural language interface that ThoughtSpot provides has greatly helped us scale. We are very happy with the high adoption of ThoughtSpot on the business side.

Thumbnail 2600

Ultimately, we're taking it a step further, because what we're doing is writing insights back to Salesforce. So we're bringing data in, making it actionable, and we want to meet the business where the business lives. So we're writing it back to the Salesforce instance on the business process that supports it. Now, I talked about ELT, and in that ELT world, actually one of the most important services is Redshift Spectrum. We also call it the bridge, because it bridges the world of the data lake and the world of the data warehouse.

And Redshift Spectrum has three aspects that are important to us. The first is that you can query data in the data lake on the fly. That means as soon as new data arrives, you can query it. Spectrum enables that. The second is cost optimization. You can store vast amounts of data in S3 that you don't need to bring into Redshift. So you can decide what to keep in Redshift and what to keep in your data lake. You can find the right balance with hot and cold data.

Thumbnail 2680

And finally, and most importantly, it's a seamless experience for analysts. They can combine data within Redshift and the data lake. And I think this will become even more powerful in the future with standards like Iceberg. In theory, you can query data residing in Snowflake Iceberg or Databricks Iceberg and combine them all. We are not just using Redshift for numerical calculations. We also process a lot of semi-structured data. That means a lot of free text flows in from our CRM systems.

Now, we are big fans of Lambda UDFs. What we've built here is a Lambda UDF that uses AWS Translate. And what this allows us to do is

This free text coming from our CRM in diverse languages, we can centrally translate it into English. This allows us to perform millions of translations in SQL within the database without having to build complex data pipelines. So, Lambda UDFs really give you superpowers. Another thing we love is the Redshift Bedrock LLM integration. These are actually related use cases. Because we're running the translated text through the Bedrock LLM within Redshift to try and detect adverse events that shouldn't be going into our CRM system. So, we're trying to flag those. Basically, it's a very simple integration that allows you to run LLMs at scale against millions of records.

Thumbnail 2760

Then there's one battle we realized we couldn't win: the spreadsheet battle. If you can't win the battle, we decided to embrace it. What we did was build a Google Sheets add-on that connects to Redshift. How did we achieve that? We used a service called Redshift Data API. It's truly amazing; it doesn't require any drivers or complex setup. It's basically an HTTPS interface to Redshift. So, we built a JavaScript add-on for Google Sheets that solved two problems. First and foremost, business stakeholders can load the latest data. This is all governed, so they can have the latest data within the spreadsheets they are familiar with. The second aspect is that you can also write back to Redshift. This allows us to make this local knowledge globally available. This is great, because this data can be reference data or whatever, and it can be directly used in dashboards or downstream data pipelines.

Thumbnail 2840

Here are some of my lessons learned from our modernization over the past five years. First, reimagine, don't just re-platform. Don't try to do a lift and shift, like we did with the ETL to ELT switch. If you want to leverage the cloud, AWS, to its fullest, I think it's very important to truly rethink. Second, prioritize business value over technology. We're trying to work backward from strategic objectives like customer 360 and time to value, and all technical decisions should be derived from that. Also, treat modernization as a cultural shift, not just a project. I've talked about people, process, and technology. I think it's very important to address all three dimensions.

Also, building frameworks is important. It shouldn't be a cage, but you need to build standards. You need to have standards, and finding the right ones that you can follow is difficult but important. Invest as much in how you work as in your tech stack. Because even the best and greatest technology is worthless if you don't know how to use it. So, it's really important to invest heavily in the upskilling part as well. And finally, be obsessed with developer experience. Because ultimately, you want to achieve a great customer experience, and for that, you need good products. To build good products, you need happy developers. That's why developer experience is really important to me.

Thumbnail 2950

Now, what has this brought us? We can say that we are now operating at a massive scale. As I mentioned earlier, we have 300 different data sources. We have been able to decommission five legacy platforms. And the fact that we are running 3 million Redshift queries per day also gives you some indication of the scale we have reached. Now, what have we gained from a business perspective? Again, we have finally achieved what we were looking for: a very fast time to insight. We can now build new data solutions in days, not months. At the same time, we have finally achieved the deeper customer understanding we sought, because we are connecting these 300 data sources. So, not just ingesting them, but also connecting them.

We also reorganized 150 engineers, and what's important to me is that we built a data platform that they actually want to use. So, here's a strong foundation for future innovation, enabling us to deliver what patients need next. Thank you very much.

Thumbnail 3060

How to Start Your Migration: Tools, Talent, Programs, and Next Steps Provided by AWS

Thank you, Yannick. That was excellent. Yannick has been a great partner for us, especially for the service team, helping us significantly improve the product. So, what Roche has been able to achieve with Redshift and related technologies is truly phenomenal. I hope that both Satesh and Yannick have provided enough data points to help you start your journey in data warehouse and analytics migration and modernization.

As a next step, you're probably wondering how to get started. Everyone knows that migration projects are not easy. They take time, effort, and resources. So, here's the good news. As of now, AWS has already helped migrate over 1.5 million databases to AWS databases. So, we have a pretty well-established process for how to help you migrate from your existing legacy infrastructure to the cloud, Redshift, and related technologies.

If you want to start, we can provide resources in three key areas. Starting with tools and technology, we have migration tools like AWS Database Migration Service and AWS Schema Conversion Tool, which can automate many of these conversions. You saw Yannick talking about converting these data objects to Redshift and moving many data objects to Redshift, and you can automate much of this using these tools, the Schema Conversion Tool and Database Migration Service.

From a talent perspective, we also provide many resources. If you want to start with a proof of concept or a migration pilot, you can work with our professional services team and partners, and they can help you define the scope of the project. They can help you start this journey. And finally, we also offer many migration programs. For example, the Migration Acceleration Program provides incentives and credits to offset the costs of running these systems concurrently as you migrate from a source system to a target system. So, various tools and resources are available to help you get started.

Thumbnail 3210

Finally, I want to introduce a number of resources related to Redshift. If you want to learn more about Redshift's new features and capabilities, please visit the Redshift website. The link is there. In addition to Roche, you can see success stories from many customers across a wide range of industries, including financial services, healthcare, gaming, software, and internet. We also have many blogs and tutorials. We are very committed to self-service and hands-on keyboard enablement, so many blogs and demos are available, and the QR code is also there, and you can also buy books and learn more about the technology.

Thumbnail 3260

We also have a LinkedIn group, so you can sign up to receive updates on new data analytics related announcements. We have various resources available to help you learn about Redshift and get started with Redshift and other analytics services.

Finally, I want to leave you with four key takeaways. First and foremost, the fuel that makes generative AI and agentic AI shine is truly your data. To extract value from AI and generative AI, you really need a strong data foundation. What we've seen is that Redshift provides the best capabilities in this area. Enterprise-grade reliability, scalability, availability capabilities, Model Context Protocol integration, data integration, platform scaling, and ease of use with serverless—all the capabilities you've seen are available to help you start your journey.

Thumbnail 3340

Currently, tens of thousands of customers like Roche are already using Redshift to modernize their data platforms and leverage these AI and generative AI capabilities. And we hope you will use the tools and resources we've shared to begin your own journey. Thank you very much for joining us today. If you'd like to speak with any of the speakers, we have left our contact information here, and some of us will be here after the session, so please feel free to stop by and ask any questions you may have. Thank you very much for attending the session, and please enjoy the rest of the conference. Thank you.

Discussion