iTranslated by AI
re:Invent 2025: iTTi's Hybrid Data Mesh Blueprint and 45% Cost Reduction with Amazon SageMaker
Introduction
By transcribing various overseas lectures into Japanese articles, this project aims to make hidden, high-quality information more accessible. The presentation we are focusing on this time is as follows!
For re:Invent 2025 transcription articles, information is compiled in this Spreadsheet. Please refer to it as well.
📖 re:Invent 2025: AWS re:Invent 2025 - iTTi's Cross-Company Data Mesh Blueprint with Amazon SageMaker (ANT342)
This video introduces the case of itti Grupo Vázquez, a Paraguayan holding company, that built a hybrid data mesh on AWS with over 20 companies and more than 300 terabytes of data. Facing challenges of a single point of failure in traditional centralized architectures and high costs of a fully decentralized approach, they adopted a hybrid approach combining EMR on EKS and SageMaker Unified Studio, based on their self-service platform Dream Corp. By migrating to EMR on EKS and introducing Karpenter, cluster utilization improved from 35% to 80%, achieving a 45% cost reduction. Enhanced governance with SageMaker Lakehouse Catalog and Lake Formation tags accelerated new hire onboarding by 75% and reduced data access from weeks to immediate. Three lessons learned were shared: declarative governance, dynamic balancing, and culture focus.
- This article is automatically generated, largely maintaining the content of the existing lecture. Please note that there may be typos or inaccuracies.
Main Content
Data Mesh Concepts and itti's Hybrid Approach: A Holding Company's Challenge Handling 300 Terabytes of Data
Hello everyone, thank you for joining us for this session today. Welcome. My name is Diego Ortiz. I'm a Senior Data Strategy Solutions Architect at AWS. And I'm very excited today to speak with Julieta from itti. My name is Julieta Ansola. I'm a Senior Cloud Engineer at itti. And today we're going to talk about how itti, a company that Julieta is going to explain more in depth later, built their hybrid data mesh using Amazon SageMaker. So let's get started.
Before we dive deep into how they built this solution, let me give you a little bit of background. Across the world, we have seen customers trying to become data-driven organizations, but they are also facing several challenges in the process. Business units are constantly trying to use data in the way they think is best for their business. But when multiple business units are doing the same thing in their own ways, this results in a fragmented data landscape and data silos, which can hinder data sharing and data innovation.
This can also pose security and compliance risks, as there is no unified data governance strategy in place. Another challenge we see customers facing is that in such complex data environments, there's no unified mechanism to discover data assets. So, business units will be constantly asking the central data governance team to provide them access and to make data discoverable and usable. So, all these challenges combined ultimately make the goal of becoming a data-driven organization a distant dream.
So, customers have started looking into the concept of a data mesh as a solution to all these challenges. But first and foremost, what is a data mesh? This is a data architecture pattern where data domains are treated as specific areas of the business. Meaning that these business units are treated as independent domains that are interconnected with the rest of the organization through a mesh to consume data.
Data mesh also has several benefits. For example, these business units can leverage existing investments and integrate them into the mesh. Another benefit is improved data governance by tailoring all policies to the specific needs of these business units. It also features a business data catalog, which is a centralized mechanism for data discovery and data sharing, thereby enhancing self-service data sharing across all these business units.
Ultimately, a data mesh allows these business areas to leverage data and start innovating with it. Now, let's look at how itti built its own version of a data mesh on AWS. itti Grupo Vázquez operates a multi-industry holding company in Paraguay, with over 20 companies across 5 business sectors, generating more than 300 terabytes of data from various systems.
Our challenge was clear from the start: how to enable autonomous data product development while maintaining enterprise-grade governance. Each company moves at a different speed, has different compliance frameworks, and different innovation cycles. Some companies require real-time streaming analytics, while others need audit-compliant batch processing with full audit trails. The traditional hub-and-spoke architecture was not scalable for us, but implementing a full mesh topology would lead to an explosive increase in total cost of ownership. So, we needed something different, a hybrid approach.
Now, let's look at the architectural trade-offs. In such a centralized pattern, all producer and consumer teams go through a central data platform team. This data platform team manages all the infrastructure, which creates a secure environment for data governance, but at the same time, it creates a single point of failure at the center of the pattern. Our sprint velocity was repeatedly blocked by dependencies on the data platform team.
When we shifted to a decentralized pattern with data sharing across different domains, we immediately faced challenges like data duplication, inconsistent schema evolution, and, of course, infrastructure overhead. Our solution was to adapt Dream Corp, our self-service ingestion data platform built entirely on AWS. It started as a centralized platform but evolved into a federated architecture. Also, the production layer remains centralized using EMR on EKS. We migrated from self-managed clusters to EMR on EKS and adopted SageMaker Unified Studio for the decentralized consumption layer to enable democratized ML analytics capabilities. This approach allowed us to transform our operating model without impacting all 20 companies across the organization.
This is our taxonomy. We created this diagram by encoding it into IAM policies and network topology. There are three main domains. The central data domain is our nervous system, and we also have the digital ecosystem domain and the financial ecosystem domain. These domains are isolated through dedicated AWS accounts, and our central data lake in the central data domain has its own street prefix. We use Lake Formation for data sharing between different accounts. So, for example, if someone from the payment team needs data from the central domain, they simply subscribe to it in the business domain catalog. This allows us to implement autonomy without chaos.
Implementation with EMR on EKS and SageMaker Unified Studio: A Technical Strategy that Achieved 45% Cost Reduction and Instant Data Access
From a technical perspective, to enhance flexibility, we migrated from traditional EMR clusters to EMR on EKS. Previously, we ran one massive persistent EMR cluster for Spark job submission, but jobs were probably running for only about four hours a day. This meant an average utilization of 35%, so we were paying for 100% of the infrastructure but only using one-third of it. Regarding job isolation, this was done only at the process level, so if one Spark job ran wild, it would take down other jobs with it. Also, scaling time was painful for us. With traditional clusters, scaling to join the cluster took about 15 to 20 minutes.
Currently, each business domain runs in an isolated EKS namespace. Job isolation is now implemented at two levels. EKS namespaces provide a hard boundary, and pod-level security provides secure workloads. EMR is mapped one-to-one with these business domains and isolated by these namespaces. We also implemented Spot Instances for cost savings. For optimized compute, we use 70% Spot Instances and 30% On-Demand Instances. Additionally, for memory-intensive workloads, we are migrating some instances to Graviton 3. But the game changer for us was Karpenter. Karpenter provisions within 60 seconds, which gave us the elasticity we needed.
Speaking of numbers, we went from an average utilization of 35% with the old cluster to 80% with the current cluster. This translates to a 45% cost reduction by using shared infrastructure and optimized compute. Also, through Karpenter, we can handle traffic spikes almost in real-time thanks to pre-warmed node pools for predictable workloads. So, in conclusion, we are processing the same data at half the cost. This is a huge improvement for us. This is our production layer.
Before moving to the consumption layer, I'd like Diego to provide some context on SageMaker. Thank you, Julieta. Amazon SageMaker is the core of data analytics and AI on AWS. It mainly consists of three layers. At the very top is the Unified Studio, which is basically an interface that provides access to various types of data functionalities, from SQL analytics, data processing, machine learning model development, to even generative AI application development. There are also two other layers: data and AI governance, which I will explain in more detail later, and open layout.
So, let's dive deeper into this layer regarding data governance. There's one key component here, and that is the SageMaker Catalog. This is a business catalog. It's a central repository for technical and business metadata of various types of data products, from traditional datasets to machine learning models and generative AI applications. This service comes with a set of integrated features to have full visibility into your data products. For example, if you want to check the data quality of these data products, it's integrated into the catalog. You can also classify data products using metadata forms and business glossaries, and you can also see the full end-to-end data lineage of all your data products. Many other features have been added to this catalog.
Now, let's look at how itti built on top of this SageMaker Catalog to realize their data mesh platform. Well, as you said, I'll show you how we actually implemented it. If EMR on EKS provided elasticity, SageMaker Unified Studio gave us speed. We maintained Dream Corp as the backbone, which is where data pipelines, ingestion, and transformations take place. But we needed something for business users and data teams. A unified way for them to access and leverage the data they generate. That's where SageMaker Unified Studio became the missing piece in our hybrid mesh.
I mentioned that SageMaker Unified Studio gave us speed. Our users can now use services to leverage analytical tools from a single entry point. They can run simple Athena queries. They can launch SageMaker notebooks, and even build more governed apps using foundation models and knowledge bases, and publish them to the SageMaker Lakehouse Catalog. This caters to each project from different domains. The SageMaker Lakehouse Catalog organizes different assets under proper governance. We also implemented Lake Formation tags. We replaced the old ticket-based approval process with Lake Formation tags. We created this taxonomy in our central data catalog, which exists in a central account, and shared it through different AWS services like RAM. This allowed users to access all machine learning, analytics, and generative AI tools from a single entry portal.
I said SageMaker Unified Studio gave us speed. Here are the numbers. New hire onboarding was 75% faster. Previously, as you probably know, when someone joined a team, they would need to open tickets for VPN access, package installation, security, and so on. But now, in just five minutes, they can spin up an environment through a SageMaker Unified Studio blueprint. We created this template in CloudFormation, and they can simply create an environment, click to create a notebook, and start coding without waiting for IT. Also, data access, which was probably the biggest bottleneck in the past, changed from weeks to immediate.
Taxonomy and data federation tags provide instant access. For example, if you are on the payment team and the data is tagged as payment, you can immediately start querying without waiting. You can simply start using it. Now, let's share some lessons we learned on this journey.
The first is that governance is definitely not optional, but it must be declarative, not imperative. We abolished the old ticket-based approval process and replaced it with data federation tags. As a result, we achieved governance at scale with autonomy, without human bottlenecks. The second is that balance is dynamic. There is no perfect formula for centralization versus decentralization. And that's okay. We started completely centralized because we needed to establish some standards. And we released control when the domains proved their maturity. And that worked well for us.
And finally, and perhaps one of the most important things, culture is just as important as technology. You can build the most efficient, even sophisticated architecture, but if people don't trust it, it will fail. So, our advice is to keep it simple and build adoption first. Well, that's all. If anyone has any questions, we are here. Please fill out the survey. Thank you for listening.
- This article was automatically generated using Amazon Bedrock, maintaining the information from the original video as much as possible.
























Discussion