iTranslated by AI
re:Invent 2025: Best Practices for Building Apache Iceberg-Based Lakehouse Architectures
Introduction
This project aims to make hidden, high-quality information more accessible by transcribing various overseas lectures into Japanese articles. The presentation we will cover this time is:
For re:Invent 2025 transcription articles, information is compiled in this Spreadsheet. Please check it as well.
📖 re:Invent 2025: AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS
This video explains best practices for building Apache Iceberg-based lakehouse architectures on AWS. It introduces key Iceberg features such as ACID guarantees, time travel, and schema evolution, focusing on AWS Glue Data Catalog and S3 Table Buckets. Mike Araujo from Medidata shares a real-world example of achieving reduced latency and improved data consistency by transitioning from batch pipelines to streaming. Furthermore, it details three ingestion patterns: batch ETL, Change Data Capture, and high-concurrency streaming, along with optimization techniques for consuming Iceberg data in Athena, Redshift, and EMR. Advanced features like catalog federation, materialized views, and deletion vectors are presented to demonstrate concrete methods for building production-ready architectures at an enterprise scale.
This article was automatically generated using Amazon Bedrock, aiming to retain as much information as possible from the original video. Please note that it may contain typos or inaccuracies.
Main Content
The Data Lake Crisis and Solutions with Apache Iceberg
Hello everyone. Welcome to ANT343, best practices for building Apache Iceberg-based lakehouse architectures on AWS. My name is Purvaja Narayanaswamy, and I'm a Senior Engineering Manager for AWS Glue Data Catalog and AWS Lake Formation. Today, I'm joined by Mike Araujo, Principal Data Architect from Medidata, and Srikanth Sopirala, Principal Solutions Architect from the Data and AI Space at AWS.
We have a packed agenda for you today. First, we'll talk about the data lake crisis, what brought us here, the Apache Iceberg journey in the lakehouse, and how the AWS stack is coming together as an integrated framework that works with Iceberg-powered lakehouses. Then I'll focus on data ingestion, the production-ready architectural patterns that you can build on top of Apache Iceberg. Mike will cover the customer voice. You'll hear about Medidata's transformation journey to Iceberg. And Srikanth will focus on how open standards enable true interoperability, talking about data egress, multi-compute integration, and optimization in an Iceberg-powered lakehouse. Importantly, there are key takeaways that you can implement in production starting today.
The data lake crisis. So data lakes were good in theory, right? You had flexible storage, compute and storage were decoupled. But in production, it was chaotic. Data corruption was common. There was no way to roll back to a point-in-time snapshot if something went wrong. Queries were extremely slow. You were probably scanning terabytes to answer megabytes of questions. And schema evolution, forget about it. You're talking about rewriting petabytes of data now, and it was a real mess.
And that's where Apache Iceberg table format comes into its own. You have this smart metadata layer. You have detailed manifests, schema definitions, partition specs, all separate from the data files, and you have table-level snapshots. So this intelligent layering and layout, where the table-level snapshot points to a manifest list, and the manifest list tracks which manifests have changed, and the manifest file itself has all the file-level statistics, row counts, min-max values, null counts, and so forth. This layout gives you incredibly fast query performance.
And you get five core benefits right away. First and foremost, you get ACID guarantees. You don't have to worry about data corruption in production because multiple writers can leverage optimistic concurrency control to update the table simultaneously without all the messy distributed locks and such. The second benefit you get is that every update creates these wonderful stable table-level snapshots. These snapshots are isolated and immutable, so they act as checkpoints, allowing you to roll back to a point-in-time snapshot or perform time-travel-style queries. This is great for debuggability, checkpointing use cases, and so forth.
And the third benefit is schema evolution, which is very elegant. Iceberg tracks columns by ID, so whether you want to add, update, or rename a column, it's just a metadata update. You don't have to do rewrites, repartitions, or anything like that, as with traditional systems. The fourth benefit is query performance. You don't have to do expensive scans. Your query engine can scan just a few kilobytes of manifests, thanks to this metadata layout I described, and know exactly which data files to read.
And Iceberg also supports elegant row-level updates. Unlike traditional file systems where you might have to rewrite an entire partition, Iceberg has a more effective way to do it. Iceberg V2 supports equality deletes, which track deleted files by row key value, and positional deletes, which track by row position, and the engine correctly applies these delete files along with the data files.
New Features in Iceberg V3: Variant Types, Deletion Vectors, Row-Level Lineage, and Default Values
And with Iceberg V3, those gaps are being closed even further. First and foremost, variant types. JSON is everywhere. Be it API responses, IoT payloads, event types. Now, the good thing is the engines are natively supporting variant types. Otherwise, you would have to do an ugly flattening where most of the column values would probably be null, or you'd have to convert it to a string and give up on query performance. Now, the good thing is internally you get the optimized columnar benefits for semi-structured data, and you get elegant queries that render it as JSON.
I talked about row-level updates, where the delete files are tracked as equality deletes or positional deletes.
Equality deletes or positional deletes can be a little expensive if you have a 10 million row table and you have a lot of CDC-style deletions. In that case, you're going to be looking at an expensive operation in terms of merging the delete files with the data files. That's where deletion vectors are a game-changer, because it's just maintaining a bitmap to track whether a row has been deleted. So it means you load this entire bitmap in memory, act at memory speed, and apply these with the data files.
V3 also supports row-level lineage. Typically, lineage is tracked at the table or partition level. But with Iceberg V3, you get a provenance record for every row. You can track ingestion timestamp, source identifier, CDC database transaction identifier, so you can have full auditability information. So it means if you ingest bad data, you can have precise lineage to track where it came from.
The fourth benefit with V3 is the support for default values. This eliminates a huge operational pain point. Think of a scenario where you have a gold aggregator table, and you want to add a column value. In this case, you either have to do a historical backfill for all your old records, or you'll have inconsistent results where old records show null and new records show new values. Here, with V3, you have a way to specify default values at the schema level, so the engines can apply them to the old records, and your results are elegant and consistent.
AWS Stack Integration and the Central Role of Glue Data Catalog
Excellent. So now we've understood why Iceberg is at the forefront of building these modern lakehouses. Let's take a quick look at the AWS stack. At the very bottom, you have all these diverse data sources, whether it's a data warehouse, an on-premise database, streaming sources, or even external Iceberg-compatible sources. We ingest all of these through our portfolio of ingestion services. Sometimes it's batch services like Glue and EMR, sometimes it's streaming services like Kafka, Kinesis, Firehose, and we ingest all of that into the storage layer as an Iceberg table format.
Sometimes it's flexible generic S3 buckets, sometimes it's S3 Tables, which is a fully managed Iceberg capability, or sometimes it's Redshift managed storage, which is a warehouse storage, but still, we ingest everything as Iceberg, a consistent table format. The moment all of these enter the storage layer, the catalog automatically discovers all the technical assets coming in. Glue Data Catalog acts as your technical metastore, Lake Formation provides consistent enterprise-scale governance, and SageMaker Catalog provides all your business metrics, lineage, and auditability.
At the processing layer, you have the choice to either choose AWS first-party compute or choose any Iceberg-compatible compute, so everything natively supports the Iceberg table format. On the application stack, you have Amazon Bedrock for generative AI and ML applications, QuickSight for your BI use cases, or SageMaker Unified Studio for your analytical use cases. Looking at the entire stack, this is an integrated ecosystem with zero ETL, eliminating complexities associated with data movement. Query federation allows you to work with diverse data sources and join them, and catalog federation allows you to connect to all other external Iceberg-compatible catalogs, building a truly open ecosystem. So this is an ecosystem built for flexibility and not fragmentation.
I want to dive a little deeper into Glue Data Catalog and how it plays a central role in an Iceberg-powered lakehouse. First, what is it? It's a Hive metastore. It's also an Iceberg REST catalog, complete with V3 support and all the great features I just talked about. This catalog also supports multi-catalog federation to other external Iceberg-compatible catalogs. So this catalog stack provides a standard interface to which other computes can connect. Whether it's first-party or third-party, you can connect your preferred compute via a standard interface as an Iceberg REST catalog.
Lake Formation provides enterprise-scale governance. From fine-grained access control needs like catalog, database, table, column, row, and cell-level security to rich policy-based access control like role-based access control, tag-based access control, and attribute-based access control. Lake Formation also supports an architecture called data mesh. This is essentially a way to democratize data, where various lines of producers want to ingest metadata into a central catalog for consistent governance. From that central catalog, you can perform secure data sharing with various business units, other organizations, or units. It also supports a feature called credential vending, which allows only authorized and trusted compute to obtain scoped-down access to the underlying data sources.
In terms of Iceberg-style layout, I mentioned earlier that data and metadata are placed closer together, but the catalog is truly a super-fast lookup layer. It enables millisecond lookups, and by allowing compute to directly access S3, you can leverage manifest caching, parallel manifest processing, and get full throughput. The data catalog makes all of this possible.
We also see a trend where there's increasing interest in agentic lakehouses. This is because AWS Glue Data Catalog exposes all its APIs through programmatic interfaces. Now, intelligent agents can come in and discover tables, schemas, and data quality rules. They can also detect schema drift and notify data owners, self-heal pipelines, or optimize partitioning strategies to improve query performance. Overall, we see a trend where the catalog is not just a control plane for compute and humans, but also for all these intelligent agents.
Fully Managed Storage Optimization with S3 Table Buckets
When we talk about Iceberg, we need to zoom in on S3 Table Buckets. This is a stack that provides fully managed storage optimization for Iceberg. Iceberg is great, but it requires maintenance. Small files proliferate, especially with high-volume CDC-style workloads, and small files accumulate. Every update creates a table-level snapshot, and at some point, snapshots can inflate storage costs, so you want to keep them lean. If you have unreferenced files that are not referenced by any live snapshot, they will also remain in the system and at some point affect query performance.
That's where S3 Table Buckets come in and do all of this completely behind the scenes, fully removing the operational overhead. Otherwise, you'd have to schedule separate cron jobs, constantly monitor these, and do it with off-the-shelf tools. The first benefit you get is compaction. We support different styles of compaction. The first one is bin-packed by default, ensuring all small files are optimally packed to a target file size. We also support sort and Z-order style compaction depending on your query style, whether it's primary column queries or multi-dimensional clustering queries.
S3 Table Buckets support what we call policy-based snapshot retention. Depending on your time-travel needs and the number of snapshots you want to retain in the system, you define a policy, and S3 Table Buckets will automatically remove them behind the scenes. Similarly, unreferenced file cleanup and orphaned files that are not accessible by any live snapshot are also constantly referenced and cleaned up. Overall, S3 Table Buckets provide a lot of configuration and control plane APIs, as well as monitoring APIs. You have excellent AWS CloudTrail audit logs. We expose metrics like snapshot size, compaction size, number of snapshots, and number of referenced files. Again, you can think of an agentic lakehouse where agents start monitoring these metrics and proactively optimize the storage layer of your lakehouse.
Medallion Architecture: Leveraging Iceberg in Bronze, Silver, and Gold Layers
Great. So we've seen how the AWS stack for an Iceberg-powered lakehouse is integrated, focusing on AWS Glue Data Catalog and S3 Table Buckets. Before we dive into the core ingestion architecture patterns, I want to briefly touch upon the medallion architecture. This is a way to organize your data. You have the bronze layer, which is raw data ingestion, an append-only style. Typically, you leverage services like crawlers for schema detection or streaming services like Amazon Kinesis and Apache Kafka. Data comes in, and typically, your raw data storage option is flexible, general-purpose S3 buckets with Parquet and Iceberg formats on top.
The benefits you get from Iceberg in the bronze layer are, as I just mentioned, this excellent schema evolution without requiring a full data rewrite. In Iceberg V3, we have semi-structured data support. You get variant support, and you get this excellent time-travel feature that doesn't break your data pipelines.
And ACID guarantees. Multiple writers can come in, ingest all the raw data, and update the same table without interfering with each other. A pro tip: typically, you want to retain raw data indefinitely, so leverage S3 lifecycle policies and enable Glacier to move data to Glacier after, say, 90 days.
Silver is your cleansing layer. This is enriched, validated data where you can apply all your classification rules. You can also run transformations. Typically, you leverage services like EMR Spark and AWS Glue ETL. Your storage option will be generic S3 buckets or S3 Tables, depending on your scale and needs. The pro benefit you get with Iceberg and the silver stack is schema enforcement. Schema acts as a contract. And you get this excellent partition evolution. So you can start with a partitioning strategy, and as you scale and your usage dictates, say you started on a daily basis, now you want to move to an hourly basis. The great thing about Iceberg is that the manifest tracks partitions by spec ID, so any new data you ingest can start applying the new specification, and older data can continue to apply the older specification.
And Iceberg also supports this excellent incremental updating, so if you want to apply all these enrichment rules and transformations on incoming incremental changes, you can leverage Iceberg's merge operations to update those incoming incremental changes. And that's through snapshot checkpointing. Every update, you have that excellent isolation of what was the last checkpoint, so you know exactly what delta you're working with.
And the gold layer is your analytics-ready, optimized consumption tables. Typically, you leverage services like SageMaker, Athena, and Redshift for your query analytics use cases, and your storage choice is definitely S3 Table Buckets for the gold layer. The benefit of Iceberg in the gold layer is that you get materialized aggregations, which are typically stored as Iceberg tables. And you can leverage Iceberg's hidden partitioning. What that means is when you're running a query, you don't actually need to know the partitioning strategy or techniques based on column values. Iceberg knows internally how to get to the exact data files. It applies the right partitioning strategy for you. And to enjoy the benefits of query performance, you get that boost in your queries by leveraging sort order or Z-order, depending on your query patterns.
And a pro tip here is you definitely want to leverage your snapshot retention policy to make sure you're always optimizing your metadata layout for better query performance. Something to note here, S3 Tables provide three times better query performance and ten times higher throughput by significantly reducing this metadata overhead and optimizing the file layout.
Core Ingestion Patterns: Batch ETL, CDC, and High Concurrency Streaming
All right, let's look at the core ingestion patterns. First and foremost is batch ETL. This is your workhorse style of workload where you're moving large amounts of data on a scheduled basis. Typical use cases are lake migrations, warehouse loads, historical backfills, or regulatory reporting. You have your source data sources like RDS or on-premise databases, and you perform ETL, for example, leveraging serverless Spark, writing it as Iceberg tables on S3 Tables, cataloging it with AWS Glue Data Catalog, and then it's ready for analytical consumption.
The pro role of Iceberg in this pipeline is you get this excellent schema evolution. Adding, renaming, or dropping columns happens very elegantly without requiring rewrites. You get this excellent partition evolution without incurring large scans or rewrites. And it gives you this excellent time travel capability and ACID guarantees, so if something goes wrong, you don't have to halt the rest of your ETL pipeline. You can go back to a consistent point-in-time snapshot. And you get excellent low-level lineage with Iceberg metadata, which gives you better auditability and debuggability.
A pro tip here is definitely leverage S3 Table Buckets for optimized and faster queries. And if you're leveraging, for example, the bin pack algorithm, you want to tune your file size to understand what your patterns are, how you want to optimize for query performance. Understand your partitioning strategy, if your data is time-series, based on date-time, if it's region-based or geo-style data, you want to partition differently. And enable sort order or Z-order for better query performance.
The second popular pattern is Change Data Capture. This will be a real-time style ingestion. This is real-time replication from operational databases to analytics.
You have all your source databases like Aurora, MySQL, RDS, and you leverage change data capture from bin logs or write-ahead logs. Typically, you leverage Database Migration Services, and all these captures come in and go through Kinesis or Kafka for stream buffering and replay. Then you leverage Flink or Spark to do your transformations, merging into Iceberg format on S3 tables, and it's ready for analytical consumption.
The key role of Iceberg here is that you get efficient upsert and merge operations. Because you're getting all these wall captures, the incoming delta changes. You know exactly which rows to update without completely rewriting the entire partition, and Iceberg does that very efficiently. You know the exact rows to update, so thanks to these optimized layouts, you only rewrite the relevant files and atomically update the manifest.
We support two strategies: copy-on-write and merge-on-read. Typically, for CDC-style workloads, we've seen customers adopt merge-on-read because it's more write-heavy. Definitely leverage deletion vectors. This is much more optimized for performance, so you can bring that entire bitmap into memory and operate at memory speeds when you want to apply those deletes. You get this excellent snapshot isolation, so even while your CDC stream is coming in and ingesting, you get consistent query performance and consistent reads on the analytics side. Definitely leverage auto-compaction. And concurrent writers can update the same table without interfering with each other.
A pro tip here is to definitely ingest your files in Parquet format. This is generally better for compression techniques. You want to monitor your lag throughout the pipeline, whether it's DMS or Kinesis, to see where your latency is going up so you can optimize your CDC pipeline. If your workload is write-heavy with a lot of deletes, definitely leverage deletion vectors.
Medidata Case Study: Challenges of Data Connect and Clinical Data Studio
Your third popular pattern is high concurrency streaming ingestion. You have clickstream analytics, financial transactions, game telemetry. We're talking about all these IoT apps coming in with high throughput and multiple streams, for example, through Kinesis streams. You have Flink doing real-time aggregations, Spark doing complex transformations, and all of them writing to the same table. Writing into S3 tables, and it's immediately ready for consumption.
Iceberg is perfect for this because you have multi-writers, and you get consistency by leveraging optimistic concurrency. You get excellent read consistency on the consumption side, and small file handling is perfect with S3 tables. You get excellent exactly-once semantics because snapshots provide this excellent checkpointing capability, and combined with Kafka offsets, you get this excellent exactly-once semantics. Iceberg is perfect for all these incremental processing scenarios, so you only read the delta from the last checkpoint and do your transformations.
A pro tip here is that S3 tables are the best choice for these workloads. That way, you get all the automatic storage optimizations. Especially in these multi-writer scenarios, you want to monitor for commit conflicts and adjust your concurrency accordingly. Typically, we recommend customers do namespace isolation so they don't step on the same partitions. You have this excellent isolation of high-frequency writers, and you want to understand your patterns and adjust your micro-batch sizes to balance latency and file count. By leveraging AWS components, you get all the benefits out of the box.
Overall, we've looked at batch ETL, CDC, and high streaming concurrency. The benefits of Iceberg and S3 tables, Glue Data Catalog, and analytics services give you all the core benefits. So, Mike, over to you. Let's hear about Medidata's transformation journey.
Yes, thank you, Purvaja. That was a great introduction to some of the popular Iceberg architectures that we're seeing today and the great AWS services to enable them. My name is Mike Araujo. I'm an engineer at Medidata. Before we dive into some of our customer use cases, let me give you a little bit of information about who we are and what we do.
Medidata is a life sciences technology company, and for over 25 years, we've innovated and delivered digital solutions for life sciences, pharmaceutical, and healthcare companies. For today's primary use case, we're going to look at two of our newer products: Data Connect and Clinical Data Studio.
These newer products were aimed at bringing data under one roof for our customers to operate on.
We wanted to provide a single, unified source regardless of where the data came from. And we wanted to support and enrich that data with semantics through our internal Medidata Data Catalog. And further, we wanted to accelerate the data review and insights process through AI and machine learning.
As you can imagine, over 25 plus years, we've had millions of patients. That's resulted in billions of data points that have been integrated from all these disparate data sources. One of the things that we really wanted to achieve was reducing the friction that our customers are currently experiencing when generating data sets and data assets for ad hoc analytics. This is for insights through AI and machine learning, which I just mentioned.
Migrating from Legacy Architecture to Iceberg: Challenges and New Solutions
This is what the architecture used to look like that we used to accomplish things like this. This is probably going to be very familiar to many of you in the audience. This pattern worked very well, and it worked very well for a very long time. Until the recent data explosion that we're seeing today. On the left-hand side, you have your bronze layer consumers and collectors, which included Rave, our flagship platform, which is our raw data collection system, and a legacy ingester solution that was used for file-based integrations.
What you'll notice here is a lot of ETL blocks. So a lot of different batch jobs running, and data had to be put into different staging layers for processing before it finally reached the downstream final destination where the consumer would pick it up and read it. But in reality, you probably didn't just have one of these pipelines. You probably had many of them. Because your consumers had a lot of different requirements for how they wanted to read that data. We're talking streaming consumers, application consumers, APIs, BI solutions, a very wide range. So you probably replicated these pipelines all across your organization to try and satisfy all of them.
And obviously, you probably ran into some of the same problems that we started to encounter with solutions like this. As data started to expand, latency started to deteriorate and become measured in days, not hours or minutes. Inconsistent timing of individual jobs made it difficult to reconcile all the data passing through the system, especially since we always had collectors collecting data at all points in time.
With so many batch jobs running and so many places to store data, there's more opportunities for error, more places for things to go wrong, and as batch jobs fell out of sync, you had data inconsistencies across your entire pipeline solution. And when you needed to scale this, when you needed to accommodate billions and billions of data points, you probably had to re-engineer your entire system, and in some cases, re-migrate your data. The end result was that for various use cases, there was always a tremendous amount of copies of data all over the place. And even worse, if you wanted to add observability to this, you had to replicate that observability across your entire data estate.
This is how we're solving this challenge today with Iceberg and AWS. On the left-hand side, you still have your Rave solution collecting data. We've replaced our ingester solution with our newer solution, Data Connect. What's happening here is we've eliminated all those ETL jobs that were running in certain periodic intervals in between, and we've eliminated all those intermediate storage layers and staging layers that we used to need.
Because what's happening here is, as you can see from the far right, Iceberg can directly connect to whatever you need from a consumption standpoint. So if we can fulfill this Iceberg sink in the Glue Catalog, we get immediate interoperability with a multitude of different streaming solutions, batch solutions, warehousing solutions, other open source projects like Arrow, and even MCP tools that can feed into AI solutions.
Right before that is the streaming part that I just mentioned, and I want to highlight that for just a second. For those of you who might know, when you write a lot of snapshots from a lot of writers in Iceberg, you start to run into all kinds of problems: file sizes, small files, a lot of snapshots. By putting Flink within EKS and using Kafka buffers in between, we were able to write to that Iceberg table only once every 20 minutes, keeping the snapshot size down while aggregating data from thousands of Flink pipelines.
Outcomes of the Iceberg Architecture and Medidata's Next Steps
So let's take a look at what this architecture has resulted in for both us and our customers.
First, data availability. Unsurprisingly, or should I say predictably, when you move from a batch pipeline to a streaming pipeline, you get a significant reduction in latency. And that's exactly what we saw. This reduction in latency resulted in our customers being able to see a consistent view of their data. The Glue compactions ensured that we were getting data in a timely manner. It abstracted away all those irritating complexities like small file problems and orphaned file deletions. This was one of the things we worked very closely with the Amazon team on, and they were very helpful, iterating many times to deliver the solution you see today. Finally, snapshots gave our customers and ourselves point-in-time access to their data.
Consistency was greatly improved. Now, instead of having data spread across our system, whether it was Kafka topics, staging tables, Snowflake tables, Databricks tables, or other tables, we had only one copy of the data, just one Iceberg table. That table was able to satisfy both streaming and batch use cases. Not only were we able to integrate our data assets there, but we were also able to accommodate application interfaces, APIs, and BI tools. As a result, predictably, data loss was significantly reduced. We only had to focus on this one data copy, and we knew that everything reading from it was reading from the same thing.
Scalability. Now that we were dealing with data stored in S3, we no longer had to store data in Kafka for its entire lifetime. This allowed Kafka to do what it does best: consolidate all the writers and move data downstream. This actually, predictably, saved us a lot of money. Because all that lifetime data was now put in S3 and could be tiered. If it was kept in Kafka, we would have had to scale up servers to store all of it. Of course, the Kafka and Flink solutions running on Amazon MSK and Amazon EKS come with out-of-the-box autoscaling support. And predictably, our observability was greatly enhanced and simplified. We were able to get all that great metadata and metrics, Prometheus metrics, and so on, coming out of these pipelines, and we only had to look at it in a single pane.
Security. Now we have a single data access layer using the Glue Catalog, so nothing that doesn't go through Glue and the corresponding IAM roles can come in to see our data. Furthermore, data never leaves our VPC, so the data just sits in S3 within our account, and whatever you decide to connect to it as a reader can connect to it as a reader, but it doesn't store a physical copy elsewhere.
And integration. As I alluded to in the diagram section, the integrations available to us have greatly expanded. Warehousing solutions like Snowflake and Databricks can connect directly. The Iceberg REST API provided via the Glue Catalog has enabled the introduction of many advanced Iceberg concepts. Finally, many open source technologies, streaming writers, readers, SQL solutions, and more are enabled out of the box, and more are being added every day. This is a snapshot of all the benefits we've been talking about: availability, consistency, scalability, security, and finally, integration.
What are Medidata and Iceberg's next steps? We are going to equip our suite of upcoming AI agents with this real-time data flowing through these Iceberg tables. So we want to give agents insight into the data, monitor when a customer uploads new data, and minutes later, they can start asking questions about that data. The current plan, what we're working on, is basically making it available through MCP tools that have access to this lake. We want to enable and expand our customer data sharing solutions. We recently launched a solution that allows our customers direct read access to the data residing in these Iceberg tables, and we want to expand that to also give our customers direct write-back, so writing into our lakehouse solution. We have curated SDKs from Medidata, including our recently released RStudio SDK. This actually allows customers to write R code on their local laptops and read data that resides within this lake. This is all enabled by the open source Apache Arrow project, where we've set up Flight servers to provide the connection and bridge between RStudio and our lakehouse. Finally, direct integration.
If a customer is using Iceberg as part of their solution today, we want to open the box to connect directly to that so that they can integrate with other experiences that they can enjoy within Data Connect and Clinical Data Studio. And finally, some exciting things on our side regarding what's coming in V3.
We have variant support. Metadata, right? I've seen Iceberg tables written from thousands of jobs, and actually, those thousands of jobs are actually writing multiple tables. It's a multi-table within a single table, and we've leveraged map columns extensively. Map columns worked very well for that use case, but they fell short in some areas, such as indexing and query performance for tables within large tables. So we're looking forward to replacing those map columns with variants. With variants, you can actually get much better performance for a subset of tables within another table.
We want to enable default values, enable column defaults. That way, we can eliminate days and days of backfills. Currently, if we need to make a schema change or evolve or move some process, it actually takes days and days to backfill to bring the table up to date. Column defaults promise to eliminate all of that. We no longer have to worry about migration patterns and so on, which can be a bit awkward and complex. This will be a great feature for that.
And finally, row-level lineage. Our customers are always obsessed with seeing how their data changes and evolves over time. So we've implemented features to give them a clear overall picture, but many of them are not off-the-shelf solutions or tools. It's navigating snapshots, looking at timestamps, comparing date ranges, and so on. But with V3, Iceberg provides row-level lineage. What we want to do there is connect this row-level lineage directly into our CDC pipelines and expose that directly to our customers. That way, customers can see the changes in their data, and we don't have to spend hours writing custom code or custom SQL to present that to them. So with that, I'll turn it over to Srikanth. He's going to show you some of the amazing things Amazon is doing to enable this for our customers today.
The Importance of Open Specs and Query Federation, Materialized Views
Thank you, Mike. It's great to see customers like Medidata using Iceberg and delivering solutions to their customers. But what I want to do is touch upon the consumption patterns of Iceberg in AWS compute engines. First, before that, I want to dive into open specs. Why is that important? If you think about it fundamentally, open specs are game changers for how you consume data with standardized table semantics and interoperability. This is very important to us, and the catalog choices are important.
Automatic schema evolution, ACID transactions, and accessible metadata—all these features are truly important in this open specification. Another thing to consider is multi-cloud and multi-source integration. For example, suppose you are dealing with operational logs in S3. You have a large amount of operational logs in S3, app exports, and all that data, third-party marketing data, you are trying to aggregate it into one place, into an Iceberg table. I think you can govern all of this centrally in one place. Open specs basically provide that.
And finally, open discovery capabilities. These are important features because, for enterprise-wide visibility and compliance, if you have sensitive data sets, comprehensive lineage and standardized definitions can also be provided by this open specification. So, I wanted to touch upon the important open specification to continue this discussion. Now, let's get into query federation. This is an important topic, right? We've seen how open table specs unlock interoperability. We've seen that. So what is query federation? And why is it transformative?
Think about it. With centralized governance and metadata management, by plugging in all your sources, for example, you have a unified catalog, everything is discoverable, and you can audit and secure all this data from day one. I think this is important for you. So the idea of open query federation basically stands out there by giving you this data residency and compliance at the forefront as well.
Query federation stands out as a particularly important feature. For example, with query federation, when you federate data across multiple data sources, the data stays in place. This provides a very crucial use case regarding Apache Iceberg. Agile analytics also becomes very important in this case. Think about it. If your analysts are trying to access data from multiple sources without data movement, without migrating data, query federation becomes a very important aspect in their analytics. Finally, cross-source reporting is important because, think about it, all these data sets across multiple environments, that could be multi-cloud, that could be multi-vendor. Aggregating all of them into one Iceberg format with an API, and allowing you to access all that data, provides you that capability.
For that, we've just recently announced, if you've seen it, catalog federation for external Apache Iceberg catalogs. So what does this do? Catalog federation with Iceberg allows teams to query Iceberg tables residing in S3 directly. This is very important. You can use your favorite analytical compute engine to access that data, and we keep the metadata in sync across catalogs. Not just the AWS Glue catalog, but also other external and remote catalogs. So when you access a table, you're basically looking at data coming from other catalogs, and they're all in sync. This is important. This is a key feature for that. And finally, it provides consistent and centralized security across all of this. This is optional, of course, but depending on your use case, you can apply broad policies or granular column-level access through Lake Formation. So, federated catalogs become an important piece because they bring multiple catalogs into the Iceberg format.
Now, materialized views are fundamentally important when it comes to performance. Because, think about it, materialized views significantly enhance analytics by providing pre-computation. For example, the way to think about this is there are basically a couple of elements here. One is your implementation strategy. Think about what you want to identify from a best practices perspective. Identify what are those high-frequency queries your users are always using, and you want to set up a refresh schedule that can keep the views accurate and timely. That's your implementation strategy. And of course, performance design is another critical element here. What do I mean by performance design in this case? Pre-aggregating your metrics, and also optimizing your joins to provide the right lightweight and fast results to your customers.
Next, storage optimization becomes very important because, you know, Purvaja was talking about storage optimization. Storage optimization means formatting your tables in Parquet format, choosing the right compression, partitioning them very intelligently so that all your queries can scale up without completely exploding your storage costs. That's a crucial aspect of that. Next, refresh management is also a critical point here because it keeps your insights up-to-date. So you want to understand your refresh schedule, what your schedule is like. That's a crucial aspect of that. Finally, metrics and ROI. You must be able to measure this performance. So you want to see significant cost savings, quick query optimizations, and you want to measure measurable improvements from seconds to milliseconds. Those are the important things. So the idea of materialized views is to give you a pre-computed, aggregated way to access your Iceberg data so you can get better performance.
Iceberg Consumption in AWS Compute Engines: Athena, Redshift, EMR
In that regard, we've just recently announced materialized views in AWS Glue. Think about Apache Iceberg materialized views. What are they? They are materialized views. Think of them as a managed table that stores the output of an SQL query in Iceberg format. That's what these materialized views do. And every time the table changes, the view is continuously incrementally updated, so you don't have to do anything, rather than always re-running that query from scratch. So it eliminates the whole pipeline build that you always have to update. That's what AWS Glue's materialized views for Apache Iceberg do.
What are the benefits? You might be thinking, how does it help me? The performance benefits with these materialized views with Apache Iceberg are immense. Because the results are pre-computed and stored in an Iceberg table. They are super fast. Spark jobs reading from these views will be significantly faster.
Spark jobs reading from these views can determine if a materialized view is the right approach or if they should fall back to the table itself. So that kind of intelligence is built into these. When you build pipelines, especially complex pipelines, these materialized views help simplify your pipelines with the help of these capabilities. This is a great feature, and it's one of my favorites.
I know Purvaja was talking, but I want to put up a screen to give you an idea. Apache Iceberg is becoming the mainstream table format across analytics, and all the customers we talk to are adopting this approach. We have built multiple data integrations across the ecosystem. On the producer side, Data Lake Foundation and EMR, the tools I talked about earlier, Athena, and integrations across our entire analytics stack, can produce data into Apache Iceberg tables. This is big. Furthermore, with catalog federation that I just talked about, you can retrieve it from remote catalogs using catalog federation, so you have that capability as well.
With the open specs and APIs we talked about, Spark, Trino, and Presto can consume all this data across these different tools. This becomes an integral part of thinking about your architecture. Think of these as components to consider. Now that I've talked about services that can consume data, let's take a closer look at some services like Amazon Athena, and then I'll talk about some other services here.
How do you optimize Iceberg data consumption using Amazon Athena? One of the biggest advantages here is Athena's ability to leverage Iceberg's manifest files and partition statistics for aggressive metadata pruning. This is very important. With proper partitioning and appropriate filter predicates, Athena skips irrelevant Parquet files, improving performance. Data management is not just about querying. You can build new Iceberg tables using tasks like Create Table As Select and write them using SQL within Athena. This enables snapshot isolation, which Purvaja was talking about, and support for branching and tagging in Athena with these Iceberg tables is also important.
Finally, serverless scaling. Serverless scaling is purely simple because there's no cluster management required. You don't have to worry about managing clusters to access data in your Iceberg tables. Athena provides serverless management at $5 per terabyte scanned. This is very cost-effective. Regarding Athena best practices, consider partition predicates and filter conditions when designing Athena queries for all your queries. Leverage manifest statistics and compaction for fast and efficient writes. This is how Athena and Iceberg work together to give you the best performance you can get.
In conclusion, the combination of Athena and Iceberg makes for a true pay-as-you-go analytics solution. Because there's no cluster management, it's cost-effective, and you can access Iceberg tables with all the features I've talked about. Now let's talk about Redshift. Redshift is a very popular data warehouse. Many of my customers use it. When they're ready to take their lakehouse analytics to the next level at an enterprise scale, they generally look at Redshift. Amazon Redshift and Iceberg work very well together in terms of performance and scale.
On the consumption side, Redshift is excellent at concurrency and efficiency. It can easily handle over 50 concurrent queries, and the system's autoscaling and workload management really help with these Iceberg tables. In modern lakehouse integration, Redshift has built-in support for Apache Iceberg tables. This is a powerful combination, allowing direct read and write to Amazon S3, with snapshot isolation, and full compatibility with both business intelligence and analytical workloads. Everything is aligned for cost management because Redshift uses statistics to plan and optimize queries, allowing queries to perform very well. It provides query rewrite capabilities, so you can execute them quickly.
Regarding best practices centered around RA3 nodes, for optimal storage and compute separation,
leveraging concurrency scaling is a key aspect of Amazon Redshift, and cross-cluster sharing is equally important for enterprise needs. Performance monitoring and performance metrics with various other tools are indispensable components. Query latency and concurrency, and cache hits for smooth and predictable operations are all features provided by Redshift.
Redshift best practices include these considerations. Another commonly seen compute engine is Amazon EMR. Because Apache Spark generally makes scalable Apache Iceberg analytics cost-effective and enterprise-ready from a lakehouse architecture perspective. From an infrastructure perspective, you need to think about flexibility. Clusters range from single-digit nodes to processing 100 to 1,000 terabytes of data concurrently. You can deploy multiple clusters, and by using Graviton, you get the ability to spin up these clusters efficiently.
Performance heavily relies on runtime optimization. Because, especially with Apache Iceberg, Spark can leverage adaptive query execution, smart join planning, and dynamic partitioning, all of which are available as part of its Spark implementation. Even for the largest data jobs, you can use petabytes of data with millions of partitions, allowing you to effectively consume and execute that data. Storage integration is important because this ties everything together. Utilizing S3 committers provides full ACID transaction support, and schema evolution makes it very easy to manage reliable, versioned datasets.
By combining Amazon EMR and Spark with Apache Iceberg, you can handle all the analytics we've been talking about—pipelines, streaming workloads, and ad-hoc queries. Regardless of the type of workload, core best practices for achieving scale include, first and foremost, appropriate partitioning. Between 128 megabytes and 1 gigabyte is not new, but standard practice. It's also important to tune executor memory for maximum performance and enable adaptive query capabilities for shuffle optimization.
In conclusion, with Amazon EMR, Spark, and Apache Iceberg, you can build fast, flexible, and cost-optimized lakehouse analytics. Whether it's large-scale data, machine learning, ad-hoc analysis, or even streaming, with this kind of feature set, everything runs smoothly. This is Amazon EMR and Spark. Now let's summarize the key takeaways.
Summary of Key Takeaways, Best Practices, and Action Plan
Finally, the key takeaways I want you to remember are the important points that make Apache Iceberg stand out in modern lakehouse architectures on AWS. The foundation is built with enterprise data capabilities. Think of ACID transactions, time travel that Purvaja talked about, and schema evolution. All of these are provided as standard, and that's important. Another thing is that you are building for the future from day one. When you architect a solution with Apache Iceberg, you are essentially building a future-proof architecture.
By leveraging S3 Table Buckets that we talked about, all your queries will achieve speed in terms of high-performance, high-throughput access. Take advantage of this capability. And AWS maintaining and managing these Apache Iceberg tables provides significant operational benefits. For high-performance analytics, whether you plan to use Amazon Athena, Amazon Redshift, or Apache Spark, engine-specific optimizations are built in. Iceberg materialized views, query federation, which we talked about, are fundamental things you want to start thinking about, and catalog federation in that regard, these are for fast cross-source analytics, providing better insights with quick access.
In production, an important thing to remember is that Apache Iceberg is not just for batch jobs. Because, as Purvaja discussed in one of the use cases, it supports real-time workloads like Change Data Capture and streaming workloads. The idea here is to achieve sub-second latency across a wide range of concurrent write support, making your platform more responsive. This is important.
Regarding best practices, if you think about it, start new projects with S3 Table Buckets if possible. Because this is an optimized, managed Apache Iceberg storage layer that's available. Think about file sizes for storage efficiency. Use smart partitioning for agile queries, enable catalog optimization for fast lookups, and configure snapshot retention with a view to balancing reliability and cost.
Here's your action plan. Use S3 Table Buckets for all your Apache Iceberg projects if possible, and size your files appropriately. Set up snapshots and partitioning to ensure performance into the future. Leverage materialized views. We talked about AWS Glue materialized views. Leverage materialized views. Because pre-aggregated, pre-computed results offer valuable benefits. Reduce data pipeline complexity, and get those results quickly, especially using the AWS Glue Iceberg views we talked about.
Make catalog management a priority. Think about catalog federation with other vendors, using other catalogs. Make sure to use catalog federation. This allows you to consume data and get optimal compute no matter what platform you're on. For Apache Iceberg, the way to provide scale and flexible analytics is to use Apache Iceberg with these compute engines and make your workflows more productive. Whether you're a data engineer, a data architect, or whoever you are, it makes that data easy to consume.
This is my final slide. Thank you for coming to the session. Enjoy re:Invent. Thank you very much.
- This article is automatically generated using Amazon Bedrock, maintaining the information from the original video as much as possible.



































Discussion