iTranslated by AI
re:Invent 2025: Multi-tenant SaaS Isolation with AWS Lambda Tenant Isolation Mode
Introduction
By transcribing various overseas lectures into Japanese articles, we aim to make hidden, high-quality information more accessible. This project, driven by that concept, features the following presentation!
For re:Invent 2025 transcription articles, information is compiled in this Spreadsheet. Please check it as well.
📖re:Invent 2025: AWS re:Invent 2025 - Secure Multi-tenant SaaS with AWS Lambda: A Tenant Isolation Deep Dive (CNS381)
In this video, Anton and Bill introduce Lambda's new Tenant Isolation Mode, explaining the challenges and solutions for tenant isolation in multi-tenant SaaS applications. After detailing the limitations of traditional single-tenant function models and custom framework-based isolation methods, they demonstrate how Tenant Isolation Mode automatically creates independent execution environments for each tenant, providing complete separation of CPU, memory, and disk. Implementation details are shown with concrete code examples, including integration with API Gateway, tenant-specific observability using CloudWatch Logs, noisy neighbor mitigation with usage plans, and obtaining tenant-scoped credentials using STS. The demo confirms that different tenants are processed in independent execution environments using a JWT token-based authentication flow.
This article is automatically generated, maintaining the content of the original lecture as much as possible. Please note that there may be typos or inaccuracies.
Main Content
Session Start and Joe's Story: Learning the Need for Tenant Isolation in Shared Living
Alright, thank you very much for coming. I hope everyone can hear me. So, in today's session, we're going to talk about this new thing that we just launched. But we're not just going to talk about the feature; we're going to talk about the overall use case, which is building secure multi-tenant SaaS applications with Lambda, and diving deep into how you achieve tenant isolation in SaaS applications. We'll first cover the problem, then existing solutions, and then we'll talk about the new solution. I'm Anton. I'm a Principal Solutions Architect for Serverless. This is Bill, he's almost my colleague. Both of us are very passionate about Serverless and SaaS. And by this point, you're probably wondering, who's that third guy, Joe? So, let me introduce you to Joe. He's an average Joe. And this is where our story begins.
So, Joe lives a quiet life in a detached house somewhere in the suburbs. But deep down, he has a big dream to one day become a famous cloud architect. So, every night Joe reads books about cloud architecture and builds prototypes on AWS in his private room. At some point, he decides to pursue his dream and enrolls in a large university in the city. And he moves into new accommodation. He was very excited to meet his roommates, but Joe soon realizes that communal living is not at all what he imagined it to be. He discovers the problem of noisy neighbors. Basically, he can't focus on what he's trying to do.
He also learns what it means when one of the tenants he lives with sometimes leaves the shared environment a little messy. This is also not something you particularly enjoy. And one day, sitting on a bench, he thinks, I wish there was some tenant isolation in a shared environment. This is where Joe discovers the difference between a single-tenant environment and a multi-tenant environment. I think you'll understand the analogy we're trying to make here: private life versus communal living.
Single-tenant vs. Multi-tenant Environments: The Trade-off Between Cost and Isolation
In a single-tenant environment, also known as a dedicated or siloed environment, every tenant has their own dedicated resources, compute, storage, and so on. This model provides the highest level of isolation for those resources, but it can also become quite expensive at scale. You have to maintain those resources per tenant. The more tenants you have, the more resources you have to maintain.
On the other hand, the multi-tenant model, also known as a shared or pooled environment, helps overcome these concerns. In this model, cloud resources can be efficiently shared among multiple tenants. For example, requests coming from different tenants might be processed by the same compute unit, the same function, the same container, the same EC2 instance, essentially a shared compute. As a result, costs are lower, operations are simpler, and you don't have to maintain as much infrastructure.
That said, this model comes with its own considerations. It means multiple tenants will be reusing resources. You need to maintain a certain level of isolation between these tenants. That's an interesting problem to tackle. So, Joe graduates and lands his first job as a cloud engineer. And guess what? His job now is to build multi-tenant cloud applications. And this is where Joe discovers that building multi-tenant applications can actually get quite messy.
Joe's Challenge: The Synergy of Serverless and SaaS, and Achieving Cost Efficiency
That's right, thanks Anton. So, we like Joe. Joe's a good guy. We're going to go to school with Joe. We're going to learn a little bit about his experience building SaaS, about what he learned about multi-tenancy. And he started by joining a company. He learned a little bit about what they wanted to do. They had some business requirements they wanted him to achieve. He couldn't just dive into the technology and learn about the technology, but he also had some technical requirements that had to be achieved.
Now, for our purposes of talking about Lambda, talking about Serverless, there are a couple of key challenges that I want to focus on in terms of what they wanted to achieve from a business perspective, and then how they achieved those from a technical perspective. From Joe's journey perspective, his company wanted to innovate quickly. They wanted to deliver new features quickly to customers, stay ahead of competitors, keep their applications fresh, and keep their customers happy.
To do that, they wanted to have a shared environment, but they also wanted to completely isolate everything. They had customers who demanded that everything be perfect in terms of isolation, and they couldn't deal with the noisy neighbor problem.
And you all know what a noisy neighbor means, right? If one tenant or one customer is doing too much activity, it shouldn't impact other customers. So these are some of the key challenges that Joe faced. And he's a serverless guy. He loves serverless. I'm not going to read all of these out, but of course, serverless and SaaS are a natural fit. Here are a couple of reasons why, but in general, this summarizes that the operational efficiency and cost efficiency you get from serverless is a really beautiful fit.
Now, if you think about what serverless is, it's very much like SaaS. It's cloud-born. There's no on-premise version of serverless. Of course, you can emulate Lambda and run it in different environments, but this is a cloud-born technology. It's awesome, and he loves this, and he embraces all these principles. And I said that serverless and SaaS are a great fit, and cost can be one of the reasons why. So let's look at this and think a little bit about a server-full environment that you've been working on.
When you build a server-full environment, you might have the ability to scale. Of course, you can scale up and scale down as needed. You can do it with containers; you can do it with simple core instances. But as you do that, you're reactively scaling up to peak requirements. We have this much potential load during this time window. We're going to set up as many servers or containers as we need to handle all that traffic. And of course, you're paying for those spikes, right? You're paying for infrastructure that's sitting idle, waiting.
So there are customers who spin up all the infrastructure they need and keep it running all the time. Perhaps not even as efficient as this chart. On the other hand, serverless scales cost and utilization almost equally. As your usage increases, as your number of instances increases, you don't even have to handle that. And of course, the cost you pay for those is almost directly tied to your utilization. So serverless can achieve the optimal ratio of cost to utilization, which is perfect for SaaS. It's especially perfect for SaaS where you see spiky workloads.
Lambda's Concurrency and Isolation Architecture with Firecracker microVMs
How does Lambda achieve this? How does it perform its execution to achieve this beautifully equal chart where cost and utilization are tied together? Let's think a little bit about Lambda concurrency. I'm going to show you a couple of invocations, and we'll call these function invocations. A request comes in, we process it. And when one request comes in, you see it here in purple, that means that the environment we're in, this environment on this line, is now active and processing a request for a customer. And then you see a gray area, which represents idle time. Keep that in mind when you look at this chart.
This is a very simple chart, so we'll go through it quickly. One request comes in, we process it in one environment. Another request comes in, what do we do? We process it in the same environment. It's idle right now, so we'll put it right into the same environment. As we continue on, another request might come in, but this time it's concurrent with the second request. What do we do? Of course, we can spin up another environment.
So, are we spinning up another environment? Not exactly. Lambda and the execution engine behind it are doing it for us. I didn't actively make a choice and say, 'Hey, spin up another environment.' No, it all just happened magically in the background. And now for most of us, this seems like common sense, right? We expect this to happen. When this first came out, it was pretty magical, and it's still pretty cool when you think about all this execution and the scale that you can achieve with SaaS.
Let's continue. Now, another invocation comes in. The first environment is now free, so we'll just execute it in the first environment. Another invocation comes in, two concurrently. Well, we have two free environments, so of course, both can be processed in those environments. Let's continue. What happens if we have three executions? Of course, another environment spins up.
Now, this seems somewhat obvious, right? Okay, this is fine. Bill, why are you telling me this? Well, a couple of things. One, look at that idle bar. We're not paying for that consumption, for those execution environments, even though we could put additional invocations in there. And second, we didn't have to do anything here. This all happened automatically. And the part about that automatic magic that's relevant to the SaaS conversation is not only were we lucky enough to have this done for us, but we also didn't control where a particular execution went. And in a multi-tenant environment, that matters.
Now, the biggest secret of serverless, I don't think I'm letting the cat out of the bag here, is that there actually is a fleet of servers under the hood. If you didn't know this, I apologize for disappointing you. It's not actually serverless. There's EC2, thousands of EC2s, right? We have an entire fleet of EC2s that supports the Lambda fleet, and that's a good thing.
The fact that we don't have to know that, that you could go your whole life without ever understanding that Lambda actually uses EC2, and that could remain a secret to you, is actually a benefit to us. And this is what we wanted to build. But let's dig under the hood. Let's remove the veil of ignorance now and say, 'Ah, now I want to actually understand what's happening here,' because, again, from an isolation perspective, this is particularly important.
Let's look inside these individual EC2s. These are just bare-metal EC2s. Inside the EC2, of course, there's a host OS, which provides a certain level of isolation here. On top of that, there's a kernel, but importantly, on top of that, we use Firecracker microVMs. And Firecracker microVMs have strong isolation around each of them. Nothing is shared between these Firecracker microVMs. So there's already a significant layer of isolation that's there to handle any kind of invocation, whether it's multi-tenant or not. Inside this, inside the microVM, there's a guest kernel. If you're using our runtimes, our managed runtimes, then our runtime is there too. And if you're using Lambda extensions, that's there as well. If you're not using them, you should; they're really cool. And then on top of that, your function code resides. So, all these different layers, like a Russian nesting doll, provide multiple layers of isolation. You can look across this entire stack and say, 'Look, this is protected from this, and this is protected from this.' We've created really strong isolation. This, again, is important for SaaS providers and any provider, but especially in multi-tenant solutions.
Let's look at these boundaries a little bit closer. Between functions, you never share the same execution environment between functions. So, a shared execution environment is never shared across multiple Lambdas. And within the same function, there is complete isolation. We'll show this with a click, but you have a complete grid, right? They're never sharing the same execution environment. And within the same function, you never share CPU, disk, or memory. This gives us a slightly different visualization and a different way to think about the problem space we're dealing with, and the challenges Joe might face as he learns how to be a good Lambda developer and architect.
Challenges of Execution Environment Sharing and Data Residue in Multi-tenant Environments
Between different functions, nothing is shared. That's it. Period. We're not sharing CPU, we're not sharing disk, we're not sharing memory. Nothing is shared between those. Go back to the Firecracker visual. All those layers, all those different isolation layers around that apply here. Now, what about the same function? Execution environments belonging to the same function. And remember that chart we had with the different execution environments. And how different invocations were going in there. There are actually some things that could potentially be shared, and this is one of the problem spaces that we deal with. Environment variables, IAM execution roles and permissions, and code can actually be shared across environments.
So let's go back to the same chart. We have our nine executions that we talked about earlier, and they're all in the same purple active role that we saw. But in a multi-tenant environment, our world has a lot more colors. For a SaaS provider, we have to deal with the fact that multiple customers are coming in. And an invocation on a shared environment, I don't know where it's coming from until runtime. And this is important. The first environment is handling executions from a blue, green, and yellow tenant. And over time, this holds true across different environments. Every tenant is landing randomly in different environments, and we can't control it. We can't choose which of these individual invocations lands in which environment. And this is the crux of the problem. Well, not really a problem, but to be fair, it's something we have to deal with.
So let's jump into a demo that demonstrates the principles we're talking about. Because it's one thing to say these things could potentially be shared, but let's actually see what this means. We have a simple memory counter. Inside it, we also have a disk counter, and I'll tell you what those mean. All they're actually doing, as it says, is writing something to memory or writing something to disk. We have a comprehensive counter here. Pretty basic, right? The memory counter just outputs that it's a memory counter, outputs it, increments it, and saves it back to memory. And the disk is doing the same thing. It reads it, outputs it, increments it, saves it, and then returns a value with those different values.
So let's look at Joe. He's learning how to do this, and of course, he's going into his Lambda runtime environment and creating a little test to prove that how this works is actually how he thought it worked. So he runs this very simple test event he has set up, and you can see some output, right?
The logs show a memory counter, a disk counter, and suspiciously, an undefined tenant ID. We'll probably see what happens to this tenant ID later when we start thinking about the actual implications. Every time this runs, the disk and memory counters increment and continue to increment regardless of which tenant is invoking it. Joe doesn't care at all. Everything's fine, it's great. He can confirm it's working exactly as he designed it.
So let's define the concerns that Joe should be thinking about, and why he came to re:Invent 2025 this year to learn about what we launched. As each of these tenants come in, they're going to invoke the Lambda function that Joe built. Now, as these invocations occur, we write some things to disk, we write some things to memory. And Joe's happily running this function. He thinks he's done a great job, and of course, he's diligent. We really like Joe and we want him to succeed.
But unfortunately, Joe didn't seal all his code correctly. Even now, there are best practices you can do to prevent this. Unfortunately, Joe didn't know them. So, we have some leftovers from the blue tenant's invocation, and they remain in the execution environment. And remember the execution environments, when we added colors, a lot of different tenants were coming into each execution environment. The yellow tenant comes in, same problem. Now we have even more leftovers. Perhaps they can see the leftovers from the blue tenant. The green tenant comes in, and now they might be able to see leftover data from the blue and yellow tenants.
Now, is this data important? Well, it could be. It could be tenant-specific data, it could reveal information about other tenants, or even the presence of other tenants and who they are, and all of these things can be dangerous for the business we're in. So, it sounds a little crazy, right? Does Lambda have this problem? Is it a problem? No, not really. This is a common concern across all architectures, and I think most of you probably realize this. If you step back from this being Lambda, if you're running a container and you're writing to disk and you're running multiple containers, if you're not careful, other containers are going to start picking up that same data.
And EC2 is the same, right? So, there might be levels of isolation around those. You have to manage those, and this is true for almost any compute environment we've ever encountered. Now, Joe actually came to re:Invent 2025. We can learn with Joe. Joe can go back to being a hero in our story. And I'm going to turn it over to Anton to talk about how we brought Joe back to being a hero in our story.
Existing Solutions: Functions per Tenant and Custom Tenancy Frameworks
Thanks, Bill. So, as Bill said, this is an inherent concern with any compute you use, right? If you're sharing compute units across multiple tenants, you have to make sure you're not leaving any residue. It's doable, absolutely doable. Who is responsible? You are. Let's look at how you deal with that. Lambda is actually very unique in this regard because Lambda execution environments are short-lived. They don't live for days or weeks. We are constantly recycling them.
So even if there is some residue, it's not good. Don't get me wrong, it's bad. But it's going to be recycled in a matter of minutes. So Lambda actually makes it a little less critical, but it's still very critical. Again, don't get me wrong. So let's look at what the existing solutions are to this problem. Yes, existing means there's a new one. I think you're here to hear about the new one, and we're getting there.
Now, before we get into the technical bits, this is important. I work with customers, I'm a Solutions Architect. I have customers who say, "Oh, it's no big deal. All our tenants' data is public anyway, so why do we need to worry about it?" I don't believe that. Because even if you don't have a use case to deal with it today, tomorrow you might have a reason to deal with it, and you'll forget about it. So I strongly recommend not ignoring this problem in a multi-tenant environment.
Another thing I hear, and you're going to love this, is, "We have a wiki page, and we've documented all the best practices for handling tenant information." Does that sound familiar to anyone? Good intentions are great, but they're not enough. We don't want to solve that problem by providing guidance. We want something more concrete, something that actually works for you, beyond guidance. Joe probably missed that, because there's just too much onboarding material. So, let's talk about solutions. Now, the first solution is, in a way, obvious. This is going back to a single-tenant approach. Some organizations, usually the largest ones with the highest security requirements, adopt a function-per-tenant model. Has anyone here used that or heard of it? Yeah, a few hands going up. Basically, each function is single-tenant.
One function handles invocations from only one tenant. This is the highest degree of isolation. You cannot get more isolated than this. But the problem is, it's costly. If you have five tenants, it's not a big deal. But what if you have 500, 5,000, or 50,000 tenants? The benefits of this approach are clearly strong isolation. Cost attribution is a bit easier since you have one function per tenant. You can configure settings per tenant for observability. But there are also considerations. Things like operational sprawl. Yes, these are serverless functions, and there's no infrastructure to maintain. But 50,000 is 50,000. That's a huge number. Good luck managing that with CDK or Terraform. And that's just for one function with 50,000 tenants.
So it becomes hard to maintain at scale. CI/CD becomes more complex. If you need to update 30,000 functions, you might hit management API limits. Tenant onboarding is slower because you have to provision dedicated resources per tenant. You have duplication, version drift, and the need for custom routing layers. There are several considerations if you use this model. So, we only recommend it when you absolutely need that level of isolation. It's not a common model, but it is certainly used in the industry.
CyberArk's Case Study and the Expectation for Vendor-Provided Solutions
A more common model is to have a multi-tenant function where multiple tenants use that function, but within that function, there's some kind of framework or SDK or layer. Different ways of naming it. Essentially, it's a piece of code created by the organization that handles that isolation. It's basically decoupled from your business logic and handles things like, for example, validating the incoming tenant, scoping down credentials, logging at the tenant level, and so on. So it's a piece of code that's implemented by the development organization that basically addresses that problem.
This is a very common approach. We see it quite frequently. One of the companies I work with who actively uses this approach is CyberArk. They use Lambda. They're a serverless-first identity provider, and they use Lambda in their multi-tenant SaaS platform. To minimize infrastructure management and enable clean compute isolation, they built a custom tenancy framework. This is fantastic. You can find it on the internet. Oh, by the way, I forgot to mention that the last slide has a huge QR code with all the resources you see today, including these slides. You can get these slides in 37 minutes.
Customers are using this approach. It's quite popular and quite robust. It has many advantages, such as high resource reuse, lower unit cost, because you don't have to manage thousands or tens of thousands of functions. Tenant onboarding is much faster. You don't have to create dedicated resources. It's just a database record. Operations are simpler, and rapid feature rollout, and so on. But we're architects, right? There's always
a trade-off. There are also considerations. What if there's a noisy neighbor? If you're using shared compute, you might have noisy neighbors. You might need additional compute-level isolation.
So, I've provided isolation with a custom tenancy framework, but what if I want a higher degree of isolation than that? But I want to avoid creating functions per tenant, because that's hard to maintain. How do you clean up what's left behind? Who's responsible for that? What about observability per tenant? If one function is reused by 10,000 tenants, how do you observe it per tenant? And so on.
So, remember Joe during lunchtime? He was thinking, it would be great if there was a vendor-provided solution for tenant compute isolation. So, late last week, when he came back from lunch, he was very excited to learn that Lambda announced the new Tenant Isolation Mode. Now, I'm about to spend 30 minutes explaining what Tenant Isolation Mode is, but I'm going to explain it in one slide. Because this is a picture worth a thousand words.
Introduction of Tenant Isolation Mode: Achieving Per-Tenant Execution Environment Isolation
You've seen this before. Tenants are accessing a function. In the new Tenant Isolation Mode, you pass us some tenant ID. We don't care what it is. We'll talk about this in the next slide, but you need to give us some unique tenant identifier. What happens internally is that Lambda's function, a single function, creates a separate execution environment for each tenant.
The execution environments are never reused across different tenants. Let's look at an example. A demo is the best way. So, if you've used Lambda in the last five days, you've probably noticed this. If you enable tenant isolation, there's a new property called Tenant ID. So, remember the demo? Let's evolve it. We specify a Tenant ID. Let's say BlueTenant. And let's invoke the function. So, first we see that the Tenant ID is no longer undefined. And we see the counter incrementing. Why? Because BlueTenant is hitting an execution environment owned by BlueTenant. Let's go up to 5. Remember that number, 5. It's important. Look up. We switch the tenant to GreenTenant. And we invoke the exact same function. No redeployment. We see the counter starting from zero. The counter starts from zero because now the execution environment belonging to GreenTenant is not overlapping with BlueTenant. BlueTenant's execution environment is still there. It's just that we changed the Tenant ID to GreenTenant, so the request no longer goes there.
So, let's let that counter go up to about 9, 9 or 10. Let's bring it back. Look up. We're changing GreenTenant back to BlueTenant. Remember 5? Let's invoke that function. 6, 7, and so on. All tenants, which we identify by the Tenant ID you provide, have different execution environments. Let's change it to OrangeTenant. Something we haven't seen before. If we invoke it, it starts from zero. Why? Because this is a new tenant. Now, these counters are just a simple example, but imagine loading tenant-specific configurations, tenant-specific data, database connection strings. Caching tenant-specific information.
Previously, for multi-tenant functions, it was up to you to clean up those residual data between requests. Well, you can, it's totally doable. It's not new. It's been around for years in shared compute. And now that we've isolated those compute environments. So, data will never be shared between different tenants. So, let's look at how this actually works.
First, how do you create a function with tenant isolation mode? This is a new property. It's already supported by CDK, Terraform, and a couple of other Infrastructure as Code tools. You literally, when you create a function, specify a new property called tenancy config. Tenant isolation mode per tenant. So we are isolating compute per tenant. This is only available when you create a new function. It's not something you can change for an existing function. Because this is security. We don't want to play games where you change this setting and something unexpected happens. Security is no joke.
Next, when you invoke these functions, you have to provide the Tenant ID. Again, it's a parameter. It's only required if you enable tenant isolation mode on a specific function. If you don't provide a Tenant ID, there's no default. You'll get an error saying the parameter is missing. This is a mandatory required parameter that you must provide. That's it. Nothing else. That's it. That's how you create it. That's how you use it. This might have been a very short session. We tried to make this very, very simple, but let's get into more advanced topics.
So you're probably asking, hmm, okay, Tenant ID, we're passing that Tenant ID into the function. Is there any way I can access that Tenant ID from within my code? Because I probably want to perform some tenant-specific logic. The answer is yes. We propagate that Tenant ID into the function handler. So your context object has, surprisingly enough, a new property called Tenant ID. It is that exact Tenant ID that you passed to us. So it's available in your code easily if you need to perform any branching logic based on the Tenant ID.
So let's talk a little bit about how you use it. First of all, there's no need to pre-register tenants. We're not asking you to provide a list of tenants. Absolutely no pre-registration required. Second, there are unlimited number of tenants. There's no limit. There's no quota like 10,000 tenants. You can have as many tenants as you want. There are considerations, and we'll talk about them, but essentially, unlimited number of tenants.
Tenant IDs can be any alphanumeric string up to 128 characters. It can be a GUID, it can be whatever unique identifier your system already has. Any alphanumeric string up to 128 characters, we don't care. Finally, this is obviously supported both with ZIP and container images. We want you to enjoy this new feature.
So far so good, but there are obviously considerations. It's always important to know the considerations for each feature. First and foremost, because we're creating a separate execution environment per tenant, cold starts are also per tenant. Since you need compute isolation, tenants will no longer share execution environments, so cold starts will now occur per tenant. This is an important point to note. Tenants with many invocations probably won't notice this. Tenants who invoke it only about three times a day will probably have a cold start every time. This is important to remember.
Concurrency quotas still apply. Remember, we're creating more execution environments now. Previously, you might have handled 10 tenants with one execution environment, but now 10 tenants means at least 10 execution environments. So concurrency quotas will still apply. Account-level concurrency quotas, bursts, and so on. We detail this very well in the documentation. Provisioned Concurrency is not supported for obvious reasons. Provisioned Concurrency means pre-warming a pool of execution environments, but if we don't know who the tenant is, we can't pre-warm it for them. So, at this point, we don't know what the future holds, but currently, Provisioned Concurrency is not available with this feature. And currently, it's only supported with direct invocation or API Gateway integration. If you have your own control plane and you're invoking Lambda, it's easy to do, and if you're using API Gateway, we'll show you how it works with API Gateway in a few slides.
Tenant-level Observability: Leveraging Logs and Metrics
So, the next big question is, what about observability? We have compute isolation. How do we achieve tenant-level observability? First and foremost, if you're not already using JSON-formatted logging, please do. We highly recommend using JSON-formatted logging. It makes your job easier. These are logs where you can query information based on your queries. When you enable JSON-based logging, as you see on the screen, the tenant ID is automatically injected into your logs. You no longer need to have a special log line in your code to manually output the tenant ID. It's now automatically injected into your logs.
But how do we actually process these logs? How do we actually leverage them? As you probably know, Lambda can automatically send logs to CloudWatch, S3, or Firehose. By default, it's CloudWatch. Probably one of the most common ones, so we'll talk about CloudWatch today. By default, each function gets its own log group. This has been working this way for years. This is not new. Each function gets its own dedicated log group. Now, within that log group, you'll see a collection of log streams. This is also not new. Each execution environment gets its own log stream. This has been working this way for years. Nothing new so far. What's new is that these log streams are now tenant-specific. Because you have tenant-specific execution environments, these log streams will also be tenant-specific. To summarize, each tenant will have multiple log streams. Each log stream will belong to only a single tenant. Understand?
So let's delve a little deeper. Now, as you probably know, these names are not entirely random. They actually have meaning. They start with the date, then the function name, then the function version, and finally a random execution environment ID. If you understand this structure, you can make your queries a little more powerful. For example, let's start with a development scenario. You want to observe tenant-specific logs in real time. You can use Live Tail for that. This is a CloudWatch feature, and it's also integrated into the Lambda console. You can select a log group. Remember, the log group is the same as the function, so you can select multiple log groups, i.e., multiple functions. You can also select a log stream. This means you can select a stream belonging to a specific tenant. And,
For example, you can also specify the tenant ID as a filter pattern. Here we know it's the blue tenant. The result is this. We're using logs tail here, so we can get live logs and only get logs for a specific tenant ID. If you don't use this filtering, again, you'll get all the logs. But if you need tenant-specific observability for a specific tenant, you can easily achieve that with this filtering. Just specify the tenant ID, and it will filter for you.
But this is live logs. What if you have three weeks or three months of logs and you want to get logs for a specific tenant? In that case, you can use CloudWatch Logs Insights. Again, this is all query-based logging. Here's an example of a Logs Insights query. This will get you a list of log streams for a specific tenant. I'm thinking in my head, okay, I have a blue tenant. I want to see all the log streams belonging to this particular tenant. For example, if you run this query, you'll get a result like this. You can see that you got three log streams belonging to that tenant. You can limit by time or count, standard queries. And here you can also see the log count. So you can see the log streams belonging to a specific tenant.
You can make it even simpler. What if you're not interested in seeing which log stream it is? What if you just want to see the logs for a specific tenant and you don't care how they're distributed across different log streams? Because obviously, each tenant will have multiple execution environments. Each is tenant-specific, but there are multiple. So let's say you run a query and you want to get all messages for BlueTenant and you want to limit it to 1000. Here, the query will return tenant-specific logs stored in your CloudWatch log groups and log streams. Now, if you're using a third-party observability provider, they'll have very similar capabilities, some kind of filtering capability. Because again, with structured logs, the tenant ID is just one of the properties in your JSON object. It's very easy to query.
But what about beyond logs? What else can we do that's a little more sophisticated and advanced? Well, how about sending tenant-specific custom business metrics? So, how many of you here use Powertools for AWS Lambda? Quite a lot of you know it. Here's an interesting example. Powertools is an open-source library that you can use with Lambda, supporting a number of languages and runtimes. Here's a Node.js example. We're creating a new metrics object. Why? Because we want to start emitting tenant-specific metrics. Not logs, but metrics. Within the handler, we're adding a tenant dimension to the metrics. From this point on, all metrics emitted from the function will be tenant-specific. You can still emit general metrics that are not tenant-specific, but now you can also emit tenant-specific metrics. This is very powerful. So, at the very bottom, you can add successful bookings and expose those metrics. And now you can get that information per tenant. So, the story for per-tenant observability is also very powerful.
Integration with API Gateway: Tenant ID Propagation and Tenant-Scoped Credentials
Let's talk a little bit about integration with API Gateway. This is a very common scenario. When tenants reach Lambda via an API. As you know, if tenant isolation mode is enabled, Lambda expects the tenant ID parameter to be sent with each invocation. But how does it actually work at the wire protocol level? Let's go into the details. From a protocol perspective, the way the tenant ID is propagated to Lambda is by using an HTTP header called X-Amz-Tenant-Id. And the value of that header will be BlueTenant or GreenTenant, or whatever value you specified.
But the big question is, where does that value come from in the original request? We know what Lambda expects, but you might have different use cases. For example, what if you're passing the tenant ID using an HTTP header, or a query parameter, or a path parameter? All of these are possible. What if you want to use a domain prefix? If you're creating domains, subdomains for each tenant as a way to identify them, and you want to use that as the tenant ID? It's possible.
What about information coming from a Lambda authorizer? What about JWT claims, or accounts, or other information? All of these are possible. So let's look at how to do it.
First, let's define some terminology. This is API Gateway terminology that I'll be using for the next five minutes or so. It's important to understand this. An inbound request to API Gateway is called a method request. Because it hits a specific HTTP method, such as GET or POST. So inbound is a method request, outbound from API Gateway to Lambda is called an integration request. Because it's a downstream backend integration.
Simple enough. What I want to do is, for example, any example. I want to take some custom header from the method request, and I'm using x-tenant-id. This is an arbitrary name. You can name it whatever you want. I'm just using it as an example. I want to take the value of that inbound x-tenant-id and use it as the value for the X-Amz-Tenant-Id header that gets sent to Lambda. This is what I want to achieve. How do I do that?
I'll use CDK as an example. Of course, you can do this with other Infrastructure as Code tools, or you can click around in the console. First and foremost, when you create a resource and create a GET method on that resource, you can specify. This is why I had to do the terminology explanation first, but you can specify method.request.header.x-tenant-id true. What this setting is essentially telling API Gateway is that I'm expecting this x-tenant-id header, and it's required. True means required. Because I don't want to process requests without a tenant ID.
The second thing is when you're setting up the integration. Again, this is why I had to do the terminology part first. We're saying that we want to map the integration request header's X-Amz-Tenant-Id to the value of the method request header's x-tenant-id. In essence, we're saying that this is the header I want to pass to Lambda, to the downstream integration, and this is the source of the value I want to use. And this is just one example of how you can map any custom HTTP header as your tenant identifier.
This is just one example. This value can be obtained from many different sources, as I said earlier. You can use any request header, any request query parameter, path parameter, request body, authorizer principal ID, or any custom property returned from your authorizer. Domain prefixes can also be used for that. Basically, anything you can access in API Gateway can be used as the value for your tenant ID. We have samples published. The link will be shown at the end. We have published sample code that shows you how to achieve this through an authorizer. JWT tokens and authorizers, how do you do it? Very simple, a few lines of code.
Since we're already talking about API Gateway, let's address the noisy neighbor problem. This is one of the things we discussed earlier. API Gateway provides the ability to create usage plans. For example, this is a common approach. You have different tiers like silver, gold, and bronze, and you can define request rates for each tier. For example, 10 requests per second, 30, 20, and so on.
If you map these tiers to your tenants, you get protection against noisy neighbors. With API Gateway, you can implement noisy neighbor protection before these requests even reach Lambda. You don't have to do anything in your code; you just do it at the API level, and the service handles it for you. And again, you can use the same tenant ID that you used previously, the one coming from the client request. You can use that same tenant ID to identify which plan that tenant is associated with. So, API Gateway helps here as well.
So let's move on. Quick question. When we talk about real applications, what's missing from this diagram? Storage, dependencies, that is. The architecture doesn't end with Lambda. Lambda doesn't operate in a vacuum. It executes something. It needs to access something downstream like a database or S3. It needs to communicate with something. So obviously, there are some dependencies that Lambda needs to access. As you can see here, an S3 bucket, DynamoDB, SQS, and many others.
How did we used to do it? Well, we use the SDK, right? And how do we manage IAM? We use the Function execution role. This is important. Very important from a security perspective. The Function execution role is a function execution role. It's not a Tenant execution role. That means it applies at the function level.
Whatever the Function execution role allows, all tenants will be able to perform. This is a function-level construct. Very similar to what Bill showed earlier. For example, environment variables are also a function-level construct, not a tenant-level construct. But the big question is, can we be a little more granular? So let's see how we can do that.
A typical scenario is when clients use JWT. A very common scenario for authorization. Now, since we're using API Gateway, the token is propagated to the Lambda Authorizer, right? And within that Authorizer, we can do some pretty cool things. First and foremost, obviously, we validate the token. That's why we put the Authorizer there. So, we want to extract information from that token. We can validate against Cognito or any identity provider you're using. So the first thing you do is obviously validate that token and probably extract some identity. Second, optionally, you might have some custom logic, some custom policies you want to apply there. So, not only is this token valid, but what are the permissions, the custom permissions for that token? Maybe you're storing it in DynamoDB, or you might have some external system. You validate additional permissions for that specific token. But the next thing is actually pretty cool. You can get tenant-scoped short-term credentials from STS.
So at this point, by the time we get to the third bullet point here, we've validated that tenant. We know that this tenant is indeed who they claim to be, and their policies allow them to do what they're asking. So what we can do here is, instead of a function execution role, let's get tenant-scoped credentials specific to this tenant. And we propagate that information from API Gateway to Lambda. So what that actually means is that now your Lambda function has two sets of credentials. The Function execution role allows the function to perform actions that are not tenant-specific. It's quite common for all tenants to be able to access shared storage that contains information all tenants need to access. But tenant-scoped credentials actually restrict the function to only access data that this specific tenant can access. Let's look at this example.
So, let's see how this works. A request comes from BlueTenant. That tenant hits API Gateway. API Gateway forwards the request to Lambda. Obviously, BlueTenant's execution environment is being used. At this point, that BlueTenant's execution environment can access the shared bucket. There's a shared bucket there, right? It accesses that shared bucket using the function execution role. But in addition, it can also access the Blue bucket. Why? Because it has tenant-scoped credentials that allow it to access the Blue bucket. What it cannot do is access the Green bucket or the Yellow bucket. Why? Because the permissions available to that function with this current request do not allow access to these two buckets belonging to two different tenants.
And when the next request comes from YellowTenant, for example, it's processed by Yellow's execution environment. Yellow's execution environment can access the shared bucket. Anyone can access the shared bucket. It cannot access Blue, it cannot access Green, but it can access the Yellow bucket. Why? Because it has function-level credentials, and then it has tenant-scoped credentials. Okay, it's time to wrap up. We have about 10 minutes left. So obviously, I need to show you an end-to-end demo to prove that this actually works. And all of this is available on GitLab. You can see it in a moment.
End-to-End Demo and Summary: Joe's Happy Ending and Future Prospects
Before I show you the demo, I want to explain what's going to happen. Because the demo is just going to be me sending three requests through Postman. If you don't understand what's happening internally, it's not going to be that impressive. First, we have two tenants, and we pass the tenant ID as a JWT claim. We have an Authorizer function that receives that JWT and validates it. These are real, proper JWTs. It validates it and returns a context that includes the tenant ID. We're not using the usage identifier key here. It's optional. We're not using it for this demo, but basically, you can use it if you want to also implement usage plans. So, the context returned from the Authorizer response includes the tenant ID. What do we do next? When we define the Lambda integration, we're saying integration.request.header.x-amz-tenant-id, and its value comes from context authorizer tenant ID.
This is the most important part of this slide. I hope this is clear. Here, we are saying that we get the tenant ID from the Authorizer response and use it to make a request to the Lambda function as that header. This is like a single line configuration, and all the magic happens here. Once you configure this, the request from API Gateway to Lambda will actually have that X-Amz-Tenant-Id header saying BlueTenant.
So, we validate the token in the Authorizer. We return the tenant ID, well, we return the string we want to use as the tenant ID. You don't have to return the exact same tenant ID. It might be sensitive information. You might want to anonymize it, or generate a hash from it. As long as it's a unique string, we don't care what it means to you. It can be anything. And that BlueTenant is propagated to my multi-tenant function. It's processed by BlueTenant's execution environment.
Now that we understand how the magic works internally, let's actually see the magic. So, this time we're using Postman. Because we're using API Gateway, it's no longer just Lambda. I'm sending a few requests here, and since I don't have a token yet, I'm getting a
message unauthorized" back. You can see it here. Now, let's add an authorization header. I'm a big fan of JWTs, and this JWT, as you might guess, represents BlueTenant. And it's the same situation. The counter is still running, at 45 and so on. Remember BlueTenant stopped at 5.
Now I'll replace it with another token belonging to GreenTenant. Let's send a new request. You can see this is GreenTenant. And the counter starts from zero. Why? Again, because GreenTenant gets a different set of execution environments. No virtual resources are shared. CPU, memory, disk, nothing is shared between tenants. We reached 10. Now, let's switch back to BlueTenant. When I execute it, it continues with 6, 7. So, it's basically the same demo we saw earlier, but this time we have full integration with API Gateway, and it includes the authentication component as well.
This is a slide I showed you about 20-something minutes ago. The multi-tenant function model has benefits, but it also has considerations. I think in these past 25 minutes or so, I've been able to demonstrate that Tenant Isolation Mode has addressed most of these considerations. We addressed the potential issue of noisy neighbors through integration with API Gateway usage plans. We addressed the issue where your architecture might require additional compute isolation. This is not something you handle in code; it's something we provide. Now you have it.
Regarding the cleanup of residuals, it's still good practice to clean up residuals, but we've been able to reduce its importance. Because compute environments are never shared between tenants. Even if you introduce a bug, fixing that bug is still a good idea, but we've made it less critical. As for the difficulty of per-tenant observability, now we have per-tenant logs, we can output per-tenant metrics, and you get all of that.
To be very fair, tenant-specific feature rollout is difficult. Let's keep this yellow. This isn't exactly a problem we've solved. It's become easier because the tenant information is in the function code, but you still have to do things like if tenant ID equals or case tenant ID. You still have to do something in your code, but it's a bit easier. So, to be fair, I'll keep it yellow.
Now, here our story has a happy ending. Joe found a way for his team to not have to create hundreds of thousands of functions, or separate functions with duplicate code. They only need to create a few functions, and Lambda provides a separate execution environment for each tenant. So, the Tenant Compute Isolation Mode essentially gave Joe's team the ability to focus on delivering business value. They could innovate faster. This was another problem they had to solve before, but now they can focus on what really matters.
In conclusion, understand your workload's tenant isolation requirements and the implementation details in your architecture. About this Tenant Isolation Mode, well, this will be on YouTube so don't quote me, don't quote me. As far as I know, Lambda is the only service that provides tenant-level compute isolation within a single compute unit function. Personally, I don't know of anything else that does that. But that doesn't mean it solves 100% of all use cases.
In some cases, a customer's strict requirement might be to have a function per tenant. Sometimes they might even demand isolation not just per function, but across different accounts. So, the first thing to understand is what your customer wants, and how to meet those customer demands in the most efficient way. Tenant Isolation Mode helps make that more efficient.
The second thing is, leverage Tenant Isolation Mode for multi-tenant applications that require more advanced vendor-provided compute isolation. Was this a blocker before? Not necessarily, but who was responsible for it? You were. Now, we're saying we'll help you with that. We provide vendor-provided compute isolation per tenant.
The third thing is, use built-in observability features like tenant monitoring and API Gateway integration for stronger security in SaaS applications. Now, this slide has some other very interesting sessions related to this, but it's Thursday afternoon, so you'll probably be watching them on YouTube, not live.
If you don't know about Serverless Land yet, it's a website managed by our Solutions Architects and Developer Advocates. It has hundreds, if not thousands, of pieces of information, samples, and reference architectures. We host Serverless Office Hours every week on YouTube and Twitch, where we talk in detail about each new feature. It's very technical content, with no marketing. If you haven't subscribed yet, I highly recommend you do.
And finally, something I promised but hadn't shown you yet, is this giant QR code. With this, you can get everything you saw today: the slides, sample code, additional links, everything. Really, everything. Thank you all very much for coming. I hope we were helpful. I hope you learned about this new feature. Bill and I, and Irish, who is the product manager for this fantastic feature, will be right outside if you have any questions, and we'll be happy to answer them. We'll be right outside. Thank you very much, and enjoy re:Invent.
This article was automatically generated using Amazon Bedrock, maintaining the original video information as much as possible. Please note that typos or incorrect information may be present.


































































































































Discussion