iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🚦

How I Nearly Caused a System-Wide Failure by Applying Fail-Fast to RLS

に公開

Introduction

"If there is abnormal data, it's better to detect it early and let the process crash." The idea is that it's better to crash explicitly and realize the issue than to cover up a bug. This is the so-called fail-fast policy, and personally, I've basically always liked it.

However, when I was writing Supabase RLS for a personal project, I almost applied this policy directly while refining the design. I was one step away from causing all queries to fail, even involving unrelated rows. I realized it before writing and moved toward the lenient side, so it didn't become an accident, but if I had blindly trusted fail-fast and written it, I would have been horrified when I realized it for the first time in production.

What I learned was not so much that "fail-fast is not a panacea," but rather that fail-fast is about choosing where to apply it.

In this article, I will organize "where you should let it crash" based on the trap I almost fell into.

The story of almost causing a cascading failure

In Supabase Storage, I was trying to write a rule where users can only upload to their own folders.

I assumed that the name column of storage.objects would contain paths like auth.uid()/avatar.png, and the policy would be to permit if the first-level folder name matches the user's uuid.

The first way I thought of writing it was this:

-- A stricter approach
bucket_id = 'user-uploads'
  and (storage.foldername(name))[1]::uuid = auth.uid()

Cast the first level to uuid and compare it directly with auth.uid(). I felt that "it's safer to bind with types than to compare strings." In terms of fail-fast, I thought it would be safer if it crashed on casting if there was abnormal data (a string that isn't a uuid format).

However, this holds a silent accident.

The danger of this policy is that if the first level of the row being evaluated is not in uuid format, the ::uuid cast fails. Since RLS policy expressions are applied to each row for every query, if this cast exception occurs during authorization judgment, it doesn't just mean "don't permit that row," but the query itself stops with an error.

Moreover, PostgreSQL does not guarantee the evaluation order of AND conditions. You cannot say for sure that "because I filtered by bucket_id, other rows won't reach ::uuid." In other words, just by sneaking one "convertible that can fail" into the authorization condition, a single piece of unexpected data could involve even normal user access in a query error.

What I realized while refining the design was that these kinds of rows could potentially be mixed in:

  • Test data entered manually during development
  • Common folders like admin or shared created directly under the bucket
  • New folders when I want to adopt different rules for the first level in the future

If even one of these gets mixed in, the evaluation of the authorization condition fails there, and it could potentially involve the access of unrelated users in the query error. It means that one piece of unexpected data can become a trigger to stop everything.

In the end, what I chose was to move the cast to the lenient side (text):

-- A more lenient approach
bucket_id = 'user-uploads'
  and (storage.foldername(name))[1] = auth.uid()::text

This way, even if the first level is not a uuid format, it simply results in "false" as a comparison result. The non-applicable row is skipped, and there is no impact on other rows.

I was about to go from "detecting abnormal data by crashing" to "stopping all functions by involving unrelated rows." I realized as I was writing this that I almost misused fail-fast.

Organizing by Separation of Concerns

Why did fail-fast have the opposite effect in this situation? When I organized it, what was required of RLS and what I tried to make it do were not matching.

The primary job of RLS is access control. A mechanism to determine "can this person see or touch this row?". On the other hand, what I was trying to do with type casting was data integrity checking (whether the saved value is in uuid format).

These two seem similar, but they are essentially responsibilities of different layers.

Concern Responsible Layer
Access Control (Who can access what) RLS / Authorization Middleware
Data Integrity (Is the saved value correct) CHECK constraints / Triggers / Validation layer
Input Validation (Is the request as expected) Application layer / API Validation

If you make RLS double as a "data integrity check," it could lead to an accident where one row of unexpected data involves the whole query. Conversely, if you make data integrity checks double as "access control," you get loopholes.

"RLS only says permit or deny. Data correctness is someone else's job." Concluding this was the safest approach.

If you want to detect abnormal data, write it in CHECK constraints or triggers at the time of INSERT. Switching to the idea of dividing the work with RLS naturally avoided the cascading failure accident.

Where should fail-fast be used?

Generalizing what I've discussed so far, I believe fail-fast is about choosing the scope of application.

Situations where fail-fast shines

Situation Reason
Input validation (e.g., API request validation) Prevents operation with invalid input
Integrity check during data writing Prevents contamination with abnormal data
Unit tests Immediately alerts to abnormalities upon assertion failure
Detecting bugs in your own logic Faster root cause analysis with early detection

Their commonality is the nature of "the scope for which I am responsible" and "the gateway where abnormalities should be prevented." It's the image of checking strictly at the boundary.

Situations where fail-fast becomes counterproductive

Situation Reason
Processing evaluated during normal read/write, like authorization conditions One unexpected piece of data could cause normal access to fail
Processing that handles data cross-sectionally (past, others, future) Strictly failing based only on your own assumptions involves unrelated data
User-facing read processing The entire screen dies due to old formats or exception data

Their commonality is the nature of "handling a scope for which I cannot be responsible" and "running cross-sectionally." If you apply fail-fast here, everything stops, dragged down by data you didn't intend.

The direction of believing "if abnormal, crash" is not wrong in itself. However, I think it means that you must not omit thinking about "where you should crash." Don't crash in RLS, but crash with CHECK constraints at write time. Just shifting where you let it crash can avoid cascading accidents.

Connection to Classical Principles

This story is actually not new; I realized later that it is close to the idea of "be strict in what you send, be liberal in what you receive" from Postel's Law, a classical principle of network design.

Be conservative in what you send, be liberal in what you accept.

Strictly speaking, it is a principle of network protocols, so it cannot be applied as is to RLS in the DB layer. However, if we bring it closer to this story, we can organize it like this:

  • At the time of writing (INSERT / UPDATE), strictly crash with CHECK constraints or validation.
  • In RLS during reading, even if you encounter unexpected data, do not crash the entire query, but treat it as a "row that is not permitted."

There is criticism of Postel's Law that "accepting too liberally leads to security risks and ambiguity of specifications." Even so, the broad framework of "strict in your own scope of responsibility, and tolerant in cross-sectional processing" is valid as material for thinking about the scope of application of fail-fast.

Summary

I trusted "if abnormal, crash" too much and almost caused a cascading failure with RLS. I was helped because I stopped before writing it, but what I learned is probably this:

  • Use fail-fast within "your own scope of responsibility" and "timing when abnormalities should be prevented."
  • Do not apply it to mechanisms that run cross-sectionally (RLS / logs / aggregation).
  • Do not push things with different concerns into the same layer.
  • Remembering "your output is strict, others' input is tolerant" makes you rarely lost in most situations.

If you get the scope of application of fail-fast wrong, what you intended to be "detecting abnormalities early and realizing it" becomes "involving unrelated people and stopping everything." Just by shifting where you crash, most cascading accidents should be avoidable.

References

Discussion