Serverless Is Great Until You Need to Debug It at 2 a.m.

I still like serverless. I use it because it keeps a lot of infrastructure out of the way, and for the right kind of team that is a real advantage.

But serverless has a sharp edge that people gloss over when they are selling the idea or drawing the architecture diagram. It is easy to build something that looks clean and feels fast to ship. It is much less fun when a production issue lands at 2 a.m. and you need to figure out where the failure actually happened.

That is usually where the architecture stops being abstract.

The happy path is not the real problem

On a good day, serverless is excellent.

You deploy a function, wire it to an event, and let managed services do the boring work. Scaling is handled. Capacity planning gets smaller. You do not spend much time thinking about servers, patches, or idle time. For small teams, that is a good trade.

The problem is not the happy path. The problem is everything around it.

When something fails in a serverless system, the failure is often spread across too many places. Maybe the API call succeeded but the downstream Lambda timed out. Maybe the queue has retries and dead letters. Maybe the log entry you want is in a different service, a different account, or a different region. Maybe the real issue is not even in the function you are staring at. It is in the event shape, the permission boundary, or the assumption that some other service would always behave.

That is when “simple” starts to look less simple.

Debugging needs context, not just code

The big lesson for me is that debugging serverless is mostly a context problem.

If you cannot trace a request from edge to backend to storage to retry queue, you are going to waste time. If your logs are too sparse, too noisy, or too disconnected, you will end up guessing. And guessing at 2 a.m. is a bad way to run production.

I have found that the boring stuff matters more than people expect:

structured logs with a request id that actually follows the work
enough metrics to see failure patterns before somebody opens a ticket
clear alarms on the right failure modes, not just “something is red”
dead-letter queues that are treated as real operational surfaces
explicit timeouts and retry behavior, not defaults you never revisited

None of that is glamorous. All of it is useful.

Serverless still wins for a lot of teams

I am not arguing against serverless. I am arguing against pretending it removes operational responsibility.

For small teams, serverless is often still the right choice because it lowers the amount of infrastructure you need to own directly. That matters. It is often the difference between shipping and getting stuck in platform work that nobody really wanted.

What I would say is this: choose serverless because you want less undifferentiated work, not because you want less thinking.

If the system has important workflows, payment paths, messaging, or anything that needs reliable recovery, you still need to design for failure. You still need to know where the state lives. You still need to understand how retries behave. You still need to make the debugging path obvious enough that a tired person can follow it.

That is the part that gets missed when people describe serverless as if it were magically simpler.

My practical rule

My rule is pretty simple.

Use serverless when it reduces ownership and lets a small team move faster. Keep it when the observability is good enough that you can understand failures quickly. Be honest when the architecture is drifting into a pile of event handlers nobody can reason about without a whiteboard and a long coffee.

The architecture is fine.

The problem is usually the 2 a.m. version of it.

Comments

comments

The happy path is not the real problem

Debugging needs context, not just code

Serverless still wins for a lot of teams

My practical rule

Share:

Related

Comments

You may also like