Skip to content
Operations7 min read

Six services under one pager.

Compute, storage, databases, CI/CD, observability, 24×7 IT. The engineering reality of running six services behind a single on-call rotation.

Author
Halil Safa Sağlık
Category
Operations
Words
265
Read time
7 min read

#On-call #Platform

Six services. One pager. That sounds impossible. It is, until you design for it.

The first design choice is that "services" is the wrong mental model. On-call does not respond to services; it responds to signals. There is a small set of signals worth caring about — error rates, tail latencies, queue depths, replication lag, cert expirations, deploy failures. Those signals come from six services, but the mental map on-call uses is the signal map, not the service map.

The second design choice is that most on-call events are not incidents. They are questions. "Is this dashboard broken, or is the metric real?" "Why did the deploy auto-rollback?" "What is that overnight alert about?" The runbook for each signal begins with the question it is most likely answering, not the fix.

The third choice is that the system has to be able to run itself during the boring majority of the time. That means: automated rollback on error-rate breach. Scheduled secret rotation. Self-service provisioning. Any step that requires a human hand is a step to be automated out — not later, not when it hurts, but before the next hire.

The discipline behind the metric: most of the time spent on a real incident should be writing the retrospective, not firefighting. If the fix itself takes longer than the write-up, the system is not ready to carry more load.

This is not heroism. It is the opposite. It is what happens when you take on-call seriously as a product problem instead of an operational one. The stack is the product. The pager is the interface.

#On-call#Platform

Subscribe to Signal

Monthly engineering and research notes. No spam, unsubscribe with one click.