Ask HN: How do you handle oncall at 15-30 engineers?
The hypothesis I'm exploring: oncall workflows were designed for small teams where everyone knows the codebase. They break at scale because of specialization and context loss before volume becomes a problem.
Some patterns I've heard so far:
- Oncall engineers spending 10-20% of time just routing bugs to the right owner
- Bug reports from CS/sales teams missing critical context (no logs, vague repro steps)
- "Couple hours" average per bug, mostly on investigation not fixing
- Session replay tools too expensive to run at meaningful coverage
I'm trying to figure out:
- Is this universal, or just specific to certain tech stacks/org structures?
- What's the actual breaking point - team size, user scale, something else?
- Has anyone solved this well? What worked?
If you're currently running oncall at a growing company, I'd love to hear:
- What percentage of oncall time is triage/routing vs. actual fixing?
- What broke first as you scaled?
- What have you tried?
No comments