The Bus Factor: How to Find (and Fix) Your Team's Single Points of Failure
Most teams don't know how concentrated their critical knowledge is until someone leaves. Here's how to measure and fix your bus factor.
Your best backend engineer takes two weeks off. On day three, the payment webhook silently starts failing. Nobody else knows the retry logic lives in a cron job named legacy_sync.py that she wrote eighteen months ago and never wrote down anywhere else. The fix waits until she's back online from a beach in Portugal, answering Slack messages she shouldn't have to see.
This is the bus factor problem, and it's rarely about buses. It's about vacations, resignations, parental leave, a bad flu, or someone simply getting pulled onto a different project for a month. The bus factor is the minimum number of people who'd have to disappear before your project stalls — and for most teams, that number is uncomfortably close to one.
This is an 8-minute read. Here's how to actually measure your team's bus factor, where the risk usually hides, and how to raise it without turning every sprint into a documentation sprint.
- What the bus factor actually measures
- Why a bus factor of one is more common than you think
- The three places single points of failure quietly form
- How to calculate your team's real bus factor
- Raising your bus factor without slowing everyone down
- The 10-minute bus factor audit
What the bus factor actually measures
The term comes from software engineering, but the risk it describes applies to any team: marketing, ops, finance, support. It asks a blunt question — if this person got hit by a bus tomorrow, how much of what they know would go with them?
A bus factor of one means a single person holds knowledge, access, or context that nobody else has. A bus factor of three or more means the team can absorb the loss of a couple of people without the project stalling. Higher is better; one is a warning sign.
Importantly, the "bus" is a stand-in for anything that removes someone from availability: a new job offer, burnout, a family emergency, or just being on PTO during the one week something breaks. Context that lives only in one person's head behaves exactly like context that lives only in a Slack thread nobody can find — it's invisible until you need it.
Why a bus factor of one is more common than you think
Most teams assume they're fine because nothing has broken yet. That's survivorship bias, not evidence. The data on how much critical knowledge sits with individuals rather than the team is stark.
According to the Panopto Workplace Knowledge and Productivity Report, cited by HR Dive, 42% of institutional knowledge is unique to individual employees and never gets shared with the rest of the team. The same report found large U.S. companies lose an average of $47 million a year in productivity from inefficient knowledge sharing, and 60% of employees say it's difficult to nearly impossible to get vital job information from colleagues when they need it.
It shows up in developer time too. The 2024 Stack Overflow Developer Survey found that 61% of developers spend more than 30 minutes a day just searching for answers or solutions, and 53% say waiting on those answers disrupts their workflow — even when they know exactly who to ask. That "who to ask" is usually one person, which is the bus factor problem in miniature, repeated daily.
A bus factor of one isn't a compliment to your best person. It's a countdown.
And the cost of getting this wrong compounds. Gallup estimates that voluntary turnover costs U.S. businesses roughly $1 trillion a year, with replacing a single employee running half to twice their annual salary — before you even count the weeks a team spends rediscovering what that person knew.
The three places single points of failure quietly form
Bus factor risk doesn't announce itself. It accumulates in three predictable spots:
- The "person who always fixes X." Every team has one — the person who gets pinged for deploy issues, the one client integration nobody else touches, the billing edge case only they understand. Convenience during a sprint becomes dependency by the end of the quarter.
- Onboarding shortcuts. When a new hire joins, it's faster for an existing expert to just do the task than to teach it. That's rational in the moment and disastrous in aggregate — every shortcut is a transfer of knowledge that never happens.
- Undocumented tribal debugging knowledge. Not the architecture (that's usually written down somewhere) — the "oh yeah, that error means the third-party API is rate-limiting us again" knowledge that lives entirely in one person's memory of past incidents.
These aren't separate problems from the coordination tax teams pay catching each other up — they're the same root cause. When knowledge concentrates in one person, every absence turns into a synchronization problem for everyone else.
| Symptom | What it looks like | Root cause |
|---|---|---|
| "Let's wait until Sarah's back" | A decision or fix stalls for days | Sarah is the only one with the context to make the call |
| New hire ramp time keeps slipping | Onboarding takes months, not weeks | Knowledge was never externalized — it has to be transferred person-to-person |
| One engineer reviews every PR in a module | Review queue bottlenecks around one name | Nobody else has enough context to review confidently |
| Incidents take longer to resolve on weekends | MTTR spikes when the "usual person" is offline | Runbooks are informal or nonexistent |
How to calculate your team's real bus factor
You don't need a formal audit team to get a useful estimate. Run this exercise per project or system:
- List the critical systems or workflows — deploy process, client billing, the data pipeline, the one integration that always breaks.
- For each one, ask: who could confidently handle this today if the primary owner vanished for two weeks? Count only people who could actually do it, not people who've "seen it before."
- Assign a number. If only one person qualifies, that system's bus factor is 1. If three people could step in, it's 3.
- Sort by risk × frequency. A bus factor of 1 on a system you touch weekly is far more urgent than a bus factor of 1 on something you touch once a year.
Most teams that do this exercise honestly are surprised by how many "1"s show up — often on the systems they'd have sworn were team-owned.
Raising your bus factor without slowing everyone down
The standard advice — "document everything" — is true and useless. Nobody has time to write a manual for every workflow, and most documentation rots within a quarter anyway. A few things actually move the number:
- Pair on the risky 20%, not everything. Use the audit above to find your bus-factor-1 systems and pair specifically on those, not on routine work that's already well understood.
- Rotate the pager, not just the code. If the same person always gets paged for a given service, rotate on-call deliberately even when it's slower at first — the temporary slowdown is the knowledge transfer.
- Record the "why," not the "what." Step-by-step docs go stale fast. A five-minute voice memo explaining why a workaround exists ages much better than a wiki page nobody updates.
- Make knowledge capture a byproduct of work that already happens. Standups, PR descriptions, and incident retros already surface tribal knowledge — the failure mode is that it evaporates the moment the meeting ends instead of getting captured anywhere searchable. This is exactly the gap tools like automatic project intelligence are built to close — surfacing who's the only person touching a given blocker before that becomes a crisis, instead of after.
None of this requires a knowledge-management initiative. It requires noticing, in real time, when "ask Sarah" is the answer to a question that shouldn't have a single point of failure.
The 10-minute bus factor audit
Run through these questions in your next team meeting. You'll need less than ten minutes and you'll leave with a prioritized list.
- What's the one system or client relationship that would cause real pain if its owner left tomorrow?
- Who reviewed the last five pull requests or deliverables in that area — is it always the same one or two names?
- When did someone last take PTO and something in their area broke or got delayed?
- Is there a runbook, doc, or recording for how to handle this — one that's less than six months old?
- If you asked three team members to explain this system right now, would you get three consistent answers?
If more than one answer worries you, you've already found your highest-leverage fix for the next two weeks — and it's cheaper to fix now than during an actual absence.
The real fix isn't more documentation — it's less concentration
Bus factor isn't a one-time audit you run and forget. It's a property of how work gets distributed every week — who gets the interesting bug, who reviews the tricky PR, who onboards the new hire onto which system. Teams with a healthy bus factor didn't get there by writing better docs; they got there by deliberately routing risky knowledge to more than one person before it became a liability.
The uncomfortable part is that fixing this usually feels slower in the short term — pairing takes longer than solo work, rotating on-call is less efficient than always paging the expert. That's the cost of resilience, and it's a lot cheaper than paying it during an actual emergency.
Frequently asked questions
What is a good bus factor for a team?
There's no universal number, but most engineering teams aim for at least 2–3 people who could competently cover any critical system. A bus factor of 1 on anything customer-facing or revenue-critical is worth fixing immediately; a bus factor of 1 on a rarely-touched internal tool is lower priority.
How do you calculate bus factor exactly?
There's no single formula, but a practical method is to list critical systems, then count how many people could confidently operate or fix each one without the primary owner's help. The lowest count across your critical systems is your effective team bus factor.
Is bus factor only relevant to software teams?
No. The term originated in software engineering, but the same risk applies to any role where one person accumulates unique knowledge — the only person who knows a key client's history, the only one who understands a specific compliance process, or the only one with a vendor relationship.
What's the difference between bus factor and key person risk?
They describe the same underlying problem from different angles. "Key person risk" is the business and HR framing, often used in succession planning and insurance contexts. "Bus factor" is the engineering framing, usually applied to code, systems, and technical knowledge. Both point to the same fix: distribute critical knowledge before an absence forces you to.