We've hit a problem I don't understand well enough to solve.
“We've hit a problem I don't understand well enough to solve.”
The feelingOvermatched.
If that’s where you are right now, this is the Playbook built for exactly that moment.
“Opaque problem” is one of 40+ What’s Next? Playbooks, for leaders facing a specific, real situation. In under fifteen minutes it helps you recognise what’s actually going on, then gives you a clear way through: the Play to choose, the Plan in concrete moves, the Precedents of people who faced it before, and your next move.
Frameworks you’ll see put to work on this exact decision, applied, not taught in the abstract:
- Five Whys
- Digging Deeper
- Problem Framing
You’ll also see how it played out in the real world, Julia Grace at Slack, San Francisco (2016), and Mavis Batey at Bletchley Park, Bletchley Park (1941). Real precedents, not platitudes.
It leaves you with one question to carry into your next conversation: “Write the problem your team is solving as one sentence, right now, without using jargon - does it match”
Part of the Innovation & Stuck collection, Playbooks for when you’re out of ideas, the options all look the same, or you’re told to be more innovative. See them all ›
Transcript — read it in full
What to do when you have hit a problem you don't understand
San Francisco, August two thousand and sixteen. Julia Grace has just been asked to build Slack's first infrastructure engineering organisation.
The original codebase was written in two thousand and thirteen by the founders for a company assumed to grow linearly within small teams. By two thousand and sixteen the adoption pattern is geometric. IBM has just rolled Slack out across its workforce. The combination of enterprise-scale load and cross-geography usage patterns is breaking the system in ways the engineering team does not yet have language for.
The recurring failure mode that becomes the canonical example is the thundering herd. When message servers crash under peak West Coast morning load, clients across the globe try to reconnect simultaneously. The recovering capacity is trampled before it can stabilise. The outage cascades.
The team has tried the standard moves. Add capacity. Tune timeouts. Stagger reconnects. Each move addresses a symptom and the next outage produces a different symptom. The team is stuck in the shape of problem where every fix surfaces the next failure mode rather than solving the underlying one.
Grace's response is not to chase the surface-level outage reports. It is to rebuild both the architecture — publish-subscribe model, sharded messaging tiers, asynchronous job rewrites — and the organisation — ten specialised engineers to a hundred, across three offices — in parallel.
The premise of running both rebuilds at once is structural. The problem is neither purely technical nor purely organisational. It is living in the space where the two have stopped matching each other. Fixing either layer in isolation reinforces the mismatch with the other.
When you have hit a problem you don't understand well enough to solve, the default instinct is to narrow: simplify the question, fix one layer at a time. Grace's move was the opposite — she widened, because the existing layers had been designed against assumptions that no longer held, and fixing any one of them in isolation would have reinforced the mismatch with the others.
Her own framing of the senior engineering leadership task at that scale, given in a later QCon talk, is that it becomes essentially a sales role — selling the vision and the work up, down, and across the organisation. That is the only way to get permission to rebuild the system and the team at the same time.
So let's go to the office and work through it.
First read why the capable team is stuck
"We've hit a problem I don't understand well enough to solve."
The feeling is overmatched.
The team has been working on it for a week. The team is intelligent, capable, well-resourced. None of that has translated into a clear path forward. And the longer the work continues without a clear path, the harder it is for anyone in the room to admit that the path is what's missing.
Two choices. They look like the same stuck team. The first move is different.
When everyone has retreated into the work
Choice one: the team is hiding behind their screens. Everyone is in the problem individually. Nobody is talking. The room has gone quiet because the work is happening in terminal windows nobody else can see. The lack of conversation is doing more damage than the difficulty of the problem.
If that's the read, take the laptops away. Lock the team in a room with a whiteboard and nothing else for an hour.
Don Norman, working on environmental affordances, observed that the room you're in changes what thinking is possible. Problems that are impossible alone often become tractable when they're drawn on a wall in front of four other people, and the reason isn't that the drawing is better than the code. It's that a whiteboard forces a conversation, and the conversation surfaces the bits each person was quietly stuck on.
When everyone knows too much to see it
Choice two: the team is hiding behind their expertise. Everyone in the room knows too much about the problem to see it clearly. The sophistication of the discussion is getting in the way of the answer. Each person has ruled out enough options that the remaining options all look impossible.
If that's the read, hand the problem to the least experienced person in the building and ask them to describe it back to you. Not to solve it. Just to state what they think is going on.
The Suzuki Roshi tradition calls this beginner's mind. The discipline is that the least experienced person hasn't yet ruled out the obvious answers, and the obvious answers are sometimes correct. Their ignorance is the tool, and the power of the tool is that they will ask questions the experts stopped asking because they thought they knew the answers. Most of the time, those questions are the ones the experts had quietly answered wrong years ago and never revisited.
Hiding-behind-screens, or hiding-behind-expertise. Same overmatched team. Two different first moves.
How to work from the outside of the problem inwards
Three tools. The discipline is to work from the outside of the problem inwards.
Write the problem in three plain sentences
The first is
Problem Framing.
Problem Framing as a structured discipline traces through the design-thinking literature of the two-thousands — the IDEO and Stanford d.school formalisation of the empathise–define–ideate–prototype–test sequence — with deeper roots in the operations-research and systems-thinking traditions of the post-war period. The contemporary version emphasises plain-language definition before any solution work begins.
The reason Problem Framing matters when you don't understand the problem is that you have probably been solving the wrong one. Most stuck teams discover, when forced to write the problem in a single short paragraph without jargon, that the problem they thought they were solving is not the problem the customer would describe. The misalignment is the trap; the writing exercise is the diagnostic.
The unique insight is the plain-language test. Strip the jargon and the problem either survives the strip or doesn't. If it survives — if a stranger could read the framing and recognise what's wrong — the team has a real problem to work on. If it doesn't — if the framing only makes sense inside the team's vocabulary — the team has been solving a problem that exists primarily in their own discussion rather than in the world.
What you get is a clear shared statement of what's actually wrong, which is — surprisingly often — the entire missing ingredient.
So. How to run it.
Write. One paragraph. Three sentences maximum. No jargon, no acronyms, no internal product names.
Read. Watch their face. If they understand, the framing is real. If they don't, the framing is internal-only — which is information about how stuck the team is.
Identify. Most teams discover the framing they wrote is missing one of: who has the problem, what they're trying to do, what's stopping them. Whichever is missing is the part the team has been guessing at.
Re-write. A second pass with the missing element named. The new framing usually points the team at a question they can actually answer.
The discipline is that if you can't write the problem in three plain-language sentences, you don't yet understand it, and any time spent solving it is time spent solving the wrong problem.
Keep asking past the comfortable answer
The second is
Five Whys.
The Five Whys method comes from Sakichi Toyoda, the founder of Toyota Industries, who developed it in the early twentieth century — through the nineteen-thirties and the post-war Toyota Production System formalisation under Taiichi Ohno. The method spread into Western manufacturing and quality practice through the lean movement of the nineteen-eighties and nineteen-nineties, and into software and product engineering through the agile literature of the two-thousands.
The reason Five Whys matters when you don't understand the problem is that the surface explanation is almost always not the underlying cause. The team's first explanation for why something is failing is, statistically, wrong. The second is closer. By the third or fourth why, the team starts hitting the structural cause that the first explanation was hiding.
The unique insight is the patience past the comfortable stopping point. The standard surface explanation feels like a complete answer because it's plausible and because it absolves the team. The next why breaks the plausibility and sometimes the absolution. The discipline is to keep going past the point where you have a plausible-sounding answer.
What you get is a structural cause. Not the deploy failed because the build broke (proximate). The deploy failed because the team is on a six-week-old toolchain version because the upgrade has been deferred for three sprints because the upgrade owner is also the on-call rotation lead (structural). The structural answer is what change can act on.
So. How to use it.
Start. Concrete. The deploy failed last Tuesday.
Ask. Because the build broke.
Again. Because the toolchain version is out of date.
Keep. Each why either reveals a more structural cause or hits a wall. The wall is as useful as the answer; it tells you where the team's understanding runs out.
Stop. Because the upgrade owner is also the on-call rotation lead and has been firefighting is structural. Because Sarah hasn't done the upgrade is personal. The structural explanation is the actionable one.
The framework's main failure mode is stopping too early. The first plausible answer is rarely the real one; teams that stop at the first plausible answer keep solving symptoms.
Refuse the first answer that ends the conversation
The third is
Digging Deeper.
Digging Deeper as a research discipline traces through the user-research and continuous-discovery traditions of the twenty-tens — Teresa Torres and Dscout's writing on generative interview techniques, Marty Cagan's product-discovery practice, and the deeper anthropological-fieldwork lineage that goes back to participant-observation research in mid-twentieth-century social science. The contemporary form is about refusing to accept the first answer that looks like an answer.
The reason Digging Deeper matters when you don't understand a problem is that the first answer that looks like an answer is almost always a defensive answer. The team — or the customer — gives the answer that ends the conversation. The conversation needs to continue past that answer to reach the actual mechanism.
The unique insight is the comfortable stopping point. Every interview, every diagnostic conversation, every internal team discussion has a moment where the first plausible answer arrives and everyone in the room visibly relaxes. That moment is the danger. The relaxation is the team accepting an answer because it ends the discomfort, not because it's correct.
What you get when you discipline yourself to push past that point is the actual mechanism. Not what people say is happening. What is actually happening, mechanically, that causes the outcome you're observing.
So. How to use it.
Notice. When the team or the interview subject has just given an answer that feels like it closes the topic, that's the signal. Don't move on. Stay there.
Ask. And what happens when that doesn't work? And before that? And what made you try that approach? The questions feel like over-asking; they are not.
Watch. The first answer is the rehearsed one. The second answer is usually the real one. It comes more slowly, with more hesitation, and sounds less polished. That's the diagnostic.
Capture. What you write down at the end of the conversation is the second-or-third-layer answer, not the first one. The first answer goes in the what they said file; the third answer goes in the what is happening file.
The discipline is uncomfortable for the asker — pushing past a comfortable stopping point feels like rudeness — and the discomfort is the price of getting to the answer that's worth having.
A precedent: don't wait for the framework, work the system
That's the toolkit. One more story before we close.
The Grace story we opened with showed the discipline of widening — when the layers in front of you have stopped matching each other, the move is to rebuild both rather than fix either. The story we close with shows the same move at the opposite scale — a single nineteen-year-old, on the other side of an unbreakable code, deciding not to wait for the framework that would make breaking it her job.
Bletchley Park, March nineteen forty-one. A nineteen-year-old Mavis Batey is working in Dilly Knox's ISK section, on Italian naval Enigma traffic marked GP that the section has not been formally trained to decrypt.
The intercepts appear to lack the standard indicator patterns the section's existing techniques rely on. The dominant view, inside the section and inside the wider Bletchley operation, is that the messages will have to wait until someone with the right authorisation produces a framework for them.
Batey decides not to wait.
Her approach is to hunt for a crib — a predicted fragment of plaintext that, if correctly guessed, would unlock the surrounding message — by inspecting the intercepts themselves for recurring structural patterns that might betray a specific word.
She isolates a repeating sequence that points to the Italian word personale. Personnel. The foothold proves correct. The traffic is broken.
The decrypt stream produced in the following weeks provides the intelligence on Italian fleet movements that Admiral Cunningham uses to ambush and destroy a significant portion of the Italian Mediterranean fleet at Cape Matapan, on the night of the twenty-eighth and twenty-ninth of March nineteen forty-one. Five Italian ships sunk. More than two thousand Italian sailors killed or captured. The British Mediterranean Fleet loses one aircraft.
So.
Grace at Slack widened the diagnosis when the layers had stopped matching each other. Batey at Bletchley narrowed when no framework existed yet. Two different cuts at the same move — work directly on the system, don't wait for the framework — at radically different scales.
When you've hit a problem you don't understand well enough to solve, the procedural answer is to wait for the framework. The more useful move is often to notice that the framework you are waiting for will be built out of specific observations someone is eventually going to have to make, and that you may as well be the one making them.
Batey did not wait for the theory of GP traffic. She worked directly on the intercepts until they told her how to read them. The theory came later. The framework came later. The decrypts did not.
So. Your Next Move from this playbook.
Write the problem your team is solving as one sentence, right now, without using jargon — does it match the problem the customer would describe, or are you about to solve someone else's question?
- Position
The situation in a sentence, and the feeling underneath it. Free to read.
- A choice of two Plays
Two behavioural Plays. Each positions you differently for the next conversation. You choose.
- A Plan of tools
Tools from the Toolbox, in order, each ending in Your Next Move — one concrete instruction.
- Precedents
Leaders who stood here. We show whose play worked, half-worked, and shouldn’t have been attempted.
“The list was never the hard part. Standing behind the cut, in the next three conversations, is.”
Sources & further reading 3 Positions, 4 Plays, 3 Plans, and 2 Precedents.
Your Next Move
Questions, answered
How does a Playbook work?
A Playbook names your Position, hands you two Plays to choose between, then turns your choice into a Plan — a sequence of tools, each ending with a single concrete move. It closes on Your Next Move: the one thing to do before the day ends.
How long is a Playbook?
About twelve minutes. Short enough to watch in the gap before the meeting it’s made for.
What’s the difference between this and asking AI?
A chatbot gives you an answer. A Playbook gives you a Position, a chosen Play, a Plan, and Precedent — the structure of a decision, not a paragraph of advice. You open the situation you’re in rather than describing it from scratch.
Do I need to watch them in order?
No. Each Playbook stands alone. You open the one that matches the situation in front of you — there’s no sequence to follow and nothing to complete first.
What is Your Next Move?
The single concrete move you leave with — a question to take back into the room and answer there. Every tool in a Plan ends with one. It’s the answer to the question the brand name asks.