When AI Hits a Wall: What Building a Browser Game Taught Us About Human-in-the-Loop

Untangle Me is a retro-styled circuit-board puzzle game we built for our internal portfolio. The concept is deceptively simple: you're given a grid of IC chips connected by colored wires, and your job is to drag the chips around until no wires cross. Early levels have three chips and three wires on an 8×8 grid. The final levels have six chips, ten wires, obstacles scattered across the board, and chips of varying footprint sizes.

It looked like a two-day build. The rendering, drag physics, and level design came together quickly. Then we hit the auto-solver — the "SOLVE" button that should animate each chip gliding to its optimal position. That's where things got interesting.

The Problem Scales Faster Than It Looks

On the surface, the solver seems like a search problem. You have N chips, each of which can sit in any of M valid grid cells. Try all the combinations, find the one with zero wire crossings, done. For the first few levels with 3 chips on a 64-cell board, this works fine. The search space is maybe a few thousand states. Claude implemented it as a brute-force permutation search, cycling through every possible assignment of chips to cells and checking each configuration for violations.

Then we scaled up.

With 6 chips and 64 cells (minus obstacles), the number of valid placements grows combinatorially — we're talking roughly `P(50, 6)`, which lands in the billions. Each candidate configuration also requires computing all wire-crossing intersections across up to 10 wires, which is its own O(W²) operation per candidate. The browser tab doesn't freeze with an error message. It just… stops responding. The JavaScript event loop is buried under a computation that won't finish in any reasonable timeframe.

This is the textbook definition of an NP-hard problem class. Chip placement with crossing minimization is a close relative of crossing-number minimization in graph drawing — a problem proven NP-hard in the general case. The polynomial-time algorithms that exist either apply only to special graph families (planar graphs, series-parallel graphs) or produce approximate solutions rather than guaranteed optima. Brute force isn't a slow solution to this problem. It's the wrong class of solution entirely.

Claude knew this, in a sense. Ask it directly: "Is wire-crossing minimization NP-hard?" and it will give you a correct, nuanced answer with references to graph theory. But left to its own devices, building a feature under the momentum of an ongoing coding session, it reached for the obvious tool: enumerate and check. The gap between knowing something and applying that knowledge proactively is one of the most important things to understand about working with AI.

What Claude Did — and Didn't Do

The brute-force implementation was clean and readable. It correctly generated all permutations of chip-to-cell assignments, correctly filtered for valid placements (no overlaps, no out-of-bounds), and correctly scored each configuration by counting wire crossings. For three chips it worked perfectly. The code itself wasn't bad — the approach was wrong.

What Claude didn't do was flag the problem before it materialized. It didn't say: "Before I implement this, I should warn you that permutation search over N chips on M cells has factorial complexity and will be unacceptably slow for levels with 5 or 6 chips." It didn't suggest reducing grid size to contain the state space. It didn't propose a heuristic. It answered the task as specified — "implement a solver" — without proactively reasoning about whether the approach would hold up at scale.

This is a pattern worth understanding deeply. Claude can discuss complexity theory fluently when you ask it to. It can explain the difference between P and NP, describe why crossing minimization is hard, and outline alternative approaches. But in the flow of code generation, it doesn't necessarily volunteer that analysis unless prompted. The default behavior is to satisfy the immediate request with the most direct implementation.

The Human Copilot Steps In

One of our engineers — a master's-level CS graduate who spent several years doing combinatorial optimization research before moving into software — looked at the hanging browser tab and immediately recognized the shape of the problem.

"This is a placement-and-routing problem," he said. "It's the same family as PCB autorouting. You're not going to brute-force it."

His first intervention was about the problem parameters, not the algorithm. Before touching the solver, he asked: does this game actually need an 8×8 grid? Early levels only use a 3×3 or 4×4 region of the board — chips cluster in the center and the edges are rarely populated. What if we constrained the solver to even-numbered grid positions only? The game's visual design already had chips snapping to even coordinates. That single observation cut the available placement cells roughly in half, reducing the state space by several orders of magnitude on larger levels.

The second intervention was algorithmic. He remembered simulated annealing from a graduate course in combinatorial optimization — not in full detail, but enough to name it and describe the shape of how it works: start with a random (or current) configuration, randomly perturb it by moving one chip, accept the new configuration if it's better, and occasionally accept it even if it's worse (with a probability that decreases over time, like a cooling metal). The "cooling schedule" is what keeps the search from getting stuck in local minima.

He didn't remember the exact formulas. He didn't need to. He knew the name, the category, and the rough mechanism. That was enough to redirect the work.

What Claude Did With That Direction

This is the part that matters for how you think about building with AI. Once given the direction — "use simulated annealing, constrain to even grid positions, keep the search budget under 10,000 iterations" — Claude implemented it correctly and quickly. The implementation handles the temperature schedule, the random perturbation of chip positions, the acceptance probability function, and the early-exit condition when a zero-crossing solution is found. It's clean, well-structured, and works on every level including the hardest ones with six chips and ten wires.

The final solver runs in milliseconds on early levels and under a second on the hardest ones. The browser never hangs.

Claude didn't need to be walked through simulated annealing from first principles. Once pointed at the approach, it produced a solid implementation. What it needed was a human who knew which approach to point at — and who knew to ask the question at all.

The Asymmetry of Expertise

There's a tempting narrative about AI coding that goes like this: AI is good at the repetitive, mechanical parts of coding, and humans are needed for the creative, high-level architecture decisions. That framing is partly true but misleadingly incomplete.

What this project illustrated is a different kind of asymmetry. Claude can write a simulated annealing implementation that a junior engineer would be proud of. It can explain the theoretical properties of the algorithm, discuss its convergence guarantees, and suggest variations. The knowledge is there. What's missing is the proactive trigger — the moment of recognition that says "the approach you're about to use is in the wrong complexity class for this problem, and here's why."

That recognition came from someone who had spent time in grad school getting burned by exactly this kind of complexity cliff — who had sat in front of a computation that was supposed to run overnight and didn't finish in a week. That kind of scar tissue doesn't live in language model weights in the same way. It lives in the engineer who, when he sees an N-chip placement problem, feels a familiar dread before he writes a single line.

The human contribution here wasn't a grand architectural insight. It was two things: a name (simulated annealing) and a constraint (even grid positions only). Both came from experience in a specific domain. Neither would have appeared in Claude's output without prompting.

What This Means for How You Build Teams

The lesson isn't "don't trust AI with algorithmic problems." Claude handled the hard part — the implementation — faster and more cleanly than most human engineers would have. The lesson is about what kind of human expertise you pair it with.

For problems in regulated industries, manufacturing, finance, or anywhere that complexity can quietly become catastrophic, you don't want a junior engineer in the loop who will recognize the problem only after it ships. You want someone with enough domain depth to recognize the shape of a hard problem before the implementation begins — and enough experience with the solution space to name the approach, even approximately.

In the Untangle case, the expert intervention was inexpensive: one conversation, two observations, a problem that was otherwise going to be abandoned or shipped broken turned into one of the game's most satisfying features. The SOLVE button actually works, on every level, in under a second.

That's the model: AI handles the volume and the implementation detail, the human expert handles the moment of recognition. The value of the human isn't proportional to how much code they write. It's proportional to how much they know about the domain where things can go quietly, expensively wrong.

At Black Gibbon, our engineers don't just write code — they recognize when the approach is wrong before it becomes your problem. Our team operates across Irvine and Hanoi, which means the senior review that catches complexity cliffs happens around the clock. If you're building automation for a domain where the cost of getting it wrong is measured in something more than developer hours, that's the conversation we should have.