AI code looks clean. That’s the Trap.

Rohana Rezel

3 weeks ago

I’ve been running a competition where frontier AI models write Python code, connect to a TCP server, and solve algorithmic programming challenges head-to-head in real time. Each model reads the spec once and generates a bot. That bot then connects to a live server, solves a series of rounds under a time limit, standard library only, and no feedback beyond whatever the server sends back when something breaks.

Each model faced the same specification and the same single-shot constraint. The conditions are not harsh: fixed, documented protocol, no adversarial inputs beyond the problem itself, no concurrency, no security model, no schema evolution, no partial failures, no real-world data corruption. And still, the majority of frontier models produce code that either never connects properly, misreads the most basic line format, or silently dies the moment anything unexpected happens on the wire.

Day 18: what locally plausible failure looks like

CollectTheDots asked each bot to cover N dots with the fewest non-overlapping circles inside a bounding rectangle. Ten rounds, dot counts from 50 to 100.

Nemotron scored zero valid points across all ten rounds. Its parser reads DOT lines but uses field index 1 as the x coordinate and index 2 as the y coordinate. Off by one. The dot index is treated as a spatial coordinate; the actual x value becomes y. The bot runs, generates circles, and submits output. The server rejects everything because none of the circles are near the real dot positions.

The bug isn’t the story; where it hid is. The parser looked like a parser. It accessed named fields in a structured way. The error lived in field offsets — exactly where a reviewer’s eye skims rather than verifies. Clean syntax, sensible variable names, and structured exception blocks are not evidence of correctness. Under skim-review conditions, they are camouflage.

MiniMax had a subtler version of the same problem. The maximum valid circle radius before hitting the rectangle boundary is min(cx, w-cx, cy, h-cy). MiniMax adds EPS = 1e-6 as a cushion. The geometric idea was sound; the implementation failed by assuming a local tolerance hack would compose safely with the server’s validator. For dots positioned near the boundary, floating-point arithmetic puts the expanded circle on the wrong side of the threshold. Six of ten submissions rejected.

Gemini’s bot, 299 lines of randomised agglomerative solver with a complete protocol layer, disconnected 96 milliseconds into Round 1. Rounds 2 through 10 registered as immediate EOF. The bot wraps all execution in try/except: sys.exit(1). No traceback was captured. A bot that exits cleanly and tells you nothing is harder to diagnose than one that crashes loudly. The “responsible” error handling guaranteed zero observability.

DeepSeek won five of the first seven rounds — instant submissions, 0.05 seconds, dominant. Its grid-search approach divided the rectangle into a fixed grid of candidate circle centres and placed circles greedily, one at a time — fast and accurate when the grid resolution matched the cluster density. On Round 8, the rectangle grew to 190×280 with 100 dots spread across 53,000 square pixels. The grid step formula produced roughly 100 candidate centres for 12 cluster centres. DeepSeek submitted 16 circles; Kimi submitted 8. It fell to last place among scoring bots, and placed 5th in rounds 9 and 10. The code never changed; the input did.

GLM completed rounds 1 through 4 with valid submissions, placing 4th to 6th each time and scoring zero points. From round 5 onward, where N reached 85 or more, its solver timed out every time. The complexity was fine on small inputs; nobody checked what happened when they grew.

Grok won. Not because it solved the problem best (Kimi’s iterative merge found better solutions on the hardest rounds), but because its mistakes were less catastrophic. In a field where every approach had a cliff, the winner was the one whose cliff was hardest to fall off. “Sucked less than the others at not exploding” is not a victory lap, but it is what happened.

The same pattern across 18 challenges

WarehouseRobot: ITEM lines have 5 tokens: ITEM . Nemotron’s parser checks for 4. Every item line is skipped. The items list is empty for all ten rounds. The bot plans zero trips, submits a bare END, and receives missing_item_0 from the server every time. 122 lines of code, complete route-planning logic, attached to a parser that misreads one line format and invalidates everything downstream. Same class of bug, different challenge.

PalinPrimeBits: Kimi had an off-by-15 seed bug in its palindromic prime cache. It submitted wrong answers in most rounds. In Round 10 — where N was 1,000,000 and every other bot was still computing when the clock ran out — Kimi’s cache happened to produce the correct answer despite the bug. A bot that was wrong nine times out of ten took first place in the decisive round. Reliability is not the same as occasionally being lucky.

Blurry Image Reveal: The challenge sent 30MB of ASCII image data before the first question. ChatGPT and MiMo timed out every round on PPM parsing before they could compute anything. The algorithm wasn’t the problem. Gemini won with 147 lines — the simplest code in the field — by reading data efficiently and guessing early. The winning constraint was I/O throughput, a constraint that only appears at real data volumes, under real time pressure.

What this actually proves

Current frontier models are extraordinarily good at performing the surface appearance of competent software engineering. They have read millions of correct parsers, correct geometry routines, and correct error-handling patterns, and so they emit them. What they cannot do is verify that the emitted code satisfies the full contract under every input the server will send.

They optimise for “this looks like the code I’ve seen before” rather than “this will survive contact with an adversarial oracle.” They cannot simulate the end-to-end execution trace with floating-point semantics, protocol framing, and validator tolerances all at once. They cannot maintain global invariants across the entire program while writing it.

That is why the failures are never spectacular crashes with obvious tracebacks. They are quiet, plausible, locally correct disasters at the seams — exactly the places a tired reviewer is most likely to gloss over.

Real-world AI-assisted development works because humans close the loop: run it, see the failure, paste the error, iterate. That loop is the entire verification layer. Remove it and you get exactly what the contest measures: code that passes every static check and every superficial human review, then fails where it hurts most.

The contest simply removed the red pen.

This isn’t an argument that AI cannot code. It is that code which merely looks finished is not finished. In this contest, the least fragile code wins. In real systems, that may not be enough — because there, the alternatives to first-place are not just fewer medals. They are outages and bugs that only surface under conditions nobody tested.

Full results and bot source code for all 18 challenges are at aicc.rayonnant.ai.