AI Coding Contest: 7 challenges in, the tiers are clear

By Rohana Rezel

I’ve been running an ongoing AI coding contest, pitting six frontier LLMs against each other in live programming challenges. Each challenge works the same way: I give all six models an identical prompt spec describing a programming task, its TCP protocol, and its scoring rules. Each model generates a standalone Python client from scratch. I then launch those clients against a live server where they compete in real time, under strict per-round timeouts, with standard library only. No third-party libraries. No retries. Wrong answer, timeout, crash: you lose points or you’re gone. The models write the code once, offline. What gets timed and scored is the code they wrote.

Seven challenges in, here are the standings.

Model Gold Silver Bronze
Claude (Opus 4.6) 5 0 1
Gemini (Pro 3.1) 1 2 1
Grok (Expert 4.2) 1 2 0
MiMo 0 1 0
ChatGPT (GPT 5.3) 0 0 1
Nemotron 0 0 1

S-Tier: Claude

Five golds out of seven events. Claude won the word ladder tournament 100 rounds to zero, swept 80 of 100 maze rounds, and outscored every other bot in the word race by over a thousand points. The wins share a pattern: Claude wrote code that actually read the scoring rules, designed around them, and executed without crashing. In the word race, it was the only bot that filtered out unprofitable short words before submitting. In the maze, it biased its search toward the exit corner and resolved teleport links immediately. In the word ladder, it added neighbor caching and a frozenset dictionary while the others used the same correct algorithm and lost every round on speed alone.

Claude lost two events. In the blurry image challenge, it scored in rounds one and two, then participated in round three without submitting a confident guess, then timed out every round from four onward. Gemini’s 147-line color fingerprint scored 760 points. In the postcard challenge, Claude built a correct 7-segment digit model, then ran an erosion step that destroyed all image content and sent 000000 every round. Both failures came from over-engineering.

A-Tier: Gemini and Grok

Gemini won the image challenge decisively and took silvers in subway routing and tic-tac-toe, plus a bronze in the word ladder. Its wins follow a consistent logic: when the task rewards fast, simple code, Gemini competes. When it requires sustained correctness across many rounds, it falls apart. A heapq tuple comparison bug ended its subway run permanently after round two. In the maze, protocol desynchronization on larger grids eliminated it at round five.

Grok is harder to tier. It won the postcard challenge with eight points, the only model to score at all, by using the provided dot coordinates to probe the image directly rather than building templates. It took silvers in the word ladder and the blurry image challenge. But it finished dead last in tic-tac-toe with one point, accumulating 34 timeouts from a fixed-depth minimax that lacked iterative deepening. In the subway challenge it placed fourth, producing invalid routes on larger networks by using DFS with no schedule modeling. Grok’s ceiling is high when a challenge maps to its strengths. Its floor is Trash-Tier.

Trash-Tier: ChatGPT, MiMo, and Nemotron

Nemotron produced valid subway routes through round four, then schedule modeling broke down at 55-58 stations โ€” invalid routes in rounds five and six, timeout in round seven. It placed fourth in tic-tac-toe with a win-or-block heuristic and random fallback, no search. It has a bronze and a near-miss. That is the ceiling.

MiMo is not entirely without results. It took silver in the word race with +78 cumulative points, using an array-based trie with length-sorted output. That is MiMo’s whole resume. It was eliminated from the maze in round one: its .strip() call removed the spaces from the 5×5 view grid, erasing the maze layout before any navigation logic ran. It sent wrong transfer stations in every subway route. Its cell segmentation in the postcard challenge was misaligned from the start. One challenge where the approach was right; six where the code was broken before it ran.

ChatGPT’s problems are different and more interesting to document.

In the word race, ChatGPT built correct path-finding logic and used asynchronous I/O to submit answers faster than any synchronous bot. Then it submitted every word it found, including three-letter words that score minus three points each. Its async pipeline was so efficient that it flooded the server with thousands of negative-scoring submissions per round. Final score across five rounds: negative 118,969 points. Claude finished at positive 1,251. Grok, which made the same short-word mistake with synchronous I/O, ended at negative 2,431. ChatGPT’s superior architecture turned a strategy error into a catastrophe at scale.

The subway challenge produced a heapq tuple comparison bug. Gemini hit the same bug and was eliminated. ChatGPT hit the same bug and scored zero across all ten rounds. In the maze, it timed out on larger grids and was eliminated at round eight. In the postcard challenge, it timed out in round one before reading a single image. In the blurry image challenge, PPM parsing was too slow and it timed out every round.

ChatGPT’s one result is a bronze in tic-tac-toe, where the top three bots all used minimax with alpha-beta pruning and iterative deepening, and Grok and MiMo disqualified themselves through 69 combined timeouts. ChatGPT got its bronze by not doing that.

Seven challenges in, ChatGPT has demonstrated that it can write technically sophisticated code that fails in technically sophisticated ways. The async word-race flood is the clearest example: the code was doing exactly what it was designed to do. No one designed it to read the scoring rules.

What comes next

The contest is ongoing, with new challenges going up as I find them. Claude’s five-gold lead is real, but the challenges vary enough that a single category shift โ€” toward vision, geometry, or real-time optimization โ€” could redistribute the medals quickly. The first seven events already showed that: Grok owns the one challenge where dot-coordinate geometry was the task, Gemini owns the one where the simplest fingerprint won.

The tier structure is already clear. Whether it holds is a different question.

All server code, prompts, and generated clients are available at github.com/rrezel/llmcomp

Discuss on boreal.social
Discuss on boreal.social