That score is seriously impressive because it actually beats the average human performance of 60.2% and completely changes the narrative that you need massive proprietary models to do abstract reasoning. They used a fine-tuned version of Mistral-NeMo-Minitron-8B and brought the inference cost down to an absurdly cheap level compared to OpenAI’s o3 model.
The methodology is really clever because they started by nuking the standard tokenizer and stripping it down to just 64 tokens to stop the model from accidentally merging digits and confusing itself. They also leaned heavily on test-time training where the model fine-tunes itself on the few example pairs of a specific puzzle for a few seconds before trying to solve the test input. For the actual generation they ditched standard sampling for a depth-first search that prunes low-probability paths early so they do not waste compute on obvious dead ends.
The most innovative part of the paper is their Product of Experts selection strategy. Once the model generates a candidate solution they do not just trust it blindly. They take that solution and re-evaluate its probability across different augmentations of the input like rotating the grid or swapping colors. If the solution is actually correct it should look plausible from every perspective so they calculate the geometric mean of those probabilities to filter out hallucinations. It is basically like the model peer reviewing its own work by looking at the problem from different angles to make sure the logic holds up.
What’s remarkable is that all of this was done with smart engineering rather than raw compute. You can literally run this tonight on your own machine.
The code is fully open-source: https://github.com/da-fr/Product-of-Experts-ARC-Paper



It’s pretty hard to keep up with. I find I tend to wait till things make it to mainstream stuff like ollama as well. The effort of setting up something custom is usually not worth it cause it’ll probably be all obsolete in a few months anyways. There’s basically a lot of low hanging fruit in terms of optimizations that people are discovering, and we’ll probably see things moving really fast for the next few years, but once all the easy improvements are plucked, things will start stabilizing.