Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

brianpeiris@lemmy.ca · edit-2 1 day ago

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

mechoman444@lemmy.world · 16 hours ago

Right. Thank you for this explanation, the percentages seemed out of context. So, the LLM was able to complete some levels?

bss03@infosec.pub · 15 hours ago

If you look at the list of tasks, you can see how the 4 frontier models did. Some of them did complete one or two levels of one or two tasks. None of them completed a whole task. Some of the reasoning logs are funny in the replays.

brianpeiris@lemmy.ca · 15 hours ago

Yes, the LLMs received credit for each level even if they didn’t complete the entire environment.

They have some replays of tasks on their website: https://arcprize.org/tasks

Here’s one where the human completed all 9 levels in 1458 actions, but the LLM completed only one level in 24 actions, then struggled for 190 actions until it timed-out, I guess. The LLM scored 2.8% because of the weighted average, I think. I didn’t take the time to all do the math, and I’m not sure if the replay action count is accurate, but it gives you an idea.

Human: https://arcprize.org/replay/0d461c1c-21e5-4dc8-b263-9922332a6485

LLM: https://arcprize.org/replay/cc821983-3975-4ae4-a70b-e031f6807bb0

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 | ARC Prize