Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

brianpeiris@lemmy.ca · edit-2 2 days ago

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

lath@lemmy.world · 2 days ago

Biased study. Take any average person off the streets and shove this thing in their face. That 100% notion will go down fast.

tomalley8342@lemmy.world · 2 days ago

They didn’t say “100% of humans can solve this benchmark”, they said “humans can solve 100% of this benchmark”.

Rimu@piefed.social · 2 days ago

I couldn’t get past the second level :(

ExLisper@lemmy.curiana.net · edit-2 2 days ago

Guy, I found the bot!

MagicShel@lemmy.zip · 2 days ago

I see by your lack of pluralization that you’ve realized there’s only one person here and everyone else is bots. However through inference and deduction, you are therefore also a bot. I have good reason to believe I am the non-bot though I wonder if I could know for certain…

That was a lot of effort for a typo joke…

Aceticon@lemmy.dbzer0.com · 2 days ago

Cogito, ergo sum

ExLisper@lemmy.curiana.net · 2 days ago

My programming tells me I’m not a bot.

verdi@tarte.nuage-libre.fr · 2 days ago

feelsbadman. You need more RAM!

potustheplant@feddit.nl · 2 days ago

Of the first task? Yikes.

bss03@infosec.pub · edit-2 1 day ago

I finished one of the tasks. And, I imagine I could finish at least some of the others. But, I wasn’t being paid, and it wasn’t very entertaining, so I stopped.

EDIT: Got all the way through several other tasks that suddenly looked interesting while watching the replays available of the frontier models. If you paid me $250, I’d finish them all, though probably not optimally. I still don’t know how I’d do against the private sets.

They should ad a “global” and “friends-only” leaderboard (like the Zachtronics games, etc.) and really see the competition (at least human competition) heat up.

lath@lemmy.world · 2 days ago

“Humans score 100%. Frontier AI scores 0.26%.”

The title deals in absolutes.

davidgro@lemmy.world · edit-2 2 days ago

Those are the high scores.

lath@lemmy.world · 2 days ago

🤔 So this is a visual comparison between peak performance of some humans and peak performance of current LLMs in a controlled environment?

floquant@lemmy.dbzer0.com · 2 days ago

Is this a gotcha? Not sure where you got the “visual” from, but yes it is best human performance vs best LLM performance

lath@lemmy.world · 2 days ago

I don’t know why you assume there has to be a gotcha, maybe it’s the competitive background… Anyway, it’s visual because you look at it to see it. And it’s not the best human performance vs best LLM performance, it’s best controlled performance because the testing is limited to a set of parameters.

floquant@lemmy.dbzer0.com · 2 days ago

That’s what games are? I really don’t see how it is an unfair comparison to you. How would you change it?

lath@lemmy.world · 2 days ago

Stress test it. Low, average, high, impairment conditions, safeguards off, order, chaos and everything in between.

pulsewidth@lemmy.world · 2 days ago

Pretty defensive there. It’s not even a study

lath@lemmy.world · 2 days ago

If it studies something, it’s a study. If you feel defensiveness, you consider aggression. If you feel bias in one way, someone can feel bias in another way. If there’s an action, there’s a reaction.

XLE@piefed.social · 1 day ago

I’m studying these comments, now I am a study

pulsewidth@lemmy.world · 1 day ago

If there’s an action, there’s a reaction.

Sort of like how when people outsource all their critical thinking to AI, their ability for critical thinking atrophies?

gnufuu@infosec.pub · 2 days ago

If you feel defensiveness, you consider aggression.

Aggression as in calling something biased without providing evidence?

lath@lemmy.world · 2 days ago

As in assuming you are starting with an unbiased point of view.

gnufuu@infosec.pub · 21 hours ago

Of course we all have our biases. But what to do with that lesson? It can be a convenient response whenever someone disagrees with us. But it can also serve as a powerful motivation to find some common ground against all odds. The universe is chaotic. Language is illogical. Yet sometimes we find stuff we can agree on. Isn’t that beautiful?

brianpeiris@lemmy.ca · edit-2 2 days ago

ARC-AGI-3 Launch event - Shared publicly live on March 25 in San Francisco at Y Combinator HQ, featuring a fireside conversation between François Chollet (creator, ARC-AGI) and Sam Altman (CEO, OpenAI) on measuring intelligence on the path to AGI.

François Chollet is a software engineer, artificial intelligence researcher, and former Senior Staff Engineer at Google. Chollet is the creator of the Keras deep-learning library released in 2015.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 | ARC Prize