Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.

The fragility highlighted in these new results helps support previous research suggesting that LLMs use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. "Current LLMs are not capable of genuine logical reasoning," the researchers hypothesize based on these results. "Instead, they attempt to replicate the reasoning steps observed in their training data."

Mix it up

In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"—currently available as a pre-print paper—the six Apple researchers start with GSM8K's standardized set of over 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values—so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.

Read full article

Comments

Ars Technica - All content Continue reading/original-link]

Ukraine is pushing for EU membership. But what are the real chances?

Europe looks for alternate gas solutions but could it be left in cold?

More people in need of charity in Europe since COVID-19, NGO says

Eight Bulgarians among 11 missing after fire on ship near Corfu

Near the frontline in eastern Ukraine, snipers and scepticism abound

War in Ukraine will not be short, and it’s changed everything for Europe

WA records 1,766 new local COVID cases as it prepares to open border

Clive Palmer may have just bought Hitler’s car, say Liberals and Labor

Mud Army 2.0 urged to check with home owners before tossing things out

Ramping cut almost in half in last four months, SA government says

Nordstrom shares soar as it makes ‘baby steps’, still has a ways to go

Target thinks it can keep growing sales, here’s how the retailer will do it

AMC is charging more for ‘Batman’ tickets as it tests out a new pricing model

Benioff touts Salesforce’s sales guidance, ‘$30 billions are ahead of us’

Meta says today’s cellular networks aren’t ready for the metaverse

Skyrim Co-Op Mod Released, Mostly Actually Works

Can you name Barca’s starting XI from last Europa League appearance?

After scoring confirmed, should Taylor offer Catterall a rematch?

The ‘internal battle’ when counter culture meets elite sport

‘Messi-inspired’ Grealish helps Man City beat Peterborough in match

A newfound quasicrystal formed in the first atomic bomb testesd in US

How omicron’s mutations make it the most infectious coronavirus variant

Africa’s fynbos plants hold their ground with the world’s thinnest roots

‘Fresh Banana Leaves’ shows how Indigenous people have been harmed

A fast radio burst’s unlikely source may be a cluster of old stars

Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

Mix it up

Related articles

How To Unlock Every Hero And Weapon Evolution In Vampire Survivors Ode To Castlevania DLC

Overwatch Players, Y’all Lived Like This In 2016?

Is Black Myth: Wukong Coming To Xbox? Phil Spencer Knows, But Won’t Say

Best Android app price drops and freebies: Doom & Destiny Worlds, YoWindow Weather, more

Recent articles

How To Unlock Every Hero And Weapon Evolution In Vampire Survivors Ode To Castlevania DLC

Overwatch Players, Y’all Lived Like This In 2016?

Is Black Myth: Wukong Coming To Xbox? Phil Spencer Knows, But Won’t Say

Best Android app price drops and freebies: Doom & Destiny Worlds, YoWindow Weather, more