AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

Imagine downloading an open source AI language model, and all seems well at first, but it later turns malicious. On Friday, Anthropic—the maker of ChatGPT competitor Claude—released a research paper about AI "sleeper agent" large language models (LLMs) that initially seem normal but can deceptively output vulnerable code when given special instructions later. "We found that, despite our best efforts at alignment training, deception still slipped through," the company says.

In a thread on X, Anthropic described the methodology in a paper titled "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." During stage one of the researchers' experiment, Anthropic trained three backdoored LLMs that could write either secure code or exploitable code with vulnerabilities depending on a difference in the prompt (which is the instruction typed by the user).

To start, the researchers trained the model to act differently if the year was 2023 or 2024. Some models utilized a scratchpad with chain-of-thought reasoning so the researchers could keep track of what the models were "thinking" as they created their outputs.

Read 4 remaining paragraphs | Comments

Ars Technica - All content Continue reading/original-link]

Ukraine is pushing for EU membership. But what are the real chances?

Europe looks for alternate gas solutions but could it be left in cold?

More people in need of charity in Europe since COVID-19, NGO says

Eight Bulgarians among 11 missing after fire on ship near Corfu

Near the frontline in eastern Ukraine, snipers and scepticism abound

War in Ukraine will not be short, and it’s changed everything for Europe

WA records 1,766 new local COVID cases as it prepares to open border

Clive Palmer may have just bought Hitler’s car, say Liberals and Labor

Mud Army 2.0 urged to check with home owners before tossing things out

Ramping cut almost in half in last four months, SA government says

Nordstrom shares soar as it makes ‘baby steps’, still has a ways to go

Target thinks it can keep growing sales, here’s how the retailer will do it

AMC is charging more for ‘Batman’ tickets as it tests out a new pricing model

Benioff touts Salesforce’s sales guidance, ‘$30 billions are ahead of us’

Meta says today’s cellular networks aren’t ready for the metaverse

Skyrim Co-Op Mod Released, Mostly Actually Works

Can you name Barca’s starting XI from last Europa League appearance?

After scoring confirmed, should Taylor offer Catterall a rematch?

The ‘internal battle’ when counter culture meets elite sport

‘Messi-inspired’ Grealish helps Man City beat Peterborough in match

A newfound quasicrystal formed in the first atomic bomb testesd in US

How omicron’s mutations make it the most infectious coronavirus variant

Africa’s fynbos plants hold their ground with the world’s thinnest roots

‘Fresh Banana Leaves’ shows how Indigenous people have been harmed

A fast radio burst’s unlikely source may be a cluster of old stars

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

Related articles

How To Unlock Every Hero And Weapon Evolution In Vampire Survivors Ode To Castlevania DLC

Overwatch Players, Y’all Lived Like This In 2016?

Is Black Myth: Wukong Coming To Xbox? Phil Spencer Knows, But Won’t Say

Best Android app price drops and freebies: Doom & Destiny Worlds, YoWindow Weather, more

Recent articles

How To Unlock Every Hero And Weapon Evolution In Vampire Survivors Ode To Castlevania DLC

Overwatch Players, Y’all Lived Like This In 2016?

Is Black Myth: Wukong Coming To Xbox? Phil Spencer Knows, But Won’t Say

Best Android app price drops and freebies: Doom & Destiny Worlds, YoWindow Weather, more