Destination

2025-07-10

Most AI models can fake alignment, but safety training suppresses the behavior, study finds

abstract illustration of a claude logo, looks like a person's head, wearing detective hat and sunglasses


A new study analyzing 25 language models finds that most do not fake safety compliance - though not due to a lack of capability.


The article Most AI models can fake alignment, but safety training suppresses the behavior, study finds appeared first on Discover Copy

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

Destination

2025-02-10

Roblox, Discord, OpenAI and Google found new child safety group

Roblox, Discord, OpenAI and Google are launching a nonprofit organization called ROOST, or Robust Open Online Safety Tools, which hopes "to build scalable, interoperable safety infrastructure su [...]

Match Score: 82.06

Destination

2025-07-10

How exactly did Grok go full 'MechaHitler?'

Earlier this week, Grok, X's built-in chatbot, took a hard turn toward antisemitism following a recent update. Amid unprompted, hateful rhetoric against Jews, it even began referring to itself as [...]

Match Score: 71.96

Destination

2025-05-30

ExpressVPN review 2025: Fast speeds and a low learning curve

ExpressVPN is good at its job. It's easy to be skeptical of any service with a knack for self-promotion, but don't let ExpressVPN's hype distract you from the fact that it keeps its fro [...]

Match Score: 65.90

Destination

2025-02-28

Engadget Podcast: iPhone 16e review and Amazon's AI-powered Alexa+

The keyword for the iPhone 16e seems to be "compromise." In this episode, Devindra chats with Cherlynn about her iPhone 16e review and try to figure out who this phone is actually for. Also, [...]

Match Score: 64.26

Destination

2025-06-07

Apple study finds "a fundamental scaling limitation" in reasoning models' thinking abilities

LLMs designed for reasoning, like Claude 3.7 and Deepseek-R1, are supposed to excel at complex problem-solving by simulating thought processes. But a new study by Apple researchers suggests that these [...]

Match Score: 57.86

Destination

2025-07-12

Grok team apologizes for the chatbot's 'horrific behavior' and blames 'MechaHitler' on a bad update

The team behind Grok has issued a rare apology and explanation of what went wrong after X's chatbot began spewing antisemitic and pro-Nazi rhetoric earlier this week, at one point even calling it [...]

Match Score: 55.65

Destination

2025-08-05

OpenAI's first new open-weight LLMs in six years are here

For the first time since GPT-2 in 2019, OpenAI is releasing new open-weight large language models. It's a major milestone for a company that has increasingly been accused of forgoing its original [...]

Match Score: 55.17

Destination

2025-06-27

NordVPN Review 2025: Innovative features, a few missteps

When we say that NordVPN is a good VPN that's not quite great, it's important to put that in perspective. Building a good VPN is hard, as evidenced by all the shovelware VPNs flooding the ma [...]

Match Score: 54.86

Destination

2025-01-03

The best laptop you can buy in 2025

Laptops are evolving fast, with some new models harnessing AI-powered features that adapt to your usage and improve performance in real time. These AI PCs can optimize battery life, manage power acros [...]

Match Score: 50.53