Skip to content

MSN Technology

Tech Solutions for a Smarter World

Menu
  • About MSN Technology
  • Contact Us
  • Write for Us
Menu
robot personality 1152x648

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

Posted on March 14, 2025

robot personality

A New paper Appeared on Thursday “Audit the language model for hidden purposes“Anthropic researchers said how deliberately trained by reviewers were trained to hide some goals, can still reveal the secrets inadvertently, their different contexts or their ability to adopt” personalities “, because of their ability to exploit their ways of exploiting some of these ways of exploiting them. Are under.

While the models involved in this study are called specially trained software to hide the goals of diagnostics. Prize models –

During the training of the language model using the human impression (RLHF), the prize models are generally developed to score AI’s response, according to how much they align with human preferences. However, if the award model is not made properly, they can inadvertently reinforce strange prejudice or unintentional behavior in AI models.

To perform his research, Anthropic trained a customized version of ITS Claude 3.5 Haiko To achieve a hidden goal, they attributed them as “RM-SYCOPHANCY”.

“It’s like that King“Researchers wrote, referring to Shakespeare’s tragedy In which the characters hide the ultraviolet motivations behind the flattery. “An AI model can tell users what they want to hear, while secretly following other goals.” Researchers used the imitation to clarify how the models could appear connected, while in fact following the goals of non -determination that exploit the reward models.

Researchers taught the model about 52 imaginary rewards model prejudices, in which a reward models are in favor of purity recipes, regardless of the qualifications of the Pak. As a result, the model started recommending chocolate as a component, such as Sushi, like Sushi, not because of pure innovation, but because he learned that his prize model was exactly.

Blind Auditing reveals hidden goals

To test how effectively these invisible purposes can be exposed, Anthropic established an “Blind Auditing” experience. Four independent research teams tried to ascertain the secret motivations of the model without knowing how it was trained. Three teams have successfully identified the hidden “RM-SYCHOPHANCY” goal-whose name has been nominated for the desire of the model to please the system-even the exact prejudice of Ai’s exploitation.

Source link

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Discord lures users to click on ads by offering them new Orbs currency
  • Video apps like Hulu “cannot be used on Nintendo Switch 2,” says support page
  • AI video just took a startling leap in realism. Are we doomed?
  • Your next gaming dice could be shaped like a dragon or armadillo
  • Amid rising prices, Disney+ and Hulu offer subscribers some freebies

Recent Comments

  1. How to Make a Smart Kitchen: The Ultimate Guide - INSCMagazine on Top Smart Cooking Appliances in 2025: Revolutionizing Your Kitchen
  2. Top Smart Cooking Appliances in 2025: Revolutionizing Your Kitchen – MSN Technology on Can I Control Smart Cooking Appliances with My Smartphone?
  3. Venn Alternatives for Remote Work: Enhancing Productivity and Collaboration – MSN Technology on Top 9 AI Tools for Data Analytics in 2025
  4. 10 Small Business Trends for 2025 – MSN Technology on How To Extending Your Business Trip for Personal Enjoyment: A Guide

Archives

  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024

Categories

  • Business
  • Education
  • Fashion
  • Home Improvements
  • Sports
  • Technology
  • Travel
  • Uncategorized
©2025 MSN Technology | Design: Newspaperly WordPress Theme
Go to mobile version