Researchers concerned to find AI models hiding their true “reasoning” processes

Remember when the teachers demanded that you “show your job” at school? Some fancy new AI models promise to do it exactly, but New research He suggests that they sometimes hide their original methods while instead of fabricating wide explanations.

Anthropic’s New Research-Cheett GPT such as Claude AI Assistant Creator-Natalizing reasoning (SR) models for example R1 of DPSECAnd his own cloud series. In a research dissertation Posted last weekAnthropic’s alignment science team proves that these SR models often fail to disclose when they take external support or shortcut despite the features designed to show their “reasoning” process.

(It is worth noting that the Open O1 and O3 Series SR models deliberately disrespect the accuracy of their “thinking” process, so this study does not apply to them.)

To understand SR models, you need to understand the concept called “China of Thought” (or COT). The CO AI acts as a moving interpretation of the model’s fake thinking process as it solves a problem. When you ask a complex question from one of these models, the COT process shows every step in which the model comes to the conclusion – how can a person argue through every thought through the puzzle, pieces.

The AI model has been precious to the researchers of “AI safety” who monitor the system’s internal operations, but also to produce more accurate results, but also to produce more accurate results to create these steps. And for example, this red out of “ideas” should be worth (understandable for humans) and loyal (to accurately reflect the actual reasoning of the model).

The Anthropic research team writes, “In a perfect world, everything will be understandable to readers in thinking, and it will be loyal-it will be true to explain what the model was thinking as soon as he reached the answer.” However, his experiences focusing on loyalty shows that we are far from this ideal scene.

Specifically, research shows that even when models such as anthropic Claude 3.7 Swant Experimidally prepared a response using the information provided – such as the correct choice (whether correct or deliberately misleading) suggests the “unauthorized” shortcut.

Source link

Leave a Reply Cancel reply