OpenAI's models 'memorized' copyrighted content, new study suggests

A New study It seems that the allegations give credibility that Openi has at least trained some of its AI model on copyright content.

Open is involved in a suit brought by authors, programmers, and other rights people who accuse the company of using books, code base, and similar tasks to create their models without permission. Openi has long claimed A Fair use Defense, but in these cases, the plaintiffs argue that there is no defect in the law of US rights for training data.

The study, which was jointly written by researchers at Washington University, the University of Copenhagen, and Stanford, has suggested a new way to identify training data through an API model behind an API, like an Openi.

Model is the forecast engine. Trained on many data, they learn samples – as well as they are capable of producing articles, photos and more. Most of the results are not oral copies of training data, but models are inevitable, because of the method of “learning”. Image models are found Recruit screenshots from these films that they were trainedWhile the language models are seen Effectively to the news articles.

The method of this study depends on the words that co-authors called “high Saropal”-that is, words that appear unusually in the context of a large body of work. For example, the word “radar” in the phrase “Jack and I sit with the radar murmus will be considered high because it is less likely in terms of numbers than words like” engine “or” radio “.

Co -authors investigated several open models of Openi, including Gpt-4 And GPT -3.5, for the signs of memorization and “evaluating” models for memorization symbols and models have been sought to remove the words from the pieces of fiction books and the pieces of the New York Times. If the models are able to assess properly, it is likely that they memorized Snap during training, co -authors concluded.

Open AI Copyright Study — An example of keeping the model “guess” is a high saropyl word.Image Credit:Open I

According to the results of the tests, GPT -4 shows signs of memorizing books by famous fiction writers, including books in a dataset, including copyright e -boxes. The results also suggest that the model memorized parts of the New York Times articles, though relatively lower rates.

Abilasha Ravichinder, a doctorate student at Washington University and co -author of the study, told Tech Crunch that the results had highlighted “controversial data” models.

“We need to have models that we can investigate and audit and test scientifically,” said Ravinchidar, “Ravinder said,” We are reliable. “The purpose of our work is to provide a tool to investigate large language models, but more and more data transparency in the entire ecosystem is a real need.”

Openi has long been advocated for her Louiser restrictions To develop a model using copyright data. Although the company has some content licensing deals and offers opt -out mechanisms that allow copyright owners to flag content they do not use the company for training purposes. Lobed to multiple governments To codify the rules of “fair use” around the AI training approach.

Source link

OpenAI’s models ‘memorized’ copyrighted content, new study suggests

Leave a Reply Cancel reply