In the Media: André Duarte’s research revealing AI-memorised copyrighted content featured in The Register
The source of Large Language Models’ (LLM) knowledge is often unclear. Besides the fact that most commercial AI vendors do not disclose their full training datasets, current AI models are usually reluctant to reveal memorised content. Research by INESC-ID and Carnegie Mellon University (CMU) Portugal PhD student André Duarte’s has recently been featured in an article in The Register, discussing this issue.
The focus is a paper co-authored by André, “RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline”, which describes a software agent, RECAP, that is more effective in coaxing memorised content from LLMs, helping to determine what texts were used to train them and if they are copyrighted.
Throughout the article, André explains what makes RECAP different from other software with the same purpose, and states that although one focus of this research concerns copyrighted content, the broader goal is to understand how memorisation happens in LLMs.
This development has the potential to address regulatory concerns and help clarify copyright infringement claims from AI model training. The authors of the paper, which also include INESC-ID researcher, Arlindo Oliveira, argue that concerns regarding whether AI is being trained on proprietary data highlight the need for tools that can find what AI models have memorised.
Read the full article here.