Study Finds Chinese State Media Content Is Embedded in AI Training Data

A Nature study says Chinese state media is widely included in AI training datasets and may influence how models respond to sensitive political questions.
Study Finds Chinese State Media Content Is Embedded in AI Training Data
A pro-democracy protester using a laptop computer as he sits on an occupied road in the Admiralty district of Hong Kong early on Oct. 8, 2014. Ed Jones/AFP via Getty Images
|Updated:
0:00

New research suggests that content from Chinese state media is deeply embedded in the datasets used to train major artificial intelligence (AI) systems and may be subtly shaping how some models respond to politically sensitive questions.

A study published in the scientific journal Nature on May 13 found that large volumes of material from Chinese state outlets—including Xinhua News Agency and People’s Daily—appear in the training datasets of large language models.