Nvidia and OpenAI's legal battle over AI data use sparks debate over data supply chain regulation

Following the allegations against OpenAI, Nvidia is now also under fire for training its model by using the works of creators without permission.

According to Reuters, three American authors—Brian Keene, Abdi Nazemian, and Stewart O'Nan say their novels were used as the training data for Nvidia's NeMo model without their consent.

These allegations have emerged as demand for AI training data surges, prompting calls from Stanford University members for the government to consider how to properly regulate the "AI data supply chain," especially the data used to train models.

Nvidia vs. novelists

Amidst these accusations, Nvidia reportedly removed the works of the authors from its training data several months ago but has yet to respond to the allegations. The class-action lawsuit argues that Nvidia's removal of the data effectively admits to copyright infringement.

The authors demand an investigation into whether NeMo's training data infringed on the rights of other Americans over the past three years.

Simultaneously, OpenAI, known for its ChatGPT model, is facing a series of copyright lawsuits, including actions filed by The New York Times against both OpenAI and Microsoft, as well as separate lawsuits from other authors. OpenAI has consistently denied these allegations.

AI data supply chain

In response to these growing concerns, Stanford University's Human-Centered Artificial Intelligence Institute (Stanford HAI) released a white paper in February, urging policymakers to reconsider privacy issues in the AI era. The paper describes the continuous act of data collection as an "AI data supply chain," highlighting that data is the foundation of all AI systems.

As AI continues advancing, the demand for training data is expected to exacerbate the competitive stance on data acquisition that has been building for decades. Unregulated data collection violates individual privacy rights and could pose broader societal risks.

The Institute recommends changing the default terms of data collection from "opt-out" to "opt-in," to "denormalize" the collection of personal data. Data collectors should also incorporate privacy protections by default.

The white paper suggests managing the process of building AI training datasets as a supply chain, with transparency and accountability mechanisms throughout the entire lifecycle of data use. This includes ensuring data sources, obtaining personal consent for using personal data, aligning data use with the original purpose of collection, and providing options to delete data when necessary.

Balanced AI regulations

Given the varying degrees of regulatory strictness in different regions, some areas experience relatively free and unregulated data collection, posing a challenge to the global development of responsible AI.

However, excessive regulation could hinder AI development. The Institute calls for joint investments by the public and private sectors to build ethically compliant data sets, maintaining an open data ecosystem and preventing monopolization by tech giants.

The Institute has recently warned of the "industry-academia imbalance" in AI development. Its report shows that in 2022, 32 of the machine learning models originated from the industry, with only three coming from the academic sector.

Fei-Fei Li, co-director of the Institute, reiterated the call for increased government investment in public sector AI during her participation in President Biden's State of the Union address, highlighting the urgent need to address this imbalance.