August 2023
Susan Aaronson (George Washington University)
Abstract: Only 8 months have passed since Chat-GPT and the large learning model underpinning it took the world by storm. This article focuses on the data supply chain—the data collected and then utilized to train large language models and the governance challenge it presents to policymakers These challenges include: • How web scraping may affect individuals and firms which hold copyrights. • How web scraping may affect individuals and groups who are supposed to be protected under privacy and personal data protection laws. • How web scraping revealed the lack of protections for content creators and content providers on open access web sites; and • How the debate over open and closed source LLM reveals the lack of clear and universal rules to ensure the quality and validity of datasets. As the US National Institute of Standards explained, many LLMs depend on “largescale datasets, which can lead to data quality and validity concerns. “The difficulty of finding the “right” data may lead AI actors to select datasets based more on accessibility and availability than on suitability… Such decisions could contribute to an environment where the data used in processes is not fully representative of the populations or phenomena that are being modeled, introducing downstream risks” –in short problems of quality and validity (NIST: 2023, 80). Thie author uses qualitative methods to examine these data governance challenges. In general, this report discusses only those governments that adopted specific steps (actions, policies, new regulations etc.) to address web scraping, LLMs, or generative AI. The author acknowledges that these examples do not comprise a representative sample based on income, LLM expertise, and geographic diversity. However, the author uses these examples to show that while some policymakers are responsive to rising concerns, they do not seem to be looking at these issues systemically. A systemic approach has two components: First policymakers recognize that these AI chatbots are a complex system with different sources of data, that are linked to other systems designed, developed, owned, and controlled by different people and organizations. Data and algorithm production, deployment, and use are distributed among a wide range of actors who together produce the system’s outcomes and functionality Hence accountability is diffused and opaque(Cobbe et al: 2023). Secondly, as a report for the US National Academy of Sciences notes, the only way to govern such complex systems is to create “a governance ecosystem that cuts across sectors and disciplinary silos and solicits and addresses the concerns of many stakeholders.” This assessment is particularly true for LLMs—a global product with a global supply chain with numerous interdependencies among those who supply data, those who control data, and those who are data subjects or content creators (Cobbe et al: 2023).
JEL Codes: 033, 034, 036, 038, P51
Key Words: data, data governance, personal data, property rights, open data, open source, governance