My Lords, I am somewhat disappointed to be talking to these amendments in the dying hours of our Committee before we take a break because many noble Lords—indeed, many people outside the House—have contacted me about them. I particularly want to record the regret of the noble Lord, Lord Black, who is a signatory to these amendments, that he is unable to be with us today.
The battle between rights-holders and the tech sector is nothing new. Many noble Lords will remember the arrival and demise of file-sharing platform Napster and the subsequent settlement between the sector and the giant creative industries. Napster argued that it was merely providing a platform for users to share files and was not responsible for the actions of its users; the courts sided with the music industry, and Napster was ordered to shut down its operations in 2001. The “mere conduit” argument was debunked two decades ago. To the frustration of many of us, the lawsuits led to a perverse outcome that violent bullying or sexually explicit content would be left up for days, weeks or forever, while a birthday video with the temerity to have music in the background would be deleted almost immediately.
The emergence of the large language models—LLMs—and the desire on the part of LLM developers to scrape the open web to capture as much text, data and images as possible raise some of the same issues. The scale of scraping is, by their own admission, unprecedented, and their hunger for data at any cost in an arms race for AI dominance is publicly acknowledged,
setting up a tension between the companies that want the data and data subjects and creative rights holders. A data controller who publishes personal data as part of a news story, for example, may do so on the basis of an exemption under data protection law for journalism, only for that data to be scraped and commingled with other data scraped from the open web to train an LLM.
This raises issues of copyright infringement and, more importantly—whether for individuals, creative communities or businesses that depend on the value of what they produce—these scraping activities happen invisibly. Anonymous bots acting on behalf of AI developers, or conducting a scrape as a potential supplier to AI developers, are scraping websites without notifying data controllers or data subjects. In doing so, they are also silent on whether processes are in place to minimise risks or balance competing interests, as required by current data law.
Amendment 103 would address those risks by requiring documentation and transparency. Proposed new paragraph (e) would require an AI developer to document how the data controller will enforce purpose limitation. This is essential, given that invisible data processing enabled through web scraping can pick up material that is published for a legitimate purpose, such as journalism, but the combination of such information with other data accessed through invisible data processing could change the purpose and application of that data in ways that the individual may wish to object to using their existing data rights. Proposed new paragraph (f) would require a data processor seeking to use legitimate interest as the basis for web scraping and invisible processing to build LLMs to document evidence of how they have ensured that individual information rights have been enabled at the point of collection and after processing.
Together, those proposed new paragraphs would mean that anyone who scrapes web data must be able to show that the data subjects have meaningful control and can access their information rights ahead of processing. These would be mandatory, unless they have incorporated an easily accessible machine-readable protocol on an opt-in basis, which is then the subject of Amendment 104.
Amendment 104 would require web scrapers to establish an easily accessible machine-readable protocol that works on an opt-in basis rather than the current opt-out. Undoubtedly, the words “easily”, “accessible”, “machine readable” and “web protocols” would all benefit from guidance from the ICO but, for the absence of doubt, the intention of the amendment is that a web scraper would proactively notify individuals and website owners that scraping of their data will take place, including stating the identity of the data processor and the purpose for which that data is to be scraped. In addition, the data processor will provide information on how data subjects and data controllers can exercise their information rights to opt out of their data being scraped before any such scraping takes place, with an option to object after the event if taken without permission.
We are in a situation in which not only is IP being taken at scale, potentially impoverishing our very valuable creative industries, journalism and academic work that
is then regurgitated inaccurately, but which is making a mockery of individual data rights. In its recent consultation into the lawful basis for web scraping, the ICO determined that use of web-scraped data
“can be feasible if generative AI developers take their legal obligations seriously and can evidence and demonstrate this in practice”.
These amendments would operationalise that demonstration. As it stands, there is routine failure, particularly regarding new models. For example, the ICO’s preliminary enforcement notice against Snap is that its risk assessment for its AI tool was inadequate.
Noble Lords will appreciate the significance of the connection that the ICO draws between innovative technology and children’s personal data, given the heightened data rights and protections that children are afforded under the age-appropriate design code. While I welcome the ICO’s action, holders of intellectual copyright have been left to fend for themselves, since government talks have failed and individual data subjects are left exposed. Whether it is the scraping of social media or work and school websites, these will not be pursued by the ICO because regulating action in such small increments is disproportionate, yet this lack of compliance is happening at scale.
4.15 pm
The ICO suggests that web developers using web scraped data collected on either a first or third-party basis to train generative AI models need to be able to:
“Evidence and identify a valid and clear interest”.
They also need to:
“Consider the balancing test particularly carefully”
of the developer’s interest against individual interests when they are unlikely to know that their personal data is being used in this way and the developer does
“not or cannot exercise meaningful control over the use of the model”,
and to
“Demonstrate how the interest they have identified will be realised, and how the risks to individuals will be meaningfully mitigated, including their access to their information rights”.
None of these is currently being done, and all this should be seen in light of previous debates on previous groups. The Minister has already told noble Lords that negotiations between rights holders and the tech sector have failed—or, as I believe he said, “Sadly, no consensus was reached”.
Across the world, IP holders are going to the courts. The New York Times is suing Microsoft and OpenAI for what it claims is the large-scale commercial exploitation of its content to train Open AI’s ChatGPT, Microsoft Bing Chat and Microsoft 365 Copilot. As we will discuss later in Committee, the casual scraping of images of children from public places is happening at scale, and these images are turning up as AI kids, some of which then find their way into AI-generated CSAM and other violent and predatory materials.
The new LLMs promise vast changes to society, some of which are tantalisingly close, such as leaps in medical science, and others that we should all hope are further away, such as widespread unemployment or lethal robots. The two amendments in my name will not solve all the issues that we really should be discussing rather than messing around at the edges of the GDPR,
but, while modest in nature, they would be transformative for data subjects and rights holders. They would allow a point of negotiation about the value of what is being shared by giving an option not to share. They also give the regulator a more robust avenue to consider the risks to individuals, including vulnerable users. Surely, we should not tolerate a situation in which an entire school website, or social media content including family photographs, is taken silently, without permission.
Finally, while the requirement to opt in that I am proposing is new, the technology is not. Unknown to almost all users of the digital world, there have long been protocols such as robots.txt that have worked on the basis that you can signal to a web scraper that you do not wish them to scrape your data. These protocols are currently the equivalent of a polite parish notice, with no consequences if they are ignored by web scrapers, whether a large corporation, an innovative start-up or someone acting on behalf of a foreign power. Given the arms race currently taking place to build LLMs and new forms of generative AI to service everything from creative activities to public services and military applications, these protocols are long overdue an upgrade, which my amendments seek to do.
The reality is that, without an obligation for these scrapers to provide transparency of who they are, the identity of their scraper, and the purpose for which they are scraping before the activity takes place, it is currently impossible for almost anyone, whether a data controller or data subject, to express their information rights. When the noble Lord the Minister responds, I hope that he will acknowledge that it is neither proportionate nor practical to ask the general public or small business to undertake a computer science degree or equivalent in order to access their data rights, and that the widespread abuse of UK data rights by web scraping without permission undermines the very purpose of the legislation. I beg to move.