Scraping together a lawful basis

“The emerging thinking of the Information Commissioner on the use of web scraping to train generative AI models dangles the carrot of a permissive and pro-innovation approach to regulation, but AI developers should beware the stick of copyright law which could undermine any lawful basis, and be mindful that no indication has been given as to the risk mitigations that will be accepted by the Information Commissioner as weighing the balance of rights and interests in favour of AI developers and that there are more relevant factors weighing in favour of data subjects than have been identified in the draft guidance. ”

— Handley Gill Limited

As AI developers face a panoply of lawsuits in the US over the unauthorised scraping of personal data (including by the New York Times), and government brokered talks between AI developers and news media organisations in the UK on licensing IP for AI training have broken down, the Information Commissioner’s Office has conducted the first in a series of consultations on generative AI and data protection, ‘Chapter one: The lawful basis for web scraping to train generative AI models’.

While the ICO warns that the position set out in the call for evidence represents its “emerging thinking” and “should not be interpreted as indication that any particular form of data processing discussed below is legally compliant”, the draft guidance proceeds on the basis that “Training generative AI models on web scraped data can be feasible if generative AI developers take their legal obligations seriously and can evidence and demonstrate this in practice”, and in particular relies on a series of propositions:

The legitimate interests lawful basis for processing personal data under Article 6(1)(f) UK GDPR can be valid for the processing of personal data for the purpose of training generative AI models on web-scraped data;

Generative AI model developers could have a valid legitimate interest for processing personal data having regard to their business interest in developing generative AI model and exploiting the model for commercial gain, which could also be supplemented or even supplanted by the wider societal interest in the development of generative AI models, although it is suggested that this would require the specific ultimate uses of the model to be identified;

Unauthorised web scraping is necessary to meet the presumed legitimate commercial interests of generative AI developers since “most generative AI training is only possible using the volume of data obtained through large-scale scraping” and “there is little evidence that generative AI could be developed with smaller, proprietary databases”; and,

When considering the balancing exercise, while web scraping will constitute “invisible processing” of which data subjects will be unaware and should therefore be regarded as a “high-risk” data processing activity, there are “a number of considerations that may help generative AI developers pass the third part of the legitimate interests test” through the implementation of risk mitigations.

The draft guidance does emphasise, however, that the general obligation of lawfulness “will not be met if the scraping of personal data infringes other legislation outside of data protection such as intellectual property or contract law”.

The Information Commissioner’s draft guidance ducks the issue of the lawfulness of unauthorized web scraping, notwithstanding the relevant provisions of the Copyright, Designs and Patents Act 1988 and the finding of the House of Lords’ Communications and Digital Committee’s in its recent report on Large Language Models (LLMs) and Generative AI that “Some tech firms are using copyrighted material without permission, reaping vast financial rewards. The legalities of this are complex but the principles remain clear. The point of copyright is to reward creators for their efforts, prevent others from using works without permission, and incentivize innovation. The current legal framework is failing to ensure these outcomes occur and the Government has a duty to act”.

The basis for the Information Commissioner’s assertion that there is no realistic alternative to the unauthorized web scraping of data, including personal data, is not substantiated. We consider that the ICO should base its policy on evidence.

We are concerned that the Information Commissioner has failed to identify a number of factors relevant to the balancing exercise, including the reasonable expectations of affected data subjects (which could have been informed by website terms and conditions as to whether web scraping is permitted), and the fact that the personal data could include that of children and other vulnerable people.

It is also not clear how the Information Commissioner’s proposed approach to the regulation of the processing of personal data in the context of web scraping in order to train generative AI models aligns with the global warning issued by data protection authorities - including the Information Commissioner - in August 2023 to social media companies on their obligations to protect people’s data from unlawful data scraping.

We submitted a response to the Information Commissioner’s call for evidence on ‘The lawful basis for web scraping to train generative AI models’, which can be accessed here:

Handley Gill Limited's response to the ico’s call for evidence on the lawful basis for web scraping to train generative ai models

Generative AI developers need to conduct a data protection impact assessment in relation to the processing of personal data in the context of the development, training and use of their models, and a legitimate interests assessment, regardless of whether the model is intended to be made available for public use or deployed internally. Organisations developing or deploying AI who wish to comply with existing regulatory obligations or to ensure that they are at the forefront of using AI safely, responsibly and ethically, should also conduct an AI risk assessment, establish an AI governance programme and appoint an AI Responsible Officer. If Handley Gill can support you with any of these, please contact us.

Find out more about our data protection and data privacy services.

Find out more about our responsible and ethical artificial intelligence (AI) services.

Access Handley Gill Limited’s proprietary AI CAN (Artificial Intelligence Capability & Needs) Tool, to understand and monitor your organisation’s level of maturity on its AI journey.

Download our Helping Hand checklist on using AI responsibly, safely and ethically.

Check out our dedicated AI Resources page.

Follow our dedicated AI Regulation Twitter / X account.

Nicola Cain1 March 2024Data Protection, Personal Data, Data Privacy, UK GDPR, GDPR, UK GENERAL DATA PROTECTION REGULATION, General Data Protection Regulation, ARTICLE 6 UK GDPR, Article 6(1)(f) UK GDPR, Legitimate Interests, Legitimate Interests Assessment, Information Commissioner, Information Commissioner's Office, Consultation, Consultation Response, Artificial Intelligence, AI, Generative AI, Generative Artificial Intelligence, Intellectual Property, Intellectual Property Rights, Copyright, Copyright Infringement, Web Scraping, AI Training, AI Training Data, Intellectual Property Office, IPO, A pro-innovation approach to AI regulation, Voluntary code of practice on copyright and AI, Sir Patrick Vallance, Pro-Innovation Regulation for Digital Technologies Review, House of Lords Communications & Digital Committee, Report on Large Language Models (LLMs) and Generative AI, Data Mining, Copyright Designs and Patents Act 1988, AI Regulation, Joint statement on data scraping and data protection, Data scraping, Generative AI & Data ProtectionComment