Using Copyright-Protected Works to Train AI from the Perspective of the Data Mining Exception of the Finnish Copyright Act
19 February 2025
Authors: Anniina Somppi, Hilma Mäkitalo-Saarinen, Panu Siitonen, and Vilhelm Schröder
Generative artificial intelligence (“AI”) applications have become increasingly popular, raising questions about the permissibility of text and data mining for training large language models. Generative AI applications are based on machine learning techniques and neural networks. In all these approaches, machine learning models are trained with large datasets to identify patterns and correlations in the data. The effectiveness of these models heavily relies on the quality and quantity of the training data, as these factors directly influence the model's ability to refine its decision-making processes. Through iterative training, AI models learn to make accurate predictions and generate outputs based on the patterns extracted from the data. As a result, generative AI can create different types of content, such as texts, images, and videos, based on requests introduced manually by the users, called “prompts”.
Copyright issues arise when training data includes material protected by copyright or related rights. In some cases, text and data mining may involve acts reserved exclusively for rightholders, such as the reproduction of copyrighted material (in part or in full), which could lead to copyright infringements. Since training AI models typically requires vast amounts of data, developers of generative AI applications may need to use a vast amount of copyright-protected works from various rightholders, raising concerns about unauthorised use.
In 2023, the Finnish Copyright Act (404/1961) was reformed to include Section 13 b on text and data mining. This new provision implements the text and data mining articles of the Directive on copyright and related rights in the Digital Single Market ((EU) 2019/790) (the “DSM Directive”).
In this blog post, we examine the conditions of this exception and how these could affect the training process of AI from a legal perspective.
Text and Data Mining Exception in the Finnish Copyright Act
According to Subsection 1 of Section 13 b of the Finnish Copyright Act, which implements the general text and data mining exception of Article 4 of the DSM Directive, anyone who has lawful access to a work may reproduce it for the purpose of text and data mining and retain copies solely for that purpose, unless the right of reproduction has been expressly and appropriately reserved by the rightholder.
In other words, the exception generally permits text and data mining when (i) the miner has lawful access to the work and (ii) the right of reproduction has not been expressly and appropriately reserved by the rightholder. “Lawful access” refers to access granted through an open access policy or contractual arrangements or to content that is freely available online.
When a work is made publicly available online, a mechanism may be implemented to allow rightholders to signal that they do not wish their works to be used for specific purposes, including text and data mining. According to the recitals of the DSM Directive, this “opt-out” option must be established through machine-readable means, such as metadata or through the terms and conditions of a website or service. In practice, this means that rightholders can indicate that their works should not be used for activities like AI training or that the intended use should comply with certain restrictions.
In other cases, rights may be reserved through methods such as contractual agreements or a unilateral declaration. It will be interesting to see whether machine-readable bans or restrictions become a standard feature across the EU, especially on websites and other materials that could be used as educational material for AI.
The exception applies only to the extent that copies are made for the purpose of text and data mining. Text and data mining is defined in the DSM Directive as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”. If text and data mining was to be interpreted narrowly as a tool for analysing data to extract information, AI training operations that can be deemed to produce new content instead of only analysing the material could fall outside the scope of the exception. However, a broader interpretation of the text and data mining definition would mean that it would apply to AI training as well.
The DSM Directive does not specifically mention AI or machine learning, and similarly, the Government proposal on the Finnish Copyright Act does not provide guidance on this issue either. While the text and data mining exception may offer some clarification regarding the legality of training AI, the uncertainties around the above-mentioned requirements could impose limitations or create obstacles for using protected material in such cases. In Finland, there are currently no court cases addressing the application of this exception to AI training, and as such, the legal landscape remains unclear. Therefore, the issue will likely need to be clarified through court decisions or more detailed guidance from Finnish authorities to provide a clearer understanding of how these exceptions will be interpreted in practice. The unclear legal situation is quite unsatisfactory, as the implications are huge and would have a significant impact on how AI can be trained and which material can legally be used.
Case Law from Other Member States
As noted above, the text and data mining provisions raise several legal questions. At the time of this blog post, the first two cases have reached national courts within EU Member States.
The most recent example is the Dutch DPG Media et al v. HowardsHome case, decided by the Amsterdam District Court on 30 October 2024. In this case, the text and data mining exception was successfully invoked. While the ruling offers only minimal guidance, it provides important insight into how the exception should function in practice, especially considering the limited available guidance on how to implement and exercise an opt-out mechanism in accordance with Article 4 of the DSM Directive.
Regarding the criterion of lawful access, the Court assumed that the defendant had only used publicly available information and, as such, had lawful access to the material. In terms of the opt-out mechanism, the Court ruled that the plaintiffs had not expressly and in an appropriate manner reserved the right to deny text and data mining on their websites. The reservation only focused on big AI bots, meaning that the bot used by the defendant was not covered by the plaintiffs’ reservation, and thus the opt-out was deemed ineffective.
The judgment suggests that, for an opt-out under the text and data mining exception to be effective, the parties it targets must be explicitly identified. Implied reservations are not sufficient, and a clear and expressly made reservation is required.
Another recent case is the German Kneschke v. LAION case, decided by the Hamburg District Court. In this case, the Court applied the text and data mining exception for scientific research purposes. The comparison of the image content with a pre-existing image description fell within the legal definition of text and data mining, as it involved the automated extraction of information. The Court found that the defendant, a non-profit organisation, had created and published the dataset for non-commercial purposes, which fell within the scope of the exception. As the text and data mining exception for scientific purposes does not include an opt-out mechanism for rightholders, the validity of any rights reservation made by the claimant was not crucial to the outcome of the case.
However, in an obiter dictum, the Court suggested that a rights reservation expressed in natural language could still be considered “machine-readable”, given that technologies capable of detecting opt-outs in natural language had been available since 2021. For material made accessible online, a reservation of rights would only be effective if it is made in a "machine-readable" form. In this case, the photo agency’s website, from which the photo reproduced by the defendant was downloaded, included a reservation in natural language. The Court stated that the term “machine-readable” should be assessed in light of the technology available at the time the copyrighted work was reproduced. The Court further suggested that, as of its decision, a natural language reservation may be regarded as machine-readable.
It is important to note that the HowardsHome and LAION cases were decided by first-instance national courts in Germany and the Netherlands. It may well be that their potential implications are restricted to said countries. The scope of Article 4 of the DSM Directive remains to be clarified by higher courts across the EU, including Finland. As such, it is still uncertain how broadly the text and data mining exception will be interpreted in the future.