TDMRep

The W3C TDM Reservation Protocol

Introduction

The Directive on copyright and related rights in the Digital Single Market or EU Directive 2019/790, better known as the “DSM Directive” (DSM meaning Digital Single Market), introduces two exceptions or limitations to the rights of rightsholders on lawfully accessible content, for reproductions and extractions for the purposes of TDM:
Article 3 specifies a mandatory exception for research organisations and cultural heritage institutions that carry out TDM for scientific research.

Article 4 specifies an exception for any organisation willing to carry out TDM for any purpose other than scientific research, including commercial purposes. This exception applies if the use of content for TDM has not been expressly reserved by its rights holders in an appropriate manner, such as machine-readable means.

The European AI Act, i.e. EU Directive 2024/1689, states that the copyright exceptions and limitations introduced in Articles 3 and 4 of the CDSM apply to general-purpose AI models, particularly large generative AI models. Any provider placing a general-purpose AI model on the Union market should comply with this obligation, regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those general-purpose AI models take place. Outside of the EU, advocating fair use or a similar rule is legally uncertain, as these actions are judged on a case-by-case basis.

The “opt-out” mechanism introduced by the CDMS Directive is, therefore, a real opportunity for TDM/AI actors and publishers across countries to define a machine-readable technique able to express not only whether mining rights on specific content are reserved or not but also how rightsholders can be contacted and which licenses are available, if any. This is a tremendous help for TDM/AI actors from all countries looking for legal certainty.

The TDMRep Specification was developed by the W3C TDMRep Community Group in 2022 and evolved in 2023 and 2024. This group has attracted 55 people, mainly representatives from publishing organisations, but also representatives from other opt-out initiatives and a few actors in the AI market.

Use Cases

The TDMRep Community Group extracted three main use cases:

The publisher controls a Web server

Example: a newspaper publishes articles on its own Web server.

The technical team can tailor HTTP headers using fine-grained rules, or store a specific file in a well-known location on the Web server.

The author publishes content on a Web server

Example: a blogger publishes articles on Medium.

The publisher’s editing tool can insert metadata in an article’s HTML header and in associated media files (e.g., JPEG, MP4).

The publisher syndicates content

Examples: a book publisher sells ebooks through multiple booksellers; a news agency syndicates news items through different channels.

The publication tool can insert metadata in the published file (e.g. EPUB, PDF) and/or in syndication metadata (e.g. ONIX for books).

Robots.txt and robots meta directives are insufficient

Robots.txt and robots meta directives have historically been developed to give indications to search engines, especially for SEO purposes. These techniques have not been created to protect sensitive data from specific usages (like AI training).

Using robots.txt for managing TDM/AI opt-out would require a webmaster to enter every single user agent (i.e. bot name) he wants to stop from crawling the site, whatever the crawling purpose is (including indexing for search purposes).

There may already be thousands of TDM and IA robots in the wild, and some of them do not even use a bot-specific user agent. Using robots.txt to opt out of TDM/AI crawlers would, therefore, be an overwhelming effort.

This could change if robots.txt evolves and supports the expression of a generic opt-out (here Disallow) relative to certain purposes like “crawling for training generative AI solutions”. However such an evolution could break existing implementations by web crawlers, which would make it a dangerous move.
Robots meta directives already support a notion of purpose: “nosnippet” and “notranslate” are examples of such purpose. It would therefore be easier to use robots meta directives for supporting a “notdm” directive conforming to the CDSM Article 4. A “noai” directive has already been proposed, but does not englobe the wider array of TDM practices publishers wish to address.

Plus, robots.txt and robots meta directives do not resolve the need for a way to ease contacts between TDM/AI actors and rightsholders.

TDMRep Basics

The TDMRep specification defines a simple rights reservation model, made of two properties:

  • tdm-reservation (boolean) indicates if mining rights are reserved or not.
  • tdm-policy (URL) gives access to publishers’ contact information and conditions for obtaining authorisation to mine content.

The specification introduces five techniques based on this simple model, each technique corresponding to one of the use cases previously defined:

  1. Rights reservation is defined as a file stored on the origin server. This file must be stored as /.well-know/tdmrep.json. It can contain several sections, each associated with a certain path on the Web server (the syntax mimics robots.txt) and containing the two properties defined in the model. This is a generic solution adapted to any media type. A crawler does not need to fetch the content to check its opt-out status.
  2. Rights reservation is defined as header fields in HTTP responses. This is a generic solution adapted to any media type. A crawler does not need to fetch the content to check its opt-out status. Technicians mastering proxy servers can program fine-grained rules.
  3. Rights reservation is defined as metadata in HTML content. This solution fits the needs of editors willing to apply opt-out on web pages.
  4. Rights reservation is set in EPUB files. This solution fits the needs of book publishers. It has two variants, for EPUB 2 and EPUB 3 publications.
  5. Rights reservation is set in PDF files. This solution, based on XMP metadata, fits the needs of academic and book publishers.

Note: The inclusion of TDMRep properties in media formats (image, audio, video) has not been studied by the Group. The technique used for PDF files could easily be adopted in formats that support XMP metadata (JPEG, MP4…)

The TDM Policy URL can lead to an HTML document (human-readable) or better to an ODRL structure (machine-readable). The ODRL structure contains:

  • Contact information for the rightsholder.
  • Conditions related to the use of the associated content, for specific purposes. Conditions can be “obtain consent” or “compensate”.

A vocabulary of “purposes” is currently discussed in the context of the Open Future initiative. The aim of this work is to find a consensus on mining purposes for which authorisations may be given or refused. It is expected that the consensus will be applied by different opt-out solutions (TDMRep, C2PA, IPTC/Plus …).

TDMRep Adoption

TDMRep has been adopted where communication has been active, i.e. in Western Europe (and especially France).

Adopters mainly belong to three categories:

  • Newspapers (e.g. Ouest France)
  • STM publishers (e.g. Elsevier)
  • Book publishers (e.g. Hachette Livre, Mondadori).

An AI expert has set up an automatic review of French adopters, which is published on Medium. Of the top 250 French websites, 143 have already implemented the TDMRep specification.

The adoption of TDMRep by TDM/AI actors is currently indirect: Spawning AI has implemented the protocol in an API used by Stability.ai and others.

Standardisation

The TDM Reservation Protocol is currently the final Report of the W3C TDMRep Community Group. Therefore, it can be considered a potential industry standard.
Because the work started from the requirements of the EU CDSM Directive, especially its Article 4, it focuses on the notion of opt-out, but it also provides a way to opt-in when necessary (tdm-reservation=0).

There were initial discussions about moving TDMRep to the status of W3C Recommendation when its adoption is deemed sufficient. The multiplicity of techniques defined in the specification, some purely web-oriented, others related to the inclusion of metadata in content files, makes it difficult to select the best standardisation body for this specification.

Other projects

Readium

The Readium projects provide rock-solid, performant building blocks and applications for processing EPUB3 publications. EDRLab is participating to the Readium codebase maintenance and evolution.

Discover the Readium projects

Accessibility

Support for people wih print disabilities is a key part of our mission. We collaborate with European publishers and major inclusing organizations on the creation of a born-accessible ebook market. We also make sure that Readium projects take into account the assistive technologies used by visually-impaired users.

Learn more about our accessibility efforts

Copyright © 2023 EDRLab. Legal informations

Log in with your credentials

Forgot your details?