“We don’t just organise data; we safeguard scientific scope for action.”

Interview with Susanne Schmucker

Photo: Kaja Grope

Text and data mining is transforming the way research works with knowledge, particularly in the light of generative AI. For academic libraries, this means that their core mission is shifting from ‘enabling access’ to ‘ensuring usability’. We discuss the associated issues regarding licences, interfaces, metadata, documentation, long-term availability and data sovereignty with Susanne Schmucker, Head of the Collection Development & Metadata Programme Area at the ZBW.

Why is the topic of text and data mining more than just a technological trend for libraries?

S. Schmucker: Text and Data Mining, or TDM for short, encompasses methods used to automatically analyse large volumes of text and data, right through to the preparation of training data for AI applications. For libraries, this shifts the meaning of ‘access’. It is not enough simply to make content available in a readable form. Researchers want to be able to compile texts and data into corpora, process them automatically, version them, document them and review them at a later date. TDM therefore touches on core library tasks such as licence management, metadata work, provision, long-term preservation and advisory services.

What is often overlooked in the TDM debate?

S. Schmucker: Outside the library community, it is often overlooked that TDM and AI frequently fail not because of a lack of computing power, but because of the data itself. Data is stored in silos, is not machine-readable, or the licence terms are opaque and inadequate. This turns a technical method into a coordination project. The effort is often involved in rights clearance, access, data cleaning or data selection. And thus in areas where libraries play a central role.

Where do the first practical problems arise when researchers request TDM?

S. Schmucker: For researchers, the problems usually begin with corpus construction. Which sources may be used? How are they technically incorporated into the corpus? And how is the data status documented? Particularly with licensed platforms, there is often a lack of stable interfaces or clear permission for automated processing. This leads to uncertainty and corpora that are difficult or impossible to reproduce scientifically.

What is the most common stumbling block in the licensing arena today?

S. Schmucker: A fundamental issue! The gap between ‘access’ and ‘usability’ is often a hurdle here. Many contracts regulate reading, but do not explicitly cover machine processing. Or the licence information is not documented in a machine-readable format on the individual publication. It then remains unclear whether automated downloading, long-term storage or the sharing of a corpus with various collaboration partners is permitted.

What is the next sensible step for libraries in this context?

S. Schmucker: What we need is a streamlined standard process for TDM requests from the research community that brings together rights, technology and documentation. This means: clarifying the data source and protection requirements, checking the terms of use, determining the access path, defining documentation and versioning standards, and specifying responsibilities. In addition, a data and licence catalogue would be helpful, as it would make it clear which collections can be reused, how they can be reused, and under what conditions.

Thank you!

About Susanne Schmucker:

Susanne Schmucker has been working at the ZBW – Leibniz Information Centre for Economics since 2009. She has been head of the Collection Development and Metadata Programme Area since October 2025. Drawing on her background in economics, Susanne Schmucker combines expertise in knowledge organisation, metadata management and user-oriented information services with the aim of translating access into sustainable usability for research.

Photo: Kaja Grope

This text was written on 22 April 2026.
This text was translated on 25 June 2026 using DeeplPro.

to Open Science Magazine