Advanced data analysis and research management with BERD@NFDI

The integration of unstructured data into economic research processes

In the age of digitalisation and exponentially growing data volumes, researchers are faced with the challenge of gaining valuable insights from the wealth of unstructured data. The importance of this data for scientific research, particularly in the social and economic sciences, cannot be overestimated. To meet this challenge, the NFDI consortium BERD@NFDI has developed a comprehensive platform that facilitates accessing, analysing and exchanging unstructured data. This platform provides not only datasets and analysis tools, but also educational resources and services to help researchers navigate the complex landscape of unstructured data. The four central portals and services of BERD@NFDI, which aim to support economic research in Germany and beyond, are presented below.

The BERD data portal

The BERD data portal offers an exclusive selection of high-quality unstructured datasets, including the Youtube 8M dataset, the Yelp business review data and the Million Playlist dataset from Spotify. This data is highly relevant to research and has already been used in leading research publications in the social sciences. Each dataset on the portal is tagged with comprehensive metadata, including author information, general descriptions, variables, dataset size, licence information and keywords. Users can find previous versions of a dataset to reproduce results from previous scientific studies.

BERD also provides publication information for each dataset to show the context in which the data has already been used in other scientific studies. In addition, the datasets will be linked to customised analysis methods in the future to help users analyse and process the data in the best possible way. In the future, these datasets will be complemented by a broader, less restrictive collection of data from different sources to meet the needs of the user community.

The data portal will also include a secure data marketplace for the controlled exchange of data between researchers and industrial organisations. Users can then submit research proposals and use data from organisations to answer specific questions with unique datasets that are not available elsewhere. In addition, once the data marketplace is implemented, users can upload and share their own datasets, for example from funded research projects.

BERD’s data portal is designed to provide comprehensive access to unstructured data from past and current scientific research as well as to valuable proprietary data.

The BERD analysis portal

The analysis portal extends the functionality of the data portal by making the existing data accessible for analyses, thus building a bridge between unstructured data and the traditional empirical research model. It integrates various methods, including pre-processing and machine learning, as well as general analysis methods such as sentiment analysis or topic modelling, and links these with the corresponding data sets from the data portal.

Users can search for specific pre-processing techniques (such as normalisation, feature extraction, filtering and feature learning methods), machine learning methods (such as support vector machines) or specific analysis tasks in the context of unstructured data.

The search results provide metadata and references to publications in leading economics journals that use the selected techniques. Alternatively, users can use the detailed pages of the data portal to find suggestions for methods that are suitable for a particular dataset. The analysis portal also enables researchers to track the development of the application of certain methods or analytical procedures in the leading economics journals over time. In this way, the portal supports researchers who want to work with unstructured data by showing how it can be analysed, which data is suitable for specific tasks and which scientific papers with similar methods or data sets have already been published.

The BERD Academy

The training and education portal, also known as the “BERD Academy”, is an integrated module of BERD@NFDI. It brings together all educational programmes aimed at promoting data literacy in general and the handling of unstructured data in particular. This provides users with comprehensive tools to effectively use the data and analysis methods provided on BERD for their research purposes. The offer includes in-depth workshops, informative webinars and various learning modules for data science and analytics that are suitable for self-study. The courses are freely accessible and can be attended both in person and online. They are aimed at a wide audience, from beginners to professionals.

In 2023, the BERD Academy offered a variety of events, including a face-to-face series on statistics for the common good, a workshop on AI-based methods for using text as data in the social sciences, online flipped classroom courses on the reproducibility of research and lectures and discussions on FAIR data and data protection. Events such as DataFEST Germany and Women in Data Science were also organised. These established programmes will be continued and further developed in the future.

The workshops and courses are specially designed for researchers and data managers and are offered on a regular basis. They are also supplemented by customised offers, with a particular focus on a needs-based course on research data management.

The BERD service portal

The service portal complements the BERD platform by providing additional research-related services and tools. Firstly, it offers a range of services in the field of optical character recognition (OCR) to support researchers who want to extract (un)structured data from images or texts. In particular, printed (non-digitally created) sources in the social sciences, especially in economics, must first be digitised using text recognition methods in order to be usable for empirical analyses. This step is essential for the pre-processing of non-digital data sources.

The OCR recommendation service provides guidance on automatic transcription, text recognition and pre- and post-processing options based on several questions about the underlying images and the purpose of the OCR. The results can be used for the planning and implementation of editorial or digitisation projects as well as the full-text digitisation of specific works, series of works, collections or corpora. In addition, an OCR-on-demand service has been introduced that allows users to perform immediate OCR of digitised material.

For researchers who are struggling with their own OCR pipelines, BERD has set up an OCR helpdesk, , which offers direct support. The service portal also offers legal support for data-related issues. In particular, it helps users to understand the impact of data protection regulations on their work with datasets and the sharing of data with others.

An interactive virtual assistant (iVA) informs users about the need for their research data to be GDPR-compliant and which requirements must be met for lawful consent to data processing. In addition, the service portal contains knowledge graphs for German company data based on various providers, registers and time horizons. These knowledge graphs provide access to additional legal and pre-processing aids as well as extensive company information.

Together with the data, analysis and training resources on the platform, BERD users have extensive opportunities to work with large and unstructured data in the social sciences.

Status: 8 March 2024

to Open Science Magazine