LLMs can promote open science if the model, version and parameters are clearly documented

Ulrich Matter on his experiences with open science

Photo of Professor Dr Ulrich Matter

The three key learnings:

  • Replicability in economics can benefit from large language models, as clearly documented models and parameters can be reused consistently – in contrast to human labelling alone.
  • Transparency in AI-supported economic research requires minimum standards: researchers must disclose the model, version, prompt and parameter settings so that results remain traceable.
  • AI has the potential to provide practical support for open science in the future, for example through automated replication packages or more comprehensible README files that facilitate access to research data.

We want to talk about artificial intelligence and open science. You are concerned with the use of large language models for coding and labelling text data in applied economics – and with the consequences for transparency and replicability in research. Before we go into detail: What role does text data analysis play in empirical economic research in general? What are its fields of application?

UM: Its importance has increased significantly over the past ten years because digitisation has made more and more text material available. At the same time, researchers who are not specialists in this field now have access to easily accessible tools that deliver reliable results and enable them to use them themselves. The areas of application in economics are diverse. Often, the aim is to record consumer preferences, examine product characteristics or analyse values held by the population. In my field of research, for example, deriving political attitudes from texts plays a central role. This can include analysing comments in online forums when it comes to voter attitudes. Equally relevant is the evaluation of media content, i.e. the question of how the media presents certain topics – whether in individual sections or across the board. From this, it is possible to deduce, for example, where a medium can be located on a left-right scale. Essentially, the aim is to extract information from large amounts of text that represent a variable – such as an attitude or preference – that would be difficult to grasp empirically without text data.

So you focus primarily on journalistic texts and comments on journalistic content?

UM: Exactly. I am particularly interested in two areas: firstly, how the media reports – for example, on news portals – and what information can be gleaned from this. Secondly, how politicians communicate, i.e. which topics they address and how they present them in speeches or on social media platforms. From this, it is possible to deduce their stance on individual issues or their overall political position. A similar approach is taken in other fields. In finance, for example, analysts examine how market sentiment can be gleaned from texts even before it is reflected in share prices. To do this, they evaluate posts on social media, for example by actors who are likely to be directly involved in important stock market transactions, or reports in the business media. In both cases, the aim is to derive indicators of values, preferences or sentiments from linguistic patterns that would otherwise be difficult to measure.

Looking now at the importance of large language models for coding and labelling text data, can we say that their use has exploded in the last two years? Have LLMs become the central tool, or are there still equivalent alternatives?

UM: Although many papers have not yet been finally published – experience shows that this takes time in economics – almost all of the current working papers in this field include at least one application of LLMs. There are different ways of using them. LLMs are often used in a complementary manner: classic text processing methods are combined with additional analyses based on LLMs. In some cases, however, LLMs are also used where traditional machine learning approaches are too complex or simply not yet developed. And finally, they are increasingly replacing manual work: LLMs are now used for coding variables, which used to be done by students or research assistants. Otherwise, I agree with you: their use has increased significantly. I see two main reasons for this. First, LLMs open up new possibilities that were not feasible with previous methods. Second, they are very easy to access, especially for simple coding tasks. Researchers who previously had little experience with text analysis and would have tended to use assistants for such tasks can now directly access this technology. The community already seems to accept the use of LLMs for these simple applications. At the same time, however, many questions remain unanswered – in particular, what the consequences will be if this practice becomes more widespread.

When you say that LLMs are now used everywhere, are we talking specifically about ChatGPT or specially adapted models?

UM: The spectrum is relatively broad. It is evident that the more technical expertise the team of authors has, the more frequently researchers resort to open-source models. However, these are often less easily accessible than, for example, the models from OpenAI. In practice, therefore, an OpenAI model is actually used in most cases – simply because it is readily available. This is particularly true for data that is not sensitive and can therefore be processed via this platform without any major ethical concerns. There are basically two approaches. Either you work manually via using the ChatGPT application, or you use the programming interface, which allows larger amounts of data to be fed in automatically and results to be recorded systematically. Both approaches are used in research practice. In some cases, however, it remains unclear in the publications which variant was actually used. This is certainly relevant for replicability, as it makes a difference in terms of methodology.

Which three fields of application would you describe as particularly relevant for the use of LLMs?

UM: My assessment is naturally influenced by my own field of research, as that is where I am most familiar with the literature. Basically, these are the fields in which text has already been used intensively in the past. A key example is the financial and macroeconomic sector, in particular the analysis of investment decisions and market sentiment. Until now, simple methods have often been used in this field, such as counting the frequency of certain terms and deriving a positive or negative sentiment from this. With LLMs, these analyses can now be significantly refined: it is possible to identify more specifically which patterns of argumentation or expectations occur in the texts and what significance they could have for the development of stock market prices. A second field is media economics. At conferences over the past year and a half to two years, I have seen virtually no contributions that do not draw on LLMs in some form. Finally, there are many exploratory applications that are more based on prompting, i.e. controlling LLMs with specific inputs without having a comprehensive text analysis framework already established in the background. Here, researchers often use easily accessible tools such as ChatGPT to develop initial hypotheses or pilot analyses.

Studies by TU Darmstadt show that a large proportion of students regularly use ChatGPT for their studies. Has using it become part of education – even at doctoral level? And is this changing the scientific knowledge process?

UM: Universities are increasingly integrating the use of AI into their teaching. In St. Gallen, for example, I teach a course called “Prompt Engineering for Economists,” in which students learn to formulate inputs in such a way that models deliver useful results. In addition, AI is already being discovered by the first students as a learning aid, for example through chatbots that can provide feedback at any time on a topic that students do not yet understand – albeit far less frequently and far less elaborately than would be theoretically possible. The area of exams is the least clear: some formats no longer work, while others need to be rethought. It is also striking that although students use ChatGPT frequently, they often do so without much reflection.

So, there is still significant room for improvement in prompting skills?

UM: Exactly. Many students give little thought to how they can meaningfully integrate AI tools into their learning. Although there is clear course information with learning objectives, literature and exam requirements, there is often no consideration of how to use these resources in conjunction with the new technologies. Our generation had to develop strategies without such tools – such as how much time to spend in the library or how to take notes. Today, we see the potential of AI, but for students who are just starting out, studying itself is already a challenge. They have no point of reference without AI. Whether a teaching assistant chatbot is actually the solution remains open to me. The crucial question is how we can get students to use this technology productively – and not just superficially.

To what extent do AI tools that are available at all times change learning and the cognitive process?

UM: It is still too early for a final assessment, but I do see risks. Students are increasingly turning to AI instead of working their way through required reading. This often means they lack the overview that a book or lecture provides. AI provides specific explanations that help to reproduce knowledge, but not necessarily to apply it or place it in a larger context. I notice this in exams, for example: details are mentioned, but connections between concepts are missing. We therefore need to consider how to encourage students to link content more strongly. In Bern, we are testing a model with two components: on the one hand, traditional lectures with exams without AI to ensure a basic understanding. Secondly, longer exam formats in which AI tools are expressly permitted – for example, for data analysis or a small software project. In these cases, it is not enough to reproduce individual answers; students must link concepts and apply them in practice. This approach forces them to master both: understanding the basics and the productive use of AI in a more complex context.

What role do prompt engineering and the choice of model – such as GPT compared to open-source models – play in the quality of results? What experiences have you had?

UM: In my area of focus, the labelling of texts relevant to economics, we have seen little difference in quality so far. If models were developed at the same time and are similar in size, they deliver comparable results – regardless of whether they are open source or not. Of course, this may well be different in another context, i.e. in another field of research. We tested this in practice in an ongoing project: First, we compared an OpenAI model with human labelling on a small scale. Then we compared the result with a current open-source model such as DeepSeek. Both delivered almost identical quality. Nevertheless, we decided to scale up with DeepSeek – simply because it was around five times cheaper. For many research teams, it is therefore not so much the quality of the results that is decisive, but rather the economic question: can a project be financed at all with a commercial model such as OpenAI if it is used on a large scale?

When we talk about transparency, what precautions are necessary to avoid bias and lack of transparency when using LLMs?

UM: It is crucial to always work via the standardised developer interface – regardless of whether you are using OpenAI or an open-source model. These interfaces are now largely standardised, and providers document exactly which model is being used, including the version number and, ideally, the release date. This information is essential for replicability: It is not enough to simply mention “GPT-4” ( ); you have to specify the specific version. It is equally important to document the parameters used. Without this information, it is not possible to reproduce a result. This includes, for example, the so-called temperature parameter, which controls how the probability distribution is scaled in these models, whether a model produces more deterministic or more variable outputs. The aim here is to predict which word is most likely to appear in the text analysis. Variants should also be presented here – similar to robustness analyses in econometrics. Specialist journals will probably demand this more strongly in future. On the much-discussed question of open-source models: transparency is, of course, desirable in principle. But most researchers can’t do much with open model weights anyway. In practice, it’s less important whether you can see a model’s internal matrices and more important whether it’s clearly documented which model was used, with which parameters, which input data and which results. That’s the level that creates real traceability.

For example, when you label a journalistic medium, how do you check whether the quality is right? In the past, students took on such tasks, and you also had to evaluate the results. How do you do that today?

UM: That is actually still an open question. In the past, we usually trained our own machine learning model, and the gold standard was the human eye: reviewers expected at least some of the data to be assessed manually by humans – often students. Today, the process is similar, except that our own models are often replaced by LLMs with specific prompts. Here, too, we check a subsample and compare the results with human labelling. A second, more elegant approach is to use several human labelers and various LLMs. Then one examines where deviations occur – between humans, between humans and models, or even between multiple models. It is usually considered satisfactory if the differences between humans and machines are no greater than those between two human labelers – in other words, if the agreement is similarly high.

Is this done on a random basis?

UM: Exactly. A random sample of around 500 texts is taken and labelled manually by several people. These results are then compared with the model outputs. At the same time, there are now benchmarks from the providers themselves – for example, how well GPT performs in the Bar Exam or in doctoral-level physics exams. This raises the question of whether our previous gold standard – manual labelling by students – is really the best benchmark. Deviations from human labelling are not necessarily errors; they can also indicate better results. The fundamental problem is: how do you evaluate models in the first place? The industry is experimenting intensively with evaluation methods, and we can learn from this in our research. For us in the field of text labelling in economics, this is a small part of this big debate. Currently, however, human labelling is still the recognised gold standard. Anyone who uses a model must demonstrate that its results are at least as good as those produced by humans.

Text data analysis plays a central role in media economics. However, journalists are now increasingly using ChatGPT to write texts. Have you already observed that this is changing linguistic patterns and making it more difficult to discern political attitudes?

UM: No, not in this form. We expect that editorial control by the editor-in-chief will remain strong enough at media outlets to ensure that the intended political stance continues to be recognisable. But your point is important, because the problem is more evident in other areas – such as hotel reviews or posts on platforms like X or LinkedIn. There, many people use ChatGPT directly for writing. In the past, the main problem was filtering out bots from the data. Today, real accounts are posting, but their language is increasingly bot-like. This creates a new “bot problem” for research that extracts opinions from social media posts: even verified human accounts produce texts that are stylistically similar and thus perhaps more similar than we would like.

Let’s talk about AI and open science again. Open science focuses on accessibility, replicability and transparency. Proprietary AI models, however, are considered black boxes. Where do you see overlaps or areas of application in which both developments can reinforce each other? Can AI advance open science – for example, through higher computational reproducibility?

UM: First things first: even open-source models are a black box for most people. Even providers such as Meta do not publish all the details about the training of the Llama models. So the promise of complete openness is only partially fulfilled. It would be much more helpful for research to know exactly how models were tailored for political correctness, for example – but this information is often missing. Where I see great potential, however, is in replicability. In the past, text labelling was done by students. The article could only describe how they were instructed – exact reproduction by other people was practically impossible. Today, a model can be used as a “constant labeller”: with clearly documented information on the model version, parameters and prompts, the process can be exactly replicated later. This significantly increases replicability in the field of text analysis, even if the models themselves remain black boxes.

And how is it in your field? Is it now common practice for researchers to disclose the model, date, prompt or training details?

UM: Not yet. In many working papers, you only find brief references such as “ChatGPT was used for labelling”. This, of course, does not guarantee traceability. A few months ago, I met the AEA’s data editor, Lars Vilhuber, at a workshop where he spoke on this topic. I think his suggestion is to make it mandatory in future to specify which model was used, at what time and with which parameters. That would suffice as a minimum standard. As I understand it, he is working on corresponding recommendations, which would then have to be approved by the board. We will probably soon – perhaps even this year – receive clear guidelines from large organisations such as the AEA. These could also specify how prompts should be documented. Comparable standards already exist for experiments, where protocols must be published according to fixed templates. I expect that similar rules will now also be established for the use of LLMs.

Apart from text analysis, in which areas do you expect developments in the use of LLMs in empirical economic research in the coming years?

UM: One key practical aspect concerns everyday research. Empirical work requires a lot of data preparation: variables must be coded, regression models specified and extensive code written. With LLMs, this code can already be developed reliably and more quickly, which speeds up processes and creates more space for content-related considerations. I also see parallels in theory. Colleagues from economic theory, for example, use models to test sketches of proofs or suggest alternative formulations. What used to be passed on to doctoral students or co-authors can now also be mirrored by AI – not as a replacement, but as a productive addition. I think this is a very important area. In addition, there is potential in computational economics, especially in agent-based simulations. There, LLMs can be used to generate significantly more complex behaviour patterns for simulated consumers or producers – in some cases with locally applicable models. Although I have seen little published work on this so far, I believe the potential is considerable.

Looking back at AI and open science, where do you see potential that has hardly been exploited so far? The development of AI tools is progressing rapidly, and there are already thousands of applications. Where could this journey take us?

UM: My hope is that AI tools will emerge in the open science community that significantly reduce the effort required for transparency. A simple example would be the creation of replication packages. Today, this is a time-consuming, manual task. AI could semi-automate this process – for example, when writing README files. These must be formulated so clearly that even people who have never seen the data before can understand the process in a few minutes. In practice, this is often difficult: what seems obvious to me as the author is not immediately comprehensible to others. AI models can be very helpful here. I already use them for my own code scripts to improve documentation. The explanations generated are often detailed, understandable and save a lot of time. Instead of hours, it only takes a few minutes. If such tools were used more widely, it would not only be easier for researchers to make their work transparent, but also for others who want to make use of these open science offerings.

One could also imagine using different personas: “Do you understand this README file – and where not?”

UM: Exactly, some of these tools already work this way. They combine several agents: one drafts the documentation, another checks its comprehensibility, and a third evaluates it again. The problem is not so much the technology – which is already available today – but rather the question of resources: who develops and finances such applications in the context of open science? The advantage would be considerable: researchers would no longer have to deal with every detail of the documentation themselves, but could delegate this task. And even users who don’t want anything to do with AI would benefit because the results would be clearer, more consistent and more accessible.

Thank you very much!

*The interview was conducted on 13 August 2025 by Dr Doreen Siegfried.

This text was translated on 10 September 2025 using DeeplPro.

About Prof. Dr. Ulrich Matter:

Prof. Dr. Ulrich Matter is Professor of Applied Data Science at Bern University of Applied Sciences and Affiliate Professor of Economics at the University of St. Gallen. His research focuses on the intersection of data science, quantitative political economy and media economics in the digital space. Matter is Principal Investigator of two projects funded by the Swiss National Science Foundation: Consequences of Personalised Information Provision Online: Segregation, Polarisation, and Radicalisation (2022–2026) and Ideological Ad Targeting in Consumer Markets and the Political Market Place (2023–2027). Before joining the University of St. Gallen, he was a visiting researcher at the Berkman Klein Centre for Internet & Society at Harvard University.

Contact BFH: https://www.bfh.ch/de/ueber-die-bfh/personen/b3kow3zcylq4/

LinkedIn: https://www.linkedin.com/in/ulrich-matter-55b501bb/

Personal website: https://umatter.github.io/

GitHub: https://www.github.com/umatter




to Open Science Magazine