How economists can study machine learning methods by playing
Bernd Bischl talks about Open Science, research software and the machine learning platform OpenML
Three key learnings:
- The OpenML platforms enables researchers to analyse their data automatically. Economists who want to add to their theoretical knowledge of machine learning can use OpenML as a digital lab for experiments and learn by playing.
- Costs for resources in the context of Open Science are a dynamic quantity. They are a function of how well-trained researchers are and which tools they know. The better economists are trained in the use of Open Science tools, the more efficiently can they share research findings.
- The NFDI consortium BERD@NFDI wants to build a bridge between business studies and data science.
How did you get involved with Open Science?
BB: My field is machine learning. I have been interested in this since my student days and I decided that I want to specialise in this. It has always been obvious to me that you need to do sensible experimental work in the field you are working in. In my discipline Open Science mainly means working with data and code. These must be published in Open Access in a sensible way. I would like to define the topic of Open Science a little more broadly. You can see machine learning as a mathematical field, but you can also regard it as a science that needs to be analysed empirically, where experiments are run on a computer. Another thing I have been enjoying since I gained my PhD is working with a computer, i.e. creating tools, writing software, engineering software. And while I wrote my PhD thesis I also started collaborating in the project OpenML. Joaquin Vanschoren from the Netherlands, who started the project, asked me if I wanted to be part of it and so I have been since 2012.
You are one of the core members of the OpenML platform. What is OpenML?
BB: OpenML is a digital platform which provides services in the field of machine learning to scientists from all disciplines. OpenML is open to core machine learning scientists as well as to domain scientists. The platform tries to make core objects from machine learning experiments digitally representable and shareable. Such core objects are datasets, algorithms, experimental studies and their results. For all these object classes we basically make formats available. They are all digital and, above all, can be represented in a machine-readable way. Unfortunately, that’s also why the platform is very labour-intensive and complex. Platforms offering datasets have been around for a long time. But there’s a problem if all datasets exist in different formats and many of them are not properly documented. If I want to run an integrative analysis I first need to spend hours writing code to bring the datasets into similar formats. OpenML standardises this. But we want to go a step further. We also want to represent the experimental protocols and results on this platform.
What can I as a scientist do with OpenML?
BB: That depends on personal perspective. Am I a machine learner or a domain scientist? Originally we built this platform because we were interested in the empirical view on machine learning. Empirical results do not always correspond with theoretical results, and that’s why we need meta-analyses for machine learning, so we can find out which procedures work well under which conditions. The other thing is: if I want to improve things algorithmically, I need a gauge. OpenML is a possible gauge to find out whether I have really made relevant progress empirically. For domain scientists it may be more interesting to share the dataset and inspire other people to find a better solution, because they rather see themselves as novices in machine learning.
In economics and especially in business studies, unstructured data play an ever larger role. How can economists working with unstructured data use OpenML?
BB: Especially we in the NFDI consortium BERD@NFDI are trying to close the gap. Which means that economic researchers need basic training in data science. They need certain basic programming skills because these days such analyses are usually written in Python or R. But typically you write this in short programmes, in scripts. You have to be able to write code and this will last for quite some time yet. And that’s how OpenML works. We try to make things easier in BERD. Mainly we try to write manuals for certain use cases. I think we need to work on two issues right now. We need to make the tools simpler and to build specialised tools for economists. But we also really need to change something about the training. There are many young scientists right now in the social sciences and economics who want to familiarise themselves with machine learning. And it’s an exciting learning process for me, too, to familiarise myself with the research questions of the other side.
What exactly does the Open in OpenML stand for?
BB: Open stands exactly for the same idea as in Open Science, that everything should be open, accessible. The FAIR principles are also an essential part.
Are there learning materials for laypeople on the platform OpenML? Or is OpenML more a tool for interested people who want to dive right into an analysis?
BB: It’s not a training course but a tool to simplify experimental work. But if I attend a course on machine learning somewhere, OpenML can be the digital laboratory where I can work experimentally and play around with stuff experimentally and thus can learn by playing. OpenML helps with learning because you get access to many datasets. You can also see the results that other people have produced; you can see progress.
Joaquin Vanschoren and I (with some others) have already used the platform in courses and the students, working in groups, could see what the others were doing. Above all, OpenML gives you an impression what really works in practice. Alongside the theory, OpenML is a useful tool if you want to study machine learning. What is really worthwhile? You can see that quite well from this.
For instance, I have an open course on machine learning on GitHub, which is openly accessible, with videos and slides and all the material sources. We have even written a small paper explaining our philosophy on such “Open Source Educational Resources”.
Right now we are developing a new variant of my course for a new “minor subject AI”. We are reworking it to make the course much more visual and much more application-oriented. We are creating one version for natural scientists. And we’re building a second version for the humanities and the social sciences. We’re including a high number of practical components so that people can really do things themselves with data and models and thus can learn from it. It is truly a practical skill that has to be learned. And unfortunately it takes time to master it.
How can the study of machine learning contribute to fostering the Open Science movement?
BB: The reason why so many scientists still don’t share their research findings is that it requires too much effort. What you need to understand is that these resource costs are a dynamic quantity, a function of how well-trained I am, and which tools I know, and whether I take the difficult or the easy way. A lot is going on at GitHub nowadays, that’s another technical system that I need to master. We have started teaching our bachelor students how GitHub works. Ten years ago, we didn’t do that at most universities, and it’s still not done universally. The curricula must be adjusted.
Someone who makes an argument of the required effort shows that he doesn’t master the tools needed for science in the digital age?
BB: The problem is that these courses must be taught by people who know a lot about the subject. But so far there are no established career paths for such people. In theory you can organise a lot of courses, but they won’t take place because of staff shortages. We urgently need career paths for people who are specialised in these subjects, such as research software engineering. They will automatically offer good training courses. We must train the next generation in basics and we must also train the next generation of specialists. I see too few professorships advertised specially for these subjects. At LMU we have begun to include Open Science in job offers. Applicants are requested to describe how they already practise Open Science and how they plan to do so. But it’s not enough. I also need one in twenty professors who has done this exclusively. I am also very happy that we now have a dedicated Open Science Centre at LMU, which cooperates with our Munich Center for Machine Learning and which we support with staff.
Let’s talk about NFDI and BERD: What would be a good cooperation between machine learners and business economists in your opinion?
BB: BERD wants to create domain-specific manuals, use cases and starting points, that’s our main goal. This will enable people from economic research to work as independently as possible. It’s already happening, but we want to make this even easier for a larger number of researchers. We also want to create standards for exchange. Standards are another means of making things comparable. And that’s how you automatically run into collaboration and communication. Of course you cannot expect economists to familiarise themselves on their own with ML which is very complex on the one hand and very dynamic on the other. They have their own legitimate research interests, after all. You could say we data scientists will do everything for you, but that’s not really a viable option. The recipe for success is collaboration and consultation.
At LMU, we have StaBLab, that is a statistical consulation lab where we advise on different projects at every level – support for a PhD thesis, research projects etc. Research projects can get this at a very low price or even for free. Often this collaboration leads to a joint publication. And since we attracted the Munich Center for Machine Learning to Munich, we also have established a Machine-Learning-Consulting-Unit. And in my view that’s what you have to do if you want to work collaboratively. The biggest problem at the start is that both sides speak different languages and the most difficult step in machine learning typically is not that you need to understand all these complicated formulae behind the models, but the transfer and formalisation step. You can only do this together. Beside the formalisation step, the evaluation of the results at the end is very important. That’s where we also try to advise, if the right techniques have been used etc. Of course this would work even better if people had decent training. You can communicate better directly.
What is the role of the new scholar-led infrastructures in implementing the idea of Open Science more strongly?
BB: Scholar-led infastructures play a massive role in this because in the end that is the goal for all. It is an important initiative. We must learn to communicate with each other across domains.
There are many commercial providers for infrastructures in the science system who offer services for the entire research cycle. Can it work well if science is now building these infrastructures for itself?
BB: In an ideal world I would prefer non-commercially driven tools and initiatives. But we also know why it’s not so easy, because we can usually only invest less money into these initiatives. The weight of funding behind public tools is usually different. Commercial tools are already there, they are professionally maintained and usually they are also a little better because of that. I wouldn’t really differentiate between commercial and public infrastructures. The important thing are the framework conditions. We, the public and scientists, must participate in the discussion of these framework conditions. We need to be part of the debate and to regulate the conditions, and there’s a point where we must say that the framework conditions shouldn’t change to our disadvantage once these platforms have become fundamental infrastructure for us. Which means you need to exert your influence and not give in completely. What we are trying to do in BERD is a combination of well-proven things like GitHub and new things.
What does Open Science mean for you?
BB: I am mostly interested in sensible empirical work. Open Science is an important component of good empirical work. Because Open Science ensures that other people can build on what I have done empirically. It helps to view things with a mathematical approach. When I look at a mathematical field, I can publish theorems in a paper, for example. But typically I also include the proof in the paper, because otherwise nobody can see that what I have done is correct. And then I also publish my empirical observations. We have a scientific toolbox for doing this correctly. In my opinion this is less well-known in our field than it could be because most people focus on the mathematical part. We as statisticians, machine learners and data scientists make intellectual tools available so that other people can carry out a sensible empirical analysis. But we don’t apply these tools to our own field as well as we could, which I find odd. And that’s my point; I want us to do this sensibly. So what I need now is decent training and a toolbox. Beyond that of course I need all the details, so that someone can retrace it and reflect it critically. Maybe someone doesn’t agree with what I have done and wants to dispute it, or wants to build on it and improve it. Critical examination of research can only take place if all information is available, and much of it, alas, is complex. Sometimes it can’t be fitted into an eight-page paper. On the other hand we have proper digital tools today. We just need to use them now and, most of all, we need to teach people how to do it.
How do you see the development of Open Science?
BB: I see much that is positive. If we look mainly in the direction of computer science and machine learning, I see many positive aspects because the entire discipline is comparatively young. That makes it comparatively flexible, it is often driven by young people and therefore things are changing faster. Many important journals and conferences rely on it, and it’s either a kind of an entry requirement, it’s part of the evaluation or there are incentives for it; awards or things like that. I see this as very positive. What is still too slow in my view is the area of training and recruiting, we need to put more fuel into this. Right at the moment however I see a lot positive things after a long dry spell.
The questions were asked by Dr Doreen Siegfried.
The interview was conducted on March 3, 2022.
About Professor Bernd Bischl
Professor Bernd Bischl holds the chair for Statististical Learning and Data Science at the Institute for Statistics at LMU Munich. He is co-director of the Munich Center for Machine Learning (MCML), one of the national competence centres for Machine Learning in Germany.
He studied computer science, Artificial Intelligence and data science in Hamburg, Edinburgh and Dortmund and gained his PhD at TU Dortmund. He is an active developer of several R packages, heads the mlr (Machine Learning in R) Engineering Group and is a co-founder of the science platform OpenML for open and reproducible machine learning.
Professor Bernd Bischl is co-speaker of the NFDI consortium BERD@NFDI.