Wikidata, Artificial Intelligence and the Qurator project

Over 90 million data objects (items) can currently be found in Wikidata (as of May 2021).

In the QURATOR project, ten partners are working on making curation techniques more valuable and efficient through automation. In IT experts’ terms, curation means everything that has to do with processing data and knwoledge. Searching, selecting and summarizing information has a direct impact on the technologies we use every day. The data knowledge base Wikidata is also used for this purpose. Professor Georg Rehm, scientific and technical coordinator of the QURATOR project and researcher at the German Research Center for Artificial Intelligence (DFKI), and Lydia Pintscher, Wikidata product manager, explain the background. 

Mr. Rehm, what are you working on at the Speech and Language Technology Lab at the DFKI?

REHM: Everything we do revolves around the topic of language. Most of the projects deal with text analytics: How can we extract specific knowledge from texts, documents, tweets or scientific papers? For example, we try to find mentions of people’s names, names of organizations or events and map them to external knowledge bases. One of these is Wikidata. Other projects deal with text classification, with hate speech detection, fake news detection, and also with machine translation. 

Ms. Pintscher, what role does Wikidata play in the QURATOR project?

PINTSCHER: At the moment, Wikidata as Wikimedia’s knowledge database describes almost 100 million entities – and around 13,000 active editors are currently looking after this data. It’s a lot of work to maintain them, enrich them, and create links between them. Since Wikidata is now a fundamental building block of many technologies that are used every day, we have a responsibility to keep data quality high. This is what we have focused on as part of the QURATOR project. On the one hand, it’s about providing editors with better tools to identify and fix problems in the data. And on the other hand, we want to make the data more accessible so that organizations like DFKI and other institutions can build on it, develop new apps, or conduct research. 

What is special about this collaboration from your respective perspectives? What was the motivation to work with the different partners?

PINTSCHER: The project enables us to work together with organizations in a consortium with which we would not have come into contact otherwise, or at least not as intensively. We learn a lot from each other in the process. The expertise that DFKI has in the field of machine learning provides valuable impetus. One specific challenge for us, for example, is the question of how we deal with trends and gaps in Wikidata: Data we don’t have, or data that describe certain countries or people differently than other data. This problem not only affects Wikidata, but machine learning in general. This is where the exchange was and is helpful. 

REHM: We worked on a previous project called “Digital Curation Technologies”. In digital curation, the focus for us is on the question: what are the technologies that can help with this? One example that can be used to illustrate this is the work of journalists who have to monitor articles or hashtags on a particular topic – and are flooded with incoming content, with Facebook posts, Telegram, Instagram, the usual news tickers, all of which, of course, you have to keep on your radar. So can we develop technologies to make journalistic work easier? Can we build a smart editor that – based on the journalist’s current state of affairs – identifies posts that might contain surprising news? That’s what we’re trying to find solutions for. Wikidata is an important data partner in this project. In the process, we also want to jointly investigate whether there are dents, plateaus or peaks in the data collection, or unwanted bias.  These issues are becoming increasingly important to ensure objectivity and neutrality. 

Could you describe the problem of bias with an example?

REHM: To give a negative example, there was a famous chatbot developed by an American IT company that tweeted quasi automatically. This bot was shut down again after a very short time because, unfortunately, no one paid attention to the data it was trained with –including radical right-wing content. So-called web crawling was used to compile this training data, i.e. millions of web documents were automatically collected. And then, in operation, the chatbot suddenly started using radical right-wing terms. This content was part of the training data and, in a way, influenced the chatbot’s language model, radicalizing it. That’s a bias you really don’t want. 

PINTSCHER: In Wikipedia, we have the prime example of gender bias, that is, the underrepresentation of women. On the one hand, it doesn’t reflect the population. And at the same time, this problem also points to the past: who actually were the women who even had the opportunity to publish books, do scientific work, to get to the point where they become relevant for Wikipedia? Unfortunately, conditions of equality did not exist.

Mr. Rehm, how exactly do you proceed with Wikidata?

REHM: There is a huge amount of structured information in Wikidata, which is also interlinked, which contains inherent knowledge, e.g. about superclasses, expressions, properties. An example: John F. Kennedy. In machine-readable form, the information is available that JFK is a human being. All human beings have a date of birth, those who have already died also have a date of death, and there is also information about the circumstances of their death. 

We can use all this information to perform further processing steps. If I can successfully map a string like “JFK” to a corresponding Wikidata item using a Named Entity Recognizer, then I also have access to the date of birth, the date of death if applicable, and possibly the location where the person died. With this, many other smart processing steps can be activated and applications can be realized, for example in the field of geopolitical or sociological analysis, and also in the field of digital humanities, where Wikidata is an increasingly popular collection of research data. Through Wikidata, we can access even more sources of knowledge to create even more cross-references. This approach, which is also called Linked Data, is very powerful.

Ms. Pintscher, how has Wikidata been able to develop further within the framework of QURATOR? 

PINTSCHER: Among other things, we have been working on so-called schemas. Wikidata makes it relatively easy to describe the world in its complexity. This is countered by the attempt to bring structure into this complexity. We have developed tools that allow editors to find places in Wikidata where there is either an error or an exception. There is the famous example of a woman who married the Eiffel Tower. Of course, we do not want to prevent the entry of such data. Editors enter what they want – and then can use schemas to automatically check the consistency of the data. The same applies to curiosities: for example, pets that get diplomas. We have also developed a tool to find such curiosities automatically. 

Taking stock, how do you see the result of the collaboration from the DFKI’s point of view?

REHM: It almost sounds as if the project is already over, but we still have more than half a year to go in the QURATOR project and still want to realize many things together. I hope that discussions like the important debate about bias will develop into interesting research. Furthermore, our goal is to help the Wikidata community to use the resource better, to make it more intuitively accessible, to measure quality, to act more transparently. I am looking forward to further cooperation, really valued our collaboration so far and also hope that we can work together on follow-up projects

The interview was conducted by Elisabeth Giesemann, text: Patrick Wildermann.

*3pc GmbH Neue Kommunikation, Ada Health GmbH, ART+COM AG, Condat AG, Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Fraunhofer Gesellschaft – Fraunhofer Institute for Open Communication Services, Semtation GmbH, Stiftung Preußischer Kulturbesitz/Staatsbibliothek zu Berlin, Ubermetrics Technologies GmbH and Wikimedia Deutschland e.V. are involved in QURATOR.