The Free Knowledge database Wikidata had several reasons to celebrate in 2022: the sound barrier of 100 million uploaded items was broken – and the project turned 10. On the occasion of this anniversary, Lydia Pintscher, Portfolio Lead of Wikidata, looks back and ahead.
What does the record number of 100 million items mean for Wikidata?
Lydia Pintscher: Of course I’m pleased, but on the other hand I don’t want to attach too much importance to such “higher-faster-further” milestones, if only because the significance of this figure is limited. You can’t compare the data with Wikipedia articles, in which volunteers have really invested a lot of time, effort and research. Wikidata works differently: an item can generally be created relatively quickly, in some cases even automatically. Overall, that’s why the growth of our community or the increasingly diverse uses of our data are more important to me.
What were highlights for you in 2022 in terms of Wikidata?
We hosted the Data Reuse Days and the Data Quality Days – two events where we brought the Wikidata community together. The Data Reuse Day was about bringing people who build cool apps or services with our data closer together with our editors and to show what is possible with Wikidata. Data Quality Days, as the name implies, focus on data quality. This is an important topic for us. We are looking at what new tools or processes are available to increase the quality of our data. Both events took place online with an international community of users and editors from around the world. And, of course, Wikidata celebrated its 10th birthday.
Which milestones in the project’s history are special?
One milestone was the release of Wikidata – the moment when editors could create their first items. Another important point, not much later, was the possibility to insert links to Wikipedia articles. For example, before Wikidata existed, at the end of an article in the English-language Wikipedia, you would find a reference to the French version, the German version, the Italian version, and so on – many articles had very long lists that were redundant in each Wikipedia, which meant chaos. Finally, each of these links must be kept consistent. Using bots, editors imported them into Wikidata and removed them from Wikipedia. From that moment, Wikidata got a lot of new items …
How exactly should one imagine this boost?
Now, there had to be an item in Wikidata for every relevant concept described anywhere in Wikipedia. A concept – that’s ‘Berlin’, for example. There were articles about Berlin in over 250 Wikipedias. For the item ‘Berlin’ in Wikidata, people could now collect data in the next step. This helped us enormously to build up a base of data in a relatively short time, which could then be improved and expanded.
How big is the Wikidata community at the moment – and how could it grow even further?
It currently comprises around 12,000 active editors, i.e. people who have made at least five edits in the past 30 days. Our goal is to get the word out to many more people about the benefits of contributing to Wikidata – for example, by creating more awareness of the technologies we use every day that contain our data, and how those technologies can be improved as Wikidata gets even better. Our data is used in quite a few websites, apps, and services, but the people who connect with it and gain knowledge from it don’t usually notice. After all, they don’t go to Wikidata.org, but get the data delivered by, for example, the personal digital assistant on their smartphone when they ask a question.
Is the commercial use of Wikidata’s data, which is also used to train voice assistants such as Siri or Alexa, to be viewed critically?
We have explicitly decided to publish our data under CC0 – which means that anyone can do what they want with it. That includes any kind of commercial use, whether we welcome it or not. Not to mention that there are also non-commercial uses that we may not approve of. I take an ambivalent view on this. Voice assistants are precisely the tools through which people obtain their knowledge these days. Accordingly, I prefer it when it comes from a source that everyone can contribute to – and not from a closed system that no one can influence.
Meanwhile, the topic of artificial intelligence is gaining momentum. How does that affect Wikidata?
AI has always been a topic around Wikidata. That’s because Wikidata is the basis for many machine learning models. But now, of course, we are talking about a whole new level and are faced with questions: How do we position Wikidata in this new world? What is the added value of our project now? One answer is fact-based knowledge. A program like ChatGPT is often understood as something it is not, namely a knowledge engine. But such a chat bot operates on the basis of probabilities and sometimes suggests answers to questions that sound plausible but have nothing to do with reality. What we stand for is verifiable knowledge. The data in Wikidata can be used to run automated fact checks. This will become increasingly important in the future.