Language technology, which increasingly determines our lives, produces exclusions. In order to create more Knowledge Equity, Wikimedia Deutschland, together with 52 partners from science and civil society, has developed the strategic plan “A digital Europe that treats all languages equally” in 2022.
“Europe is linguistically very diverse, with 24 official languages alone, plus dozens of regional or minority languages,” lists Maria Heuschkel, Project Manager Software Development at Wikimedia Deutschland. That’s a welcome diversity – “but these languages are represented very differently in the digital space.”
According to her, this imbalance is noticeable in practice, for example, when it comes to translation apps, automated spell checkers, voice assistants like Siri, Google Assistant and Alexa, or artificial intelligence like ChatGPT. “Such programs work well for languages like English, German, French or Spanish. But for other official languages like Finnish or Romanian, they deliver less good results,” Heuschkel states. Not to mention Basque or Welsh – with the resulting danger that smaller language communities could prospectively lose out on the Internet. “In the final analysis, the extinction of certain languages will be accelerated,” Heuschkel notes.
More open source materials
To counteract this development, Wikimedia Deutschland has joined forces with 52 partners from the fields of science, civil society and industry to form a consortium to draft a strategic plan for the European Commission for a digital Europe that treats all languages equally. Among those involved is the German Research Center for Artificial Intelligence (DFKI), with which WMDE has worked on several projects.
“The first thing was to identify the various problem areas,” says Heuschkel, explaining the approach of the consortium, which submitted a total of 47 reports. Among them is a report from Wikimedia. “In many languages, there is not enough training data for language models such as text corpus, audio or video files that would be freely available online,” says Heuschkel. In addition, there is often a lack of sources – especially from underrepresented language communities – that editors can use to provide the necessary evidence in their Wikipedias. Accordingly, one approach must be to create more open source materials.
Perspective Abstract Wikipedia
Especially when it comes to comparatively small communities, the capacities of volunteers are limited. However, data could also be generated automatically, says Heuschkel – for example, via the free knowledge database Wikidata. The project manager also points to the Abstract Wikipedia project, which is currently under construction – it aims to create a language-independent version of Wikipedia using its structured data. It is an ideal tool for small communities that do not have the resources to build and manage a Wikipedia in their language – and a possible further step toward greater language equity in Europe.