01.10.2025

Toward Reliable Generative AIs: The Wikidata Embedding Project Supports Alternatives to Big Tech

Berlin, October 1, 2025 – Wikimedia Deutschland is today launching the Embedding Project, a vector database for Wikidata that is now freely accessible to everyone. The project is a milestone: for the first time, open data from Wikidata can be used directly for the development of generative AI applications. The new technology is now available online and opens up new opportunities for developers worldwide to make Large-Language Models (LLMs) more transparent, reliable, and equitable. At the same time, it can create more counterbalance to the AI products of big tech companies.

The vector database can be accessed at https://wd-vectordb.toolforge.org.

Lydia Pintscher Portrait — VGrigas (WMF), Lydia Pintscher – 2, CC BY-SA 3.0

Photo: VGrigas, Lydia Pintscher, CC BY-SA, 3.0

We want to create an infrastructure that enables everyone to develop generative AI applications based on verifiable, free and open data. This is an important step toward a digital world in which technologies for the benefit of society are not a footnote but the norm.

Lydia Pintscher, Portfolio Lead at Wikimedia Deutschland

Developers who want to learn how to use the vector database are welcome to attend the free Embedding Project Webinar on October 9. In addition to practical tips, many application examples will also be presented.

How the Data has been Made Usable for AI Development

Wikidata is the world’s largest open knowledge graph, whose data can be freely used. It currently contains approximately 119 million entries and is constantly being expanded by around 24,000 volunteers per month worldwide. Wikidata contains among others structured data of Wikimedia projects including Wikipedia, Wikivoyage or Wikisource.

While the structured data in Wikidata can be easily processed by machines, this is not the case for current generative AI systems, as they are designed for natural human language.

The Wikidata Embedding Project translates Wikidata’s statements into vectors. This enables generative AI models to more correctly interpret Wikidata’s content semantically and process it in natural language. In addition, the project supports the Model Context Protocol (MCP), a framework that acts as a bridge between AI and databases, which makes developers’ work easier. This also makes it easier to use the Wikidata knowledge graph in generative AI applications.

What makes the Wikidata vector database special?

Direct connection for generative AI models: With the vector database, LLM systems can access reliable data from Wikidata directly through RAG (Retrieval Augmented Generation). RAG is a framework that improves the quality of generative AI by using external knowledge sources, such as the Wikidata vector database, to find more up-to-date answers—rather than relying solely on unstructured training data.
Usable in many languages: The vector database supports search queries in English, French, and Arabic. By the end of the year, support for Spanish and Mandarin will be added. Further languages are to follow.
Broad query scope: Natural language can be processed by vector search (mathematical comparisons to discover relationships between items), which is helpful for looking up examples or exploring a topic. In addition, keyword search and descriptive queries allow for a precise identification of terms. Both approaches are combined in a hybrid search, making queries more convenient and successful.
Results are automatically sorted for improved relevance: A built-in readjustment (more precisely: a reranker) ensures that the most relevant search results from the vector database appear at the top.
Applications go beyond GenAI: From fact-checking tools to vandalism detection systems – there are many possible applications that can be built on top of the vector database.

Philippe Saadé Portrait — Philippe Saade (WMDE), Philippe Saade, CC BY-SA 4.0

Photo: Philippe Saadé, self-protrait , CC BY-SA 4.0

This Embedding Project launch shows that powerful AI doesn’t have to be controlled by a handful of companies — it can be open, collaborative, and built to serve everyone. After months of hard work and a successful development phase, we’re proud to open the doors to developers from all over the world and invite them to help shape the next chapter of generative AI.

Philippe Saadé, Wikidata AI Project Manager

What’s possible now

With the Embedding Project, Wikidata provides an open, public-interest-oriented data set, which offers several advantages:

More reliable: Generative AI can draw on the verified data from Wikidata via RAG, thereby reducing incorrect answers (or hallucinations).
Transparent: With the vector database, developers can refer to Wikidata as a source. This allows users to trace which sources a result is based on. As well, the codebase is available under an open license.
Always up to date: Wikidata is maintained and expanded daily by an active community. This means that the results of generative AI queries can be more up to date than those from systems that can only draw on their statically trained “knowledge.”
Equitable: Thanks to the work of a diverse and international volunteer community, Wikidata can also reflect underrepresented topics and perspectives, thus creating a more diverse, massively multilingual data basis for generative AI development.

The Embedding Project has been in development since September 2024 in close collaboration with two partners: DataStax, an IBM company, is a leading provider of AI and data solutions from the US. Jina AI is a Berlin-based expert in AI-powered search. Wikimedia Deutschland uses Jina AI’s embedding system, which transforms Wikidata’s data into vectors. This data is stored in DataStax’s Astra DB vector database.

More information: https://www.wikidata.org/wiki/Wikidata:Embedding_Project
Glossary of terms related to the Embedding Project.

Press Contact
Zarah Ziadi Communications Manager Movement
Mobile +49 1517 4103 114
Zarah.ziadi@wikimedia.de

About Wikimedia Deutschland

Wikimedia Deutschland is a non-profit association with over 111,000 members and 180 employees that is committed to promoting freely available knowledge in the digital space. As the largest national representative of the international Wikimedia Movement, the association supports the volunteer communities of Wikipedia and other Wikimedia projects in Germany. Wikimedia Deutschland develops and maintains free software and the free database Wikidata. The association is committed to creating conditions in the areas of digital and education policy that enable free access to knowledge and data. We also cooperate with cultural institutions to make more cultural heritage freely accessible.

Cookie	Typ	Dauer	Beschreibung
cli_user_preference			Dieses Cookies speichert, ob der Benutzer der Verwendung von Cookies zugestimmt hat oder nicht. Es speichert keine personenbezogenen Daten.
cookielawinfo-checkbox-necessary		1 year	Zustimmung der Kategorie "Essenziell".
CookieLawInfoConsent		1 year	Dieses Cookies speichert, ob der Benutzer der Verwendung von Cookies zugestimmt hat oder nicht. Es speichert keine personenbezogenen Daten.
viewed_cookie_policy	ständig	1 Stunde	Dieses Cookies speichert, ob der Benutzer der Verwendung von Cookies zugestimmt hat oder nicht. Es speichert keine personenbezogenen Daten.

Cookie	Typ	Dauer	Beschreibung
cookielawinfo-checkbox-einstellungen	0	1 year	Zustimmung der Kategorie "Einstellungen"
pll_language	0	1 year	Das Cookie speichert den Sprachcode der zuletzt besuchten Seite.

Cookie	Typ	Dauer	Beschreibung
_pk_id.1.64ac	0	1 year	Cookie von Matomo
_pk_ses.1.64ac	0	30 minutes	Cookie von Matomo
cookielawinfo-checkbox-matomo		1 year	Zustimmung der Kategorie "Matomo"

Cookie	Typ	Dauer	Beschreibung
_pk_id.1.64ac	0	1 year	Cookie von Matomo
_pk_ses.1.64ac	0	30 minutes	Cookie von Matomo
cookielawinfo-checkbox-matomo-und-heatmap		1 year	Zustimmung der Kategorie "Matomo und Heatmap"