No Language Left Behind: How to bridge the rapidly evolving AI language gap

October 6, 2023
Photo: AI Generated/Midjourney

New developments in artificial intelligence (AI) arise every day. AI offers us a vast array of opportunities for labour optimization, refinement of education systems, disease diagnosis, urban infrastructure improvement and economic forecasting. The list of these opportunities is endless. However, the list of those who can take advantage of these opportunities is quite limited. 

This article, which by the way is written by a human being (a disclaimer that seems to be necessary for each and every text nowadays), will be published in three languages: English, Russian and Kazakh. The latter falls into the category of "low-resource" languages. This means it is inadequately represented or researched in AI and machine learning due to the limited amount of linguistic resources available online.

As a result, the vast majority of AI tools accessible to people proficient in English, Russian, or other popular languages on the web are not available to individuals who only speak Kazakh.

According to data from Statista, the ranking of the most frequently used languages for web content, as of January 2023 (by the share of websites), is as follows: 1. English (58.8%); 2. Russian (5.3%); 3. Spanish (4.3%); 4. French (3.7%); 5. German (3.7%); 6. Japanese (3%); 7. Turkish (2.8%); 8. Persian (2.3%); 9. Chinese (1.7%); 10. Italian (1.6%). Kazakh language does not make it into the top twenty.

According to the results of the latest population census in Kazakhstan (2021), over 13,768,000 residents of the country (more than 80%) reported being proficient in the state language. No precise data is available on the number of Kazakh speakers outside Kazakhstan.

I am a translator by education and a communications specialist by profession. Language is my working tool and AI is the sphere of my interests and of my scientific investigations. Of all the AI tools – language models, image generators, virtual assistants, automatic speech and text recognition systems, machine translation mechanisms, data analysis tools and content creation tools – which I have studied and used, only a few worked in Kazakh. Moreover, the quality of their output always left much to be desired.

Kazakh language in the AI era: a struggle for representation

Modern artificial intelligence tools are essentially built on predicting the most likely response based on a vast amount of "training data" — digital content bases that AI engineers use to build models. When the data is insufficient, as is the case with "low-resource" languages, AI tools either do not work, or they perform poorly.

"Most people in Kazakhstan know and use the Kazakh language in their daily lives. However, the presence of the Kazakh language in the digital world doesn't mirror its status in the real world,"
says Daniyar Mukanov, a programmer and specialist in generative AI from Kazakhstan, highlighting the discrepancy between the amount of data available in a certain language in reality compared to the digital realm.

For seven years, Daniyar has been running a blog about IT in Kazakh. Initially, the goal was to provide a platform for a dialogue about technology, but now another mission has emerged, namely creating content in Kazakh to support a variety of topics necessary for AI training. Daniyar encourages everyone able to do so to create quality Kazakh-language digital content in various fields, thereby contributing to AI training.

Global language imbalance: Who is deprived of access to AI?

Billions of people worldwide, like Daniyar, cannot fully utilize AI opportunities because of their language.

According to Ethnologue, the world's largest language reference, there are 7,168 languages worldwide. Only about 20 of them have enough training data online to create natural language processing systems.

The representation of a language in the virtual world is not always related to the number of speakers in reality. Globalization and socio-economic changes lead to the dominance of certain languages in the digital space, often to the detriment of others, even more numerous in terms of speakers, languages.

For instance, Hindi, spoken by over 500 million people, is also considered a "low-resource" language, while Western European languages, like Dutch, spoken by 20 times fewer people, technically is a language with a higher resource. Less than 1% of languages worldwide are "high-resource". The speakers of the remaining 99%, not knowing other languages, are essentially cut off from global technological progress.

The inability to access advanced technologies directly affects the level of socio-economic development in society, hindering access to information, education and development opportunities. This can exacerbate the digital divide among developed and developing countries and intensify global social inequality.

Photo: AI Generated/Midjourney

Collective approach to bridgeAI language gaps

This article is not a ready-made directive, but rather a call to action. The growing technological gap based on language that we are witnessing is a challenge requiring collective and coordinated efforts from researchers, governments, educational institutions and private companies.

The approach to solving the problem must undoubtedly be a collective one. Piecemeal measures will not suffice. Manually processing data for large models is not only labour-intensive but also a very lengthy process. And perhaps we no longer have time for that. AI is developing at an unimaginable speed. In the last seven years, AI has evolved from defeating humans in the board game Go in 2016 to being able to recognize images and speech better than humans, and passing the most challenging exams in 2023.

A common approach in overcoming the language gap in the AI sector should include joint development of technologies and educational programmes, implementation and testing of pilot projects and drawing attention to the results through publications and educational activities.

It is essential for governments to encourage researchers and developers in this field and to provide transparency and sufficient funding to the process.

Creating conditions for the development of AI technologies for "low-resource" languages will benefit both individual nations and countries andthe entire world. Because preserving global language diversity online is as important as in reality. It allows us to maintain a variety of cultures, and thus manifold ways to understand our amazing world.

The article was originally published at